Ports are given two additional properties: a character encoding and
a conversion failure strategy. These properties have getters and setters.
The new properties are used to convert any locale text to/from the
internal representation of strings.
If unspecified, ports use a default value. The default value of these
properties is held in a fluid. The default character encoding can be
modified by calling setlocale.
ISO-8859-1 is treated specially. Since it is a native encoding of
strings, it can be processed more quickly. Source code is assumed to be
ISO-8859-1 unless otherwise specified. The encoding of a source code
file can be given as 'coding: XXXXX' in a magic comment at the top of a
file.
The C functions that deal with encoding often use a null pointer
as shorthand for the native Latin-1 encoding, for efficiency's sake.
* test-suite/tests/encoding-iso88591.test: new tests
* test-suite/tests/encoding-iso88597.test: new tests
* test-suite/tests/encoding-utf8.test: new tests
* test-suite/tests/encoding-escapes.test: new tests
* test-suite/tests/numbers.test: declare 'binary' encoding
* test-suite/tests/ports.test: declare 'binary' encoding
* test-suite/tests/r6rs-ports.test: declare 'binary' encoding
* module/system/base/compile.scm (compile-file): use source-code
file's self-declared encoding when compiling files
* libguile/strports.c: store string ports in locale encoding
(scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
(scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
new functions
* libguile/strings.h: new declaration for scm_i_string_contains_char
* libguile/strings.c (scm_i_string_contains_char): new function
(scm_from_stringn, scm_to_stringn): use NULL for Latin-1
(scm_from_locale_stringn, scm_to_locale_stringn): respect character
encoding of input and output ports
* libguile/read.h: declaration for scm_scan_for_encoding
* libguile/read.c:
(read_token): now takes scheme string instead of C string/length
(read_complete_token): new function
(scm_read_sexp, scm_read_number, scm_read_mixed_case_symbol)
(scm_read_number_and_radix, scm_read_quote, scm_read_semicolon_comment)
(scm_read_srfi4_vector, scm_read_bytevector, scm_read_guile_bit_vector)
(scm_read_scsh_block_comment, scm_read_commented_expression)
(scm_read_extended_symbol, scm_read_sharp_extension, scm_read_shart)
(scm_read_expression): use scm_t_wchar for char type, use read_complete_token
(scm_scan_for_encoding): new function to find a file's character encoding
(scm_file_encoding): new function to find a port's character encoding
* libguile/rdelim.c: don't unpack strings
* libguile/print.h: declaration for modified function
scm_i_charprint
* libguile/print.c: use locale when printing characters and
strings
(scm_i_charprint): input parameter is now scm_t_wchar
(scm_simple_format): don't unpack strings
* libguile/posix.h: new declaration for scm_setbinary.
* libguile/posix.c (scm_setlocale): set default and stdio port
encodings based on the locale's character encoding
(scm_setbinary): new function
* libguile/ports.h (scm_t_port): add encoding and failed
conversion handler to port type. Declarations for new or modified
functions scm_getc, scm_unget_byte, scm_ungetc,
scm_i_get_port_encoding, scm_i_set_port_encoding_x,
scm_port_encoding, scm_set_port_encoding_x,
scm_i_get_conversion_strategy, scm_i_set_conversion_strategy_x,
scm_port_conversion_strategy, scm_set_port_conversion_strategy_x.
* libguile/ports.c: assign the current ports to zero on startup so
we can see if they've been set.
(scm_current_input_port, scm_current_output_port,
scm_current_error_port): return #f if the port is not yet
initialized
(scm_new_port_table_entry): set up a new port's encoding and
illegal sequence handler based on the thread's current defaults
(scm_i_remove_port): free port encoding name when port is removed
(scm_i_mode_bits_n): now takes a scheme string instead of a c
string and length. All callers changed.
(SCM_MBCHAR_BUF_SIZE): new const
(scm_getc): new function, since the scm_getc in inline.h is now
scm_get_byte_or_eof. This pulls one codepoint from a port.
(scm_lfwrite_substr, scm_lfwrite_str): now uses port's encoding
(scm_unget_byte): new function, incorportaing the low-level functionality
of scm_ungetc
(scm_ungetc): uses scm_unget_byte
* libguile/numbers.h (scm_t_wchar): compilation order problem with
scm_t_wchar being use in functions in multiple headers. Forward
declare scm_t_wchar.
* libguile/load.c (scm_primitive_load): scan for file encoding at
top of file and use it to set the load port's encoding
* libguile/inline.h (scm_get_byte_or_eof): new function
incorporating most of the functionality of scm_getc.
* libguile/fports.c (fport_fill_input): now returns scm_t_wchar
* libguile/chars.h (scm_t_wchar): avoid compilation order problem
with declaration of scm_t_wchar
* libguile/hash.c (scm_i_string_hash): new function
(scm_hasher): don't unpack string: use scm_i_string_hash
* libguile/hash.h: new declaration for scm_i_string_hash
* libguile/print.c (quote_keywordish_symbol): use symbol accessors
(scm_i_print_symbol_name): new function
(scm_print_symbol_name): call scm_i_print_symbol_name
(iprin1): use scm_i_print_symbol_name to print symbols
* libguile/print.h: new declaration for scm_i_print_symbol_name
* libguile/symbols.c (lookup_interned_symbol): now takes scheme string
instead of c string; callers changed
(lookup_interned_symbol): add wide symbol support
(scm_i_c_mem2symbol): removed
(scm_i_mem2symbol): removed and replaced with scm_i_str2symbol
(scm_i_str2symbol): new function
(scm_i_mem2uninterned_symbol): removed and replaced with
scm_i_str2uninterned_symbol
(scm_i_str2uninterned_symbol): new function
(scm_make_symbol, scm_string_to_symbol, scm_from_locale_symbol)
(scm_from_locale_symboln): use scm_i_str2symbol
* test-suite/tests/symbols.test: new tests
* libguile/tags.h (scm_tc7_program):
* libguile/programs.h: Programs now have their own tc7 code. Fix up the
macros appropriately.
* libguile/programs.c: Remove smobby bits, leaving marking, printing,
and application for other parts of Guile.
* libguile/debug.c (scm_procedure_source):
* libguile/eval.c (scm_trampoline_0, scm_trampoline_1)
(scm_trampoline_2): Add cases for tc7_program.
* libguile/eval.i.c (CEVAL, SCM_APPLY):
* libguile/evalext.c (scm_self_evaluating_p):
* libguile/gc-card.c (scm_i_sweep_card, scm_i_tag_name):
* libguile/gc-mark.c (1):
* libguile/print.c (iprin1):
* libguile/procs.c (scm_procedure_p, scm_thunk_p)
* libguile/vm-i-system.c (make-closure): Adapt to new procedure
representation.
* libguile/procprop.c (scm_i_procedure_arity): Do the right thing for
programs.
* test-suite/tests/procprop.test ("procedure-arity"): Arity test now
succeeds.
* libguile/goops.c (scm_class_of): Programs now belong to the class
<procedure>, not a smob class.
* libguile/vm.h (struct vm, struct vm_cont):
* libguile/vm-engine.c (vm_engine):
* libguile/frames.h (SCM_FRAME_BYTE_CAST, struct vm_frame):
* libguile/frames.c (scm_c_make_vm_frame): Fix usages of scm_byte_t,
changing them to scm_t_uint8.
This requres the creation of a new type
scm_t_string_failed_conversion_handler to replace libunistring's
enum iconveh_ilseq_handler.
* libguile/strings.h: don't include <uniconv.h>
(scm_t_string_failed_conversion_handler): new enum type
(SCM_FAILED_CONVERSION_ERROR, SCM_FAILED_CONVERSION_QUESTION_MARK):
(SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE): new enum type values
* libguile/strings.c (scm_to_stringn): now takes type
scm_t_string_failed_conversion_handler. All callers changed.
* libguile/print.c: include <uniconv.h>
* libguile/ports.c (scm_lfwrite_substr): use
scm_t_string_conversion_handler's constants
* libguile/gen-scmconfig.c (SCM_ICONVEH_ERROR):
(SCM_ICONVEH_QUESTION_MARK, SCM_ICONVEH_ESCAPE_SEQUENCE): store
iconveh_ilseq_hander constants as #define's
Also, scm_charprint is renamed to scm_i_charprint.
* libguile/strings.h: make scm_i_string_wide_chars internal.
* libguile/print.h: rename scm_charprint to scm_i_charprint. Make
internal.
* libguile/print.c (scm_i_charprint): renamed from scm_charprint
(scm_charprint): renamed to scm_i_charprint. All callers changed.
This adds full Unicode strings as a datatype, and it adds some
minimal functionality. The terminal and port encoding is assumed
to be ISO-8859-1. Non-ISO-8859-1 characters are written or
input as string character escapes.
The string character escapes now have 3 forms: \xXX \uXXXX and
\UXXXXXX, for unprintable characters that have 2, 4 or 6 hex digits.
The process for writing to strings has been modified. There is now a
function scm_i_string_start_writing that does the copy-on-write
conversion if necessary.
To compile strings that may be wide, the VM storage of strings and
string-likes has changed.
Most string-using functions have not yet been updated and may break
when used with wide strings.
* module/language/assembly/compile-bytecode.scm (write-bytecode):
use variable width string bytecode format
* module/language/assembly.scm (byte-length): use variable width
bytecode format
* libguile/vm-i-loader.c (load-string, load-symbol):
(load-keyword, define): use variable-width bytecode format
* libguile/vm-engine.h (FETCH_WIDTH): new macro
* libguile/strings.h: new declarations
* libguile/strings.c (make_wide_stringbuf): new function
(widen_stringbuf): new function
(scm_i_make_wide_string): new function
(scm_i_is_narrow_string): new function
(scm_i_string_wide_chars): new function
(scm_i_string_start_writing): new function
(scm_i_string_ref): new function
(scm_i_string_set_x): new function
(scm_i_is_narrow_symbol): new function
(scm_i_symbol_wide_chars, scm_i_symbol_ref): new function
(scm_string_width): new function
(unistring_escapes_to_guile_escapes): new function
(scm_to_stringn): new function
(scm_i_stringbuf_free): modify for wide strings
(scm_i_substring_copy): modify for wide strings
(scm_i_string_chars, scm_string_append): modify for wide strings
(scm_i_make_symbol, scm_to_locale_stringn): modify for wide strings
(scm_string_dump, scm_symbol_dump, scm_to_locale_stringbuf):
(scm_string, scm_i_deprecated_string_chars): modify for wide strings
(scm_from_locale_string, scm_from_locale_stringn): add null test
* libguile/srfi-13.c: add calls for scm_i_string_start_writing for
each call of scm_i_string_stop_writing
(scm_string_for_each): modify for wide strings
* libguile/socket.c: add calls for scm_i_string_start_writing for each
call of scm_i_string_stop_writing
* libguile/rw.c: add calls for scm_i_string_start_writing for each
call of scm_i_string_stop_writing
* libguile/read.c (scm_read_string): allow reading of wide strings
* libguile/print.h: add declaration for scm_charprint
* libguile/print.c (iprin1): print wide strings and add new string
escapes
(scm_charprint): new function
* libguile/ports.h: new declarations for scm_lfwrite_substr and
scm_lfwrite_str
* libguile/ports.c (update_port_lf): new function
(scm_lfwrite): use update_port_lf
(scm_lfwrite_substr): new function
(scm_lfwrite_str): new function
* test-suite/tests/asm-to-bytecode.test ("compiler"): add string
width byte to sting-like asm tests
This adds the 32-bit standalone characters. Strings are still
8-bit. Characters larger than 8-bit can only be entered or
displayed in octal format at this point. At this point, the
terminal's display encoding is expected to be Latin-1.
* module/language/assembly/compile-bytecode.scm (write-bytecode):
add 32-bit char
* module/language/assembly.scm (object->assembly): add 32-bit char
(assembly->object): add 32-bit char
* libguile/vm-i-system.c (make-char32): new op
* libguile/print.c (iprin1): print 32-bit char
* libguile/numbers.h: add type scm_t_wchar
* libguile/numbers.c: add type scm_t_wchar
* libguile/chars.h: new type scm_t_wchar
(SCM_CODEPOINT_MAX): new
(SCM_IS_UNICODE_CHAR): new
(SCM_MAKE_CHAR): operate on 32-bit char
* libguile/chars.c: comparison operators now use Unicode
codepoints
(scm_c_upcase): now receives and returns scm_t_wchar
(scm_c_downcase): now receives and returns scm_t_wchar
The global variables scm_charnames and scm_charnums are replaced with
the accessor functions scm_i_charname and scm_i_charname_to_num.
Also, the incomplete and broken EBCDIC support is removed.
* libguile/print.c (iprin1): use new func scm_i_charname
* libguile/read.c (scm_read_character): use new func
scm_i_charname_to_num
* libguile/chars.c (scm_i_charname): new function
(scm_i_charname_to_char): new function
(scm_charnames, scm_charnums): removed
* libguile/chars.h: new declarations
The idea is to introduce `gsubrs' whose arity is encoded in their type
(more precisely in the sizeof (void *) - 8 MSBs). This removes the
indirection introduced by cclos and simplifies the code.
* libguile/__scm.h (CCLO): Remove.
* libguile/debug.c (scm_procedure_source, scm_procedure_environment):
Remove references to `scm_tc7_cclo'.
* libguile/eval.c (scm_trampoline_0, scm_trampoline_1,
scm_trampoline_2): Replace `scm_tc7_cclo' with `scm_tc7_gsubr'.
* libguile/eval.i.c (CEVAL): Likewise. No longer make PROC the first
argument. Directly invoke `scm_gsubr_apply ()' instead of jump to the
`evap(N+1)' label or call to `SCM_APPLY ()'.
* libguile/evalext.c (scm_self_evaluating_p): Remove reference to
`scm_tc7_cclo'.
* libguile/gc-card.c (scm_i_sweep_card, scm_i_tag_name): Likewise.
* libguile/gc-mark.c (scm_gc_mark_dependencies): Likewise.
* libguile/goops.c (scm_class_of): Likewise.
* libguile/print.c (iprin1): Likewise.
* libguile/gsubr.c (create_gsubr): Use `unsigned int's for REQ, OPT and
RST. Use `scm_tc7_gsubr' instead of `scm_makcclo ()' in the default
case.
(scm_gsubr_apply): Remove calls to `SCM_GSUBR_PROC ()'.
(scm_f_gsubr_apply): Remove.
* libguile/gsubr.h (SCM_GSUBR_TYPE): New definition.
(SCM_GSUBR_MAX): Changed to 33.
(SCM_SET_GSUBR_TYPE, SCM_GSUBR_PROC, SCM_SET_GSUBR_PROC,
scm_f_gsubr_apply): Remove.
* libguile/procprop.c (scm_i_procedure_arity): Remove reference to
`scm_tc7_cclo'; add proper handling of `scm_tc7_gsubr'.
* libguile/procs.c (scm_makcclo, scm_make_cclo): Remove.
(scm_procedure_p): Remove reference to `scm_tc7_cclo'.
(scm_thunk_p): Likewise, plus add proper `scm_tc7_gsubr' handling.
* libguile/procs.h (SCM_CCLO_LENGTH, SCM_MAKE_CCLO_TAG,
SCM_SET_CCLO_LENGTH, SCM_CCLO_BASE, SCM_SET_CCLO_BASE, SCM_CCLO_REF,
SCM_CCLO_SET, SCM_CCLO_SUBR, SCM_SET_CCLO_SUBR, scm_makcclo,
scm_make_cclo): Remove.
* libguile/stacks.c (read_frames): Remove reference to `scm_f_gsubr_apply'.
* libguile/tags.h (scm_tc7_cclo): Remove.
(scm_tc7_gsubr): New.
(scm_tcs_subrs): Add `scm_tc7_gsubr'.
the value at its top. This fixes a reference leak.
(PUSH_REF): Perform `pstate->top++' after calling
`PSTATE_STACK_SET ()' in order to avoid undesired potential side
effects.
New.
(sym_reader): New.
(scm_print_opts): Added "quote-keywordish-symbols" option.
(quote_keywordish_symbol): New, for evaluating the option.
(scm_print_symbol_name): Use it.
(scm_init_print): Initialize new option to sym_reader.
Removed ref_stack field.
(PSTATE_STACK_REF, PSTATE_STACK_SET): New, for accessing the stack
of a print state. Use them everywhere instead of ref_stack.
print.c, ports.c, mallocs.c, hooks.c, hashtab.c, fports.c,
guardians.c, filesys.c, coop-pthreads.c, continuations.c: Use
scm_uintprint to print unsigned integers, raw heap words, and
adresses, using a cast to scm_t_bits to turn pointers into
integers.
SCM_PRINT_HIGHLIGHT_SUFFIX): New printer options.
(scm_iprin1): Use them instead of the previoulsy hardcoded
strings.
(scm_init_print): Initialize them.
* print.c (make_print_state, scm_free_print_state): Initialize it
to SCM_EOL.
(scm_iprin1): Wrap output in '{...}' when object is contained in
highlight_objects.
(SCM_VALIDATE_STRING_COPY): Deprecated. Replaced all uses with
SCM_VALIDATE_STRING plus SCM_I_STRING_CHARS or
scm_to_locale_string, etc.
(SCM_VALIDATE_SUBSTRING_SPEC_COPY): Deprecated. Replaced as
above, plus scm_i_get_substring_spec.
* regex-posix.c, read.c, random.c, ramap.c, print.c, numbers.c,
hash.c, gc.c, gc-card.c, convert.i.c, backtrace.c, strop.c,
strorder.c, strports.c, struct.c, symbols.c, unif.c, ports.c: Use
SCM_I_STRING_CHARS, SCM_I_STRING_UCHARS, and SCM_I_STRING_LENGTH
instead of SCM_STRING_CHARS, SCM_STRING_UCHARS, and
SCM_STRING_LENGTH, respectively. Also, replaced scm_return_first
with more explicit scm_remember_upto_here_1, etc, or introduced
them in the first place.
SCM_INUM): Deprecated by reenaming them to SCM_I_INUMP, SCM_I_NINUMP
and SCM_I_INUM, respectively and adding deprecated versions to
deprecated.h and deprecated.c. Changed all uses to either use the
SCM_I_ variants or scm_is_*, scm_to_*, or scm_from_*, as appropriate.
SCM_NEGATE_BOOL, SCM_BOOLP): Deprecated by moving into "deprecated.h".
Replaced all uses with scm_is_false, scm_is_true, scm_from_bool, and
scm_is_bool, respectively.
scm_i_unmemoize_expr for unmemoizing a memoized object holding a
single memoized expression.
* debug.c (memoized_print): Don't try to unmemoize the memoized
object, since we can't know whether it holds a single expression
or a body.
(scm_mem_to_proc): Removed check for lambda expression, since it
was moot anyway. Whoever uses these functions for debugging
purposes should know what they do: Creating invalid memoized code
will cause crashes, independent of whether this check is present
or not.
(scm_proc_to_mem): Take the closure's code as it is and don't
append a SCM_IM_LAMBDA isym. To allow easier debugging, the
memoized code should not be modified.
* debug.[ch] (scm_unmemoize, scm_i_unmemoize_expr): Removed
scm_unmemoize from public use, but made scm_i_unmemoize_expr
available as a guile internal function instead. However,
scm_i_unmemoize_expr will only work on memoized objects that hold
a single memoized expression. It won't work with bodies.
* debug.c (scm_procedure_source), macros.c (macro_print), print.c
(scm_iprin1): Call scm_i_unmemocopy_body for unmemoizing a body,
i. e. a list of expressions.
* eval.c (unmemoize_exprs): Drop internal body markers from the
output during unmemoization.
* eval.[ch] (scm_unmemocopy, scm_i_unmemocopy_expr,
scm_i_unmemocopy_body): Removed scm_unmemocopy from public use,
but made scm_i_unmemocopy_expr and scm_i_unmemocopy_body available
as guile internal functions instead. scm_i_unmemoize_expr will
only work on a single memoized expression, while
scm_i_unmemocopy_body will only work on bodies.
* deprecated.h (SCM_IFRINC, SCM_ICDR, SCM_IFRAME, SCM_IDIST,
SCM_ICDRP), eval.c (SCM_IFRINC, SCM_ICDR, SCM_IFRAME, SCM_IDIST,
SCM_ICDRP), eval.h (SCM_ICDR, SCM_IFRINC, SCM_IFRAME, SCM_IDIST,
SCM_ICDRP): Deprecated and added to deprecated.h. Moved from
eval.h to eval.c.
* deprecated.c (scm_isymnames), deprecated.h (scm_isymnames,
SCM_ISYMNUM, SCM_ISYMCHARS), eval.c (SCM_ISYMNUM, isymnames,
scm_unmemocopy, CEVAL), print.c (scm_isymnames), tags.h
(SCM_ISYMNUM, scm_isymnames, SCM_ISYMCHARS): Deprecated
scm_isymnames, SCM_ISYMNUM and SCM_ISYMCHARS and added to
deprecated.[hc]. Moved scm_isymnames from print.c to eval.c and
renamed to isymnames. Moved SCM_ISYMNUM from tags.h to eval.c and
renamed to ISYMNUM.
* eval.c (scm_i_print_iloc, scm_i_print_isym), eval.h
(scm_i_print_iloc, scm_i_print_isym), print.c (scm_iprin1):
Extracted printing of ilocs and isyms to guile internal functions
scm_i_print_iloc, scm_i_print_isym of eval.c.
* libguile/print.c (scm_isymnames): Add names for the new memoizer
codes.
* libguile/eval.c (s_missing_clauses, s_bad_case_clause,
s_extra_case_clause, s_bad_case_labels, s_duplicate_case_label,
literal_p): New static identifiers.
(scm_m_case): Use ASSERT_SYNTAX to signal syntax errors. Be more
specific about the kind of error that was detected. Check for
duplicate case labels. Handle bound 'else. Avoid unnecessary
consing when creating the memoized code.
(scm_m_case, unmemocopy, SCM_CEVAL): Use SCM_IM_ELSE to memoize
the syntactic keyword 'else.
* test-suite/tests/syntax.test (exception:bad-expression,
exception:missing-clauses, exception:bad-case-clause,
exception:extra-case-clause, exception:bad-case-labels): New.
Added some tests and adapted tests for 'case' to the new way of
error reporting.