This requires separate small fixes.
Readline has internal logic to deal with multi-byte characters, so
it wants bytes, not characters.
scm_c_read gets called by the vm when readline is activated, and it was
truncating multi-byte characters because soft ports didn't have the
UCS-4 capability.
Soft ports need the capability to read UCS-4 characters. Since soft ports
may have a single byte buffer, full characters need to be stored into the
pushback buffer.
This broke the optimizations in scm_c_read for using an alternate buffer
for single-byte-buffered ports, because the opimization wasn't expecting
anything in the pushback buffer.
* libguile/vports.c (sf_fill_input): store complete chars, not single bytes
* libguile/ports.c (scm_c_read): don't use optimized path for non Latin-1.
Add debug prints.
* libguile/string.h: make scm_i_from_stringn and scm_i_string_ref public
so that readline can use them
* guile-readline/readline.c: read bytes, not complete chars, from the
input port. Convert output to the output port's locale
String ports should be able to accept any string characters, regardless
of the current locale. Setting it to UTF-8 achieves that.
* libguile/strports.c (scm_i_mkstrport): set port's locale to UTF-8
(scm_mkstrport): convert input string to UTF-8
* libguile/unidata_to_charset.pl (designated): renamed from full
* libguile/srfi-14.c (scm_char_set_designated): new char-set
* libguile/srfi-14.i.c (cs_designated): renamed from cs_full
Since combining characters, such as accents, modify the appearance of the
previous letter, it looks awkward in its character literal form (#\name)
since it modified the backslash. This instead prints the combining
character on a small circle.
* libguile/chars.h (SCM_CODEPOINT_DOTTED_CIRCLE): new #define
* libguile/print.c (iprint1): print combining characters on dotted circles
* libguile/read.c (scm_read_character): parse the combination of combining
characters and dotted circles
* libguile/srfi-14.c (scm_i_ucs_range_to_char_set): new function that
contains the functionality of ucs_range_to_char_set, fixes
off-by-one, and doesn't store surroges
(scm_ucs_range_to_char_set, scm_ucs_range_to_char_set_x): call
scm_i_ucs_range_to_char_set
(scm_i_charset_set_range): new helper function
char-set-xor! was not modifying its input parameter. It isn't
technically required to do so by the spec, but, the other similar
functions do it.
* libguile/srfi-14.c (scm_char_set_xor_x): modify the input parameter
* libguile/srfi-4.c (free_user_data): New function.
* libguile/srfi-4.i.c (scm_take_TAGvector): Register `free_user_data ()'
as a finalizer for DATA.
* libguile/objcodes.c (scm_objcode_to_bytecode): Allocate with
`scm_malloc ()' since the memory taken by `scm_take_u8vector ()' will
eventually be free(3)d.
* libguile/vm.c (really_make_boot_program): Likewise.
* libguile/strings.c (STRINGBUF_HEADER_SIZE, STRINGBUF_HEADER_BYTES):
New macros.
(STRINGBUF_F_INLINE, STRINGBUF_INLINE, STRINGBUF_OUTLINE_CHARS,
STRINGBUF_OUTLINE_LENGTH, STRINGBUF_INLINE_CHARS,
STRINGBUF_INLINE_LENGTH, STRINGBUF_MAX_INLINE_LEN): Remove.
(STRINGBUF_CHARS, STRINGBUF_WIDE_CHARS): Adjust to return a fixed
location.
(STRINGBUF_LENGTH): Get the length from word 1.
(make_stringbuf, make_wide_stringbuf): Adjust to use a contiguous
memory region.
(wide_stringbuf): Renamed from `widen_stringbuf'. Adjust similarly.
Return the new stringbuf. Callers updated.
(narrow_stringbuf): Likewise.
(scm_sys_string_dump, scm_sys_symbol_dump): Remove `stringbuf-inline'
pair.
* test-suite/tests/strings.test ("string internals")["null strings are
inlined", "short Latin-1 encoded strings are inlined", "long Latin-1
encoded strings are not inlined", "short UCS-4 encoded strings are not
inlined", "long UCS-4 encoded strings are not inlined"]: Remove.
* test-suite/tests/symbols.test ("symbol internals")["null symbols are
inlined", "short Latin-1 encoded symbols are inlined", "long Latin-1
encoded symbols are not inlined", "short UCS-4 encoded symbols are not
inlined", "long UCS-4 encoded symbols are not inlined"]: Remove.
String ports, being 8-bit, store strings using the character encoding
of the port. This fixes a bug where the default character encoding, and
not the port's encoding, was being used to convert the string port data
back to a string.
* libguile/strports.c: extra comments
(scm_strport_to_string): use port's encoding when converting port data
to a string
* libguile/strings.c (scm_i_from_stringn): renamed from scm_from_stringn
and made internal. All callers changed.
(scm_from_stringn): renamed to scm_i_from_stringn.
* libguile/strings.h: declaration for scm_i_from_stringn
* libguile/bytevectors.c (SCM_BYTEVECTOR_INLINE_THRESHOLD,
SCM_BYTEVECTOR_INLINEABLE_SIZE_P, SCM_BYTEVECTOR_SET_CONTENTS,
SCM_BYTEVECTOR_SET_INLINE): Remove.
(SCM_BYTEVECTOR_HEADER_BYTES): New macro.
(SCM_BYTEVECTOR_SET_ELEMENT_TYPE): Adjust to new flag layout.
(make_bytevector): Remove content inlining machinery; use
`scm_gc_malloc_pointerless ()' in all cases; special-case zero-sized
vu8 buffers.
(make_bytevector_from_buffer): Simplified.
(scm_c_shrink_bytevector): New, formerly `scm_i_shrink_bytevector ()'.
Remove buffer inlining machinery.
(scm_bootstrap_bytevectors): Use `make_bytevector ()' for
SCM_NULL_BYTEVECTOR.
* libguile/bytevectors.h (SCM_BYTEVECTOR_HEADER_SIZE): New macro.
(SCM_BYTEVECTOR_CONTENTS): Adjust to new layout.
(SCM_SET_BYTEVECTOR_FLAGS): Properly cast F.
(SCM_F_BYTEVECTOR_INLINE, SCM_BYTEVECTOR_INLINE_P): Remove.
(SCM_BYTEVECTOR_ELEMENT_TYPE): Adjust.
(scm_c_shrink_bytevector): Remove macro, make a C function
declaration.
* libguile/srfi-14.c (charsets_complement): use surrogate #defines instead
of hardcoded numbers
* libguile/srfi-14.i.c (cs_full_ranges): remove surrogates from full
charset
* libguile/unidata_to_charset.pl (full): test for surrogates
* libguile/gc_os_dep.c (GC_linux_stack_base) [LINUX_STACKBOTTOM]: cast
input of ctype functions to int
* libguile/inet_aton.c (inet_aton): cast input of ctype functions to int
* libguile/read.c (scm_scan_for_encoding): cast input of isalnum to int
* libguile/win32-socket.c (scm_i_socket_uncomment): cast input of isspace
to int
* libguile/load.c (scm_primitive_load_path): If the compiled path was
out of date, but the fallback path was current, we correctly detected
that case, but loaded the wrong file. So here fix the typo.
This script was used to generate srfi-14.i.c from the UnicodeData.txt
file supplied by ftp://www.unicode.org/Public/UNIDATA/
* libguile/unidata_to_charset.pl
On Aug 5, 2009, at 10:06, Ken Raeburn wrote:
> (1) In scm_pthread_mutex_lock, we leave and re-enter guile mode so
> that we don't block the thread while in guile mode. But we could
> use pthread_mutex_trylock first, and avoid the costs scm_leave_guile
> seems to incur on the Mac. If we can't acquire the lock, it should
> return immediately, and then we can do the expensive, blocking
> version. A quick, hack version of this changed my run time for
> A(3,8) from 17.5s to 14.5s, saving about 17%; sigaltstack and
> sigprocmask are still in the picture, because they're called from
> scm_catch_with_pre_unwind_handler. I'll work up a nicer patch
> later.
Ah, we already had scm_i_pthread_mutex_trylock lying around; that made
things easy.
A second timing test with A(3,9) and this version of the patch (based
on 1.9.1) shows the same improvement.
* libguile/threads.c (scm_pthread_mutex_lock): Try the mutex before
leaving and reentering guile mode.
Ports are given two additional properties: a character encoding and
a conversion failure strategy. These properties have getters and setters.
The new properties are used to convert any locale text to/from the
internal representation of strings.
If unspecified, ports use a default value. The default value of these
properties is held in a fluid. The default character encoding can be
modified by calling setlocale.
ISO-8859-1 is treated specially. Since it is a native encoding of
strings, it can be processed more quickly. Source code is assumed to be
ISO-8859-1 unless otherwise specified. The encoding of a source code
file can be given as 'coding: XXXXX' in a magic comment at the top of a
file.
The C functions that deal with encoding often use a null pointer
as shorthand for the native Latin-1 encoding, for efficiency's sake.
* test-suite/tests/encoding-iso88591.test: new tests
* test-suite/tests/encoding-iso88597.test: new tests
* test-suite/tests/encoding-utf8.test: new tests
* test-suite/tests/encoding-escapes.test: new tests
* test-suite/tests/numbers.test: declare 'binary' encoding
* test-suite/tests/ports.test: declare 'binary' encoding
* test-suite/tests/r6rs-ports.test: declare 'binary' encoding
* module/system/base/compile.scm (compile-file): use source-code
file's self-declared encoding when compiling files
* libguile/strports.c: store string ports in locale encoding
(scm_strport_to_locale_u8vector, scm_call_with_output_locale_u8vector)
(scm_open_input_locale_u8vector, scm_get_output_locale_u8vector):
new functions
* libguile/strings.h: new declaration for scm_i_string_contains_char
* libguile/strings.c (scm_i_string_contains_char): new function
(scm_from_stringn, scm_to_stringn): use NULL for Latin-1
(scm_from_locale_stringn, scm_to_locale_stringn): respect character
encoding of input and output ports
* libguile/read.h: declaration for scm_scan_for_encoding
* libguile/read.c:
(read_token): now takes scheme string instead of C string/length
(read_complete_token): new function
(scm_read_sexp, scm_read_number, scm_read_mixed_case_symbol)
(scm_read_number_and_radix, scm_read_quote, scm_read_semicolon_comment)
(scm_read_srfi4_vector, scm_read_bytevector, scm_read_guile_bit_vector)
(scm_read_scsh_block_comment, scm_read_commented_expression)
(scm_read_extended_symbol, scm_read_sharp_extension, scm_read_shart)
(scm_read_expression): use scm_t_wchar for char type, use read_complete_token
(scm_scan_for_encoding): new function to find a file's character encoding
(scm_file_encoding): new function to find a port's character encoding
* libguile/rdelim.c: don't unpack strings
* libguile/print.h: declaration for modified function
scm_i_charprint
* libguile/print.c: use locale when printing characters and
strings
(scm_i_charprint): input parameter is now scm_t_wchar
(scm_simple_format): don't unpack strings
* libguile/posix.h: new declaration for scm_setbinary.
* libguile/posix.c (scm_setlocale): set default and stdio port
encodings based on the locale's character encoding
(scm_setbinary): new function
* libguile/ports.h (scm_t_port): add encoding and failed
conversion handler to port type. Declarations for new or modified
functions scm_getc, scm_unget_byte, scm_ungetc,
scm_i_get_port_encoding, scm_i_set_port_encoding_x,
scm_port_encoding, scm_set_port_encoding_x,
scm_i_get_conversion_strategy, scm_i_set_conversion_strategy_x,
scm_port_conversion_strategy, scm_set_port_conversion_strategy_x.
* libguile/ports.c: assign the current ports to zero on startup so
we can see if they've been set.
(scm_current_input_port, scm_current_output_port,
scm_current_error_port): return #f if the port is not yet
initialized
(scm_new_port_table_entry): set up a new port's encoding and
illegal sequence handler based on the thread's current defaults
(scm_i_remove_port): free port encoding name when port is removed
(scm_i_mode_bits_n): now takes a scheme string instead of a c
string and length. All callers changed.
(SCM_MBCHAR_BUF_SIZE): new const
(scm_getc): new function, since the scm_getc in inline.h is now
scm_get_byte_or_eof. This pulls one codepoint from a port.
(scm_lfwrite_substr, scm_lfwrite_str): now uses port's encoding
(scm_unget_byte): new function, incorportaing the low-level functionality
of scm_ungetc
(scm_ungetc): uses scm_unget_byte
* libguile/numbers.h (scm_t_wchar): compilation order problem with
scm_t_wchar being use in functions in multiple headers. Forward
declare scm_t_wchar.
* libguile/load.c (scm_primitive_load): scan for file encoding at
top of file and use it to set the load port's encoding
* libguile/inline.h (scm_get_byte_or_eof): new function
incorporating most of the functionality of scm_getc.
* libguile/fports.c (fport_fill_input): now returns scm_t_wchar
* libguile/chars.h (scm_t_wchar): avoid compilation order problem
with declaration of scm_t_wchar
* libguile/goops.c (scm_make_extended_class_from_symbol): new function
(scm_class_of): don't unpack symbol chars
(wrap_init): don't unpack symbol chars
(make_class_from_symbol): new function
(make_struct_class): don't unpack symbol chars