mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-06-11 06:20:23 +02:00
Improve handling of Unicode byte-order marks (BOMs).
* libguile/ports-internal.h (struct scm_port_internal): Add new members 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (SCM_UNICODE_BOM): New macro. (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype. * libguile/ports.c (scm_new_port_table_entry): Initialize 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (get_iconv_codepoint): Pass new 'mode' parameter to 'scm_i_port_iconv_descriptors'. (get_codepoint): After reading a codepoint at stream start, record that we're no longer at stream start, and consume a BOM where appropriate. (scm_seek): Set the stream start flags according to the new position. (looking_at_bytes): New static function. (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom, scm_utf32le_bom): New static const arrays. (decide_utf16_encoding, decide_utf32_encoding): New static functions. (scm_i_port_iconv_descriptors): Add new 'mode' parameter. If the specified encoding is UTF-16 or UTF-32, make that precise by deciding what byte order to use, and construct iconv descriptors based on the precise encoding. (scm_i_set_port_encoding_x): Record that we are now at stream start. Do not open the new iconv descriptors immediately; let them be initialized lazily. * libguile/print.c (display_string_using_iconv): Record that we're no longer at stream start. Write a BOM if appropriate. * doc/ref/api-io.texi (BOM Handling): New node. * test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"): Adapt test to cope with the fact that 'set-port-encoding!' does not immediately open the iconv descriptors. (bv-read-test): New procedure. ("unicode byte-order marks (BOMs)"): New test prefix.
This commit is contained in:
parent
45c0878b86
commit
cdd3d6c9f4
5 changed files with 515 additions and 32 deletions
|
@ -1,7 +1,7 @@
|
||||||
@c -*-texinfo-*-
|
@c -*-texinfo-*-
|
||||||
@c This is part of the GNU Guile Reference Manual.
|
@c This is part of the GNU Guile Reference Manual.
|
||||||
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009,
|
@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009,
|
||||||
@c 2010, 2011 Free Software Foundation, Inc.
|
@c 2010, 2011, 2013 Free Software Foundation, Inc.
|
||||||
@c See the file guile.texi for copying conditions.
|
@c See the file guile.texi for copying conditions.
|
||||||
|
|
||||||
@node Input and Output
|
@node Input and Output
|
||||||
|
@ -19,6 +19,7 @@
|
||||||
* Port Types:: Types of port and how to make them.
|
* Port Types:: Types of port and how to make them.
|
||||||
* R6RS I/O Ports:: The R6RS port API.
|
* R6RS I/O Ports:: The R6RS port API.
|
||||||
* I/O Extensions:: Using and extending ports in C.
|
* I/O Extensions:: Using and extending ports in C.
|
||||||
|
* BOM Handling:: Handling of Unicode byte order marks.
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
|
@ -2373,6 +2374,84 @@ Set using
|
||||||
|
|
||||||
@end table
|
@end table
|
||||||
|
|
||||||
|
@node BOM Handling
|
||||||
|
@subsection Handling of Unicode byte order marks.
|
||||||
|
@cindex BOM
|
||||||
|
@cindex byte order mark
|
||||||
|
|
||||||
|
This section documents the finer points of Guile's handling of Unicode
|
||||||
|
byte order marks (BOMs). A byte order mark (U+FEFF) is typically found
|
||||||
|
at the start of a UTF-16 or UTF-32 stream, to allow readers to reliably
|
||||||
|
determine the byte order. Occasionally, a BOM is found at the start of
|
||||||
|
a UTF-8 stream, but this is much less common and not generally
|
||||||
|
recommended.
|
||||||
|
|
||||||
|
Guile attempts to handle BOMs automatically, and in accordance with the
|
||||||
|
recommendations of the Unicode Standard, when the port encoding is set
|
||||||
|
to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}. In brief, Guile
|
||||||
|
automatically writes a BOM at the start of a UTF-16 or UTF-32 stream,
|
||||||
|
and automatically consumes one from the start of a UTF-8, UTF-16, or
|
||||||
|
UTF-32 stream.
|
||||||
|
|
||||||
|
As specified in the Unicode Standard, a BOM is only handled specially at
|
||||||
|
the start of a stream, and only if the port encoding is set to
|
||||||
|
@code{UTF-8}, @code{UTF-16} or @code{UTF-32}. If the port encoding is
|
||||||
|
set to @code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or
|
||||||
|
@code{UTF-32LE}, then BOMs are @emph{not} handled specially, and none of
|
||||||
|
the special handling described in this section applies.
|
||||||
|
|
||||||
|
@itemize @bullet
|
||||||
|
@item
|
||||||
|
To ensure that Guile will properly detect the byte order of a UTF-16 or
|
||||||
|
UTF-32 stream, you must perform a textual read before any writes, seeks,
|
||||||
|
or binary I/O. Guile will not attempt to read a BOM unless a read is
|
||||||
|
explicitly requested at the start of the stream.
|
||||||
|
|
||||||
|
@item
|
||||||
|
If a textual write is performed before the first read, then an arbitrary
|
||||||
|
byte order will be chosen. Currently, big endian is the default on all
|
||||||
|
platforms, but that may change in the future. If you wish to explicitly
|
||||||
|
control the byte order of an output stream, set the port encoding to
|
||||||
|
@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or @code{UTF-32LE},
|
||||||
|
and explicitly write a BOM (@code{#\xFEFF}) if desired.
|
||||||
|
|
||||||
|
@item
|
||||||
|
If @code{set-port-encoding!} is called in the middle of a stream, Guile
|
||||||
|
treats this as a new logical ``start of stream'' for purposes of BOM
|
||||||
|
handling, and will forget about any BOMs that had previously been seen.
|
||||||
|
Therefore, it may choose a different byte order than had been used
|
||||||
|
previously. This is intended to support multiple logical text streams
|
||||||
|
embedded within a larger binary stream.
|
||||||
|
|
||||||
|
@item
|
||||||
|
Binary I/O operations are not guaranteed to update Guile's notion of
|
||||||
|
whether the port is at the ``start of the stream'', nor are they
|
||||||
|
guaranteed to produce or consume BOMs.
|
||||||
|
|
||||||
|
@item
|
||||||
|
For ports that support seeking (e.g. normal files), the input and output
|
||||||
|
streams are considered linked: if the user reads first, then a BOM will
|
||||||
|
be consumed (if appropriate), but later writes will @emph{not} produce a
|
||||||
|
BOM. Similarly, if the user writes first, then later reads will
|
||||||
|
@emph{not} consume a BOM.
|
||||||
|
|
||||||
|
@item
|
||||||
|
For ports that do not support seeking (e.g. pipes, sockets, and
|
||||||
|
terminals), the input and output streams are considered
|
||||||
|
@emph{independent} for purposes of BOM handling: the first read will
|
||||||
|
consume a BOM (if appropriate), and the first write will @emph{also}
|
||||||
|
produce a BOM (if appropriate). However, the input and output streams
|
||||||
|
will always use the same byte order.
|
||||||
|
|
||||||
|
@item
|
||||||
|
Seeks to the beginning of a file will set the ``start of stream'' flags.
|
||||||
|
Therefore, a subsequent textual read or write will consume or produce a
|
||||||
|
BOM. However, unlike @code{set-port-encoding!}, if a byte order had
|
||||||
|
already been chosen for the port, it will remain in effect after a seek,
|
||||||
|
and cannot be changed by the presence of a BOM. Seeks anywhere other
|
||||||
|
than the beginning of a file clear the ``start of stream'' flags.
|
||||||
|
@end itemize
|
||||||
|
|
||||||
@c Local Variables:
|
@c Local Variables:
|
||||||
@c TeX-master: "guile.texi"
|
@c TeX-master: "guile.texi"
|
||||||
@c End:
|
@c End:
|
||||||
|
|
|
@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
|
||||||
|
|
||||||
struct scm_port_internal
|
struct scm_port_internal
|
||||||
{
|
{
|
||||||
|
unsigned at_stream_start_for_bom_read : 1;
|
||||||
|
unsigned at_stream_start_for_bom_write : 1;
|
||||||
scm_t_port_encoding_mode encoding_mode;
|
scm_t_port_encoding_mode encoding_mode;
|
||||||
scm_t_iconv_descriptors *iconv_descriptors;
|
scm_t_iconv_descriptors *iconv_descriptors;
|
||||||
int pending_eof;
|
int pending_eof;
|
||||||
|
@ -54,9 +56,12 @@ struct scm_port_internal
|
||||||
|
|
||||||
typedef struct scm_port_internal scm_t_port_internal;
|
typedef struct scm_port_internal scm_t_port_internal;
|
||||||
|
|
||||||
|
#define SCM_UNICODE_BOM 0xFEFFUL /* Unicode byte-order mark */
|
||||||
|
|
||||||
#define SCM_PORT_GET_INTERNAL(x) \
|
#define SCM_PORT_GET_INTERNAL(x) \
|
||||||
((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
|
((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
|
||||||
|
|
||||||
SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
|
SCM_INTERNAL scm_t_iconv_descriptors *
|
||||||
|
scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
|
||||||
|
|
||||||
#endif
|
#endif
|
||||||
|
|
157
libguile/ports.c
157
libguile/ports.c
|
@ -651,6 +651,9 @@ scm_new_port_table_entry (scm_t_bits tag)
|
||||||
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
|
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
|
||||||
pti->iconv_descriptors = NULL;
|
pti->iconv_descriptors = NULL;
|
||||||
|
|
||||||
|
pti->at_stream_start_for_bom_read = 1;
|
||||||
|
pti->at_stream_start_for_bom_write = 1;
|
||||||
|
|
||||||
/* XXX These fields are not what they seem. They have been
|
/* XXX These fields are not what they seem. They have been
|
||||||
repurposed, but cannot safely be renamed in 2.0 without breaking
|
repurposed, but cannot safely be renamed in 2.0 without breaking
|
||||||
ABI compatibility. This will be cleaned up in 2.2. */
|
ABI compatibility. This will be cleaned up in 2.2. */
|
||||||
|
@ -1319,10 +1322,12 @@ static int
|
||||||
get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
|
get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
|
||||||
char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
|
char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
|
||||||
{
|
{
|
||||||
scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
|
scm_t_iconv_descriptors *id;
|
||||||
scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
|
scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
|
||||||
size_t input_size = 0;
|
size_t input_size = 0;
|
||||||
|
|
||||||
|
id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
|
||||||
|
|
||||||
for (;;)
|
for (;;)
|
||||||
{
|
{
|
||||||
int byte_read;
|
int byte_read;
|
||||||
|
@ -1409,7 +1414,25 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
|
||||||
err = get_iconv_codepoint (port, codepoint, buf, len);
|
err = get_iconv_codepoint (port, codepoint, buf, len);
|
||||||
|
|
||||||
if (SCM_LIKELY (err == 0))
|
if (SCM_LIKELY (err == 0))
|
||||||
update_port_lf (*codepoint, port);
|
{
|
||||||
|
if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
|
||||||
|
{
|
||||||
|
/* Record that we're no longer at stream start. */
|
||||||
|
pti->at_stream_start_for_bom_read = 0;
|
||||||
|
if (pt->rw_random)
|
||||||
|
pti->at_stream_start_for_bom_write = 0;
|
||||||
|
|
||||||
|
/* If we just read a BOM in an encoding that recognizes them,
|
||||||
|
then silently consume it and read another code point. */
|
||||||
|
if (SCM_UNLIKELY
|
||||||
|
(*codepoint == SCM_UNICODE_BOM
|
||||||
|
&& (pti->encoding_mode == SCM_PORT_ENCODING_MODE_UTF8
|
||||||
|
|| strcasecmp (pt->encoding, "UTF-16") == 0
|
||||||
|
|| strcasecmp (pt->encoding, "UTF-32") == 0)))
|
||||||
|
return get_codepoint (port, codepoint, buf, len);
|
||||||
|
}
|
||||||
|
update_port_lf (*codepoint, port);
|
||||||
|
}
|
||||||
else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
|
else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
|
||||||
{
|
{
|
||||||
*codepoint = '?';
|
*codepoint = '?';
|
||||||
|
@ -2037,6 +2060,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
|
||||||
|
|
||||||
if (SCM_OPPORTP (fd_port))
|
if (SCM_OPPORTP (fd_port))
|
||||||
{
|
{
|
||||||
|
scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
|
||||||
scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
|
scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
|
||||||
off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
|
off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
|
||||||
off_t_or_off64_t rv;
|
off_t_or_off64_t rv;
|
||||||
|
@ -2045,10 +2069,14 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
|
||||||
SCM_MISC_ERROR ("port is not seekable",
|
SCM_MISC_ERROR ("port is not seekable",
|
||||||
scm_cons (fd_port, SCM_EOL));
|
scm_cons (fd_port, SCM_EOL));
|
||||||
else
|
else
|
||||||
{
|
rv = ptob->seek (fd_port, off, how);
|
||||||
scm_i_clear_pending_eof (fd_port);
|
|
||||||
rv = ptob->seek (fd_port, off, how);
|
/* Set stream-start flags according to new position. */
|
||||||
}
|
pti->at_stream_start_for_bom_read = (rv == 0);
|
||||||
|
pti->at_stream_start_for_bom_write = (rv == 0);
|
||||||
|
|
||||||
|
scm_i_clear_pending_eof (fd_port);
|
||||||
|
|
||||||
return scm_from_off_t_or_off64_t (rv);
|
return scm_from_off_t_or_off64_t (rv);
|
||||||
}
|
}
|
||||||
else /* file descriptor?. */
|
else /* file descriptor?. */
|
||||||
|
@ -2301,6 +2329,69 @@ scm_i_default_port_encoding (void)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* If the next LEN bytes from PORT are equal to those in BYTES, then
|
||||||
|
return 1, else return 0. Leave the port position unchanged. */
|
||||||
|
static int
|
||||||
|
looking_at_bytes (SCM port, const unsigned char *bytes, int len)
|
||||||
|
{
|
||||||
|
scm_t_port *pt = SCM_PTAB_ENTRY (port);
|
||||||
|
int result;
|
||||||
|
int i = 0;
|
||||||
|
|
||||||
|
while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
|
||||||
|
{
|
||||||
|
pt->read_pos++;
|
||||||
|
i++;
|
||||||
|
}
|
||||||
|
|
||||||
|
result = (i == len);
|
||||||
|
|
||||||
|
while (i > 0)
|
||||||
|
scm_unget_byte (bytes[--i], port);
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
static const unsigned char scm_utf8_bom[3] = {0xEF, 0xBB, 0xBF};
|
||||||
|
static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
|
||||||
|
static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
|
||||||
|
static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
|
||||||
|
static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
|
||||||
|
|
||||||
|
/* Decide what byte order to use for a UTF-16 port. Return "UTF-16BE"
|
||||||
|
or "UTF-16LE". MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
|
||||||
|
and specifies which operation is about to be done. The MODE
|
||||||
|
determines how we will decide the byte order. We deliberately avoid
|
||||||
|
reading from the port unless the user is about to do so. If the user
|
||||||
|
is about to read, then we look for a BOM, and if present, we use it
|
||||||
|
to determine the byte order. Otherwise we choose big endian, as
|
||||||
|
recommended by the Unicode Standard. Note that the BOM (if any) is
|
||||||
|
not consumed here. */
|
||||||
|
static const char *
|
||||||
|
decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
|
||||||
|
{
|
||||||
|
if (mode == SCM_PORT_READ
|
||||||
|
&& SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
|
||||||
|
&& looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
|
||||||
|
return "UTF-16LE";
|
||||||
|
else
|
||||||
|
return "UTF-16BE";
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Decide what byte order to use for a UTF-32 port. Return "UTF-32BE"
|
||||||
|
or "UTF-32LE". See the comment above 'decide_utf16_encoding' for
|
||||||
|
details. */
|
||||||
|
static const char *
|
||||||
|
decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
|
||||||
|
{
|
||||||
|
if (mode == SCM_PORT_READ
|
||||||
|
&& SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
|
||||||
|
&& looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
|
||||||
|
return "UTF-32LE";
|
||||||
|
else
|
||||||
|
return "UTF-32BE";
|
||||||
|
}
|
||||||
|
|
||||||
static void
|
static void
|
||||||
finalize_iconv_descriptors (void *ptr, void *data)
|
finalize_iconv_descriptors (void *ptr, void *data)
|
||||||
{
|
{
|
||||||
|
@ -2377,23 +2468,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
|
||||||
id->output_cd = (void *) -1;
|
id->output_cd = (void *) -1;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Return the iconv_descriptors, initializing them if necessary. MODE
|
||||||
|
must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
|
||||||
|
operation is about to be done. We deliberately avoid reading from
|
||||||
|
the port unless the user was about to do so. */
|
||||||
scm_t_iconv_descriptors *
|
scm_t_iconv_descriptors *
|
||||||
scm_i_port_iconv_descriptors (SCM port)
|
scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
|
||||||
{
|
{
|
||||||
scm_t_port *pt;
|
scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
|
||||||
scm_t_port_internal *pti;
|
|
||||||
|
|
||||||
pt = SCM_PTAB_ENTRY (port);
|
|
||||||
pti = SCM_PORT_GET_INTERNAL (port);
|
|
||||||
|
|
||||||
assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
|
assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
|
||||||
|
|
||||||
if (!pti->iconv_descriptors)
|
if (!pti->iconv_descriptors)
|
||||||
{
|
{
|
||||||
|
scm_t_port *pt = SCM_PTAB_ENTRY (port);
|
||||||
|
const char *precise_encoding;
|
||||||
|
|
||||||
if (!pt->encoding)
|
if (!pt->encoding)
|
||||||
pt->encoding = "ISO-8859-1";
|
pt->encoding = "ISO-8859-1";
|
||||||
|
|
||||||
|
/* If the specified encoding is UTF-16 or UTF-32, then make
|
||||||
|
that more precise by deciding what byte order to use. */
|
||||||
|
if (strcasecmp (pt->encoding, "UTF-16") == 0)
|
||||||
|
precise_encoding = decide_utf16_encoding (port, mode);
|
||||||
|
else if (strcasecmp (pt->encoding, "UTF-32") == 0)
|
||||||
|
precise_encoding = decide_utf32_encoding (port, mode);
|
||||||
|
else
|
||||||
|
precise_encoding = pt->encoding;
|
||||||
|
|
||||||
pti->iconv_descriptors =
|
pti->iconv_descriptors =
|
||||||
open_iconv_descriptors (pt->encoding,
|
open_iconv_descriptors (precise_encoding,
|
||||||
SCM_INPUT_PORT_P (port),
|
SCM_INPUT_PORT_P (port),
|
||||||
SCM_OUTPUT_PORT_P (port));
|
SCM_OUTPUT_PORT_P (port));
|
||||||
}
|
}
|
||||||
|
@ -2413,28 +2517,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
|
||||||
pti = SCM_PORT_GET_INTERNAL (port);
|
pti = SCM_PORT_GET_INTERNAL (port);
|
||||||
prev = pti->iconv_descriptors;
|
prev = pti->iconv_descriptors;
|
||||||
|
|
||||||
|
/* In order to handle cases where the encoding changes mid-stream
|
||||||
|
(e.g. within an HTTP stream, or within a file that is composed of
|
||||||
|
segments with different encodings), we consider this to be "stream
|
||||||
|
start" for purposes of BOM handling, regardless of our actual file
|
||||||
|
position. */
|
||||||
|
pti->at_stream_start_for_bom_read = 1;
|
||||||
|
pti->at_stream_start_for_bom_write = 1;
|
||||||
|
|
||||||
if (encoding == NULL)
|
if (encoding == NULL)
|
||||||
encoding = "ISO-8859-1";
|
encoding = "ISO-8859-1";
|
||||||
|
|
||||||
/* If ENCODING is UTF-8, then no conversion descriptor is opened
|
/* If ENCODING is UTF-8, then no conversion descriptor is opened
|
||||||
because we do I/O ourselves. This saves 100+ KiB for each
|
because we do I/O ourselves. This saves 100+ KiB for each
|
||||||
descriptor. */
|
descriptor. */
|
||||||
if (strcasecmp (encoding, "UTF-8") == 0)
|
|
||||||
{
|
|
||||||
pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
|
|
||||||
pti->iconv_descriptors = NULL;
|
|
||||||
}
|
|
||||||
else
|
|
||||||
{
|
|
||||||
/* Open descriptors before mutating the port. */
|
|
||||||
pti->iconv_descriptors =
|
|
||||||
open_iconv_descriptors (encoding,
|
|
||||||
SCM_INPUT_PORT_P (port),
|
|
||||||
SCM_OUTPUT_PORT_P (port));
|
|
||||||
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
|
|
||||||
}
|
|
||||||
pt->encoding = scm_gc_strdup (encoding, "port");
|
pt->encoding = scm_gc_strdup (encoding, "port");
|
||||||
|
if (strcasecmp (encoding, "UTF-8") == 0)
|
||||||
|
pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
|
||||||
|
else
|
||||||
|
pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
|
||||||
|
|
||||||
|
pti->iconv_descriptors = NULL;
|
||||||
if (prev)
|
if (prev)
|
||||||
close_iconv_descriptors (prev);
|
close_iconv_descriptors (prev);
|
||||||
}
|
}
|
||||||
|
|
|
@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
|
||||||
{
|
{
|
||||||
size_t printed;
|
size_t printed;
|
||||||
scm_t_iconv_descriptors *id;
|
scm_t_iconv_descriptors *id;
|
||||||
|
scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
|
||||||
|
|
||||||
id = scm_i_port_iconv_descriptors (port);
|
id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
|
||||||
|
|
||||||
|
if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
|
||||||
|
{
|
||||||
|
scm_t_port *pt = SCM_PTAB_ENTRY (port);
|
||||||
|
|
||||||
|
/* Record that we're no longer at stream start. */
|
||||||
|
pti->at_stream_start_for_bom_write = 0;
|
||||||
|
if (pt->rw_random)
|
||||||
|
pti->at_stream_start_for_bom_read = 0;
|
||||||
|
|
||||||
|
/* Write a BOM if appropriate. */
|
||||||
|
if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
|
||||||
|
|| strcasecmp(pt->encoding, "UTF-32") == 0))
|
||||||
|
display_character (SCM_UNICODE_BOM, port, iconveh_error);
|
||||||
|
}
|
||||||
|
|
||||||
printed = 0;
|
printed = 0;
|
||||||
|
|
||||||
|
|
|
@ -24,7 +24,8 @@
|
||||||
#:use-module (ice-9 popen)
|
#:use-module (ice-9 popen)
|
||||||
#:use-module (ice-9 rdelim)
|
#:use-module (ice-9 rdelim)
|
||||||
#:use-module (rnrs bytevectors)
|
#:use-module (rnrs bytevectors)
|
||||||
#:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
|
#:use-module ((rnrs io ports) #:select (open-bytevector-input-port
|
||||||
|
open-bytevector-output-port)))
|
||||||
|
|
||||||
(define (display-line . args)
|
(define (display-line . args)
|
||||||
(for-each display args)
|
(for-each display args)
|
||||||
|
@ -918,7 +919,9 @@
|
||||||
|
|
||||||
(pass-if-exception "set-port-encoding!, wrong encoding"
|
(pass-if-exception "set-port-encoding!, wrong encoding"
|
||||||
exception:miscellaneous-error
|
exception:miscellaneous-error
|
||||||
(set-port-encoding! (open-input-string "") "does-not-exist"))
|
(let ((p (open-input-string "")))
|
||||||
|
(set-port-encoding! p "does-not-exist")
|
||||||
|
(read p)))
|
||||||
|
|
||||||
(pass-if-exception "%default-port-encoding, wrong encoding"
|
(pass-if-exception "%default-port-encoding, wrong encoding"
|
||||||
exception:miscellaneous-error
|
exception:miscellaneous-error
|
||||||
|
@ -1233,6 +1236,283 @@
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
(with-test-prefix "unicode byte-order marks (BOMs)"
|
||||||
|
|
||||||
|
(define (bv-read-test* encoding bv proc)
|
||||||
|
(let ((port (open-bytevector-input-port bv)))
|
||||||
|
(set-port-encoding! port encoding)
|
||||||
|
(proc port)))
|
||||||
|
|
||||||
|
(define (bv-read-test encoding bv)
|
||||||
|
(bv-read-test* encoding bv read-string))
|
||||||
|
|
||||||
|
(define (bv-write-test* encoding proc)
|
||||||
|
(call-with-values
|
||||||
|
(lambda () (open-bytevector-output-port))
|
||||||
|
(lambda (port get-bytevector)
|
||||||
|
(set-port-encoding! port encoding)
|
||||||
|
(proc port)
|
||||||
|
(get-bytevector))))
|
||||||
|
|
||||||
|
(define (bv-write-test encoding str)
|
||||||
|
(bv-write-test* encoding
|
||||||
|
(lambda (p)
|
||||||
|
(display str p))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from Latin-1 stream"
|
||||||
|
"\xEF\xBB\xBF\x61"
|
||||||
|
(bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from Latin-2 stream"
|
||||||
|
"\u010F\u0165\u017C\x61"
|
||||||
|
(bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-16BE stream"
|
||||||
|
"\uFEFF\x61"
|
||||||
|
(bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-16LE stream"
|
||||||
|
"\uFEFF\x61"
|
||||||
|
(bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-32BE stream"
|
||||||
|
"\uFEFF\x61"
|
||||||
|
(bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-32LE stream"
|
||||||
|
"\uFEFF\x61"
|
||||||
|
(bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
|
||||||
|
#x61 #x00 #x00 #x00)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not written to UTF-8 stream"
|
||||||
|
#vu8(#x61)
|
||||||
|
(bv-write-test "UTF-8" "a"))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not written to UTF-16BE stream"
|
||||||
|
#vu8(#x00 #x61)
|
||||||
|
(bv-write-test "UTF-16BE" "a"))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not written to UTF-16LE stream"
|
||||||
|
#vu8(#x61 #x00)
|
||||||
|
(bv-write-test "UTF-16LE" "a"))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not written to UTF-32BE stream"
|
||||||
|
#vu8(#x00 #x00 #x00 #x61)
|
||||||
|
(bv-write-test "UTF-32BE" "a"))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not written to UTF-32LE stream"
|
||||||
|
#vu8(#x61 #x00 #x00 #x00)
|
||||||
|
(bv-write-test "UTF-32LE" "a"))
|
||||||
|
|
||||||
|
(pass-if "Don't read from the port unless user asks to"
|
||||||
|
(let* ((p (make-soft-port
|
||||||
|
(vector
|
||||||
|
(lambda (c) #f) ; write char
|
||||||
|
(lambda (s) #f) ; write string
|
||||||
|
(lambda () #f) ; flush
|
||||||
|
(lambda () (throw 'fail)) ; read char
|
||||||
|
(lambda () #f))
|
||||||
|
"rw")))
|
||||||
|
(set-port-encoding! p "UTF-16")
|
||||||
|
(display "abc" p)
|
||||||
|
(set-port-encoding! p "UTF-32")
|
||||||
|
(display "def" p)
|
||||||
|
#t))
|
||||||
|
|
||||||
|
;; TODO: test that input and output streams are independent when
|
||||||
|
;; appropriate, and linked when appropriate.
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-8 stream"
|
||||||
|
"a"
|
||||||
|
(bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
|
||||||
|
'(#\a "a")
|
||||||
|
(bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((c (read-char p)))
|
||||||
|
(seek p 0 SEEK_SET)
|
||||||
|
(let ((s (read-string p)))
|
||||||
|
(list c s))))))
|
||||||
|
|
||||||
|
(pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
|
||||||
|
"\uFEFFb"
|
||||||
|
(bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
|
||||||
|
(lambda (p)
|
||||||
|
(seek p 1 SEEK_SET)
|
||||||
|
(read-string p))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
|
||||||
|
"a\uFEFFb"
|
||||||
|
(bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM (BE) written to start of UTF-16 stream"
|
||||||
|
#vu8(#xFE #xFF #x00 #x61 #x00 #x62)
|
||||||
|
(bv-write-test "UTF-16" "ab"))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
|
||||||
|
#vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
|
||||||
|
(bv-write-test* "UTF-16"
|
||||||
|
(lambda (p)
|
||||||
|
(display "ab" p)
|
||||||
|
(set-port-encoding! p "UTF-16")
|
||||||
|
(display "cd" p))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
|
||||||
|
"a"
|
||||||
|
(bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
|
||||||
|
'(#\a "a")
|
||||||
|
(bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((c (read-char p)))
|
||||||
|
(seek p 0 SEEK_SET)
|
||||||
|
(let ((s (read-string p)))
|
||||||
|
(list c s))))))
|
||||||
|
|
||||||
|
(pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
|
||||||
|
(lambda (p)
|
||||||
|
(seek p 2 SEEK_SET)
|
||||||
|
(read-string p))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
|
||||||
|
"a\uFEFFb"
|
||||||
|
(let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
|
||||||
|
(le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
|
||||||
|
(if (char=? #\a (string-ref be 0))
|
||||||
|
be
|
||||||
|
le)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
|
||||||
|
"a"
|
||||||
|
(bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
|
||||||
|
'(#\a "a")
|
||||||
|
(bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((c (read-char p)))
|
||||||
|
(seek p 0 SEEK_SET)
|
||||||
|
(let ((s (read-string p)))
|
||||||
|
(list c s))))))
|
||||||
|
|
||||||
|
(pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
|
||||||
|
"a"
|
||||||
|
(bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
|
||||||
|
'(#\a "a")
|
||||||
|
(bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x61)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((c (read-char p)))
|
||||||
|
(seek p 0 SEEK_SET)
|
||||||
|
(let ((s (read-string p)))
|
||||||
|
(list c s))))))
|
||||||
|
|
||||||
|
(pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x61)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x61)
|
||||||
|
(lambda (p)
|
||||||
|
(seek p 4 SEEK_SET)
|
||||||
|
(read-string p))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
|
||||||
|
"ab"
|
||||||
|
(bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((a (read-char p)))
|
||||||
|
(set-port-encoding! p "UTF-16")
|
||||||
|
(string a (read-char p))))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
|
||||||
|
"ab"
|
||||||
|
(bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((a (read-char p)))
|
||||||
|
(set-port-encoding! p "UTF-16")
|
||||||
|
(string a (read-char p))))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
|
||||||
|
"ab"
|
||||||
|
(bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
|
||||||
|
#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x62)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((a (read-char p)))
|
||||||
|
(set-port-encoding! p "UTF-32")
|
||||||
|
(string a (read-char p))))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
|
||||||
|
"ab"
|
||||||
|
(bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
|
||||||
|
#xFF #xFE #x00 #x00
|
||||||
|
#x62 #x00 #x00 #x00)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((a (read-char p)))
|
||||||
|
(set-port-encoding! p "UTF-32")
|
||||||
|
(string a (read-char p))))))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
|
||||||
|
"a\uFEFFb"
|
||||||
|
(let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
|
||||||
|
#x00 #x00 #xFE #xFF
|
||||||
|
#x00 #x00 #x00 #x62)))
|
||||||
|
(le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
|
||||||
|
#xFF #xFE #x00 #x00
|
||||||
|
#x62 #x00 #x00 #x00))))
|
||||||
|
(if (char=? #\a (string-ref be 0))
|
||||||
|
be
|
||||||
|
le)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
|
||||||
|
"a"
|
||||||
|
(bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
|
||||||
|
#x61 #x00 #x00 #x00)))
|
||||||
|
|
||||||
|
(pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
|
||||||
|
'(#\a "a")
|
||||||
|
(bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
|
||||||
|
#x61 #x00 #x00 #x00)
|
||||||
|
(lambda (p)
|
||||||
|
(let ((c (read-char p)))
|
||||||
|
(seek p 0 SEEK_SET)
|
||||||
|
(let ((s (read-string p)))
|
||||||
|
(list c s))))))
|
||||||
|
|
||||||
|
(pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
|
||||||
|
"\uFEFFa"
|
||||||
|
(bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
|
||||||
|
#xFF #xFE #x00 #x00
|
||||||
|
#x61 #x00 #x00 #x00))))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
(define-syntax-rule (with-load-path path body ...)
|
(define-syntax-rule (with-load-path path body ...)
|
||||||
(let ((new path)
|
(let ((new path)
|
||||||
(old %load-path))
|
(old %load-path))
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue