Improve handling of Unicode byte-order marks (BOMs).

* libguile/ports-internal.h (struct scm_port_internal): Add new members 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (SCM_UNICODE_BOM): New macro. (scm_i_port_iconv_descriptors): Add 'mode' parameter to prototype. * libguile/ports.c (scm_new_port_table_entry): Initialize 'at_stream_start_for_bom_read' and 'at_stream_start_for_bom_write'. (get_iconv_codepoint): Pass new 'mode' parameter to 'scm_i_port_iconv_descriptors'. (get_codepoint): After reading a codepoint at stream start, record that we're no longer at stream start, and consume a BOM where appropriate. (scm_seek): Set the stream start flags according to the new position. (looking_at_bytes): New static function. (scm_utf8_bom, scm_utf16be_bom, scm_utf16le_bom, scm_utf32be_bom, scm_utf32le_bom): New static const arrays. (decide_utf16_encoding, decide_utf32_encoding): New static functions. (scm_i_port_iconv_descriptors): Add new 'mode' parameter. If the specified encoding is UTF-16 or UTF-32, make that precise by deciding what byte order to use, and construct iconv descriptors based on the precise encoding. (scm_i_set_port_encoding_x): Record that we are now at stream start. Do not open the new iconv descriptors immediately; let them be initialized lazily. * libguile/print.c (display_string_using_iconv): Record that we're no longer at stream start. Write a BOM if appropriate. * doc/ref/api-io.texi (BOM Handling): New node. * test-suite/tests/ports.test ("set-port-encoding!, wrong encoding"): Adapt test to cope with the fact that 'set-port-encoding!' does not immediately open the iconv descriptors. (bv-read-test): New procedure. ("unicode byte-order marks (BOMs)"): New test prefix.
2025-06-11 06:20:23 +02:00 · 2013-04-03 04:22:04 -04:00 · 2013-04-03 04:22:04 -04:00 · cdd3d6c9f4
commit cdd3d6c9f4
parent 45c0878b86
5 changed files with 515 additions and 32 deletions
--- a/doc/ref/api-io.texi
+++ b/doc/ref/api-io.texi
@ -1,7 +1,7 @@
@c -*-texinfo-*-
@c This is part of the GNU Guile Reference Manual.
@c Copyright (C)  1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009,
-@c   2010, 2011  Free Software Foundation, Inc.
+@c   2010, 2011, 2013  Free Software Foundation, Inc.
@c See the file guile.texi for copying conditions.
@node Input and Output
@ -19,6 +19,7 @@
 * Port Types::                  Types of port and how to make them.
 * R6RS I/O Ports::              The R6RS port API.
 * I/O Extensions::              Using and extending ports in C.
 * BOM Handling::                Handling of Unicode byte order marks.
@end menu
@ -2373,6 +2374,84 @@ Set using
@end table
@node BOM Handling
@subsection Handling of Unicode byte order marks.
@cindex BOM
@cindex byte order mark
 This section documents the finer points of Guile's handling of Unicode
 byte order marks (BOMs).  A byte order mark (U+FEFF) is typically found
 at the start of a UTF-16 or UTF-32 stream, to allow readers to reliably
 determine the byte order.  Occasionally, a BOM is found at the start of
 a UTF-8 stream, but this is much less common and not generally
 recommended.
 Guile attempts to handle BOMs automatically, and in accordance with the
 recommendations of the Unicode Standard, when the port encoding is set
 to @code{UTF-8}, @code{UTF-16}, or @code{UTF-32}.  In brief, Guile
 automatically writes a BOM at the start of a UTF-16 or UTF-32 stream,
 and automatically consumes one from the start of a UTF-8, UTF-16, or
 UTF-32 stream.
 As specified in the Unicode Standard, a BOM is only handled specially at
 the start of a stream, and only if the port encoding is set to
@code{UTF-8}, @code{UTF-16} or @code{UTF-32}.  If the port encoding is
 set to @code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or
@code{UTF-32LE}, then BOMs are @emph{not} handled specially, and none of
 the special handling described in this section applies.
@itemize @bullet
@item
 To ensure that Guile will properly detect the byte order of a UTF-16 or
 UTF-32 stream, you must perform a textual read before any writes, seeks,
 or binary I/O.  Guile will not attempt to read a BOM unless a read is
 explicitly requested at the start of the stream.
@item
 If a textual write is performed before the first read, then an arbitrary
 byte order will be chosen.  Currently, big endian is the default on all
 platforms, but that may change in the future.  If you wish to explicitly
 control the byte order of an output stream, set the port encoding to
@code{UTF-16BE}, @code{UTF-16LE}, @code{UTF-32BE}, or @code{UTF-32LE},
 and explicitly write a BOM (@code{#\xFEFF}) if desired.
@item
 If @code{set-port-encoding!} is called in the middle of a stream, Guile
 treats this as a new logical ``start of stream'' for purposes of BOM
 handling, and will forget about any BOMs that had previously been seen.
 Therefore, it may choose a different byte order than had been used
 previously.  This is intended to support multiple logical text streams
 embedded within a larger binary stream.
@item
 Binary I/O operations are not guaranteed to update Guile's notion of
 whether the port is at the ``start of the stream'', nor are they
 guaranteed to produce or consume BOMs.
@item
 For ports that support seeking (e.g. normal files), the input and output
 streams are considered linked: if the user reads first, then a BOM will
 be consumed (if appropriate), but later writes will @emph{not} produce a
 BOM.  Similarly, if the user writes first, then later reads will
@emph{not} consume a BOM.
@item
 For ports that do not support seeking (e.g. pipes, sockets, and
 terminals), the input and output streams are considered
@emph{independent} for purposes of BOM handling: the first read will
 consume a BOM (if appropriate), and the first write will @emph{also}
 produce a BOM (if appropriate).  However, the input and output streams
 will always use the same byte order.
@item
 Seeks to the beginning of a file will set the ``start of stream'' flags.
 Therefore, a subsequent textual read or write will consume or produce a
 BOM.  However, unlike @code{set-port-encoding!}, if a byte order had
 already been chosen for the port, it will remain in effect after a seek,
 and cannot be changed by the presence of a BOM.  Seeks anywhere other
 than the beginning of a file clear the ``start of stream'' flags.
@end itemize
@c Local Variables:
@c TeX-master: "guile.texi"
@c End:
--- a/libguile/ports-internal.h
+++ b/libguile/ports-internal.h
@ -46,6 +46,8 @@ typedef struct scm_iconv_descriptors scm_t_iconv_descriptors;
 struct scm_port_internal
 {
  unsigned at_stream_start_for_bom_read  : 1;
  unsigned at_stream_start_for_bom_write : 1;
  scm_t_port_encoding_mode encoding_mode;
  scm_t_iconv_descriptors *iconv_descriptors;
  int pending_eof;
@ -54,9 +56,12 @@ struct scm_port_internal
 typedef struct scm_port_internal scm_t_port_internal;
 #define SCM_UNICODE_BOM  0xFEFFUL  /* Unicode byte-order mark */
 #define SCM_PORT_GET_INTERNAL(x)                                \
  ((scm_t_port_internal *) (SCM_PTAB_ENTRY(x)->input_cd))
-SCM_INTERNAL scm_t_iconv_descriptors *scm_i_port_iconv_descriptors (SCM port);
+SCM_INTERNAL scm_t_iconv_descriptors *
 scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode);
 #endif
--- a/libguile/ports.c
+++ b/libguile/ports.c
@ -651,6 +651,9 @@ scm_new_port_table_entry (scm_t_bits tag)
    pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
  pti->iconv_descriptors = NULL;
  pti->at_stream_start_for_bom_read  = 1;
  pti->at_stream_start_for_bom_write = 1;
  /* XXX These fields are not what they seem.  They have been
     repurposed, but cannot safely be renamed in 2.0 without breaking
     ABI compatibility.  This will be cleaned up in 2.2.  */
@ -1319,10 +1322,12 @@ static int
 get_iconv_codepoint (SCM port, scm_t_wchar *codepoint,
 		     char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  scm_t_iconv_descriptors *id = scm_i_port_iconv_descriptors (port);
+  scm_t_iconv_descriptors *id;
  scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
  size_t input_size = 0;
  id = scm_i_port_iconv_descriptors (port, SCM_PORT_READ);
  for (;;)
    {
      int byte_read;
@ -1409,7 +1414,25 @@ get_codepoint (SCM port, scm_t_wchar *codepoint,
    err = get_iconv_codepoint (port, codepoint, buf, len);
  if (SCM_LIKELY (err == 0))
-    update_port_lf (*codepoint, port);
+    {
      if (SCM_UNLIKELY (pti->at_stream_start_for_bom_read))
        {
          /* Record that we're no longer at stream start. */
          pti->at_stream_start_for_bom_read = 0;
          if (pt->rw_random)
            pti->at_stream_start_for_bom_write = 0;
          /* If we just read a BOM in an encoding that recognizes them,
             then silently consume it and read another code point. */
          if (SCM_UNLIKELY
              (*codepoint == SCM_UNICODE_BOM
               && (pti->encoding_mode == SCM_PORT_ENCODING_MODE_UTF8
                   || strcasecmp (pt->encoding, "UTF-16") == 0
                   || strcasecmp (pt->encoding, "UTF-32") == 0)))
            return get_codepoint (port, codepoint, buf, len);
        }
      update_port_lf (*codepoint, port);
    }
  else if (pt->ilseq_handler == SCM_ICONVEH_QUESTION_MARK)
    {
      *codepoint = '?';
@ -2037,6 +2060,7 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
  if (SCM_OPPORTP (fd_port))
    {
      scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (fd_port);
      scm_t_ptob_descriptor *ptob = scm_ptobs + SCM_PTOBNUM (fd_port);
      off_t_or_off64_t off = scm_to_off_t_or_off64_t (offset);
      off_t_or_off64_t rv;
@ -2045,10 +2069,14 @@ SCM_DEFINE (scm_seek, "seek", 3, 0, 0,
 	SCM_MISC_ERROR ("port is not seekable", 
                        scm_cons (fd_port, SCM_EOL));
      else
-        {
+        rv = ptob->seek (fd_port, off, how);
-          scm_i_clear_pending_eof (fd_port);
+
-          rv = ptob->seek (fd_port, off, how);
+      /* Set stream-start flags according to new position. */
-        }
+      pti->at_stream_start_for_bom_read  = (rv == 0);
      pti->at_stream_start_for_bom_write = (rv == 0);
      scm_i_clear_pending_eof (fd_port);
      return scm_from_off_t_or_off64_t (rv);
    }
  else /* file descriptor?.  */
@ -2301,6 +2329,69 @@ scm_i_default_port_encoding (void)
    }
 }
 /* If the next LEN bytes from PORT are equal to those in BYTES, then
   return 1, else return 0.  Leave the port position unchanged.  */
 static int
 looking_at_bytes (SCM port, const unsigned char *bytes, int len)
 {
  scm_t_port *pt = SCM_PTAB_ENTRY (port);
  int result;
  int i = 0;
  while (i < len && scm_peek_byte_or_eof (port) == bytes[i])
    {
      pt->read_pos++;
      i++;
    }
  result = (i == len);
  while (i > 0)
    scm_unget_byte (bytes[--i], port);
  return result;
 }
 static const unsigned char scm_utf8_bom[3]    = {0xEF, 0xBB, 0xBF};
 static const unsigned char scm_utf16be_bom[2] = {0xFE, 0xFF};
 static const unsigned char scm_utf16le_bom[2] = {0xFF, 0xFE};
 static const unsigned char scm_utf32be_bom[4] = {0x00, 0x00, 0xFE, 0xFF};
 static const unsigned char scm_utf32le_bom[4] = {0xFF, 0xFE, 0x00, 0x00};
 /* Decide what byte order to use for a UTF-16 port.  Return "UTF-16BE"
   or "UTF-16LE".  MODE must be either SCM_PORT_READ or SCM_PORT_WRITE,
   and specifies which operation is about to be done.  The MODE
   determines how we will decide the byte order.  We deliberately avoid
   reading from the port unless the user is about to do so.  If the user
   is about to read, then we look for a BOM, and if present, we use it
   to determine the byte order.  Otherwise we choose big endian, as
   recommended by the Unicode Standard.  Note that the BOM (if any) is
   not consumed here.  */
 static const char *
 decide_utf16_encoding (SCM port, scm_t_port_rw_active mode)
 {
  if (mode == SCM_PORT_READ
      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
      && looking_at_bytes (port, scm_utf16le_bom, sizeof scm_utf16le_bom))
    return "UTF-16LE";
  else
    return "UTF-16BE";
 }
 /* Decide what byte order to use for a UTF-32 port.  Return "UTF-32BE"
   or "UTF-32LE".  See the comment above 'decide_utf16_encoding' for
   details.  */
 static const char *
 decide_utf32_encoding (SCM port, scm_t_port_rw_active mode)
 {
  if (mode == SCM_PORT_READ
      && SCM_PORT_GET_INTERNAL (port)->at_stream_start_for_bom_read
      && looking_at_bytes (port, scm_utf32le_bom, sizeof scm_utf32le_bom))
    return "UTF-32LE";
  else
    return "UTF-32BE";
 }
 static void
 finalize_iconv_descriptors (void *ptr, void *data)
 {
@ -2377,23 +2468,36 @@ close_iconv_descriptors (scm_t_iconv_descriptors *id)
  id->output_cd = (void *) -1;
 }
 /* Return the iconv_descriptors, initializing them if necessary.  MODE
   must be either SCM_PORT_READ or SCM_PORT_WRITE, and specifies which
   operation is about to be done.  We deliberately avoid reading from
   the port unless the user was about to do so.  */
 scm_t_iconv_descriptors *
-scm_i_port_iconv_descriptors (SCM port)
+scm_i_port_iconv_descriptors (SCM port, scm_t_port_rw_active mode)
 {
-  scm_t_port *pt;
+  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
  scm_t_port_internal *pti;
  pt = SCM_PTAB_ENTRY (port);
  pti = SCM_PORT_GET_INTERNAL (port);
  assert (pti->encoding_mode == SCM_PORT_ENCODING_MODE_ICONV);
  if (!pti->iconv_descriptors)
    {
      scm_t_port *pt = SCM_PTAB_ENTRY (port);
      const char *precise_encoding;
      if (!pt->encoding)
        pt->encoding = "ISO-8859-1";
      /* If the specified encoding is UTF-16 or UTF-32, then make
         that more precise by deciding what byte order to use. */
      if (strcasecmp (pt->encoding, "UTF-16") == 0)
        precise_encoding = decide_utf16_encoding (port, mode);
      else if (strcasecmp (pt->encoding, "UTF-32") == 0)
        precise_encoding = decide_utf32_encoding (port, mode);
      else
        precise_encoding = pt->encoding;
      pti->iconv_descriptors =
-        open_iconv_descriptors (pt->encoding,
+        open_iconv_descriptors (precise_encoding,
                                SCM_INPUT_PORT_P (port),
                                SCM_OUTPUT_PORT_P (port));
    }
@ -2413,28 +2517,27 @@ scm_i_set_port_encoding_x (SCM port, const char *encoding)
  pti = SCM_PORT_GET_INTERNAL (port);
  prev = pti->iconv_descriptors;
  /* In order to handle cases where the encoding changes mid-stream
     (e.g. within an HTTP stream, or within a file that is composed of
     segments with different encodings), we consider this to be "stream
     start" for purposes of BOM handling, regardless of our actual file
     position. */
  pti->at_stream_start_for_bom_read  = 1;
  pti->at_stream_start_for_bom_write = 1;
  if (encoding == NULL)
    encoding = "ISO-8859-1";
  /* If ENCODING is UTF-8, then no conversion descriptor is opened
     because we do I/O ourselves.  This saves 100+ KiB for each
     descriptor.  */
  if (strcasecmp (encoding, "UTF-8") == 0)
    {
      pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
      pti->iconv_descriptors = NULL;
    }
  else
    {
      /* Open descriptors before mutating the port. */
      pti->iconv_descriptors =
        open_iconv_descriptors (encoding,
                                SCM_INPUT_PORT_P (port),
                                SCM_OUTPUT_PORT_P (port));
      pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
    }
  pt->encoding = scm_gc_strdup (encoding, "port");
  if (strcasecmp (encoding, "UTF-8") == 0)
    pti->encoding_mode = SCM_PORT_ENCODING_MODE_UTF8;
  else
    pti->encoding_mode = SCM_PORT_ENCODING_MODE_ICONV;
  pti->iconv_descriptors = NULL;
  if (prev)
    close_iconv_descriptors (prev);
 }
--- a/libguile/print.c
+++ b/libguile/print.c
@ -881,8 +881,24 @@ display_string_using_iconv (const void *str, int narrow_p, size_t len,
 {
  size_t printed;
  scm_t_iconv_descriptors *id;
  scm_t_port_internal *pti = SCM_PORT_GET_INTERNAL (port);
-  id = scm_i_port_iconv_descriptors (port);
+  id = scm_i_port_iconv_descriptors (port, SCM_PORT_WRITE);
  if (SCM_UNLIKELY (pti->at_stream_start_for_bom_write && len > 0))
    {
      scm_t_port *pt = SCM_PTAB_ENTRY (port);
      /* Record that we're no longer at stream start.  */
      pti->at_stream_start_for_bom_write = 0;
      if (pt->rw_random)
        pti->at_stream_start_for_bom_read = 0;
      /* Write a BOM if appropriate.  */
      if (SCM_UNLIKELY (strcasecmp(pt->encoding, "UTF-16") == 0
                        || strcasecmp(pt->encoding, "UTF-32") == 0))
        display_character (SCM_UNICODE_BOM, port, iconveh_error);
    }
  printed = 0;
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@ -24,7 +24,8 @@
  #:use-module (ice-9 popen)
  #:use-module (ice-9 rdelim)
  #:use-module (rnrs bytevectors)
-  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port)))
+  #:use-module ((rnrs io ports) #:select (open-bytevector-input-port
                                          open-bytevector-output-port)))
 (define (display-line . args)
  (for-each display args)
@ -918,7 +919,9 @@
  (pass-if-exception "set-port-encoding!, wrong encoding"
    exception:miscellaneous-error
-    (set-port-encoding! (open-input-string "") "does-not-exist"))
+    (let ((p (open-input-string "")))
      (set-port-encoding! p "does-not-exist")
      (read p)))
  (pass-if-exception "%default-port-encoding, wrong encoding"
    exception:miscellaneous-error
@ -1233,6 +1236,283 @@
 (with-test-prefix "unicode byte-order marks (BOMs)"
  (define (bv-read-test* encoding bv proc)
    (let ((port (open-bytevector-input-port bv)))
      (set-port-encoding! port encoding)
      (proc port)))
  (define (bv-read-test encoding bv)
    (bv-read-test* encoding bv read-string))
  (define (bv-write-test* encoding proc)
    (call-with-values
        (lambda () (open-bytevector-output-port))
      (lambda (port get-bytevector)
        (set-port-encoding! port encoding)
        (proc port)
        (get-bytevector))))
  (define (bv-write-test encoding str)
    (bv-write-test* encoding
                    (lambda (p)
                      (display str p))))
  (pass-if-equal "BOM not discarded from Latin-1 stream"
      "\xEF\xBB\xBF\x61"
    (bv-read-test "ISO-8859-1" #vu8(#xEF #xBB #xBF #x61)))
  (pass-if-equal "BOM not discarded from Latin-2 stream"
      "\u010F\u0165\u017C\x61"
    (bv-read-test "ISO-8859-2" #vu8(#xEF #xBB #xBF #x61)))
  (pass-if-equal "BOM not discarded from UTF-16BE stream"
      "\uFEFF\x61"
    (bv-read-test "UTF-16BE" #vu8(#xFE #xFF #x00 #x61)))
  (pass-if-equal "BOM not discarded from UTF-16LE stream"
      "\uFEFF\x61"
    (bv-read-test "UTF-16LE" #vu8(#xFF #xFE #x61 #x00)))
  (pass-if-equal "BOM not discarded from UTF-32BE stream"
      "\uFEFF\x61"
    (bv-read-test "UTF-32BE" #vu8(#x00 #x00 #xFE #xFF
                                       #x00 #x00 #x00 #x61)))
  (pass-if-equal "BOM not discarded from UTF-32LE stream"
      "\uFEFF\x61"
    (bv-read-test "UTF-32LE" #vu8(#xFF #xFE #x00 #x00
                                       #x61 #x00 #x00 #x00)))
  (pass-if-equal "BOM not written to UTF-8 stream"
      #vu8(#x61)
    (bv-write-test "UTF-8" "a"))
  (pass-if-equal "BOM not written to UTF-16BE stream"
      #vu8(#x00 #x61)
    (bv-write-test "UTF-16BE" "a"))
  (pass-if-equal "BOM not written to UTF-16LE stream"
      #vu8(#x61 #x00)
    (bv-write-test "UTF-16LE" "a"))
  (pass-if-equal "BOM not written to UTF-32BE stream"
      #vu8(#x00 #x00 #x00 #x61)
    (bv-write-test "UTF-32BE" "a"))
  (pass-if-equal "BOM not written to UTF-32LE stream"
      #vu8(#x61 #x00 #x00 #x00)
    (bv-write-test "UTF-32LE" "a"))
  (pass-if "Don't read from the port unless user asks to"
    (let* ((p (make-soft-port
               (vector
                (lambda (c) #f)           ; write char
                (lambda (s) #f)           ; write string
                (lambda () #f)            ; flush
                (lambda () (throw 'fail)) ; read char
                (lambda () #f))
               "rw")))
      (set-port-encoding! p "UTF-16")
      (display "abc" p)
      (set-port-encoding! p "UTF-32")
      (display "def" p)
      #t))
  ;; TODO: test that input and output streams are independent when
  ;; appropriate, and linked when appropriate.
  (pass-if-equal "BOM discarded from start of UTF-8 stream"
      "a"
    (bv-read-test "Utf-8" #vu8(#xEF #xBB #xBF #x61)))
  (pass-if-equal "BOM discarded from start of UTF-8 stream after seek to 0"
      '(#\a "a")
    (bv-read-test* "uTf-8" #vu8(#xEF #xBB #xBF #x61)
                   (lambda (p)
                     (let ((c (read-char p)))
                       (seek p 0 SEEK_SET)
                       (let ((s (read-string p)))
                         (list c s))))))
  (pass-if-equal "Only one BOM discarded from start of UTF-8 stream"
      "\uFEFFa"
    (bv-read-test "UTF-8" #vu8(#xEF #xBB #xBF #xEF #xBB #xBF #x61)))
  (pass-if-equal "BOM not discarded from UTF-8 stream after seek to > 0"
      "\uFEFFb"
    (bv-read-test* "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)
                   (lambda (p)
                     (seek p 1 SEEK_SET)
                     (read-string p))))
  (pass-if-equal "BOM not discarded unless at start of UTF-8 stream"
      "a\uFEFFb"
    (bv-read-test "UTF-8" #vu8(#x61 #xEF #xBB #xBF #x62)))
  (pass-if-equal "BOM (BE) written to start of UTF-16 stream"
      #vu8(#xFE #xFF #x00 #x61 #x00 #x62)
    (bv-write-test "UTF-16" "ab"))
  (pass-if-equal "BOM (BE) written to UTF-16 stream after set-port-encoding!"
      #vu8(#xFE #xFF #x00 #x61 #x00 #x62 #xFE #xFF #x00 #x63 #x00 #x64)
    (bv-write-test* "UTF-16"
                    (lambda (p)
                      (display "ab" p)
                      (set-port-encoding! p "UTF-16")
                      (display "cd" p))))
  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE)"
      "a"
    (bv-read-test "UTF-16" #vu8(#xFE #xFF #x00 #x61)))
  (pass-if-equal "BOM discarded from start of UTF-16 stream (BE) after seek to 0"
      '(#\a "a")
    (bv-read-test* "utf-16" #vu8(#xFE #xFF #x00 #x61)
                   (lambda (p)
                     (let ((c (read-char p)))
                       (seek p 0 SEEK_SET)
                       (let ((s (read-string p)))
                         (list c s))))))
  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (BE)"
      "\uFEFFa"
    (bv-read-test "Utf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)))
  (pass-if-equal "BOM not discarded from UTF-16 stream (BE) after seek to > 0"
      "\uFEFFa"
    (bv-read-test* "uTf-16" #vu8(#xFE #xFF #xFE #xFF #x00 #x61)
                   (lambda (p)
                     (seek p 2 SEEK_SET)
                     (read-string p))))
  (pass-if-equal "BOM not discarded unless at start of UTF-16 stream"
      "a\uFEFFb"
    (let ((be (bv-read-test "utf-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)))
          (le (bv-read-test "utf-16" #vu8(#x61 #x00 #xFF #xFE #x62 #x00))))
      (if (char=? #\a (string-ref be 0))
          be
          le)))
  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE)"
      "a"
    (bv-read-test "UTF-16" #vu8(#xFF #xFE #x61 #x00)))
  (pass-if-equal "BOM discarded from start of UTF-16 stream (LE) after seek to 0"
      '(#\a "a")
    (bv-read-test* "Utf-16" #vu8(#xFF #xFE #x61 #x00)
                   (lambda (p)
                     (let ((c (read-char p)))
                       (seek p 0 SEEK_SET)
                       (let ((s (read-string p)))
                         (list c s))))))
  (pass-if-equal "Only one BOM discarded from start of UTF-16 stream (LE)"
      "\uFEFFa"
    (bv-read-test "UTf-16" #vu8(#xFF #xFE #xFF #xFE #x61 #x00)))
  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE)"
      "a"
    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
                                     #x00 #x00 #x00 #x61)))
  (pass-if-equal "BOM discarded from start of UTF-32 stream (BE) after seek to 0"
      '(#\a "a")
    (bv-read-test* "utF-32" #vu8(#x00 #x00 #xFE #xFF
                                      #x00 #x00 #x00 #x61)
                   (lambda (p)
                     (let ((c (read-char p)))
                       (seek p 0 SEEK_SET)
                       (let ((s (read-string p)))
                         (list c s))))))
  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (BE)"
      "\uFEFFa"
    (bv-read-test "UTF-32" #vu8(#x00 #x00 #xFE #xFF
                                     #x00 #x00 #xFE #xFF
                                     #x00 #x00 #x00 #x61)))
  (pass-if-equal "BOM not discarded from UTF-32 stream (BE) after seek to > 0"
      "\uFEFFa"
    (bv-read-test* "UtF-32" #vu8(#x00 #x00 #xFE #xFF
                                      #x00 #x00 #xFE #xFF
                                      #x00 #x00 #x00 #x61)
                   (lambda (p)
                     (seek p 4 SEEK_SET)
                     (read-string p))))
  (pass-if-equal "BOM discarded within UTF-16 stream (BE) after set-port-encoding!"
      "ab"
    (bv-read-test* "UTF-16" #vu8(#x00 #x61 #xFE #xFF #x00 #x62)
                   (lambda (p)
                     (let ((a (read-char p)))
                       (set-port-encoding! p "UTF-16")
                       (string a (read-char p))))))
  (pass-if-equal "BOM discarded within UTF-16 stream (LE,BE) after set-port-encoding!"
      "ab"
    (bv-read-test* "utf-16" #vu8(#x00 #x61 #xFF #xFE #x62 #x00)
                   (lambda (p)
                     (let ((a (read-char p)))
                       (set-port-encoding! p "UTF-16")
                       (string a (read-char p))))))
  (pass-if-equal "BOM discarded within UTF-32 stream (BE) after set-port-encoding!"
      "ab"
    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
                                      #x00 #x00 #xFE #xFF
                                      #x00 #x00 #x00 #x62)
                   (lambda (p)
                     (let ((a (read-char p)))
                       (set-port-encoding! p "UTF-32")
                       (string a (read-char p))))))
  (pass-if-equal "BOM discarded within UTF-32 stream (LE,BE) after set-port-encoding!"
      "ab"
    (bv-read-test* "UTF-32" #vu8(#x00 #x00 #x00 #x61
                                      #xFF #xFE #x00 #x00
                                      #x62 #x00 #x00 #x00)
                   (lambda (p)
                     (let ((a (read-char p)))
                       (set-port-encoding! p "UTF-32")
                       (string a (read-char p))))))
  (pass-if-equal "BOM not discarded unless at start of UTF-32 stream"
      "a\uFEFFb"
    (let ((be (bv-read-test "UTF-32" #vu8(#x00 #x00 #x00 #x61
                                               #x00 #x00 #xFE #xFF
                                               #x00 #x00 #x00 #x62)))
          (le (bv-read-test "UTF-32" #vu8(#x61 #x00 #x00 #x00
                                               #xFF #xFE #x00 #x00
                                               #x62 #x00 #x00 #x00))))
      (if (char=? #\a (string-ref be 0))
          be
          le)))
  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE)"
      "a"
    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
                                     #x61 #x00 #x00 #x00)))
  (pass-if-equal "BOM discarded from start of UTF-32 stream (LE) after seek to 0"
      '(#\a "a")
    (bv-read-test* "UTf-32" #vu8(#xFF #xFE #x00 #x00
                                      #x61 #x00 #x00 #x00)
                   (lambda (p)
                     (let ((c (read-char p)))
                       (seek p 0 SEEK_SET)
                       (let ((s (read-string p)))
                         (list c s))))))
  (pass-if-equal "Only one BOM discarded from start of UTF-32 stream (LE)"
      "\uFEFFa"
    (bv-read-test "UTF-32" #vu8(#xFF #xFE #x00 #x00
                                     #xFF #xFE #x00 #x00
                                     #x61 #x00 #x00 #x00))))
 (define-syntax-rule (with-load-path path body ...)
  (let ((new path)
        (old %load-path))