1
Fork 0
mirror of https://git.savannah.gnu.org/git/guile.git synced 2025-04-30 03:40:34 +02:00

U+FFFD is the input substitution character

* libguile/ports.c (UNICODE_REPLACEMENT_CHARACTER):
* libguile/ports.c (peek_utf8_codepoint)
  (scm_port_decode_char, peek_iconv_codepoint):
* module/ice-9/sports.scm (peek-char-and-len/utf8):
  (peek-char-and-len/iconv): Return U+FFFD when we get a decoding error
  when reading, instead of '?', in accordance with Unicode
  recommendations.
* test-suite/tests/iconv.test:
* test-suite/tests/ports.test:
* test-suite/tests/rdelim.test: Update tests.
* NEWS: Update.
This commit is contained in:
Andy Wingo 2016-05-16 10:44:21 +02:00
parent da456d23be
commit 1e058add7b
7 changed files with 65 additions and 44 deletions

7
NEWS
View file

@ -71,6 +71,13 @@ raise an error on bad input. Guile now raises an error without
advancing the read pointer. To skip over a bad encoding, set the port advancing the read pointer. To skip over a bad encoding, set the port
conversion strategy to "substitute" and read a substitute character. conversion strategy to "substitute" and read a substitute character.
** Decoding errors with `substitute' strategy return U+FFFD
It used to be that decoding errors with the `substitute' conversion
strategy would replace the bad bytes with a `?' character. This has
been changed to use the standard U+FFFD REPLACEMENT CHARACTER, in
accordance with the Unicode recommendations.
** API to define new port types from C has changed ** API to define new port types from C has changed
See the newly expanded "I/O Extensions" in the manual, for full details. See the newly expanded "I/O Extensions" in the manual, for full details.

View file

@ -78,11 +78,17 @@ string doesn't depend on its context: the same byte sequence will always
return the same string. A couple of modal encodings are in common use, return the same string. A couple of modal encodings are in common use,
like ISO-2022-JP and ISO-2022-KR, and they are not yet supported. like ISO-2022-JP and ISO-2022-KR, and they are not yet supported.
Each port also has an associated conversion strategy: what to do when @cindex port conversion strategy
a Guile character can't be converted to the port's encoded character @cindex conversion strategy, port
representation for output. There are three possible strategies: to @cindex decoding error
raise an error, to replace the character with a hex escape, or to @cindex encoding error
replace the character with a substitute character. Each port also has an associated conversion strategy, which determines
what to do when a Guile character can't be converted to the port's
encoded character representation for output. There are three possible
strategies: to raise an error, to replace the character with a hex
escape, or to replace the character with a substitute character. Port
conversion strategies are also used when decoding characters from an
input port.
Finally, all ports have associated input and output buffers, as Finally, all ports have associated input and output buffers, as
appropriate. Buffering is a common strategy to limit the overhead of appropriate. Buffering is a common strategy to limit the overhead of
@ -142,14 +148,10 @@ its input and output. The value @code{#f} is equivalent to @code{"ISO-8859-1"}.
@deffn {Scheme Procedure} set-port-conversion-strategy! port sym @deffn {Scheme Procedure} set-port-conversion-strategy! port sym
@deffnx {C Function} scm_set_port_conversion_strategy_x (port, sym) @deffnx {C Function} scm_set_port_conversion_strategy_x (port, sym)
Sets the behavior of the interpreter when outputting a character that Sets the behavior of Guile when outputting a character that is not
is not representable in the port's current encoding. @var{sym} can be representable in the port's current encoding, or when Guile encounters a
either @code{'error}, @code{'substitute}, or @code{'escape}. If it is decoding error when trying to read a character. @var{sym} can be either
@code{'error}, an error will be thrown when an nonconvertible character @code{error}, @code{substitute}, or @code{escape}.
is encountered. If it is @code{'substitute}, then nonconvertible
characters will be replaced with approximate characters, or with
question marks if no approximately correct character is available. If
it is @code{'escape}, it will appear as a hex escape when output.
If @var{port} is an open port, the conversion error behavior If @var{port} is an open port, the conversion error behavior
is set for that port. If it is @code{#f}, it is set as the is set for that port. If it is @code{#f}, it is set as the
@ -157,15 +159,27 @@ default behavior for any future ports that get created in
this thread. this thread.
@end deffn @end deffn
For an output port, a there are three possible port conversion
strategies. The @code{error} strategy will throw an error when a
nonconvertible character is encountered. The @code{substitute} strategy
will replace nonconvertible characters with a question mark (@samp{?}).
Finally the @code{escape} strategy will print nonconvertible characters
as a hex escape, using the escaping that is recognized by Guile's string
syntax. Note that if the port's encoding is a Unicode encoding, like
@code{UTF-8}, then encoding errors are impossible.
For an input port, the @code{error} strategy will cause Guile to throw
an error if it encounters an invalid encoding, such as might happen if
you tried to read @code{ISO-8859-1} as @code{UTF-8}. The error is
thrown before advancing the read position. The @code{substitute}
strategy will replace the bad bytes with a U+FFFD replacement character,
in accordance with Unicode recommendations. When reading from an input
port, the @code{escape} strategy is treated as if it were @code{error}.
@deffn {Scheme Procedure} port-conversion-strategy port @deffn {Scheme Procedure} port-conversion-strategy port
@deffnx {C Function} scm_port_conversion_strategy (port) @deffnx {C Function} scm_port_conversion_strategy (port)
Returns the behavior of the port when outputting a character that is Returns the behavior of the port when outputting a character that is not
not representable in the port's current encoding. It returns the representable in the port's current encoding.
symbol @code{error} if unrepresentable characters should cause
exceptions, @code{substitute} if the port should try to replace
unrepresentable characters with question marks or approximate
characters, or @code{escape} if unrepresentable characters should be
converted to string escapes.
If @var{port} is @code{#f}, then the current default behavior will be If @var{port} is @code{#f}, then the current default behavior will be
returned. New ports will have this default behavior when they are returned. New ports will have this default behavior when they are
@ -179,9 +193,9 @@ and for other conversion routines such as @code{scm_to_stringn},
@code{pointer->string}. @code{pointer->string}.
Its value must be one of the symbols described above, with the same Its value must be one of the symbols described above, with the same
semantics: @code{'error}, @code{'substitute}, or @code{'escape}. semantics: @code{error}, @code{substitute}, or @code{escape}.
When Guile starts, its value is @code{'substitute}. When Guile starts, its value is @code{substitute}.
Note that @code{(set-port-conversion-strategy! #f @var{sym})} is Note that @code{(set-port-conversion-strategy! #f @var{sym})} is
equivalent to @code{(fluid-set! %default-port-conversion-strategy equivalent to @code{(fluid-set! %default-port-conversion-strategy
@ -226,13 +240,10 @@ interactive port that has no ready characters.
@rnindex read-char @rnindex read-char
@deffn {Scheme Procedure} read-char [port] @deffn {Scheme Procedure} read-char [port]
@deffnx {C Function} scm_read_char (port) @deffnx {C Function} scm_read_char (port)
Return the next character available from @var{port}, updating Return the next character available from @var{port}, updating @var{port}
@var{port} to point to the following character. If no more to point to the following character. If no more characters are
characters are available, the end-of-file object is returned. available, the end-of-file object is returned. A decoding error, if
any, is handled in accordance with the port's conversion strategy.
When @var{port}'s data cannot be decoded according to its character
encoding, a @code{decoding-error} is raised and @var{port} is not
advanced past the erroneous byte sequence.
@end deffn @end deffn
@deftypefn {C Function} size_t scm_c_read (SCM port, void *buffer, size_t size) @deftypefn {C Function} size_t scm_c_read (SCM port, void *buffer, size_t size)
@ -262,8 +273,8 @@ return the value returned by the preceding call to
an interactive port will hang waiting for input whenever a call an interactive port will hang waiting for input whenever a call
to @code{read-char} would have hung. to @code{read-char} would have hung.
As for @code{read-char}, a @code{decoding-error} may be raised As for @code{read-char}, decoding errors are handled in accordance with
if such a situation occurs. the port's conversion strategy.
@end deffn @end deffn
@deffn {Scheme Procedure} unread-char cobj [port] @deffn {Scheme Procedure} unread-char cobj [port]
@ -627,9 +638,6 @@ Push the terminating delimiter (if any) back on to the port.
Return a pair containing the string read from the port and the Return a pair containing the string read from the port and the
terminating delimiter or end-of-file object. terminating delimiter or end-of-file object.
@end table @end table
Like @code{read-char}, this procedure can throw to @code{decoding-error}
(@pxref{Reading, @code{read-char}}).
@end deffn @end deffn
@c begin (scm-doc-string "rdelim.scm" "read-line!") @c begin (scm-doc-string "rdelim.scm" "read-line!")

View file

@ -109,6 +109,12 @@ static SCM sym_substitute;
static SCM sym_escape; static SCM sym_escape;
/* See Unicode 8.0 section 5.22, "Best Practice for U+FFFD
Substitution". */
static const scm_t_wchar UNICODE_REPLACEMENT_CHARACTER = 0xFFFD;
static SCM trampoline_to_c_read_subr; static SCM trampoline_to_c_read_subr;
@ -1590,7 +1596,7 @@ peek_utf8_codepoint (SCM port, size_t *len)
decoding_error: decoding_error:
if (scm_is_eq (SCM_PORT (port)->conversion_strategy, sym_substitute)) if (scm_is_eq (SCM_PORT (port)->conversion_strategy, sym_substitute))
/* *len already set. */ /* *len already set. */
return '?'; return UNICODE_REPLACEMENT_CHARACTER;
scm_decoding_error ("peek-char", EILSEQ, "input decoding error", port); scm_decoding_error ("peek-char", EILSEQ, "input decoding error", port);
/* Not reached. */ /* Not reached. */
@ -1648,7 +1654,7 @@ SCM_DEFINE (scm_port_decode_char, "port-decode-char", 4, 0, 0,
return SCM_BOOL_F; return SCM_BOOL_F;
else if (scm_is_eq (SCM_PORT (port)->conversion_strategy, else if (scm_is_eq (SCM_PORT (port)->conversion_strategy,
sym_substitute)) sym_substitute))
return SCM_MAKE_CHAR ('?'); return SCM_MAKE_CHAR (UNICODE_REPLACEMENT_CHARACTER);
else else
scm_decoding_error ("decode-char", err, "input decoding error", port); scm_decoding_error ("decode-char", err, "input decoding error", port);
} }
@ -1699,7 +1705,7 @@ peek_iconv_codepoint (SCM port, size_t *len)
/* EOF found in the middle of a multibyte character. */ /* EOF found in the middle of a multibyte character. */
if (scm_is_eq (SCM_PORT (port)->conversion_strategy, if (scm_is_eq (SCM_PORT (port)->conversion_strategy,
sym_substitute)) sym_substitute))
return '?'; return UNICODE_REPLACEMENT_CHARACTER;
scm_decoding_error ("peek-char", EILSEQ, scm_decoding_error ("peek-char", EILSEQ,
"input decoding error", port); "input decoding error", port);

View file

@ -291,7 +291,7 @@
(define (peek-char-and-len/utf8 port first-byte) (define (peek-char-and-len/utf8 port first-byte)
(define (bad-utf8 len) (define (bad-utf8 len)
(if (eq? (port-conversion-strategy port) 'substitute) (if (eq? (port-conversion-strategy port) 'substitute)
(values #\? len) (values #\xFFFD len)
(decoding-error "peek-char" port))) (decoding-error "peek-char" port)))
(if (< first-byte #x80) (if (< first-byte #x80)
(values (integer->char first-byte) 1) (values (integer->char first-byte) 1)
@ -308,7 +308,7 @@
(let ((len (bad-utf8-len bv cur buffering first-byte))) (let ((len (bad-utf8-len bv cur buffering first-byte)))
(when (zero? len) (error "internal error")) (when (zero? len) (error "internal error"))
(if (eq? (port-conversion-strategy port) 'substitute) (if (eq? (port-conversion-strategy port) 'substitute)
(values #\? len) (values #\xFFFD len)
(decoding-error "peek-char" port)))) (decoding-error "peek-char" port))))
(decode-utf8 bv cur buffering first-byte values bad-utf8)))))) (decode-utf8 bv cur buffering first-byte values bad-utf8))))))
@ -327,7 +327,7 @@
((zero? prev-input-size) ((zero? prev-input-size)
(values the-eof-object 0)) (values the-eof-object 0))
((eq? (port-conversion-strategy port) 'substitute) ((eq? (port-conversion-strategy port) 'substitute)
(values #\? prev-input-size)) (values #\xFFFD prev-input-size))
(else (else
(decoding-error "peek-char" port)))) (decoding-error "peek-char" port))))
((port-decode-char port (port-buffer-bytevector buf) ((port-decode-char port (port-buffer-bytevector buf)

View file

@ -97,7 +97,7 @@
(pass-if "misparse latin1 as utf8 with substitutions" (pass-if "misparse latin1 as utf8 with substitutions"
(equal? (bytevector->string (string->bytevector s "latin1") (equal? (bytevector->string (string->bytevector s "latin1")
"utf-8" 'substitute) "utf-8" 'substitute)
"?t?")) "\uFFFDt\uFFFD"))
(pass-if-exception "misparse latin1 as ascii" exception:decoding-error (pass-if-exception "misparse latin1 as ascii" exception:decoding-error
(bytevector->string (string->bytevector s "latin1") "ascii")))) (bytevector->string (string->bytevector s "latin1") "ascii"))))

View file

@ -834,7 +834,7 @@
;; If `proc' is `read-char', this will ;; If `proc' is `read-char', this will
;; skip over the bad bytes. ;; skip over the bad bytes.
(let ((c (proc p))) (let ((c (proc p)))
(unless (eqv? c #\?) (unless (eqv? c #\xFFFD)
(error "unexpected char" c)) (error "unexpected char" c))
(set-port-conversion-strategy! p strategy) (set-port-conversion-strategy! p strategy)
#t))) #t)))
@ -846,7 +846,7 @@
((_ port (proc -> error)) ((_ port (proc -> error))
(if (eq? 'substitute (if (eq? 'substitute
(port-conversion-strategy port)) (port-conversion-strategy port))
(eqv? (proc port) #\?) (eqv? (proc port) #\xFFFD)
(decoding-error? port proc))) (decoding-error? port proc)))
((_ port (proc -> eof)) ((_ port (proc -> eof))
(eof-object? (proc port))) (eof-object? (proc port)))

View file

@ -87,7 +87,7 @@
(let ((p (open-bytevector-input-port #vu8(65 255 66 67 68)))) (let ((p (open-bytevector-input-port #vu8(65 255 66 67 68))))
(set-port-encoding! p "UTF-8") (set-port-encoding! p "UTF-8")
(set-port-conversion-strategy! p 'substitute) (set-port-conversion-strategy! p 'substitute)
(and (string=? (read-line p) "A?BCD") (and (string=? (read-line p) "A\uFFFDBCD")
(eof-object? (read-line p)))))) (eof-object? (read-line p))))))