1
Fork 0
mirror of https://git.savannah.gnu.org/git/guile.git synced 2025-04-30 20:00:19 +02:00
guile/doc/mbapi.texi
2001-03-09 08:22:00 +00:00

987 lines
45 KiB
Text

\input texinfo
@setfilename mbapi.info
@settitle Multibyte API
@setchapternewpage off
@c Open issues:
@c What's the best way to report errors? Should functions return a
@c magic value, according to C tradition, or should they signal a
@c Guile exception?
@c
@node Working With Multibyte Strings in C
@chapter Working With Multibyte Strings in C
Guile allows strings to contain characters drawn from a wide variety of
languages, including many Asian, Eastern European, and Middle Eastern
languages, in a uniform and unrestricted way. The string representation
normally used in C code --- an array of @sc{ASCII} characters --- is not
sufficient for Guile strings, since they may contain characters not
present in @sc{ASCII}.
Instead, Guile uses a very large character set, and encodes each
character as a sequence of one or more bytes. We call this
variable-width encoding a @dfn{multibyte} encoding. Guile uses this
single encoding internally for all strings, symbol names, error
messages, etc., and performs appropriate conversions upon input and
output.
The use of this variable-width encoding is almost invisible to Scheme
code. Strings are still indexed by character number, not by byte
offset; @code{string-length} still returns the length of a string in
characters, not in bytes. @code{string-ref} and @code{string-set!} are
no longer guaranteed to be constant-time operations, but Guile uses
various strategies to reduce the impact of this change.
However, the encoding is visible via Guile's C interface, which gives
the user direct access to a string's bytes. This chapter explains how
to work with Guile multibyte text in C code. Since variable-width
encodings are clumsier to work with than simple fixed-width encodings,
Guile provides a set of standard macros and functions for manipulating
multibyte text to make the job easier. Furthermore, Guile makes some
promises about the encoding which you can use in writing your own text
processing code.
While we discuss guaranteed properties of Guile's encoding, and provide
functions to operate on its character set, we do not actually specify
either the character set or encoding here. This is because we expect
both of them to change in the future: currently, Guile uses the same
encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
as well) to use Unicode and UTF-8, with some extensions. This will make
it more comfortable to use Guile with other systems which use UTF-8,
like the GTk user interface toolkit.
@menu
* Multibyte String Terminology::
* Promised Properties of the Guile Multibyte Encoding::
* Functions for Operating on Multibyte Text::
* Multibyte Text Processing Errors::
* Why Guile Does Not Use a Fixed-Width Encoding::
@end menu
@node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
@section Multibyte String Terminology
In the descriptions which follow, we make the following definitions:
@table @dfn
@item byte
A @dfn{byte} is a number between 0 and 255. It has no inherent textual
interpretation. So 65 is a byte, not a character.
@item character
A @dfn{character} is a unit of text. It has no inherent numeric value.
@samp{A} and @samp{.} are characters, not bytes. (This is different
from the C language's definition of @dfn{character}; in this chapter, we
will always use a phrase like ``the C language's @code{char} type'' when
that's what we mean.)
@item character set
A @dfn{character set} is an invertible mapping between numbers and a
given set of characters. @sc{ASCII} is a character set assigning
characters to the numbers 0 through 127. It maps @samp{A} onto the
number 65, and @samp{.} onto 46.
Note that a character set maps characters onto numbers, @emph{not
necessarily} onto bytes. For example, the Unicode character set maps
the Greek lower-case @samp{alpha} character onto the number 945, which
is not a byte.
(This is what Internet standards would call a "coding character set".)
@item encoding
An encoding maps numbers onto sequences of bytes. For example, the
UTF-8 encoding, defined in the Unicode Standard, would map the number
945 onto the sequence of bytes @samp{206 177}. When using the
@sc{ASCII} character set, every number assigned also happens to be a
byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
(This is what Internet standards would call a "character encoding
scheme".)
@end table
Thus, to turn a character into a sequence of bytes, you need a character
set to assign a number to that character, and then an encoding to turn
that number into a sequence of bytes.
Likewise, to interpret a sequence of bytes as a sequence of characters,
you use an encoding to extract a sequence of numbers from the bytes, and
then a character set to turn the numbers into characters.
Errors can occur while carrying out either of these processes. For
example, under a particular encoding, a given string of bytes might not
correspond to any number. For example, the byte sequence @samp{128 128}
is not a valid encoding of any number under UTF-8.
Having carefully defined our terminology, we will now abuse it.
We will sometimes use the word @dfn{character} to refer to the number
assigned to a character by a character set, in contexts where it's
obvious we mean a number.
Sometimes there is a close association between a particular encoding and
a particular character set. Thus, we may sometimes refer to the
character set and encoding together as an @dfn{encoding}.
@node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
@section Promised Properties of the Guile Multibyte Encoding
Internally, Guile uses a single encoding for all text --- symbols,
strings, error messages, etc. Here we list a number of helpful
properties of Guile's encoding. It is correct to write code which
assumes these properties; code which uses these assumptions will be
portable to all future versions of Guile, as far as we know.
@b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
the obvious way.} This means that a standard C string containing only
@sc{ASCII} characters is a valid Guile string (except for the terminator;
Guile strings store the length explicitly, so they can contain null
characters).
@b{The encodings of non-@sc{ASCII} characters use only bytes between 128
and 255.} That is, when we turn a non-@sc{ASCII} character into a
series of bytes, none of those bytes can ever be mistaken for the
encoding of an @sc{ASCII} character. This means that you can search a
Guile string for an @sc{ASCII} character using the standard
@code{memchr} library function. By extension, you can search for an
@sc{ASCII} substring in a Guile string using a traditional substring
search algorithm --- you needn't add special checks to verify encoding
boundaries, etc.
@b{No character encoding is a subsequence of any other character
encoding.} (This is just a stronger version of the previous promise.)
This means that you can search for occurrences of one Guile string
within another Guile string just as if they were raw byte strings. You
can use the stock @code{memmem} function (provided on GNU systems, at
least) for such searches. If you don't need the ability to represent
null characters in your text, you can still use null-termination for
strings, and use the traditional string-handling functions like
@code{strlen}, @code{strstr}, and @code{strcat}.
@b{You can always determine the full length of a character's encoding
from its first byte.} Guile provides the macro @code{scm_mb_len} which
computes the encoding's length from its first byte. Given the first
rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
@var{b} <= 127}, returns 1.
@b{Given an arbitrary byte position in a Guile string, you can always
find the beginning and end of the character containing that byte without
scanning too far in either direction.} This means that, if you are sure
a byte sequence is a valid encoding of a character sequence, you can
find character boundaries without keeping track of the beginning and
ending of the overall string. This promise relies on the fact that, in
addition to storing the string's length explicitly, Guile always either
terminates the string's storage with a zero byte, or shares it with
another string which is terminated this way.
@node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
@section Functions for Operating on Multibyte Text
Guile provides a variety of functions, variables, and types for working
with multibyte text.
@menu
* Basic Multibyte Character Processing::
* Finding Character Encoding Boundaries::
* Multibyte String Functions::
* Exchanging Guile Text With the Outside World in C::
* Implementing Your Own Text Conversions::
@end menu
@node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
@subsection Basic Multibyte Character Processing
Here are the essential types and functions for working with Guile text.
Guile uses the C type @code{unsigned char *} to refer to text encoded
with Guile's encoding.
Note that any operation marked here as a ``Libguile Macro'' might
evaluate its argument multiple times.
@deftp {Libguile Type} scm_char_t
This is a signed integral type large enough to hold any character in
Guile's character set. All character numbers are positive.
@end deftp
@deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
Return the character whose encoding starts at @var{p}. If @var{p} does
not point at a valid character encoding, the behavior is undefined.
@end deftypefn
@deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
Place the encoded form of the Guile character @var{c} at @var{p}, and
return its length in bytes. If @var{c} is not a Guile character, the
behavior is undefined.
@end deftypefn
@deftypevr {Libguile Constant} int scm_mb_max_len
The maximum length of any character's encoding, in bytes. You may
assume this is relatively small --- less than a dozen or so.
@end deftypevr
@deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
If @var{b} is the first byte of a character's encoding, return the full
length of the character's encoding, in bytes. If @var{b} is not a valid
leading byte, the behavior is undefined.
@end deftypefn
@deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
Return the length of the encoding of the character @var{c}, in bytes.
If @var{c} is not a valid Guile character, the behavior is undefined.
@end deftypefn
@deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
@deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
@deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
@deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
These are functions identical to the corresponding macros. You can use
them in situations where the overhead of a function call is acceptable,
and the cleaner semantics of function application are desireable.
@end deftypefn
@node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
@subsection Finding Character Encoding Boundaries
These are functions for finding the boundaries between characters in
multibyte text.
Note that any operation marked here as a ``Libguile Macro'' might
evaluate its argument multiple times, unless the definition promises
otherwise.
@deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
Return non-zero iff @var{p} points to the start of a character in
multibyte text.
This macro will evaluate its argument only once.
@end deftypefn
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
``Round'' @var{p} to the previous character boundary. That is, if
@var{p} points to the middle of the encoding of a Guile character,
return a pointer to the first byte of the encoding. If @var{p} points
to the start of the encoding of a Guile character, return @var{p}
unchanged.
@end deftypefn
@deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
``Round'' @var{p} to the next character boundary. That is, if @var{p}
points to the middle of the encoding of a Guile character, return a
pointer to the first byte of the encoding of the next character. If
@var{p} points to the start of the encoding of a Guile character, return
@var{p} unchanged.
@end deftypefn
Note that it is usually not friendly for functions to silently correct
byte offsets that point into the middle of a character's encoding. Such
offsets almost always indicate a programming error, and they should be
reported as early as possible. So, when you write code which operates
on multibyte text, you should not use functions like these to ``clean
up'' byte offsets which the originator believes to be correct; instead,
your code should signal a @code{text:not-char-boundary} error as soon as
it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
@node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
@subsection Multibyte String Functions
These functions allow you to operate on multibyte strings: sequences of
character encodings.
@deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
Return the number of Guile characters encoded by the @var{len} bytes at
@var{p}.
If the sequence contains any invalid character encodings, or ends with
an incomplete character encoding, signal a @code{text:bad-encoding}
error.
@end deftypefn
@deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
Return the character whose encoding starts at @code{*@var{pp}}, and
advance @code{*@var{pp}} to the start of the next character. Return -1
if @code{*@var{pp}} does not point to a valid character encoding.
@end deftypefn
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
If @var{p} points to the middle of the encoding of a Guile character,
return a pointer to the first byte of the encoding. If @var{p} points
to the start of the encoding of a Guile character, return the start of
the previous character's encoding.
This is like @code{scm_mb_floor}, but the returned pointer will always
be before @var{p}. If you use this function to drive an iteration, it
guarantees backward progress.
@end deftypefn
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
If @var{p} points to the encoding of a Guile character, return a pointer
to the first byte of the encoding of the next character.
This is like @code{scm_mb_ceiling}, but the returned pointer will always
be after @var{p}. If you use this function to drive an iteration, it
guarantees forward progress.
@end deftypefn
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
Assuming that the @var{len} bytes starting at @var{p} are a
concatenation of valid character encodings, return a pointer to the
start of the @var{i}'th character encoding in the sequence.
This function scans the sequence from the beginning to find the
@var{i}'th character, and will generally require time proportional to
the distance from @var{p} to the returned address.
If the sequence contains any invalid character encodings, or ends with
an incomplete character encoding, signal a @code{text:bad-encoding}
error.
@end deftypefn
It is common to process the characters in a string from left to right.
However, if you fetch each character using @code{scm_mb_index}, each
call will scan the text from the beginning, so your loop will require
time proportional to at least the square of the length of the text. To
avoid this poor performance, you can use an @code{scm_mb_cache}
structure and the @code{scm_mb_index_cached} macro.
@deftp {Libguile Type} {struct scm_mb_cache}
This structure holds information that allows a string scanning operation
to use the results from a previous scan of the string. It has the
following members:
@table @code
@item character
An index, in characters, into the string.
@item byte
The index, in bytes, of the start of that character.
@end table
In other words, @code{byte} is the byte offset of the
@code{character}'th character of the string. Note that if @code{byte}
and @code{character} are equal, then all characters before that point
must have encodings exactly one byte long, and the string can be indexed
normally.
All elements of a @code{struct scm_mb_cache} structure should be
initialized to zero before its first use, and whenever the string's text
changes.
@end deftp
@deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
@deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
This macro and this function are identical to @code{scm_mb_index},
except that they may consult and update *@var{cache} in order to avoid
scanning the string from the beginning. @code{scm_mb_index_cached} is a
macro, so it may have less overhead than
@code{scm_mb_index_cached_func}, but it may evaluate its arguments more
than once.
Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
can scan a string from left to right, or from right to left, in time
proportional to the length of the string. As long as each character
fetched is less than some constant distance before or after the previous
character fetched with @var{cache}, each access will require constant
time.
@end deftypefn
Guile also provides functions to convert between an encoded sequence of
characters, and an array of @code{scm_char_t} objects.
@deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
Convert the variable-width text in the @var{len} bytes at @var{p}
to an array of @code{scm_char_t} values. Return a pointer to the array,
and set @code{*@var{result_len}} to the number of elements it contains.
The returned array is allocated with @code{malloc}, and it is the
caller's responsibility to free it.
If the text is not a sequence of valid character encodings, this
function will signal a @code{text:bad-encoding} error.
@end deftypefn
@deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
Convert the array of @code{scm_char_t} values to a sequence of
variable-width character encodings. Return a pointer to the array of
bytes, and set @code{*@var{result_len}} to its length, in bytes.
The returned byte sequence is terminated with a zero byte, which is not
counted in the length returned in @code{*@var{result_len}}.
The returned byte sequence is allocated with @code{malloc}; it is the
caller's responsibility to free it.
If the text is not a sequence of valid character encodings, this
function will signal a @code{text:bad-encoding} error.
@end deftypefn
@node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
@subsection Exchanging Guile Text With the Outside World in C
[[This is kind of a heavy-weight model, given that one end of the
conversion is always going to be the Guile encoding. Any way to shorten
things a bit?]]
Guile provides functions for converting between Guile's internal text
representation and encodings popular in the outside world. These
functions are closely modeled after the @code{iconv} functions available
on some systems.
To convert text between two encodings, you should first call
@code{scm_mb_iconv_open} to indicate the source and destination
encodings; this function returns a context object which records the
conversion to perform.
Then, you should call @code{scm_mb_iconv} to actually convert the text.
This function expects input and output buffers, and a pointer to the
context you got from @var{scm_mb_iconv_open}. You don't need to pass
all your input to @code{scm_mb_iconv} at once; you can invoke it on
successive blocks of input (as you read it from a file, say), and it
will convert as much as it can each time, indicating when you should
grow your output buffer.
An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
encodings, a contiguous group of bytes from the sequence completely
specifies a particular character; these are stateless encodings.
However, some encodings require you to look back an unbounded number of
bytes in the stream to assign a meaning to a particular byte sequence;
such encodings are stateful.
For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
byte sequence @samp{27 36 66} indicates that subsequent bytes should be
taken in pairs and interpreted as characters from the JIS-0208 character
set. An arbitrary number of byte pairs may follow this sequence. The
byte sequence @samp{27 40 66} indicates that subsequent bytes should be
interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
given byte is an @sc{ASCII} character without looking back an arbitrary
distance for the most recent escape sequence, so it is a stateful
encoding.
In Guile, if a conversion involves a stateful encoding, the context
object carries any necessary state. Thus, you can have many independent
conversions to or from stateful encodings taking place simultaneously,
as long as each data stream uses its own context object for the
conversion.
@deftp {Libguile Type} {struct scm_mb_iconv}
This is the type for context objects, which represent the encodings and
current state of an ongoing text conversion. A @code{struct
scm_mb_iconv} records the source and destination encodings, and keeps
track of any information needed to handle stateful encodings.
@end deftp
@deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
Return a pointer to a new @code{struct scm_mb_iconv} context object,
ready to convert from the encoding named @var{fromcode} to the encoding
named @var{tocode}. For stateful encodings, the context object is in
some appropriate initial state, ready for use with the
@code{scm_mb_iconv} function.
When you are done using a context object, you may call
@code{scm_mb_iconv_close} to free it.
If either @var{tocode} or @var{fromcode} is not the name of a known
encoding, this function will signal the @code{text:unknown-conversion}
error, described below.
@c Try to use names here from the IANA list:
@c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Guile supports at least these encodings:
@table @samp
@item US-ASCII
@sc{US-ASCII}, in the standard one-character-per-byte encoding.
@item ISO-8859-1
The usual character set for Western European languages, in its usual
one-character-per-byte encoding.
@item Guile-MB
Guile's current internal multibyte encoding. The actual encoding this
name refers to will change from one version of Guile to the next. You
should use this when converting data between external sources and the
encoding used by Guile objects.
You should @emph{not} use this as the encoding for data presented to the
outside world, for two reasons. 1) Its meaning will change over time,
so data written using the @samp{guile} encoding with one version of
Guile might not be readable with the @samp{guile} encoding in another
version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
which invented for Emacs's internal use, and was never intended to serve
as an exchange medium.
@item Guile-Wide
Guile's character set, as an array of @code{scm_char_t} values.
Note that this encoding is even less suitable for public use than
@samp{Guile}, since the exact sequence of bytes depends heavily on the
size and endianness the host system uses for @code{scm_char_t}. Using
this encoding is very much like calling the
@code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
functions, except that @code{scm_mb_iconv} gives you more control over
buffer allocation and management.
@item Emacs-Mule
This is the variable-length encoding for multi-lingual text by GNU
Emacs, at least through version 20.4. You probably should not use this
encoding, as it is designed only for Emacs's internal use. However, we
provide it here because it's trivial to support, and some people
probably do have @samp{emacs-mule}-format files lying around.
@end table
(At the moment, this list doesn't include any character sets suitable for
external use that can actually handle multilingual data; this is
unfortunate, as it encourages users to write data in Emacs-Mule format,
which nobody but Emacs and Guile understands. We hope to add support
for Unicode in UTF-8 soon, which should solve this problem.)
Case is not significant in encoding names.
You can define your own conversions; see @ref{Implementing Your Own Text
Conversions}.
@end deftypefn
@deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
@end deftypefn
@deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
Convert a sequence of characters from one encoding to another. The
argument @var{context} specifies the encodings to use for the input and
output, and carries state for stateful encodings; use
@code{scm_mb_iconv_open} to create a @var{context} object for a
particular conversion.
Upon entry to the function, @code{*@var{inbuf}} should point to the
input buffer, and @code{*@var{inbytesleft}} should hold the number of
input bytes present in the buffer; @code{*@var{outbuf}} should point to
the output buffer, and @code{*@var{outbytesleft}} should hold the number
of bytes available to hold the conversion results in that buffer.
Upon exit from the function, @code{*@var{inbuf}} points to the first
unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
the last output byte, and @code{*@var{outbyteleft}} holds the number of
bytes left unused in the output buffer.
For stateful encodings, @var{context} carries encoding state from one
call to @code{scm_mb_iconv} to the next. Thus, successive calls to
@var{scm_mb_iconv} which use the same context object can convert a
stream of data one chunk at a time.
If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
taken as a request to reset the states of the input and the output
encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
buffer to put the output encoding in its initial state. If the output
buffer is not large enough to hold this byte sequence,
@code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
the shift states of @var{context}'s input and output encodings
unchanged.
The @code{scm_mb_iconv} function always consumes only complete
characters or shift sequences from the input buffer, and the output
buffer always contains a sequence of complete characters or escape
sequences.
If the input sequence contains characters which are not expressible in
the output encoding, @code{scm_mb_iconv} converts it in an
implementation-defined way. It may simply delete the character.
Some encodings use byte sequences which do not correspond to any textual
character. For example, the escape sequence of a stateful encoding has
no textual meaning. When converting from such an encoding, a call to
@code{scm_mb_iconv} might consume input but produce no output, since the
input sequence might contain only escape sequences.
Normally, @code{scm_mb_iconv} returns the number of input characters it
could not convert perfectly to the ouput encoding. However, it may
return one of the @code{scm_mb_iconv_} codes described below, to
indicate an error. All of these codes are negative values.
If the input sequence contains an invalid character encoding, conversion
stops before the invalid input character, and @code{scm_mb_iconv}
returns the constant value @code{scm_mb_iconv_bad_encoding}.
If the input sequence ends with an incomplete character encoding,
@code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
is not necessarily an error, if you expect to call @code{scm_mb_iconv}
again with more data which might contain the rest of the encoding
fragment.
If the output buffer does not contain enough room to hold the converted
form of the complete input text, @code{scm_mb_iconv} converts as much as
it can, changes the input and output pointers to reflect the amount of
text successfully converted, and then returns
@code{scm_mb_iconv_too_big}.
@end deftypefn
Here are the status codes that might be returned by @code{scm_mb_iconv}.
They are all negative integers.
@table @code
@item scm_mb_iconv_too_big
The conversion needs more room in the output buffer. Some characters
may have been consumed from the input buffer, and some characters may
have been placed in the available space in the output buffer.
@item scm_mb_iconv_bad_encoding
@code{scm_mb_iconv} encountered an invalid character encoding in the
input buffer. Conversion stopped before the invalid character, so there
may be some characters consumed from the input buffer, and some
converted text in the output buffer.
@item scm_mb_iconv_incomplete_encoding
The input buffer ends with an incomplete character encoding. The
incomplete encoding is left in the input buffer, unconsumed. This is
not necessarily an error, if you expect to call @code{scm_mb_iconv}
again with more data which might contain the rest of the incomplete
encoding.
@end table
Finally, Guile provides a function for destroying conversion contexts.
@deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
Deallocate the conversion context object @var{context}, and all other
resources allocated by the call to @code{scm_mb_iconv_open} which
returned @var{context}.
@end deftypefn
@node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
@subsection Implementing Your Own Text Conversions
[[note that conversions to and from Guile must produce streams
containing only valid character encodings, or else Guile will crash]]
This section describes the interface for adding your own encoding
conversions for use with @code{scm_mb_iconv}. The interface here is
borrowed from the GNOME Project's @file{libunicode} library.
Guile's @code{scm_mb_iconv} function works by converting the input text
to a stream of @code{scm_char_t} characters, and then converting
those characters to the desired output encoding. This makes it easy
for Guile to choose the appropriate conversion back ends for an
arbitrary pair of input and output encodings, but it also means that the
accuracy and quality of the conversions depends on the fidelity of
Guile's internal character set to the source and destination encodings.
Since @code{scm_mb_iconv} will be used almost exclusively for converting
to and from Guile's internal character set, this shouldn't be a problem.
To add support for a particular encoding to Guile, you must provide one
function (called the @dfn{read} function) which converts from your
encoding to an array of @code{scm_char_t}'s, and another function
(called the @dfn{write} function) to convert from an array of
@code{scm_char_t}'s back into your encoding. To convert from some
encoding @var{a} to some other encoding @var{b}, Guile pairs up
@var{a}'s read function with @var{b}'s write function. Each call to
@code{scm_mb_iconv} passes text in encoding @var{a} through the read
function, to produce an array of @code{scm_char_t}'s, and then passes
that array to the write function, to produce text in encoding @var{b}.
For stateful encodings, a read or write function can hang its own data
structures off the conversion object, and provide its own functions to
allocate and destroy them; this allows read and write functions to
maintain whatever state they like.
The Guile conversion back end represents each available encoding with a
@code{struct scm_mb_encoding} object.
@deftp {Libguile Type} {struct scm_mb_encoding}
This data structure describes an encoding. It has the following
members:
@table @code
@item char **names
An array of strings, giving the various names for this encoding. The
array should be terminated by a zero pointer. Case is not significant
in encoding names.
The @code{scm_mb_iconv_open} function searches the list of registered
encodings for an encoding whose @code{names} array matches its
@var{tocode} or @var{fromcode} argument.
@item int (*init) (void **@var{cookie})
An initialization function for the encoding's private data.
@code{scm_mb_iconv_open} will call this function, passing it the address
of the cookie for this encoding in this context. (We explain cookies
below.) There is no way for the @code{init} function to tell whether
the encoding will be used for reading or writing.
Note that @code{init} receives a @emph{pointer} to the cookie, not the
cookie itself. Because the type of @var{cookie} is @code{void **}, the
C compiler will not check it as carefully as it would other types.
The @code{init} member may be zero, indicating that no initialization is
necessary for this encoding.
@item int (*destroy) (void **@var{cookie})
A deallocation function for the encoding's private data.
@code{scm_mb_iconv_close} calls this function, passing it the address of
the cookie for this encoding in this context. The @code{destroy}
function should free any data the @code{init} function allocated.
Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
cookie itself. Because the type of @var{cookie} is @code{void **}, the
C compiler will not check it as carefully as it would other types.
The @code{destroy} member may be zero, indicating that this encoding
doesn't need to perform any special action to destroy its local data.
@item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
Put the encoding into its initial shift state. Guile calls this
function whether the encoding is being used for input or output, so this
should take appropriate steps for both directions. If @var{outbuf} and
@var{outbytesleft} are valid, the reset function should emit an escape
sequence to reset the output stream to its initial state; @var{outbuf}
and @var{outbytesleft} should be handled just as for
@code{scm_mb_iconv}.
This function can return an @code{scm_mb_iconv_} error code
(@pxref{Exchanging Guile Text With the Outside World in C}). If it
returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
state must be left unchanged.
Note that @code{reset} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.
The @code{reset} member may be zero, indicating that this encoding
doesn't use a shift state.
@item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
Read some bytes and convert into an array of Guile characters. This is
the encoding's read function.
On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
be converted, and *@var{outcharsleft} characters available at
*@var{outbuf} to hold the results.
On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the
output buffer space still not filled. (By exclusion, these indicate
which input bytes were consumed, and which output characters were
produced.)
Return one of the @code{enum scm_mb_read_result} values, described below.
Note that @code{read} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.
@item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
Convert an array of Guile characters to output bytes. This is
the encoding's write function.
On entry, there are *@var{incharsleft} Guile characters available at
*@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
*@var{outbuf}.
On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
Guile characters left unconverted (because there was insufficient room
in the output buffer to hold their converted forms), and
*@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
output buffer.
Return one of the @code{scm_mb_write_result} values, described below.
Note that @code{write} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.
@item struct scm_mb_encoding *next
This is used by Guile to maintain a linked list of encodings. It is
filled in when you call @code{scm_mb_register_encoding} to add your
encoding to the list.
@end table
@end deftp
Here is the enumerated type for the values an encoding's read function
can return:
@deftp {Libguile Type} {enum scm_mb_read_result}
This type represents the result of a call to an encoding's read
function. It has the following values:
@table @code
@item scm_mb_read_ok
The read function consumed at least one byte of input.
@item scm_mb_read_incomplete
The data present in the input buffer does not contain a complete
character encoding. No input was consumed, and no characters were
produced as output. This is not necessarily an error status, if there
is more data to pass through.
@item scm_mb_read_error
The input contains an invalid character encoding.
@end table
@end deftp
Here is the enumerated type for the values an encoding's write function
can return:
@deftp {Libguile Type} {enum scm_mb_write_result}
This type represents the result of a call to an encoding's write
function. It has the following values:
@table @code
@item scm_mb_write_ok
The write function was able to convert all the characters in @var{inbuf}
successfully.
@item scm_mb_write_too_big
The write function filled the output buffer, but there are still
characters in @var{inbuf} left unconsumed; @var{inbuf} and
@var{incharsleft} indicate the unconsumed portion of the input buffer.
@end table
@end deftp
Conversions to or from stateful encodings need to keep track of each
encoding's current state. Each conversion context contains two
@code{void *} variables called @dfn{cookies}, one for the input
encoding, and one for the output encoding. These cookies are passed to
the encodings' functions, for them to use however they please. A
stateful encoding can use its cookie to hold a pointer to some object
which maintains the context's current shift state. Stateless encodings
will probably not use their cookies.
The cookies' lifetime is the same as that of the context object. When
the user calls @code{scm_mb_iconv_close} to destroy a context object,
@code{scm_mb_iconv_close} calls the input and output encodings'
@code{destroy} functions, passing them their respective cookies, so each
encoding can free any data it allocated for that context.
Note that, if a read or write function returns a successful result code
like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
input, together with the output, must together represent the complete
input text; the encoding may not store any text temporarily in its
cookie. This is because, if @code{scm_mb_iconv} returns a successful
result to the user, it is correct for the user to assume that all the
consumed input has been converted and placed in the output buffer.
There is no ``flush'' operation to push any final results out of the
encodings' buffers.
Here is the function you call to register a new encoding with the
conversion system:
@deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
Add the encoding described by @code{*@var{encoding}} to the set
understood by @code{scm_mb_iconv_open}. Once you have registered your
encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
the names in @code{@var{encoding}->names}.
@end deftypefn
@node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
@section Multibyte Text Processing Errors
This section describes error conditions which code can signal to
indicate problems encountered while processing multibyte text. In each
case, the arguments @var{message} and @var{args} are an error format
string and arguments to be substituted into the string, as accepted by
the @code{display-error} function.
@deffn Condition text:not-char-boundary func message args object offset
By calling @var{func}, the program attempted to access a character at
byte offset @var{offset} in the Guile object @var{object}, but
@var{offset} is not the start of a character's encoding in @var{object}.
Typically, @var{object} is a string or symbol. If the function signalling
the error cannot find the Guile object that contains the text it is
inspecting, it should use @code{#f} for @var{object}.
@end deffn
@deffn Condition text:bad-encoding func message args object
By calling @var{func}, the program attempted to interpret the text in
@var{object}, but @var{object} contains a byte sequence which is not a
valid encoding for any character.
@end deffn
@deffn Condition text:not-guile-char func message args number
By calling @var{func}, the program attempted to treat @var{number} as the
number of a character in the Guile character set, but @var{number} does
not correspond to any character in the Guile character set.
@end deffn
@deffn Condition text:unknown-conversion func message args from to
By calling @var{func}, the program attempted to convert from an encoding
named @var{from} to an encoding named @var{to}, but Guile does not
support such a conversion.
@end deffn
@deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
@deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
@deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
These variables hold the scheme symbol objects whose names are the
condition symbols above. You can use these when signalling these
errors, instead of looking them up yourself.
@end deftypevr
@node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
@section Why Guile Does Not Use a Fixed-Width Encoding
Multibyte encodings are clumsier to work with than encodings which use a
fixed number of bytes for every character. For example, using a
fixed-width encoding, we can extract the @var{i}th character of a string
in constant time, and we can always substitute the @var{i}th character
of a string with any other character without reallocating or copying the
string.
However, there are no fixed-width encodings which include the characters
we wish to include, and also fit in a reasonable amount of space.
Despite the Unicode standard's claims to the contrary, Unicode is not
really a fixed-width encoding. Unicode uses surrogate pairs to
represent characters outside the 16-bit range; a surrogate pair must be
treated as a single character, but occupies two 16-bit spaces. As of
this writing, there are already plans to assign characters to the
surrogate character codes. Three- and four-byte encodings are
too wasteful for a majority of Guile's users, who only need @sc{ASCII}
and a few accented characters.
Another alternative would be to have several different fixed-width
string representations, each with a different element size. For each
string, Guile would use the smallest element size capable of
accomodating the string's text. This would allow users of English and
the Western European languages to use the traditional memory-efficient
encodings. However, if Guile has @var{n} string representations, then
users must write @var{n} versions of any code which manipulates text
directly --- one for each element size. And if a user wants to operate
on two strings simultaneously, and wants to avoid testing the string
sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
Most users will simply not bother. Instead, they will write code which
supports only one string size, leaving us back where we started. By
using a single internal representation, Guile makes it easier for users
to write multilingual code.
[[What about tagging each string with its encoding?
"Every extension must be written to deal with every encoding"]]
[[You don't really want to index strings anyway.]]
Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
four-byte encoding, it is efficient in space for American and European
users. Furthermore, the properties described above mean that many
functions can be coded just as they would for a single-byte encoding;
see @ref{Promised Properties of the Guile Multibyte Encoding}.
@bye