mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-04-30 03:40:34 +02:00
987 lines
45 KiB
Text
987 lines
45 KiB
Text
\input texinfo
|
|
@setfilename mbapi.info
|
|
@settitle Multibyte API
|
|
@setchapternewpage off
|
|
|
|
@c Open issues:
|
|
|
|
@c What's the best way to report errors? Should functions return a
|
|
@c magic value, according to C tradition, or should they signal a
|
|
@c Guile exception?
|
|
|
|
@c
|
|
|
|
|
|
@node Working With Multibyte Strings in C
|
|
@chapter Working With Multibyte Strings in C
|
|
|
|
Guile allows strings to contain characters drawn from a wide variety of
|
|
languages, including many Asian, Eastern European, and Middle Eastern
|
|
languages, in a uniform and unrestricted way. The string representation
|
|
normally used in C code --- an array of @sc{ASCII} characters --- is not
|
|
sufficient for Guile strings, since they may contain characters not
|
|
present in @sc{ASCII}.
|
|
|
|
Instead, Guile uses a very large character set, and encodes each
|
|
character as a sequence of one or more bytes. We call this
|
|
variable-width encoding a @dfn{multibyte} encoding. Guile uses this
|
|
single encoding internally for all strings, symbol names, error
|
|
messages, etc., and performs appropriate conversions upon input and
|
|
output.
|
|
|
|
The use of this variable-width encoding is almost invisible to Scheme
|
|
code. Strings are still indexed by character number, not by byte
|
|
offset; @code{string-length} still returns the length of a string in
|
|
characters, not in bytes. @code{string-ref} and @code{string-set!} are
|
|
no longer guaranteed to be constant-time operations, but Guile uses
|
|
various strategies to reduce the impact of this change.
|
|
|
|
However, the encoding is visible via Guile's C interface, which gives
|
|
the user direct access to a string's bytes. This chapter explains how
|
|
to work with Guile multibyte text in C code. Since variable-width
|
|
encodings are clumsier to work with than simple fixed-width encodings,
|
|
Guile provides a set of standard macros and functions for manipulating
|
|
multibyte text to make the job easier. Furthermore, Guile makes some
|
|
promises about the encoding which you can use in writing your own text
|
|
processing code.
|
|
|
|
While we discuss guaranteed properties of Guile's encoding, and provide
|
|
functions to operate on its character set, we do not actually specify
|
|
either the character set or encoding here. This is because we expect
|
|
both of them to change in the future: currently, Guile uses the same
|
|
encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
|
|
as well) to use Unicode and UTF-8, with some extensions. This will make
|
|
it more comfortable to use Guile with other systems which use UTF-8,
|
|
like the GTk user interface toolkit.
|
|
|
|
@menu
|
|
* Multibyte String Terminology::
|
|
* Promised Properties of the Guile Multibyte Encoding::
|
|
* Functions for Operating on Multibyte Text::
|
|
* Multibyte Text Processing Errors::
|
|
* Why Guile Does Not Use a Fixed-Width Encoding::
|
|
@end menu
|
|
|
|
|
|
@node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
|
|
@section Multibyte String Terminology
|
|
|
|
In the descriptions which follow, we make the following definitions:
|
|
@table @dfn
|
|
|
|
@item byte
|
|
A @dfn{byte} is a number between 0 and 255. It has no inherent textual
|
|
interpretation. So 65 is a byte, not a character.
|
|
|
|
@item character
|
|
A @dfn{character} is a unit of text. It has no inherent numeric value.
|
|
@samp{A} and @samp{.} are characters, not bytes. (This is different
|
|
from the C language's definition of @dfn{character}; in this chapter, we
|
|
will always use a phrase like ``the C language's @code{char} type'' when
|
|
that's what we mean.)
|
|
|
|
@item character set
|
|
A @dfn{character set} is an invertible mapping between numbers and a
|
|
given set of characters. @sc{ASCII} is a character set assigning
|
|
characters to the numbers 0 through 127. It maps @samp{A} onto the
|
|
number 65, and @samp{.} onto 46.
|
|
|
|
Note that a character set maps characters onto numbers, @emph{not
|
|
necessarily} onto bytes. For example, the Unicode character set maps
|
|
the Greek lower-case @samp{alpha} character onto the number 945, which
|
|
is not a byte.
|
|
|
|
(This is what Internet standards would call a "coding character set".)
|
|
|
|
@item encoding
|
|
An encoding maps numbers onto sequences of bytes. For example, the
|
|
UTF-8 encoding, defined in the Unicode Standard, would map the number
|
|
945 onto the sequence of bytes @samp{206 177}. When using the
|
|
@sc{ASCII} character set, every number assigned also happens to be a
|
|
byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
|
|
|
|
(This is what Internet standards would call a "character encoding
|
|
scheme".)
|
|
|
|
@end table
|
|
|
|
Thus, to turn a character into a sequence of bytes, you need a character
|
|
set to assign a number to that character, and then an encoding to turn
|
|
that number into a sequence of bytes.
|
|
|
|
Likewise, to interpret a sequence of bytes as a sequence of characters,
|
|
you use an encoding to extract a sequence of numbers from the bytes, and
|
|
then a character set to turn the numbers into characters.
|
|
|
|
Errors can occur while carrying out either of these processes. For
|
|
example, under a particular encoding, a given string of bytes might not
|
|
correspond to any number. For example, the byte sequence @samp{128 128}
|
|
is not a valid encoding of any number under UTF-8.
|
|
|
|
Having carefully defined our terminology, we will now abuse it.
|
|
|
|
We will sometimes use the word @dfn{character} to refer to the number
|
|
assigned to a character by a character set, in contexts where it's
|
|
obvious we mean a number.
|
|
|
|
Sometimes there is a close association between a particular encoding and
|
|
a particular character set. Thus, we may sometimes refer to the
|
|
character set and encoding together as an @dfn{encoding}.
|
|
|
|
|
|
@node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
|
|
@section Promised Properties of the Guile Multibyte Encoding
|
|
|
|
Internally, Guile uses a single encoding for all text --- symbols,
|
|
strings, error messages, etc. Here we list a number of helpful
|
|
properties of Guile's encoding. It is correct to write code which
|
|
assumes these properties; code which uses these assumptions will be
|
|
portable to all future versions of Guile, as far as we know.
|
|
|
|
@b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
|
|
the obvious way.} This means that a standard C string containing only
|
|
@sc{ASCII} characters is a valid Guile string (except for the terminator;
|
|
Guile strings store the length explicitly, so they can contain null
|
|
characters).
|
|
|
|
@b{The encodings of non-@sc{ASCII} characters use only bytes between 128
|
|
and 255.} That is, when we turn a non-@sc{ASCII} character into a
|
|
series of bytes, none of those bytes can ever be mistaken for the
|
|
encoding of an @sc{ASCII} character. This means that you can search a
|
|
Guile string for an @sc{ASCII} character using the standard
|
|
@code{memchr} library function. By extension, you can search for an
|
|
@sc{ASCII} substring in a Guile string using a traditional substring
|
|
search algorithm --- you needn't add special checks to verify encoding
|
|
boundaries, etc.
|
|
|
|
@b{No character encoding is a subsequence of any other character
|
|
encoding.} (This is just a stronger version of the previous promise.)
|
|
This means that you can search for occurrences of one Guile string
|
|
within another Guile string just as if they were raw byte strings. You
|
|
can use the stock @code{memmem} function (provided on GNU systems, at
|
|
least) for such searches. If you don't need the ability to represent
|
|
null characters in your text, you can still use null-termination for
|
|
strings, and use the traditional string-handling functions like
|
|
@code{strlen}, @code{strstr}, and @code{strcat}.
|
|
|
|
@b{You can always determine the full length of a character's encoding
|
|
from its first byte.} Guile provides the macro @code{scm_mb_len} which
|
|
computes the encoding's length from its first byte. Given the first
|
|
rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
|
|
@var{b} <= 127}, returns 1.
|
|
|
|
@b{Given an arbitrary byte position in a Guile string, you can always
|
|
find the beginning and end of the character containing that byte without
|
|
scanning too far in either direction.} This means that, if you are sure
|
|
a byte sequence is a valid encoding of a character sequence, you can
|
|
find character boundaries without keeping track of the beginning and
|
|
ending of the overall string. This promise relies on the fact that, in
|
|
addition to storing the string's length explicitly, Guile always either
|
|
terminates the string's storage with a zero byte, or shares it with
|
|
another string which is terminated this way.
|
|
|
|
|
|
@node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
|
|
@section Functions for Operating on Multibyte Text
|
|
|
|
Guile provides a variety of functions, variables, and types for working
|
|
with multibyte text.
|
|
|
|
@menu
|
|
* Basic Multibyte Character Processing::
|
|
* Finding Character Encoding Boundaries::
|
|
* Multibyte String Functions::
|
|
* Exchanging Guile Text With the Outside World in C::
|
|
* Implementing Your Own Text Conversions::
|
|
@end menu
|
|
|
|
|
|
@node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
|
|
@subsection Basic Multibyte Character Processing
|
|
|
|
Here are the essential types and functions for working with Guile text.
|
|
Guile uses the C type @code{unsigned char *} to refer to text encoded
|
|
with Guile's encoding.
|
|
|
|
Note that any operation marked here as a ``Libguile Macro'' might
|
|
evaluate its argument multiple times.
|
|
|
|
@deftp {Libguile Type} scm_char_t
|
|
This is a signed integral type large enough to hold any character in
|
|
Guile's character set. All character numbers are positive.
|
|
@end deftp
|
|
|
|
@deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
|
|
Return the character whose encoding starts at @var{p}. If @var{p} does
|
|
not point at a valid character encoding, the behavior is undefined.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
|
|
Place the encoded form of the Guile character @var{c} at @var{p}, and
|
|
return its length in bytes. If @var{c} is not a Guile character, the
|
|
behavior is undefined.
|
|
@end deftypefn
|
|
|
|
@deftypevr {Libguile Constant} int scm_mb_max_len
|
|
The maximum length of any character's encoding, in bytes. You may
|
|
assume this is relatively small --- less than a dozen or so.
|
|
@end deftypevr
|
|
|
|
@deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
|
|
If @var{b} is the first byte of a character's encoding, return the full
|
|
length of the character's encoding, in bytes. If @var{b} is not a valid
|
|
leading byte, the behavior is undefined.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
|
|
Return the length of the encoding of the character @var{c}, in bytes.
|
|
If @var{c} is not a valid Guile character, the behavior is undefined.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
|
|
@deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
|
|
@deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
|
|
@deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
|
|
These are functions identical to the corresponding macros. You can use
|
|
them in situations where the overhead of a function call is acceptable,
|
|
and the cleaner semantics of function application are desireable.
|
|
@end deftypefn
|
|
|
|
|
|
@node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
|
|
@subsection Finding Character Encoding Boundaries
|
|
|
|
These are functions for finding the boundaries between characters in
|
|
multibyte text.
|
|
|
|
Note that any operation marked here as a ``Libguile Macro'' might
|
|
evaluate its argument multiple times, unless the definition promises
|
|
otherwise.
|
|
|
|
@deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
|
|
Return non-zero iff @var{p} points to the start of a character in
|
|
multibyte text.
|
|
|
|
This macro will evaluate its argument only once.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
|
|
``Round'' @var{p} to the previous character boundary. That is, if
|
|
@var{p} points to the middle of the encoding of a Guile character,
|
|
return a pointer to the first byte of the encoding. If @var{p} points
|
|
to the start of the encoding of a Guile character, return @var{p}
|
|
unchanged.
|
|
@end deftypefn
|
|
|
|
@deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
|
|
``Round'' @var{p} to the next character boundary. That is, if @var{p}
|
|
points to the middle of the encoding of a Guile character, return a
|
|
pointer to the first byte of the encoding of the next character. If
|
|
@var{p} points to the start of the encoding of a Guile character, return
|
|
@var{p} unchanged.
|
|
@end deftypefn
|
|
|
|
Note that it is usually not friendly for functions to silently correct
|
|
byte offsets that point into the middle of a character's encoding. Such
|
|
offsets almost always indicate a programming error, and they should be
|
|
reported as early as possible. So, when you write code which operates
|
|
on multibyte text, you should not use functions like these to ``clean
|
|
up'' byte offsets which the originator believes to be correct; instead,
|
|
your code should signal a @code{text:not-char-boundary} error as soon as
|
|
it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
|
|
|
|
|
|
@node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
|
|
@subsection Multibyte String Functions
|
|
|
|
These functions allow you to operate on multibyte strings: sequences of
|
|
character encodings.
|
|
|
|
@deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
|
|
Return the number of Guile characters encoded by the @var{len} bytes at
|
|
@var{p}.
|
|
|
|
If the sequence contains any invalid character encodings, or ends with
|
|
an incomplete character encoding, signal a @code{text:bad-encoding}
|
|
error.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
|
|
Return the character whose encoding starts at @code{*@var{pp}}, and
|
|
advance @code{*@var{pp}} to the start of the next character. Return -1
|
|
if @code{*@var{pp}} does not point to a valid character encoding.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
|
|
If @var{p} points to the middle of the encoding of a Guile character,
|
|
return a pointer to the first byte of the encoding. If @var{p} points
|
|
to the start of the encoding of a Guile character, return the start of
|
|
the previous character's encoding.
|
|
|
|
This is like @code{scm_mb_floor}, but the returned pointer will always
|
|
be before @var{p}. If you use this function to drive an iteration, it
|
|
guarantees backward progress.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
|
|
If @var{p} points to the encoding of a Guile character, return a pointer
|
|
to the first byte of the encoding of the next character.
|
|
|
|
This is like @code{scm_mb_ceiling}, but the returned pointer will always
|
|
be after @var{p}. If you use this function to drive an iteration, it
|
|
guarantees forward progress.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
|
|
Assuming that the @var{len} bytes starting at @var{p} are a
|
|
concatenation of valid character encodings, return a pointer to the
|
|
start of the @var{i}'th character encoding in the sequence.
|
|
|
|
This function scans the sequence from the beginning to find the
|
|
@var{i}'th character, and will generally require time proportional to
|
|
the distance from @var{p} to the returned address.
|
|
|
|
If the sequence contains any invalid character encodings, or ends with
|
|
an incomplete character encoding, signal a @code{text:bad-encoding}
|
|
error.
|
|
@end deftypefn
|
|
|
|
It is common to process the characters in a string from left to right.
|
|
However, if you fetch each character using @code{scm_mb_index}, each
|
|
call will scan the text from the beginning, so your loop will require
|
|
time proportional to at least the square of the length of the text. To
|
|
avoid this poor performance, you can use an @code{scm_mb_cache}
|
|
structure and the @code{scm_mb_index_cached} macro.
|
|
|
|
@deftp {Libguile Type} {struct scm_mb_cache}
|
|
This structure holds information that allows a string scanning operation
|
|
to use the results from a previous scan of the string. It has the
|
|
following members:
|
|
@table @code
|
|
|
|
@item character
|
|
An index, in characters, into the string.
|
|
|
|
@item byte
|
|
The index, in bytes, of the start of that character.
|
|
|
|
@end table
|
|
|
|
In other words, @code{byte} is the byte offset of the
|
|
@code{character}'th character of the string. Note that if @code{byte}
|
|
and @code{character} are equal, then all characters before that point
|
|
must have encodings exactly one byte long, and the string can be indexed
|
|
normally.
|
|
|
|
All elements of a @code{struct scm_mb_cache} structure should be
|
|
initialized to zero before its first use, and whenever the string's text
|
|
changes.
|
|
@end deftp
|
|
|
|
@deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
|
|
@deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
|
|
This macro and this function are identical to @code{scm_mb_index},
|
|
except that they may consult and update *@var{cache} in order to avoid
|
|
scanning the string from the beginning. @code{scm_mb_index_cached} is a
|
|
macro, so it may have less overhead than
|
|
@code{scm_mb_index_cached_func}, but it may evaluate its arguments more
|
|
than once.
|
|
|
|
Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
|
|
can scan a string from left to right, or from right to left, in time
|
|
proportional to the length of the string. As long as each character
|
|
fetched is less than some constant distance before or after the previous
|
|
character fetched with @var{cache}, each access will require constant
|
|
time.
|
|
@end deftypefn
|
|
|
|
Guile also provides functions to convert between an encoded sequence of
|
|
characters, and an array of @code{scm_char_t} objects.
|
|
|
|
@deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
|
|
Convert the variable-width text in the @var{len} bytes at @var{p}
|
|
to an array of @code{scm_char_t} values. Return a pointer to the array,
|
|
and set @code{*@var{result_len}} to the number of elements it contains.
|
|
The returned array is allocated with @code{malloc}, and it is the
|
|
caller's responsibility to free it.
|
|
|
|
If the text is not a sequence of valid character encodings, this
|
|
function will signal a @code{text:bad-encoding} error.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
|
|
Convert the array of @code{scm_char_t} values to a sequence of
|
|
variable-width character encodings. Return a pointer to the array of
|
|
bytes, and set @code{*@var{result_len}} to its length, in bytes.
|
|
|
|
The returned byte sequence is terminated with a zero byte, which is not
|
|
counted in the length returned in @code{*@var{result_len}}.
|
|
|
|
The returned byte sequence is allocated with @code{malloc}; it is the
|
|
caller's responsibility to free it.
|
|
|
|
If the text is not a sequence of valid character encodings, this
|
|
function will signal a @code{text:bad-encoding} error.
|
|
@end deftypefn
|
|
|
|
|
|
@node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
|
|
@subsection Exchanging Guile Text With the Outside World in C
|
|
|
|
[[This is kind of a heavy-weight model, given that one end of the
|
|
conversion is always going to be the Guile encoding. Any way to shorten
|
|
things a bit?]]
|
|
|
|
Guile provides functions for converting between Guile's internal text
|
|
representation and encodings popular in the outside world. These
|
|
functions are closely modeled after the @code{iconv} functions available
|
|
on some systems.
|
|
|
|
To convert text between two encodings, you should first call
|
|
@code{scm_mb_iconv_open} to indicate the source and destination
|
|
encodings; this function returns a context object which records the
|
|
conversion to perform.
|
|
|
|
Then, you should call @code{scm_mb_iconv} to actually convert the text.
|
|
This function expects input and output buffers, and a pointer to the
|
|
context you got from @var{scm_mb_iconv_open}. You don't need to pass
|
|
all your input to @code{scm_mb_iconv} at once; you can invoke it on
|
|
successive blocks of input (as you read it from a file, say), and it
|
|
will convert as much as it can each time, indicating when you should
|
|
grow your output buffer.
|
|
|
|
An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
|
|
encodings, a contiguous group of bytes from the sequence completely
|
|
specifies a particular character; these are stateless encodings.
|
|
However, some encodings require you to look back an unbounded number of
|
|
bytes in the stream to assign a meaning to a particular byte sequence;
|
|
such encodings are stateful.
|
|
|
|
For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
|
|
byte sequence @samp{27 36 66} indicates that subsequent bytes should be
|
|
taken in pairs and interpreted as characters from the JIS-0208 character
|
|
set. An arbitrary number of byte pairs may follow this sequence. The
|
|
byte sequence @samp{27 40 66} indicates that subsequent bytes should be
|
|
interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
|
|
given byte is an @sc{ASCII} character without looking back an arbitrary
|
|
distance for the most recent escape sequence, so it is a stateful
|
|
encoding.
|
|
|
|
In Guile, if a conversion involves a stateful encoding, the context
|
|
object carries any necessary state. Thus, you can have many independent
|
|
conversions to or from stateful encodings taking place simultaneously,
|
|
as long as each data stream uses its own context object for the
|
|
conversion.
|
|
|
|
@deftp {Libguile Type} {struct scm_mb_iconv}
|
|
This is the type for context objects, which represent the encodings and
|
|
current state of an ongoing text conversion. A @code{struct
|
|
scm_mb_iconv} records the source and destination encodings, and keeps
|
|
track of any information needed to handle stateful encodings.
|
|
@end deftp
|
|
|
|
@deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
|
|
Return a pointer to a new @code{struct scm_mb_iconv} context object,
|
|
ready to convert from the encoding named @var{fromcode} to the encoding
|
|
named @var{tocode}. For stateful encodings, the context object is in
|
|
some appropriate initial state, ready for use with the
|
|
@code{scm_mb_iconv} function.
|
|
|
|
When you are done using a context object, you may call
|
|
@code{scm_mb_iconv_close} to free it.
|
|
|
|
If either @var{tocode} or @var{fromcode} is not the name of a known
|
|
encoding, this function will signal the @code{text:unknown-conversion}
|
|
error, described below.
|
|
|
|
@c Try to use names here from the IANA list:
|
|
@c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
|
|
Guile supports at least these encodings:
|
|
@table @samp
|
|
|
|
@item US-ASCII
|
|
@sc{US-ASCII}, in the standard one-character-per-byte encoding.
|
|
|
|
@item ISO-8859-1
|
|
The usual character set for Western European languages, in its usual
|
|
one-character-per-byte encoding.
|
|
|
|
@item Guile-MB
|
|
Guile's current internal multibyte encoding. The actual encoding this
|
|
name refers to will change from one version of Guile to the next. You
|
|
should use this when converting data between external sources and the
|
|
encoding used by Guile objects.
|
|
|
|
You should @emph{not} use this as the encoding for data presented to the
|
|
outside world, for two reasons. 1) Its meaning will change over time,
|
|
so data written using the @samp{guile} encoding with one version of
|
|
Guile might not be readable with the @samp{guile} encoding in another
|
|
version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
|
|
which invented for Emacs's internal use, and was never intended to serve
|
|
as an exchange medium.
|
|
|
|
@item Guile-Wide
|
|
Guile's character set, as an array of @code{scm_char_t} values.
|
|
|
|
Note that this encoding is even less suitable for public use than
|
|
@samp{Guile}, since the exact sequence of bytes depends heavily on the
|
|
size and endianness the host system uses for @code{scm_char_t}. Using
|
|
this encoding is very much like calling the
|
|
@code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
|
|
functions, except that @code{scm_mb_iconv} gives you more control over
|
|
buffer allocation and management.
|
|
|
|
@item Emacs-Mule
|
|
This is the variable-length encoding for multi-lingual text by GNU
|
|
Emacs, at least through version 20.4. You probably should not use this
|
|
encoding, as it is designed only for Emacs's internal use. However, we
|
|
provide it here because it's trivial to support, and some people
|
|
probably do have @samp{emacs-mule}-format files lying around.
|
|
|
|
@end table
|
|
|
|
(At the moment, this list doesn't include any character sets suitable for
|
|
external use that can actually handle multilingual data; this is
|
|
unfortunate, as it encourages users to write data in Emacs-Mule format,
|
|
which nobody but Emacs and Guile understands. We hope to add support
|
|
for Unicode in UTF-8 soon, which should solve this problem.)
|
|
|
|
Case is not significant in encoding names.
|
|
|
|
You can define your own conversions; see @ref{Implementing Your Own Text
|
|
Conversions}.
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
|
|
Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
|
|
@end deftypefn
|
|
|
|
@deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
|
|
Convert a sequence of characters from one encoding to another. The
|
|
argument @var{context} specifies the encodings to use for the input and
|
|
output, and carries state for stateful encodings; use
|
|
@code{scm_mb_iconv_open} to create a @var{context} object for a
|
|
particular conversion.
|
|
|
|
Upon entry to the function, @code{*@var{inbuf}} should point to the
|
|
input buffer, and @code{*@var{inbytesleft}} should hold the number of
|
|
input bytes present in the buffer; @code{*@var{outbuf}} should point to
|
|
the output buffer, and @code{*@var{outbytesleft}} should hold the number
|
|
of bytes available to hold the conversion results in that buffer.
|
|
|
|
Upon exit from the function, @code{*@var{inbuf}} points to the first
|
|
unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
|
|
of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
|
|
the last output byte, and @code{*@var{outbyteleft}} holds the number of
|
|
bytes left unused in the output buffer.
|
|
|
|
For stateful encodings, @var{context} carries encoding state from one
|
|
call to @code{scm_mb_iconv} to the next. Thus, successive calls to
|
|
@var{scm_mb_iconv} which use the same context object can convert a
|
|
stream of data one chunk at a time.
|
|
|
|
If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
|
|
taken as a request to reset the states of the input and the output
|
|
encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
|
|
non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
|
|
buffer to put the output encoding in its initial state. If the output
|
|
buffer is not large enough to hold this byte sequence,
|
|
@code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
|
|
the shift states of @var{context}'s input and output encodings
|
|
unchanged.
|
|
|
|
The @code{scm_mb_iconv} function always consumes only complete
|
|
characters or shift sequences from the input buffer, and the output
|
|
buffer always contains a sequence of complete characters or escape
|
|
sequences.
|
|
|
|
If the input sequence contains characters which are not expressible in
|
|
the output encoding, @code{scm_mb_iconv} converts it in an
|
|
implementation-defined way. It may simply delete the character.
|
|
|
|
Some encodings use byte sequences which do not correspond to any textual
|
|
character. For example, the escape sequence of a stateful encoding has
|
|
no textual meaning. When converting from such an encoding, a call to
|
|
@code{scm_mb_iconv} might consume input but produce no output, since the
|
|
input sequence might contain only escape sequences.
|
|
|
|
Normally, @code{scm_mb_iconv} returns the number of input characters it
|
|
could not convert perfectly to the ouput encoding. However, it may
|
|
return one of the @code{scm_mb_iconv_} codes described below, to
|
|
indicate an error. All of these codes are negative values.
|
|
|
|
If the input sequence contains an invalid character encoding, conversion
|
|
stops before the invalid input character, and @code{scm_mb_iconv}
|
|
returns the constant value @code{scm_mb_iconv_bad_encoding}.
|
|
|
|
If the input sequence ends with an incomplete character encoding,
|
|
@code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
|
|
return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
|
|
is not necessarily an error, if you expect to call @code{scm_mb_iconv}
|
|
again with more data which might contain the rest of the encoding
|
|
fragment.
|
|
|
|
If the output buffer does not contain enough room to hold the converted
|
|
form of the complete input text, @code{scm_mb_iconv} converts as much as
|
|
it can, changes the input and output pointers to reflect the amount of
|
|
text successfully converted, and then returns
|
|
@code{scm_mb_iconv_too_big}.
|
|
@end deftypefn
|
|
|
|
Here are the status codes that might be returned by @code{scm_mb_iconv}.
|
|
They are all negative integers.
|
|
@table @code
|
|
|
|
@item scm_mb_iconv_too_big
|
|
The conversion needs more room in the output buffer. Some characters
|
|
may have been consumed from the input buffer, and some characters may
|
|
have been placed in the available space in the output buffer.
|
|
|
|
@item scm_mb_iconv_bad_encoding
|
|
@code{scm_mb_iconv} encountered an invalid character encoding in the
|
|
input buffer. Conversion stopped before the invalid character, so there
|
|
may be some characters consumed from the input buffer, and some
|
|
converted text in the output buffer.
|
|
|
|
@item scm_mb_iconv_incomplete_encoding
|
|
The input buffer ends with an incomplete character encoding. The
|
|
incomplete encoding is left in the input buffer, unconsumed. This is
|
|
not necessarily an error, if you expect to call @code{scm_mb_iconv}
|
|
again with more data which might contain the rest of the incomplete
|
|
encoding.
|
|
|
|
@end table
|
|
|
|
|
|
Finally, Guile provides a function for destroying conversion contexts.
|
|
|
|
@deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
|
|
Deallocate the conversion context object @var{context}, and all other
|
|
resources allocated by the call to @code{scm_mb_iconv_open} which
|
|
returned @var{context}.
|
|
@end deftypefn
|
|
|
|
|
|
@node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
|
|
@subsection Implementing Your Own Text Conversions
|
|
|
|
[[note that conversions to and from Guile must produce streams
|
|
containing only valid character encodings, or else Guile will crash]]
|
|
|
|
This section describes the interface for adding your own encoding
|
|
conversions for use with @code{scm_mb_iconv}. The interface here is
|
|
borrowed from the GNOME Project's @file{libunicode} library.
|
|
|
|
Guile's @code{scm_mb_iconv} function works by converting the input text
|
|
to a stream of @code{scm_char_t} characters, and then converting
|
|
those characters to the desired output encoding. This makes it easy
|
|
for Guile to choose the appropriate conversion back ends for an
|
|
arbitrary pair of input and output encodings, but it also means that the
|
|
accuracy and quality of the conversions depends on the fidelity of
|
|
Guile's internal character set to the source and destination encodings.
|
|
Since @code{scm_mb_iconv} will be used almost exclusively for converting
|
|
to and from Guile's internal character set, this shouldn't be a problem.
|
|
|
|
To add support for a particular encoding to Guile, you must provide one
|
|
function (called the @dfn{read} function) which converts from your
|
|
encoding to an array of @code{scm_char_t}'s, and another function
|
|
(called the @dfn{write} function) to convert from an array of
|
|
@code{scm_char_t}'s back into your encoding. To convert from some
|
|
encoding @var{a} to some other encoding @var{b}, Guile pairs up
|
|
@var{a}'s read function with @var{b}'s write function. Each call to
|
|
@code{scm_mb_iconv} passes text in encoding @var{a} through the read
|
|
function, to produce an array of @code{scm_char_t}'s, and then passes
|
|
that array to the write function, to produce text in encoding @var{b}.
|
|
|
|
For stateful encodings, a read or write function can hang its own data
|
|
structures off the conversion object, and provide its own functions to
|
|
allocate and destroy them; this allows read and write functions to
|
|
maintain whatever state they like.
|
|
|
|
The Guile conversion back end represents each available encoding with a
|
|
@code{struct scm_mb_encoding} object.
|
|
|
|
@deftp {Libguile Type} {struct scm_mb_encoding}
|
|
This data structure describes an encoding. It has the following
|
|
members:
|
|
|
|
@table @code
|
|
|
|
@item char **names
|
|
An array of strings, giving the various names for this encoding. The
|
|
array should be terminated by a zero pointer. Case is not significant
|
|
in encoding names.
|
|
|
|
The @code{scm_mb_iconv_open} function searches the list of registered
|
|
encodings for an encoding whose @code{names} array matches its
|
|
@var{tocode} or @var{fromcode} argument.
|
|
|
|
@item int (*init) (void **@var{cookie})
|
|
An initialization function for the encoding's private data.
|
|
@code{scm_mb_iconv_open} will call this function, passing it the address
|
|
of the cookie for this encoding in this context. (We explain cookies
|
|
below.) There is no way for the @code{init} function to tell whether
|
|
the encoding will be used for reading or writing.
|
|
|
|
Note that @code{init} receives a @emph{pointer} to the cookie, not the
|
|
cookie itself. Because the type of @var{cookie} is @code{void **}, the
|
|
C compiler will not check it as carefully as it would other types.
|
|
|
|
The @code{init} member may be zero, indicating that no initialization is
|
|
necessary for this encoding.
|
|
|
|
@item int (*destroy) (void **@var{cookie})
|
|
A deallocation function for the encoding's private data.
|
|
@code{scm_mb_iconv_close} calls this function, passing it the address of
|
|
the cookie for this encoding in this context. The @code{destroy}
|
|
function should free any data the @code{init} function allocated.
|
|
|
|
Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
|
|
cookie itself. Because the type of @var{cookie} is @code{void **}, the
|
|
C compiler will not check it as carefully as it would other types.
|
|
|
|
The @code{destroy} member may be zero, indicating that this encoding
|
|
doesn't need to perform any special action to destroy its local data.
|
|
|
|
@item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
|
|
Put the encoding into its initial shift state. Guile calls this
|
|
function whether the encoding is being used for input or output, so this
|
|
should take appropriate steps for both directions. If @var{outbuf} and
|
|
@var{outbytesleft} are valid, the reset function should emit an escape
|
|
sequence to reset the output stream to its initial state; @var{outbuf}
|
|
and @var{outbytesleft} should be handled just as for
|
|
@code{scm_mb_iconv}.
|
|
|
|
This function can return an @code{scm_mb_iconv_} error code
|
|
(@pxref{Exchanging Guile Text With the Outside World in C}). If it
|
|
returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
|
|
state must be left unchanged.
|
|
|
|
Note that @code{reset} receives the cookie's value itself, not a pointer
|
|
to the cookie, as the @code{init} and @code{destroy} functions do.
|
|
|
|
The @code{reset} member may be zero, indicating that this encoding
|
|
doesn't use a shift state.
|
|
|
|
@item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
|
|
Read some bytes and convert into an array of Guile characters. This is
|
|
the encoding's read function.
|
|
|
|
On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
|
|
be converted, and *@var{outcharsleft} characters available at
|
|
*@var{outbuf} to hold the results.
|
|
|
|
On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
|
|
still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the
|
|
output buffer space still not filled. (By exclusion, these indicate
|
|
which input bytes were consumed, and which output characters were
|
|
produced.)
|
|
|
|
Return one of the @code{enum scm_mb_read_result} values, described below.
|
|
|
|
Note that @code{read} receives the cookie's value itself, not a pointer
|
|
to the cookie, as the @code{init} and @code{destroy} functions do.
|
|
|
|
@item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
|
|
Convert an array of Guile characters to output bytes. This is
|
|
the encoding's write function.
|
|
|
|
On entry, there are *@var{incharsleft} Guile characters available at
|
|
*@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
|
|
*@var{outbuf}.
|
|
|
|
On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
|
|
Guile characters left unconverted (because there was insufficient room
|
|
in the output buffer to hold their converted forms), and
|
|
*@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
|
|
output buffer.
|
|
|
|
Return one of the @code{scm_mb_write_result} values, described below.
|
|
|
|
Note that @code{write} receives the cookie's value itself, not a pointer
|
|
to the cookie, as the @code{init} and @code{destroy} functions do.
|
|
|
|
@item struct scm_mb_encoding *next
|
|
This is used by Guile to maintain a linked list of encodings. It is
|
|
filled in when you call @code{scm_mb_register_encoding} to add your
|
|
encoding to the list.
|
|
|
|
@end table
|
|
@end deftp
|
|
|
|
Here is the enumerated type for the values an encoding's read function
|
|
can return:
|
|
|
|
@deftp {Libguile Type} {enum scm_mb_read_result}
|
|
This type represents the result of a call to an encoding's read
|
|
function. It has the following values:
|
|
|
|
@table @code
|
|
|
|
@item scm_mb_read_ok
|
|
The read function consumed at least one byte of input.
|
|
|
|
@item scm_mb_read_incomplete
|
|
The data present in the input buffer does not contain a complete
|
|
character encoding. No input was consumed, and no characters were
|
|
produced as output. This is not necessarily an error status, if there
|
|
is more data to pass through.
|
|
|
|
@item scm_mb_read_error
|
|
The input contains an invalid character encoding.
|
|
|
|
@end table
|
|
@end deftp
|
|
|
|
Here is the enumerated type for the values an encoding's write function
|
|
can return:
|
|
|
|
@deftp {Libguile Type} {enum scm_mb_write_result}
|
|
This type represents the result of a call to an encoding's write
|
|
function. It has the following values:
|
|
|
|
@table @code
|
|
|
|
@item scm_mb_write_ok
|
|
The write function was able to convert all the characters in @var{inbuf}
|
|
successfully.
|
|
|
|
@item scm_mb_write_too_big
|
|
The write function filled the output buffer, but there are still
|
|
characters in @var{inbuf} left unconsumed; @var{inbuf} and
|
|
@var{incharsleft} indicate the unconsumed portion of the input buffer.
|
|
|
|
@end table
|
|
@end deftp
|
|
|
|
|
|
Conversions to or from stateful encodings need to keep track of each
|
|
encoding's current state. Each conversion context contains two
|
|
@code{void *} variables called @dfn{cookies}, one for the input
|
|
encoding, and one for the output encoding. These cookies are passed to
|
|
the encodings' functions, for them to use however they please. A
|
|
stateful encoding can use its cookie to hold a pointer to some object
|
|
which maintains the context's current shift state. Stateless encodings
|
|
will probably not use their cookies.
|
|
|
|
The cookies' lifetime is the same as that of the context object. When
|
|
the user calls @code{scm_mb_iconv_close} to destroy a context object,
|
|
@code{scm_mb_iconv_close} calls the input and output encodings'
|
|
@code{destroy} functions, passing them their respective cookies, so each
|
|
encoding can free any data it allocated for that context.
|
|
|
|
Note that, if a read or write function returns a successful result code
|
|
like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
|
|
input, together with the output, must together represent the complete
|
|
input text; the encoding may not store any text temporarily in its
|
|
cookie. This is because, if @code{scm_mb_iconv} returns a successful
|
|
result to the user, it is correct for the user to assume that all the
|
|
consumed input has been converted and placed in the output buffer.
|
|
There is no ``flush'' operation to push any final results out of the
|
|
encodings' buffers.
|
|
|
|
Here is the function you call to register a new encoding with the
|
|
conversion system:
|
|
|
|
@deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
|
|
Add the encoding described by @code{*@var{encoding}} to the set
|
|
understood by @code{scm_mb_iconv_open}. Once you have registered your
|
|
encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
|
|
the names in @code{@var{encoding}->names}.
|
|
@end deftypefn
|
|
|
|
|
|
@node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
|
|
@section Multibyte Text Processing Errors
|
|
|
|
This section describes error conditions which code can signal to
|
|
indicate problems encountered while processing multibyte text. In each
|
|
case, the arguments @var{message} and @var{args} are an error format
|
|
string and arguments to be substituted into the string, as accepted by
|
|
the @code{display-error} function.
|
|
|
|
@deffn Condition text:not-char-boundary func message args object offset
|
|
By calling @var{func}, the program attempted to access a character at
|
|
byte offset @var{offset} in the Guile object @var{object}, but
|
|
@var{offset} is not the start of a character's encoding in @var{object}.
|
|
|
|
Typically, @var{object} is a string or symbol. If the function signalling
|
|
the error cannot find the Guile object that contains the text it is
|
|
inspecting, it should use @code{#f} for @var{object}.
|
|
@end deffn
|
|
|
|
@deffn Condition text:bad-encoding func message args object
|
|
By calling @var{func}, the program attempted to interpret the text in
|
|
@var{object}, but @var{object} contains a byte sequence which is not a
|
|
valid encoding for any character.
|
|
@end deffn
|
|
|
|
@deffn Condition text:not-guile-char func message args number
|
|
By calling @var{func}, the program attempted to treat @var{number} as the
|
|
number of a character in the Guile character set, but @var{number} does
|
|
not correspond to any character in the Guile character set.
|
|
@end deffn
|
|
|
|
@deffn Condition text:unknown-conversion func message args from to
|
|
By calling @var{func}, the program attempted to convert from an encoding
|
|
named @var{from} to an encoding named @var{to}, but Guile does not
|
|
support such a conversion.
|
|
@end deffn
|
|
|
|
@deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
|
|
@deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
|
|
@deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
|
|
These variables hold the scheme symbol objects whose names are the
|
|
condition symbols above. You can use these when signalling these
|
|
errors, instead of looking them up yourself.
|
|
@end deftypevr
|
|
|
|
|
|
@node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
|
|
@section Why Guile Does Not Use a Fixed-Width Encoding
|
|
|
|
Multibyte encodings are clumsier to work with than encodings which use a
|
|
fixed number of bytes for every character. For example, using a
|
|
fixed-width encoding, we can extract the @var{i}th character of a string
|
|
in constant time, and we can always substitute the @var{i}th character
|
|
of a string with any other character without reallocating or copying the
|
|
string.
|
|
|
|
However, there are no fixed-width encodings which include the characters
|
|
we wish to include, and also fit in a reasonable amount of space.
|
|
Despite the Unicode standard's claims to the contrary, Unicode is not
|
|
really a fixed-width encoding. Unicode uses surrogate pairs to
|
|
represent characters outside the 16-bit range; a surrogate pair must be
|
|
treated as a single character, but occupies two 16-bit spaces. As of
|
|
this writing, there are already plans to assign characters to the
|
|
surrogate character codes. Three- and four-byte encodings are
|
|
too wasteful for a majority of Guile's users, who only need @sc{ASCII}
|
|
and a few accented characters.
|
|
|
|
Another alternative would be to have several different fixed-width
|
|
string representations, each with a different element size. For each
|
|
string, Guile would use the smallest element size capable of
|
|
accomodating the string's text. This would allow users of English and
|
|
the Western European languages to use the traditional memory-efficient
|
|
encodings. However, if Guile has @var{n} string representations, then
|
|
users must write @var{n} versions of any code which manipulates text
|
|
directly --- one for each element size. And if a user wants to operate
|
|
on two strings simultaneously, and wants to avoid testing the string
|
|
sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
|
|
Most users will simply not bother. Instead, they will write code which
|
|
supports only one string size, leaving us back where we started. By
|
|
using a single internal representation, Guile makes it easier for users
|
|
to write multilingual code.
|
|
|
|
[[What about tagging each string with its encoding?
|
|
"Every extension must be written to deal with every encoding"]]
|
|
|
|
[[You don't really want to index strings anyway.]]
|
|
|
|
Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
|
|
four-byte encoding, it is efficient in space for American and European
|
|
users. Furthermore, the properties described above mean that many
|
|
functions can be coded just as they would for a single-byte encoding;
|
|
see @ref{Promised Properties of the Guile Multibyte Encoding}.
|
|
|
|
@bye
|