1
Fork 0
mirror of https://git.savannah.gnu.org/git/guile.git synced 2025-05-21 20:20:24 +02:00

Update docs and docstrings for Unicode characters

* doc/ref/api-data.texi: more info about characters and codepoints

* libguile/chars.c: replace 'code point' with 'Unicode code point' in
  docstrings
This commit is contained in:
Michael Gran 2009-09-03 08:48:23 -07:00
parent ba8477ecce
commit bb15a36c25
2 changed files with 85 additions and 42 deletions

View file

@ -1782,22 +1782,57 @@ another manual.
In Scheme, there is a data type to describe a single character. In Scheme, there is a data type to describe a single character.
Defining what exactly a character @emph{is} can be more complicated Defining what exactly a character @emph{is} can be more complicated
than it seems. Guile follows the advice of R6RS and just uses The than it seems. Guile follows the advice of R6RS and uses The Unicode
Unicode Standard to help define what a character is. So, for Guile, Standard to help define what a character is. So, for Guile, a
a character is anything in the Unicode Character Database. character is anything in the Unicode Character Database.
Unicode assigns each character an unique integer representation: a @cindex code point
@emph{code point}. Guile uses Unicode code points as the integer @cindex Unicode code point
representation of characters. Valid code points are in the ranges 0
to @code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF} The Unicode Character Database is basically a table of characters
inclusive. indexed using integers called 'code points'. Valid code points are in
the ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to
@code{#x10FFFF} inclusive, which is about 1.1 million code points.
@cindex designated code point
@cindex code point, designated
Any code point that has been assigned to a character or that has
otherwise been given a meaning by Unicode is called a 'designated code
point'. Most of the designated code points, about 200,000 of them,
indicate characters, accents or other combining marks that modify
other characters, symbols, whitespace, and control characters. Some
are not characters but indicators that suggest how to format or
display neighboring characters.
@cindex reserved code point
@cindex code point, reserved
If a code point is not a designated code point -- if it has not been
assigned to a character by The Unicode Standard -- it is a 'reserved
code point', meaning that they are reserved for future use. Most of
the code points, about 800,000, are 'reserved code points'.
By convention, a Unicode code point is written as
``U+XXXX'' where ``XXXX'' is a hexadecimal number. Please note that
this convenient notation is not valid code. Guile does not interpret
``U+XXXX'' as a character.
In Scheme, a character literal is written as @code{#\@var{name}} where In Scheme, a character literal is written as @code{#\@var{name}} where
@var{name} is the name of the character that you want. Printable @var{name} is the name of the character that you want. Printable
characters have their usual single character name; for example, characters have their usual single character name; for example,
@code{#\a} is a lower case @code{a}. Many of the non-printing @code{#\a} is a lower case @code{a}.
characters, such as whitespace characters and control characters, also
have names. Some of the code points are 'combining characters' that are not meant
to be printed by themselves but are instead meant to modify the
appearance of the previous character. For combining characters, an
alternate form of the character literal is @code{#\} followed by
U+25CC (a small, dotted circle), followed by the combining character.
This allows the combining character to be drawn on the circle, not on
the backslash of @code{#\}.
Many of the non-printing characters, such as whitespace characters and
control characters, also have names.
The most commonly used non-printing chararacters are space and The most commonly used non-printing chararacters are space and
newline. Their character names are @code{#\space} and newline. Their character names are @code{#\space} and
@ -1841,7 +1876,7 @@ describes the names for each character.
@item 32 = @code{#\sp} @item 32 = @code{#\sp}
@end multitable @end multitable
The ``delete'' character (code point 127) may be referred to with the The ``delete'' character (code point U+007F) may be referred to with the
name @code{#\del}. name @code{#\del}.
One might note that the space character has two names -- One might note that the space character has two names --
@ -1862,8 +1897,9 @@ sake of compatibility with previous versions.
@item @code{#\null} @tab @code{#\nul} @item @code{#\null} @tab @code{#\nul}
@end multitable @end multitable
Characters may also be referred to with an octal value, such as Characters may also be written using their code point values. They can
@code{#\10} for @code{#\bs} or @code{#\177} for @code{#\del}. be written with as an octal number, such as @code{#\10} for
@code{#\bs} or @code{#\177} for @code{#\del}.
@rnindex char? @rnindex char?
@deffn {Scheme Procedure} char? x @deffn {Scheme Procedure} char? x
@ -1871,7 +1907,7 @@ Characters may also be referred to with an octal value, such as
Return @code{#t} iff @var{x} is a character, else @code{#f}. Return @code{#t} iff @var{x} is a character, else @code{#f}.
@end deffn @end deffn
Fundamentally, the character comparisons operations below are Fundamentally, the character comparison operations below are
numeric comparisons of the character's code points. numeric comparisons of the character's code points.
@rnindex char=? @rnindex char=?
@ -1904,12 +1940,17 @@ Return @code{#t} iff the code point of @var{x} is greater than or
equal to the code point of @var{y}, else @code{#f}. equal to the code point of @var{y}, else @code{#f}.
@end deffn @end deffn
Case-insensitive character comparisons of characters use @emph{Unicode @cindex case folding
case folding}. In case folding comparisons, if a character is
lowercase and has an uppercase form that can be expressed as a single Case-insensitive character comparisons use @emph{Unicode case
character, it is converted to uppercase before comparison. Unicode folding}. In case folding comparisons, if a character is lowercase
case folding is language independent: it uses rules that are generally and has an uppercase form that can be expressed as a single character,
true, but, it cannot cover all cases for all languages. it is converted to uppercase before comparison. All other characters
undergo no conversion before the comparison occurs. This includes the
German sharp S (Eszett) which is not uppercased before conversion
because its uppercase form has two characters. Unicode case folding
is language independent: it uses rules that are generally true, but,
it cannot cover all cases for all languages.
@rnindex char-ci=? @rnindex char-ci=?
@deffn {Scheme Procedure} char-ci=? x y @deffn {Scheme Procedure} char-ci=? x y

View file

@ -45,8 +45,8 @@ SCM_DEFINE (scm_char_p, "char?", 1, 0, 0,
SCM_DEFINE1 (scm_char_eq_p, "char=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_eq_p, "char=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff code point of @var{x} is equal to the code point\n" "Return @code{#t} if the Unicode code point of @var{x} is equal to the\n"
"of @var{y}, else @code{#f}.\n") "code point of @var{y}, else @code{#f}.\n")
#define FUNC_NAME s_scm_char_eq_p #define FUNC_NAME s_scm_char_eq_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -70,8 +70,8 @@ SCM_DEFINE1 (scm_char_less_p, "char<?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the code point of @var{x} is less than or equal\n" "Return @code{#t} if the Unicode code point of @var{x} is less than or\n"
"to the code point of @var{y}, else @code{#f}.") "equal to the code point of @var{y}, else @code{#f}.")
#define FUNC_NAME s_scm_char_leq_p #define FUNC_NAME s_scm_char_leq_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -82,8 +82,8 @@ SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the code point of @var{x} is greater than the\n" "Return @code{#t} if the Unicode code point of @var{x} is greater than\n"
"code point of @var{y}, else @code{#f}.") "the code point of @var{y}, else @code{#f}.")
#define FUNC_NAME s_scm_char_gr_p #define FUNC_NAME s_scm_char_gr_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -94,8 +94,8 @@ SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the code point of @var{x} is greater than or\n" "Return @code{#t} if the Unicode code point of @var{x} is greater than\n"
"equal to the code point of @var{y}, else @code{#f}.") "or equal to the code point of @var{y}, else @code{#f}.")
#define FUNC_NAME s_scm_char_geq_p #define FUNC_NAME s_scm_char_geq_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -113,8 +113,8 @@ SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the case-folded code point of @var{x} is the same\n" "Return @code{#t} if the case-folded Unicode code point of @var{x} is\n"
"as the case-folded code point of @var{y}, else @code{#f}.") "the same as the case-folded code point of @var{y}, else @code{#f}.")
#define FUNC_NAME s_scm_char_ci_eq_p #define FUNC_NAME s_scm_char_ci_eq_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -125,8 +125,8 @@ SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the case-folded code point of @var{x} is less\n" "Return @code{#t} if the case-folded Unicode code point of @var{x} is\n"
"than the case-folded code point of @var{y}, else @code{#f}.") "less than the case-folded code point of @var{y}, else @code{#f}.")
#define FUNC_NAME s_scm_char_ci_less_p #define FUNC_NAME s_scm_char_ci_less_p
{ {
SCM_VALIDATE_CHAR (1, x); SCM_VALIDATE_CHAR (1, x);
@ -137,8 +137,8 @@ SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_ci_leq_p, "char-ci<=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_ci_leq_p, "char-ci<=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the case-folded code point of @var{x} is less\n" "Return @code{#t} iff the case-folded Unicodd code point of @var{x} is\n"
"than or equal to the case-folded code point of @var{y}, else\n" "less than or equal to the case-folded code point of @var{y}, else\n"
"@code{#f}") "@code{#f}")
#define FUNC_NAME s_scm_char_ci_leq_p #define FUNC_NAME s_scm_char_ci_leq_p
{ {
@ -162,8 +162,8 @@ SCM_DEFINE1 (scm_char_ci_gr_p, "char-ci>?", scm_tc7_rpsubr,
SCM_DEFINE1 (scm_char_ci_geq_p, "char-ci>=?", scm_tc7_rpsubr, SCM_DEFINE1 (scm_char_ci_geq_p, "char-ci>=?", scm_tc7_rpsubr,
(SCM x, SCM y), (SCM x, SCM y),
"Return @code{#t} iff the case-folded code point of @var{x} is greater\n" "Return @code{#t} iff the case-folded Unicode code point of @var{x} is\n"
"than or equal to the case-folded code point of @var{y}, else\n" "greater than or equal to the case-folded code point of @var{y}, else\n"
"@code{#f}.") "@code{#f}.")
#define FUNC_NAME s_scm_char_ci_geq_p #define FUNC_NAME s_scm_char_ci_geq_p
{ {
@ -224,7 +224,8 @@ SCM_DEFINE (scm_char_lower_case_p, "char-lower-case?", 1, 0, 0,
SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0, SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0,
(SCM chr), (SCM chr),
"Return @code{#t} iff @var{chr} is either uppercase or lowercase, else @code{#f}.\n") "Return @code{#t} iff @var{chr} is either uppercase or lowercase, else\n"
"@code{#f}.\n")
#define FUNC_NAME s_scm_char_is_both_p #define FUNC_NAME s_scm_char_is_both_p
{ {
if (scm_is_true (scm_char_set_contains_p (scm_char_set_lower_case, chr))) if (scm_is_true (scm_char_set_contains_p (scm_char_set_lower_case, chr)))
@ -236,7 +237,7 @@ SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0,
SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0, SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0,
(SCM chr), (SCM chr),
"Return the code point of @var{chr}.") "Return the Unicode code point of @var{chr}.")
#define FUNC_NAME s_scm_char_to_integer #define FUNC_NAME s_scm_char_to_integer
{ {
SCM_VALIDATE_CHAR (1, chr); SCM_VALIDATE_CHAR (1, chr);
@ -247,9 +248,10 @@ SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0,
SCM_DEFINE (scm_integer_to_char, "integer->char", 1, 0, 0, SCM_DEFINE (scm_integer_to_char, "integer->char", 1, 0, 0,
(SCM n), (SCM n),
"Return the character that has code point @var{n}. The integer @var{n}\n" "Return the character that has Unicode code point @var{n}. The integer\n"
"must be a valid code point. Valid code points are in the ranges 0 to\n" "@var{n} must be a valid code point. Valid code points are in the\n"
"@code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF} inclusive.") "ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to\n"
"@code{#x10FFFF} inclusive.")
#define FUNC_NAME s_scm_integer_to_char #define FUNC_NAME s_scm_integer_to_char
{ {
scm_t_wchar cn; scm_t_wchar cn;