Update docs and docstrings for Unicode characters

* doc/ref/api-data.texi: more info about characters and codepoints * libguile/chars.c: replace 'code point' with 'Unicode code point' in docstrings
2025-07-06 01:30:22 +02:00 · 2009-09-03 08:48:23 -07:00 · 2009-09-03 08:48:23 -07:00 · bb15a36c25
commit bb15a36c25
parent ba8477ecce
2 changed files with 85 additions and 42 deletions
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@ -1782,22 +1782,57 @@ another manual.
 In Scheme, there is a data type to describe a single character.  

 Defining what exactly a character @emph{is} can be more complicated
-than it seems.  Guile follows the advice of R6RS and just uses The
-Unicode Standard to help define what a character is.  So, for Guile,
-a character is anything in the Unicode Character Database.
+than it seems.  Guile follows the advice of R6RS and uses The Unicode
+Standard to help define what a character is.  So, for Guile, a
+character is anything in the Unicode Character Database.

-Unicode assigns each character an unique integer representation: a
-@emph{code point}.  Guile uses Unicode code points as the integer
-representation of characters.  Valid code points are in the ranges 0
-to @code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF}
-inclusive.
+@cindex code point
+@cindex Unicode code point
+
+The Unicode Character Database is basically a table of characters
+indexed using integers called 'code points'.  Valid code points are in
+the ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to
+@code{#x10FFFF} inclusive, which is about 1.1 million code points.
+
+@cindex designated code point
+@cindex code point, designated
+
+Any code point that has been assigned to a character or that has
+otherwise been given a meaning by Unicode is called a 'designated code
+point'.  Most of the designated code points, about 200,000 of them,
+indicate characters, accents or other combining marks that modify
+other characters, symbols, whitespace, and control characters.  Some
+are not characters but indicators that suggest how to format or
+display neighboring characters.
+
+@cindex reserved code point
+@cindex code point, reserved
+
+If a code point is not a designated code point -- if it has not been
+assigned to a character by The Unicode Standard -- it is a 'reserved
+code point', meaning that they are reserved for future use.  Most of
+the code points, about 800,000, are 'reserved code points'.
+
+By convention, a Unicode code point is written as
+``U+XXXX'' where ``XXXX'' is a hexadecimal number.  Please note that
+this convenient notation is not valid code.  Guile does not interpret
+``U+XXXX'' as a character.

 In Scheme, a character literal is written as @code{#\@var{name}} where
@var{name} is the name of the character that you want.  Printable
 characters have their usual single character name; for example,
-@code{#\a} is a lower case @code{a}.  Many of the non-printing
-characters, such as whitespace characters and control characters, also
-have names.
+@code{#\a} is a lower case @code{a}.  
+
+Some of the code points are 'combining characters' that are not meant
+to be printed by themselves but are instead meant to modify the
+appearance of the previous character.  For combining characters, an
+alternate form of the character literal is @code{#\} followed by
+U+25CC (a small, dotted circle), followed by the combining character.
+This allows the combining character to be drawn on the circle, not on
+the backslash of @code{#\}.
+
+Many of the non-printing characters, such as whitespace characters and
+control characters, also have names.

 The most commonly used non-printing chararacters are space and
 newline.  Their character names are @code{#\space} and
@ -1841,7 +1876,7 @@ describes the names for each character.
@item 32 = @code{#\sp}
@end multitable

-The ``delete'' character (code point 127) may be referred to with the
+The ``delete'' character (code point U+007F) may be referred to with the
 name @code{#\del}.

 One might note that the space character has two names --
@ -1862,8 +1897,9 @@ sake of compatibility with previous versions.
@item @code{#\null} @tab @code{#\nul}
@end multitable

-Characters may also be referred to with an octal value, such as
-@code{#\10} for @code{#\bs} or @code{#\177} for @code{#\del}.
+Characters may also be written using their code point values.  They can
+be written with as an octal number, such as @code{#\10} for
+@code{#\bs} or @code{#\177} for @code{#\del}.

@rnindex char?
@deffn {Scheme Procedure} char? x
@ -1871,7 +1907,7 @@ Characters may also be referred to with an octal value, such as
 Return @code{#t} iff @var{x} is a character, else @code{#f}.
@end deffn

-Fundamentally, the character comparisons operations below are
+Fundamentally, the character comparison operations below are
 numeric comparisons of the character's code points.

@rnindex char=?
@ -1904,12 +1940,17 @@ Return @code{#t} iff the code point of @var{x} is greater than or
 equal to the code point of @var{y}, else @code{#f}.
@end deffn

-Case-insensitive character comparisons of characters use @emph{Unicode
-case folding}.  In case folding comparisons, if a character is
-lowercase and has an uppercase form that can be expressed as a single
-character, it is converted to uppercase before comparison.  Unicode
-case folding is language independent: it uses rules that are generally
-true, but, it cannot cover all cases for all languages.
+@cindex case folding
+
+Case-insensitive character comparisons use @emph{Unicode case
+folding}.  In case folding comparisons, if a character is lowercase
+and has an uppercase form that can be expressed as a single character,
+it is converted to uppercase before comparison.  All other characters
+undergo no conversion before the comparison occurs.  This includes the
+German sharp S (Eszett) which is not uppercased before conversion
+because its uppercase form has two characters.  Unicode case folding
+is language independent: it uses rules that are generally true, but,
+it cannot cover all cases for all languages.

@rnindex char-ci=?
@deffn {Scheme Procedure} char-ci=? x y
--- a/libguile/chars.c
+++ b/libguile/chars.c
@ -45,8 +45,8 @@ SCM_DEFINE (scm_char_p, "char?", 1, 0, 0,

 SCM_DEFINE1 (scm_char_eq_p, "char=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff code point of @var{x} is equal to the code point\n"
-             "of @var{y}, else @code{#f}.\n")
+             "Return @code{#t} if the Unicode code point of @var{x} is equal to the\n"
+             "code point of @var{y}, else @code{#f}.\n")
 #define FUNC_NAME s_scm_char_eq_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -70,8 +70,8 @@ SCM_DEFINE1 (scm_char_less_p, "char<?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the code point of @var{x} is less than or equal\n"
-             "to the code point of @var{y}, else @code{#f}.")
+             "Return @code{#t} if the Unicode code point of @var{x} is less than or\n"
+             "equal to the code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_leq_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -82,8 +82,8 @@ SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the code point of @var{x} is greater than the\n"
-             "code point of @var{y}, else @code{#f}.")
+             "Return @code{#t} if the Unicode code point of @var{x} is greater than\n"
+             "the code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_gr_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -94,8 +94,8 @@ SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the code point of @var{x} is greater than or\n"
-             "equal to the code point of @var{y}, else @code{#f}.")
+             "Return @code{#t} if the Unicode code point of @var{x} is greater than\n"
+             "or equal to the code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_geq_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -113,8 +113,8 @@ SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the case-folded code point of @var{x} is the same\n"
-             "as the case-folded code point of @var{y}, else @code{#f}.")
+             "Return @code{#t} if the case-folded Unicode code point of @var{x} is\n"
+             "the same as the case-folded code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_ci_eq_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -125,8 +125,8 @@ SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the case-folded code point of @var{x} is less\n"
-             "than the case-folded code point of @var{y}, else @code{#f}.")
+             "Return @code{#t} if the case-folded Unicode code point of @var{x} is\n"
+             "less than the case-folded code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_ci_less_p
 {
  SCM_VALIDATE_CHAR (1, x);
@ -137,8 +137,8 @@ SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_ci_leq_p, "char-ci<=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the case-folded code point of @var{x} is less\n"
-             "than or equal to the case-folded code point of @var{y}, else\n"
+             "Return @code{#t} iff the case-folded Unicodd code point of @var{x} is\n"
+             "less than or equal to the case-folded code point of @var{y}, else\n"
             "@code{#f}")
 #define FUNC_NAME s_scm_char_ci_leq_p
 {
@ -162,8 +162,8 @@ SCM_DEFINE1 (scm_char_ci_gr_p, "char-ci>?", scm_tc7_rpsubr,

 SCM_DEFINE1 (scm_char_ci_geq_p, "char-ci>=?", scm_tc7_rpsubr,
             (SCM x, SCM y),
-             "Return @code{#t} iff the case-folded code point of @var{x} is greater\n"
-             "than or equal to the case-folded code point of @var{y}, else\n"
+             "Return @code{#t} iff the case-folded Unicode code point of @var{x} is\n"
+             "greater than or equal to the case-folded code point of @var{y}, else\n"
             "@code{#f}.")
 #define FUNC_NAME s_scm_char_ci_geq_p
 {
@ -224,7 +224,8 @@ SCM_DEFINE (scm_char_lower_case_p, "char-lower-case?", 1, 0, 0,

 SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0, 
            (SCM chr),
-	    "Return @code{#t} iff @var{chr} is either uppercase or lowercase, else @code{#f}.\n")
+	    "Return @code{#t} iff @var{chr} is either uppercase or lowercase, else\n"
+            "@code{#f}.\n")
 #define FUNC_NAME s_scm_char_is_both_p
 {
  if (scm_is_true (scm_char_set_contains_p (scm_char_set_lower_case, chr)))
@ -236,7 +237,7 @@ SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0,

 SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0, 
            (SCM chr),
-            "Return the code point of @var{chr}.")
+            "Return the Unicode code point of @var{chr}.")
 #define FUNC_NAME s_scm_char_to_integer
 {
  SCM_VALIDATE_CHAR (1, chr);
@ -247,9 +248,10 @@ SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0,

 SCM_DEFINE (scm_integer_to_char, "integer->char", 1, 0, 0, 
           (SCM n),
-            "Return the character that has code point @var{n}.  The integer @var{n}\n"
-            "must be a valid code point.  Valid code points are in the ranges 0 to\n"
-            "@code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF} inclusive.")
+            "Return the character that has Unicode code point @var{n}.  The integer\n"
+            "@var{n} must be a valid code point.  Valid code points are in the\n"
+            "ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to\n"
+            "@code{#x10FFFF} inclusive.")
 #define FUNC_NAME s_scm_integer_to_char
 {
  scm_t_wchar cn;