diff --git a/devel/strings/sharedstr.text b/devel/strings/sharedstr.text new file mode 100644 index 000000000..f568c3ebf --- /dev/null +++ b/devel/strings/sharedstr.text @@ -0,0 +1,143 @@ +Implementation of shared substrings with fresh-copy semantics +============================================================= + +Version: $Id: sharedstr.text,v 1.1 2000-08-26 20:55:21 mdj Exp $ + +Background +---------- + +In Guile, most string operations work on two other data types apart +from strings: shared substrings and read-only strings (which includes +symbols). One of Guile's sub-goals is to be a scripting language in +which string management is important. Read-only strings and shared +substrings were introduced in order to reduce overhead in string +manipulation. + +We now want to simplify the Guile API by removing these two data +types, but keeping performance by allowing ordinary strings to share +storage. + +The idea is to let operations like `symbol->string' and `substring' +return a pointer into the original string/symbol, thus avoiding the +need to copy the string. + +Two of the problems which then arise are: + +* If s2 is derived from s1, and therefore share storage with s1, a + modification to either s1 or s2 will affect the other. + +* Guile is supposed to interact closely with many UNIX libraries in + which the NUL character is used to terminate strings. Therefore + Guile strings contain a NUL character at the end, in addition to the + string length (the latter of which is used by Guile's string + operations). + +The solutions to these problems are to + +* Copy a string with shared storage when it's modified. + +* Copy a string with shared storage when it's being used as argument + to a C library call. (Copying implies inserting an ending NUL + character.) + +But this leads to memory management problems. When is it OK to free +a character array which was allocated for a symbol or a string? + +Abstract description of proposed solution +----------------------------------------- + +Definitions + + STRING = + + SYMBOL = + + CHARRECORD = + + PHASE = black | white + + SHAREDFLAG = private | shared + + CHARS is a character array + + CHARPTR points into it + +Memory management + +A string or symbol is initially allocated with its contents stored in +a character array in a character record. The string/symbol header +contains a pointer to this record. The initial value of the shared +flag in the character record is `private'. + +The GC mark phases alternate between black and white---every second +phase is black, the rest are white. This is used to distinguish +whether a character record has been encountered before: + +During a black mark phase, when the GC encounters a string or symbol, +it changes the PHASE and SHAREDFLAG marks of the corresponding +character record according to the following table: + + --> (white => unconditionally + --> set to ) + --> (SHAREDFLAG changed) + --> (no change) + +The behaviour of a white phase is quivalent with the color names +switched. + +The GC sweep phase frees any unmarked string or symbol header and +frees its character record either if it is marked with the "wrong" +color (not matching the color of the last mark phase) or if its +SHAREDFLAG is `private'. + +Copy-on-write + +An attempt at mutating string contents leads to copying if SHAREDFLAG +is `shared'. Copying means making a copy of the character record and +mutating the CHARRECORDPTR and CHARPTR fields of the object header to +point to the copy. + +Substring operation + +When making a substring, a new string header is allocated, with new +contents for the LENGTH and CHARPTR fields. + +Implementation details +---------------------- + +* We store the character record consecutively with the character + array and lump the PHASE and SHAREDFLAG fields together into one + byte containing an integer code for the four possible states of the + PHASE and SHAREDFLAG fields. Another way of viewing it is that + these fields are represented as bits 1 and 0 in the "header" of the + character array. We let CHARRECORDPTR point to the first character + position instead of on this header: + + CHARRECORDPTR + | + V + FCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC + + F = 0, 1, 2, 3 + +* We represent strings as the sub-types `simple-string' and + `substring'. + +* In a simple string, CHARRECORDPTR and CHARPTR are represented by a + single pointer, so a `simple-string' is an ordinary heap cell with + TYPETAG and LENGTH in the CAR and CHARPTR in the CDR. + +* substring:s are represented as double cells, with TYPETAG and LENGTH + in word 0, CHARRECORDPTR in word 1 and CHARPTR in word 2 + (alternatively, we could store an offset from CHARRECORDPTR). + +Problems with this implementation +--------------------------------- + +* How do we make copy-on-write thread-safe? Is there a different + implementation which is efficient and thread-safe? + +* If small substrings are frequently generated from large, temporary + strings and the small substrings are kept in a data structure, the + heap will still have to host the large original strings. Should we + simply accept this?