mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-05-01 20:30:28 +02:00
* strings/sharedstr.text (sharedstr.text): New file.
This commit is contained in:
parent
92ccc1f1f3
commit
9b5477f6f6
1 changed files with 143 additions and 0 deletions
143
devel/strings/sharedstr.text
Normal file
143
devel/strings/sharedstr.text
Normal file
|
@ -0,0 +1,143 @@
|
|||
Implementation of shared substrings with fresh-copy semantics
|
||||
=============================================================
|
||||
|
||||
Version: $Id: sharedstr.text,v 1.1 2000-08-26 20:55:21 mdj Exp $
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
In Guile, most string operations work on two other data types apart
|
||||
from strings: shared substrings and read-only strings (which includes
|
||||
symbols). One of Guile's sub-goals is to be a scripting language in
|
||||
which string management is important. Read-only strings and shared
|
||||
substrings were introduced in order to reduce overhead in string
|
||||
manipulation.
|
||||
|
||||
We now want to simplify the Guile API by removing these two data
|
||||
types, but keeping performance by allowing ordinary strings to share
|
||||
storage.
|
||||
|
||||
The idea is to let operations like `symbol->string' and `substring'
|
||||
return a pointer into the original string/symbol, thus avoiding the
|
||||
need to copy the string.
|
||||
|
||||
Two of the problems which then arise are:
|
||||
|
||||
* If s2 is derived from s1, and therefore share storage with s1, a
|
||||
modification to either s1 or s2 will affect the other.
|
||||
|
||||
* Guile is supposed to interact closely with many UNIX libraries in
|
||||
which the NUL character is used to terminate strings. Therefore
|
||||
Guile strings contain a NUL character at the end, in addition to the
|
||||
string length (the latter of which is used by Guile's string
|
||||
operations).
|
||||
|
||||
The solutions to these problems are to
|
||||
|
||||
* Copy a string with shared storage when it's modified.
|
||||
|
||||
* Copy a string with shared storage when it's being used as argument
|
||||
to a C library call. (Copying implies inserting an ending NUL
|
||||
character.)
|
||||
|
||||
But this leads to memory management problems. When is it OK to free
|
||||
a character array which was allocated for a symbol or a string?
|
||||
|
||||
Abstract description of proposed solution
|
||||
-----------------------------------------
|
||||
|
||||
Definitions
|
||||
|
||||
STRING = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>
|
||||
|
||||
SYMBOL = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>
|
||||
|
||||
CHARRECORD = <PHASE, SHAREDFLAG, CHARS>
|
||||
|
||||
PHASE = black | white
|
||||
|
||||
SHAREDFLAG = private | shared
|
||||
|
||||
CHARS is a character array
|
||||
|
||||
CHARPTR points into it
|
||||
|
||||
Memory management
|
||||
|
||||
A string or symbol is initially allocated with its contents stored in
|
||||
a character array in a character record. The string/symbol header
|
||||
contains a pointer to this record. The initial value of the shared
|
||||
flag in the character record is `private'.
|
||||
|
||||
The GC mark phases alternate between black and white---every second
|
||||
phase is black, the rest are white. This is used to distinguish
|
||||
whether a character record has been encountered before:
|
||||
|
||||
During a black mark phase, when the GC encounters a string or symbol,
|
||||
it changes the PHASE and SHAREDFLAG marks of the corresponding
|
||||
character record according to the following table:
|
||||
|
||||
<white, private> --> <black, private> (white => unconditionally
|
||||
<white, shared> --> <black, private> set to <black, private>)
|
||||
<black, private> --> <black, shared> (SHAREDFLAG changed)
|
||||
<black, shared> --> <black, shared> (no change)
|
||||
|
||||
The behaviour of a white phase is quivalent with the color names
|
||||
switched.
|
||||
|
||||
The GC sweep phase frees any unmarked string or symbol header and
|
||||
frees its character record either if it is marked with the "wrong"
|
||||
color (not matching the color of the last mark phase) or if its
|
||||
SHAREDFLAG is `private'.
|
||||
|
||||
Copy-on-write
|
||||
|
||||
An attempt at mutating string contents leads to copying if SHAREDFLAG
|
||||
is `shared'. Copying means making a copy of the character record and
|
||||
mutating the CHARRECORDPTR and CHARPTR fields of the object header to
|
||||
point to the copy.
|
||||
|
||||
Substring operation
|
||||
|
||||
When making a substring, a new string header is allocated, with new
|
||||
contents for the LENGTH and CHARPTR fields.
|
||||
|
||||
Implementation details
|
||||
----------------------
|
||||
|
||||
* We store the character record consecutively with the character
|
||||
array and lump the PHASE and SHAREDFLAG fields together into one
|
||||
byte containing an integer code for the four possible states of the
|
||||
PHASE and SHAREDFLAG fields. Another way of viewing it is that
|
||||
these fields are represented as bits 1 and 0 in the "header" of the
|
||||
character array. We let CHARRECORDPTR point to the first character
|
||||
position instead of on this header:
|
||||
|
||||
CHARRECORDPTR
|
||||
|
|
||||
V
|
||||
FCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
|
||||
|
||||
F = 0, 1, 2, 3
|
||||
|
||||
* We represent strings as the sub-types `simple-string' and
|
||||
`substring'.
|
||||
|
||||
* In a simple string, CHARRECORDPTR and CHARPTR are represented by a
|
||||
single pointer, so a `simple-string' is an ordinary heap cell with
|
||||
TYPETAG and LENGTH in the CAR and CHARPTR in the CDR.
|
||||
|
||||
* substring:s are represented as double cells, with TYPETAG and LENGTH
|
||||
in word 0, CHARRECORDPTR in word 1 and CHARPTR in word 2
|
||||
(alternatively, we could store an offset from CHARRECORDPTR).
|
||||
|
||||
Problems with this implementation
|
||||
---------------------------------
|
||||
|
||||
* How do we make copy-on-write thread-safe? Is there a different
|
||||
implementation which is efficient and thread-safe?
|
||||
|
||||
* If small substrings are frequently generated from large, temporary
|
||||
strings and the small substrings are kept in a data structure, the
|
||||
heap will still have to host the large original strings. Should we
|
||||
simply accept this?
|
Loading…
Add table
Add a link
Reference in a new issue