mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-04-30 20:00:19 +02:00
143 lines
5 KiB
Text
143 lines
5 KiB
Text
Implementation of shared substrings with fresh-copy semantics
|
|
=============================================================
|
|
|
|
Version: $Id: sharedstr.text,v 1.1 2000-08-26 20:55:21 mdj Exp $
|
|
|
|
Background
|
|
----------
|
|
|
|
In Guile, most string operations work on two other data types apart
|
|
from strings: shared substrings and read-only strings (which includes
|
|
symbols). One of Guile's sub-goals is to be a scripting language in
|
|
which string management is important. Read-only strings and shared
|
|
substrings were introduced in order to reduce overhead in string
|
|
manipulation.
|
|
|
|
We now want to simplify the Guile API by removing these two data
|
|
types, but keeping performance by allowing ordinary strings to share
|
|
storage.
|
|
|
|
The idea is to let operations like `symbol->string' and `substring'
|
|
return a pointer into the original string/symbol, thus avoiding the
|
|
need to copy the string.
|
|
|
|
Two of the problems which then arise are:
|
|
|
|
* If s2 is derived from s1, and therefore share storage with s1, a
|
|
modification to either s1 or s2 will affect the other.
|
|
|
|
* Guile is supposed to interact closely with many UNIX libraries in
|
|
which the NUL character is used to terminate strings. Therefore
|
|
Guile strings contain a NUL character at the end, in addition to the
|
|
string length (the latter of which is used by Guile's string
|
|
operations).
|
|
|
|
The solutions to these problems are to
|
|
|
|
* Copy a string with shared storage when it's modified.
|
|
|
|
* Copy a string with shared storage when it's being used as argument
|
|
to a C library call. (Copying implies inserting an ending NUL
|
|
character.)
|
|
|
|
But this leads to memory management problems. When is it OK to free
|
|
a character array which was allocated for a symbol or a string?
|
|
|
|
Abstract description of proposed solution
|
|
-----------------------------------------
|
|
|
|
Definitions
|
|
|
|
STRING = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>
|
|
|
|
SYMBOL = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>
|
|
|
|
CHARRECORD = <PHASE, SHAREDFLAG, CHARS>
|
|
|
|
PHASE = black | white
|
|
|
|
SHAREDFLAG = private | shared
|
|
|
|
CHARS is a character array
|
|
|
|
CHARPTR points into it
|
|
|
|
Memory management
|
|
|
|
A string or symbol is initially allocated with its contents stored in
|
|
a character array in a character record. The string/symbol header
|
|
contains a pointer to this record. The initial value of the shared
|
|
flag in the character record is `private'.
|
|
|
|
The GC mark phases alternate between black and white---every second
|
|
phase is black, the rest are white. This is used to distinguish
|
|
whether a character record has been encountered before:
|
|
|
|
During a black mark phase, when the GC encounters a string or symbol,
|
|
it changes the PHASE and SHAREDFLAG marks of the corresponding
|
|
character record according to the following table:
|
|
|
|
<white, private> --> <black, private> (white => unconditionally
|
|
<white, shared> --> <black, private> set to <black, private>)
|
|
<black, private> --> <black, shared> (SHAREDFLAG changed)
|
|
<black, shared> --> <black, shared> (no change)
|
|
|
|
The behaviour of a white phase is quivalent with the color names
|
|
switched.
|
|
|
|
The GC sweep phase frees any unmarked string or symbol header and
|
|
frees its character record either if it is marked with the "wrong"
|
|
color (not matching the color of the last mark phase) or if its
|
|
SHAREDFLAG is `private'.
|
|
|
|
Copy-on-write
|
|
|
|
An attempt at mutating string contents leads to copying if SHAREDFLAG
|
|
is `shared'. Copying means making a copy of the character record and
|
|
mutating the CHARRECORDPTR and CHARPTR fields of the object header to
|
|
point to the copy.
|
|
|
|
Substring operation
|
|
|
|
When making a substring, a new string header is allocated, with new
|
|
contents for the LENGTH and CHARPTR fields.
|
|
|
|
Implementation details
|
|
----------------------
|
|
|
|
* We store the character record consecutively with the character
|
|
array and lump the PHASE and SHAREDFLAG fields together into one
|
|
byte containing an integer code for the four possible states of the
|
|
PHASE and SHAREDFLAG fields. Another way of viewing it is that
|
|
these fields are represented as bits 1 and 0 in the "header" of the
|
|
character array. We let CHARRECORDPTR point to the first character
|
|
position instead of on this header:
|
|
|
|
CHARRECORDPTR
|
|
|
|
|
V
|
|
FCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
|
|
|
|
F = 0, 1, 2, 3
|
|
|
|
* We represent strings as the sub-types `simple-string' and
|
|
`substring'.
|
|
|
|
* In a simple string, CHARRECORDPTR and CHARPTR are represented by a
|
|
single pointer, so a `simple-string' is an ordinary heap cell with
|
|
TYPETAG and LENGTH in the CAR and CHARPTR in the CDR.
|
|
|
|
* substring:s are represented as double cells, with TYPETAG and LENGTH
|
|
in word 0, CHARRECORDPTR in word 1 and CHARPTR in word 2
|
|
(alternatively, we could store an offset from CHARRECORDPTR).
|
|
|
|
Problems with this implementation
|
|
---------------------------------
|
|
|
|
* How do we make copy-on-write thread-safe? Is there a different
|
|
implementation which is efficient and thread-safe?
|
|
|
|
* If small substrings are frequently generated from large, temporary
|
|
strings and the small substrings are kept in a data structure, the
|
|
heap will still have to host the large original strings. Should we
|
|
simply accept this?
|