mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-04-30 03:40:34 +02:00
Rework docs
This commit is contained in:
parent
3d4d4d047c
commit
ab5071f97a
8 changed files with 336 additions and 75 deletions
12
README.md
12
README.md
|
@ -8,19 +8,9 @@ Whippet is an embed-only C library, designed to be copied into a
|
|||
program's source tree. It exposes an abstract C API for managed memory
|
||||
allocation, and provides a number of implementations of that API.
|
||||
|
||||
One of the implementations is also called "whippet", and is the
|
||||
motivation for creating this library. For a detailed introduction, see
|
||||
[Whippet: Towards a new local
|
||||
maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
|
||||
a talk given at FOSDEM 2023.
|
||||
|
||||
## Documentation
|
||||
|
||||
* [Design](./doc/design.md): What's the general idea?
|
||||
* [Manual](./doc/manual.md): How do you get your program to use
|
||||
Whippet? What is the API?
|
||||
* [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
|
||||
top of Whippet.
|
||||
See the [documentation](./doc/README.md).
|
||||
|
||||
## Source repository structure
|
||||
|
||||
|
|
24
doc/README.md
Normal file
24
doc/README.md
Normal file
|
@ -0,0 +1,24 @@
|
|||
# Whippet documentation
|
||||
|
||||
* [Manual](./manual.md): How do you get your program to use
|
||||
Whippet? What is the API?
|
||||
|
||||
* [Collector implementations](./collectors.md): There are a number of
|
||||
implementations of the Whippet API with differing performance
|
||||
characteristics and which impose different requirements on the
|
||||
embedder.
|
||||
- [Semi-space collector (semi)](./collector-semi.md): For
|
||||
single-threaded embedders who are not too tight on memory.
|
||||
- [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
|
||||
but with support for multiple mutator threads. Faster than semi if
|
||||
multiple cores are available at collection-time.
|
||||
- [Whippet collector (whippet)](./collector-whippet.md):
|
||||
Immix-inspired collector. Optionally parallel, conservative (stack
|
||||
and/or heap), and/or generational.
|
||||
- [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
|
||||
Conservative mark-sweep collector, implemented by
|
||||
Boehm-Demers-Weiser library.
|
||||
|
||||
* [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
|
||||
top of Whippet.
|
||||
|
20
doc/collector-bdw.md
Normal file
20
doc/collector-bdw.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
# Boehm-Demers-Weiser collector
|
||||
|
||||
Whippet's `bdw` collector is backed by a third-party garbage collector,
|
||||
the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc).
|
||||
|
||||
BDW-GC is a mark-sweep collector with conservative root-finding,
|
||||
conservative heap tracing, and parallel tracing.
|
||||
|
||||
Whereas the other Whippet collectors which rely on mutators to
|
||||
[periodically check if they need to
|
||||
stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints),
|
||||
`bdw` will stop mutators with a POSIX signal. Also, it doesn't really
|
||||
support ephemerons (the Whippet `bdw` collector simulates them using
|
||||
finalizers), and both ephemerons and finalizers only approximate the
|
||||
Whippet behavior, because they are implemented in terms of what BDW-GC
|
||||
provides.
|
||||
|
||||
It's a bit of an oddball from a Whippet perspective, but useful as a
|
||||
migration path if you have an embedder that is already using BDW-GC.
|
||||
And, it is a useful performance comparison.
|
69
doc/collector-pcc.md
Normal file
69
doc/collector-pcc.md
Normal file
|
@ -0,0 +1,69 @@
|
|||
# Parallel copying collector
|
||||
|
||||
Whippet's `pcc` collector is a copying collector, like the more simple
|
||||
[`semi`](./collector-semi.md), but supporting multiple mutator threads
|
||||
and multiple tracing threads.
|
||||
|
||||
Like `semi`, `pcc` traces by evacuation: it moves all live objects on
|
||||
every collection. (Exception: objects larger than 8192 bytes are
|
||||
placed into a partitioned space traces by marking in place instead of
|
||||
copying.) Evacuation requires precise roots, so if your embedder does
|
||||
not support precise roots, `pcc` is not for you.
|
||||
|
||||
Again like `semi`, `pcc` generally requires a heap size at least twice
|
||||
as large as the maximum live heap size, and performs best with ample
|
||||
heap sizes; between 3× and 5× is best.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Unlike `semi` which has a single global bump-pointer allocation region,
|
||||
`pcc` structures the heap into 64-kB blocks. In this way it supports
|
||||
multiple mutator threads: mutators do local bump-pointer allocation into
|
||||
their own block, and when their block is full, they fetch another from
|
||||
the global store.
|
||||
|
||||
The block size is 64 kB, but really it's 128 kB, because each block has
|
||||
two halves: the active region and the copy reserve. Dividing each block
|
||||
in two allows the collector to easily grow and shrink the heap while
|
||||
ensuring there is always enough reserve space.
|
||||
|
||||
Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in
|
||||
a slab. The first block in a slab is used by the collector itself, to
|
||||
keep metadata for the rest of the blocks, for example a chain pointer
|
||||
allowing blocks to be collected in lists, a saved allocation pointer for
|
||||
partially-filled blocks, whether the block is paged in or out, and so
|
||||
on.
|
||||
|
||||
`pcc` supports tracing in parallel. This mechanism works somewhat like
|
||||
allocation, in which multiple trace workers compete to evacuate objects
|
||||
into their local allocation buffers; when an allocation buffer is full,
|
||||
the trace worker grabs another, just like mutators do.
|
||||
|
||||
However, unlike the simple semi-space collector which uses a Cheney grey
|
||||
worklist, `pcc` uses the [fine-grained work-stealing parallel
|
||||
tracer](../src/parallel-tracer.h) originally developed for [Whippet's
|
||||
Immix-like collector](./collector-whippet.md). Each trace worker
|
||||
maintains a [local queue of objects that need
|
||||
tracing](../src/local-worklist.h), which currently has 1024 entries. If
|
||||
the local queue becomes full, the worker will publish 3/4 of those
|
||||
entries to the worker's [shared worklist](../src/shared-worklist.h).
|
||||
When a worker runs out of local work, it will first try to remove work
|
||||
from its own shared worklist, then will try to steal from other workers.
|
||||
|
||||
Because threads compete to evacuate objects, `pcc` uses [atomic
|
||||
compare-and-swap instead of simple forwarding pointer
|
||||
updates](./manual.md#forwarding-objects), which imposes around a ~30%
|
||||
performance penalty. `pcc` generally starts to outperform `semi` when
|
||||
it can trace with 2 threads, and gets better with each additional
|
||||
thread.
|
||||
|
||||
I sometimes wonder whether the worklist should contain grey edges or
|
||||
grey objects. [MMTk](https://www.mmtk.io/) seems to do the former, and bundles edges into work
|
||||
packets, which are the unit of work sharing. I don't know yet what is
|
||||
best and look forward to experimenting once I have better benchmarks.
|
||||
|
||||
The memory used for the external worklist is dynamically allocated from
|
||||
the OS and is not currently counted as contributing to the heap size.
|
||||
If you are targetting a microcontroller or something, probably you need
|
||||
to choose a different kind of collector that never dynamically
|
||||
allocates, such as `semi`.
|
23
doc/collector-semi.md
Normal file
23
doc/collector-semi.md
Normal file
|
@ -0,0 +1,23 @@
|
|||
# Semi-space collector
|
||||
|
||||
The `semi` collector is simple. It is mostly useful as a first
|
||||
collector to try out, to make sure that a mutator correctly records all
|
||||
roots: because `semi` moves every live object on every collection, it is
|
||||
very effective at shaking out mutator bugs.
|
||||
|
||||
If your embedder chooses to not precisely record roots, for example
|
||||
instead choosing to conservatively scan the stack, then the semi-space
|
||||
collector is not for you: `semi` requires precise roots.
|
||||
|
||||
For more on semi-space collectors, see
|
||||
https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector.
|
||||
|
||||
Whippet's `semi` collector incorporates a large-object space, which
|
||||
marks objects in place instead of moving. Otherwise, `semi` generally
|
||||
requires a heap size at least twice as large as the maximum live heap
|
||||
size, and performs best with ample heap sizes; between 3× and 5× is
|
||||
best.
|
||||
|
||||
The semi-space collector doesn't support multiple mutator threads. If
|
||||
you want a whole-heap copying collector for a multi-threaded mutator,
|
||||
look at [pcc](./collector-pcc.md).
|
158
doc/collector-whippet.md
Normal file
158
doc/collector-whippet.md
Normal file
|
@ -0,0 +1,158 @@
|
|||
# Whippet collector
|
||||
|
||||
One collector implementation in the Whippet garbage collection library
|
||||
is also called Whippet. Naming-wise this is a somewhat confusing
|
||||
situation; perhaps it will change.
|
||||
|
||||
Anyway, the `whippet` collector is mainly a mark-region collector,
|
||||
inspired by
|
||||
[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
|
||||
To a first approximation, Whippet is a whole-heap Immix collector with a
|
||||
large object space on the side.
|
||||
|
||||
When tracing, `whippet` mostly marks objects in place. If the heap is
|
||||
too fragmented, it can compact the heap by choosing to evacuate
|
||||
sparsely-populated heap blocks instead of marking in place. However
|
||||
evacuation is strictly optional, which means that `whippet` is also
|
||||
compatible with conservative root-finding, making it a good replacement
|
||||
for embedders that currently use the [Boehm-Demers-Weiser
|
||||
collector](./collector-bdw.md).
|
||||
|
||||
## Differences from Immix
|
||||
|
||||
The original Immix divides the heap into 32kB blocks, and then divides
|
||||
those blocks into 128B lines. An Immix allocation can span lines but
|
||||
not blocks; allocations larger than 8kB go into a separate large object
|
||||
space. Mutators request blocks from the global store and allocate into
|
||||
those blocks using bump-pointer allocation. When all blocks are
|
||||
consumed, Immix stops the world and traces the object graph, marking
|
||||
objects but also the lines that objects are on. After marking, blocks
|
||||
contain some lines with live objects and others that are completely
|
||||
free. Spans of free lines are called holes. When a mutator gets a
|
||||
recycled block from the global block store, it allocates into those
|
||||
holes. For an exposition of Immix, see the lovely detailed [Rust
|
||||
implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
|
||||
|
||||
The essential difference of `whippet` from Immix stems from a simple
|
||||
observation: Immix needs a side table of line mark bytes and also a mark
|
||||
bit or bits in each object (or in a side table). But if instead you
|
||||
choose to store mark bytes instead of bits (for concurrency reasons) in
|
||||
a side table, with one mark byte per granule (unit of allocation,
|
||||
perhaps 16 bytes), then you effectively have a line mark table where the
|
||||
granule size is the line size. You can bump-pointer allocate into holes
|
||||
in the mark byte table.
|
||||
|
||||
You might think this is a bad tradeoff, and perhaps it is: I don't know
|
||||
yet. If your granule size is two pointers, then one mark byte per
|
||||
granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on
|
||||
32-bit, it's a lot! On the other hand, instead of the worst case of one
|
||||
survivor object wasting a line (or two, in the case of conservative line
|
||||
marking), granule-size-is-line-size instead wastes nothing. Also, you
|
||||
don't need GC bits in the object itself, and you can use the mark byte
|
||||
array to record the object end, so that finding holes in a block can
|
||||
just read the mark table and can avoid looking at object memory.
|
||||
|
||||
## Optional features
|
||||
|
||||
The `whippet` collector has a few feature flags that can be turned on or
|
||||
off. If you use the [standard embedder makefile include](../embed.mk),
|
||||
then there is a name for each combination of features: `whippet` has no
|
||||
additional features, `parallel-whippet` enables parallel marking,
|
||||
`parallel-generational-whippet` enables parallelism,
|
||||
`stack-conservative-parallel-generational-whippet` uses conservative
|
||||
root-finding, and `heap-conservative-parallel-generational-whippet`
|
||||
additionally traces the heap conservatively. You can leave off
|
||||
components of the name to get a collector without those features.
|
||||
Underneath this corresponds to some pre-processor definitions passed to
|
||||
the compiler on the command line.
|
||||
|
||||
### Generations
|
||||
|
||||
Whippet supports generational tracing via the [sticky mark-bit
|
||||
algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm).
|
||||
This requires that the embedder emit [write
|
||||
barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers);
|
||||
if your embedder cannot ensure write barriers are always invoked, then
|
||||
generational collection is not for you. (We could perhaps relax this a
|
||||
bit, following what [Ruby developers
|
||||
did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).)
|
||||
|
||||
The write barrier is currently a card-marking barrier emitted on stores,
|
||||
with one card byte per 256 object bytes, where the card location can be
|
||||
computed from the object address because blocks are allocated in
|
||||
two-megabyte aligned slabs.
|
||||
|
||||
### Parallel tracing
|
||||
|
||||
You almost certainly want this on! `parallel-whippet` uses a the
|
||||
[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
|
||||
Each trace worker maintains a [local queue of objects that need
|
||||
tracing](../src/local-worklist.h), which currently has a capacity of
|
||||
1024 entries. If the local queue becomes full, the worker will publish
|
||||
3/4 of those entries to the worker's [shared
|
||||
worklist](../src/shared-worklist.h). When a worker runs out of local
|
||||
work, it will first try to remove work from its own shared worklist,
|
||||
then will try to steal from other workers.
|
||||
|
||||
The memory used for the external worklist is dynamically allocated from
|
||||
the OS and is not currently counted as contributing to the heap size.
|
||||
If you absolutely need to avoid dynamic allocation during GC, `whippet`
|
||||
(even serial whippet) would need some work for your use case, to
|
||||
allocate a fixed-size space for a marking queue and to gracefully handle
|
||||
mark queue overflow.
|
||||
|
||||
### Conservative stack scanning
|
||||
|
||||
With `semi` and `pcc`, embedders must precisely enumerate the set of
|
||||
*roots*: the edges into the heap from outside. Commonly, roots include
|
||||
global variables, as well as working variables from each mutator's
|
||||
stack. Whippet can optionally mark mutator stacks *conservatively*:
|
||||
treating each word on the stack as if it may be an object reference, and
|
||||
marking any object at that address.
|
||||
|
||||
After all these years, *whether* to mark stacks conservatively or not is
|
||||
still an open research question. Conservative stack scanning retain too
|
||||
much data if an integer is confused for an object reference and removes
|
||||
a layer of correctness-by-construction from a system. Sometimes it is
|
||||
required, for example if your embedder cannot enumerate roots precisely.
|
||||
But there are reasons to consider it even if you can do precise roots:
|
||||
it removes the need for the compiler to produce a stack map to store the
|
||||
precise root enumeration at every safepoint; it removes the need to look
|
||||
up a stack map when tracing; and it allows C or C++ support code to
|
||||
avoid having to place roots in traceable locations published to the
|
||||
garbage collector. And the [performance question is still
|
||||
open](https://dl.acm.org/doi/10.1145/2660193.2660198).
|
||||
|
||||
Anyway. Whippet can scan roots conservatively. Those roots are pinned
|
||||
for the collection; even if the collection will compact via evacuation,
|
||||
referents of conservative roots won't be moved. Objects not directly
|
||||
referenced by roots can be evacuated, however.
|
||||
|
||||
### Conservative heap scanning
|
||||
|
||||
In addition to stack and global references, the Boehm-Demers-Weiser
|
||||
collector scans heap objects conservatively as well, treating each word
|
||||
of each heap object as if it were a reference. Whippet can do that, if
|
||||
the embedder is unable to provide a `gc_trace_object` implementation.
|
||||
However this is generally a performance lose, and it prevents
|
||||
evacuation.
|
||||
|
||||
## Other implementation tidbits
|
||||
|
||||
`whippet` does lazy sweeping: as a mutator grabs a fresh block, it
|
||||
reclaims memory that was unmarked in the previous collection before
|
||||
making the memory available for allocation. This makes sweeping
|
||||
naturally cache-friendly and parallel.
|
||||
|
||||
The mark byte array facilitates conservative collection by being an
|
||||
oracle for "does this address start an object".
|
||||
|
||||
There is some support for concurrent marking by having three mark bit
|
||||
states (dead, survivor, marked) that rotate at each collection; some
|
||||
collector configurations can have mutators mark before waiting for other
|
||||
mutators to stop. True concurrent marking and associated barriers
|
||||
are not yet implemented.
|
||||
|
||||
For a detailed introduction, see [Whippet: Towards a new local
|
||||
maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
|
||||
a talk given at FOSDEM 2023.
|
41
doc/collectors.md
Normal file
41
doc/collectors.md
Normal file
|
@ -0,0 +1,41 @@
|
|||
# Whippet collectors
|
||||
|
||||
Whippet has four collectors currently:
|
||||
- [Semi-space collector (semi)](./collector-semi.md): For
|
||||
single-threaded embedders who are not too tight on memory.
|
||||
- [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
|
||||
but with support for multiple mutator threads. Faster than semi if
|
||||
multiple cores are available at collection-time.
|
||||
- [Whippet collector (whippet)](./collector-whippet.md):
|
||||
Immix-inspired collector. Optionally parallel, conservative (stack
|
||||
and/or heap), and/or generational.
|
||||
- [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
|
||||
Conservative mark-sweep collector, implemented by
|
||||
Boehm-Demers-Weiser library.
|
||||
|
||||
## How to choose?
|
||||
|
||||
If you are migrating an embedder off BDW-GC, then it could be reasonable
|
||||
to first go to `bdw`, then `stack-conservative-parallel-whippet`.
|
||||
|
||||
If you have an embedder with precise roots, use `semi` if
|
||||
single-threaded, or `pcc` if multi-threaded. That will shake out
|
||||
mutator/embedder bugs. Then if memory is tight, switch to
|
||||
`parallel-whippet`, possibly `parallel-generational-whippet`.
|
||||
|
||||
If you are writing a new project, you have a choice as to whether to pay
|
||||
the development cost of precise roots or not. If choose to not have
|
||||
precise roots, then go for `stack-conservative-parallel-whippet`
|
||||
directly.
|
||||
|
||||
## More collectors
|
||||
|
||||
It would be nice to have a classic generational GC, perhaps using
|
||||
parallel-whippet for the old generation but a pcc-style copying nursery.
|
||||
|
||||
Support for concurrent marking in `whippet` would be good as well,
|
||||
perhaps with a SATB barrier. (Or, if you are the sort of person to bet
|
||||
on conservative stack scanning, perhaps a retreating-wavefront barrier
|
||||
would be more appropriate.)
|
||||
|
||||
Contributions are welcome, provided they have no more dependencies!
|
|
@ -1,64 +0,0 @@
|
|||
# Design
|
||||
|
||||
Whippet is mainly a mark-region collector, like
|
||||
[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
|
||||
See also the lovely detailed [Rust
|
||||
implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
|
||||
|
||||
To a first approximation, Whippet is a whole-heap Immix collector with a
|
||||
large object space on the side. See the Immix paper for full details,
|
||||
but basically Immix divides the heap into 32kB blocks, and then divides
|
||||
those blocks into 128B lines. An Immix allocation never spans blocks;
|
||||
allocations larger than 8kB go into a separate large object space.
|
||||
Mutators request blocks from the global store and allocate into those
|
||||
blocks using bump-pointer allocation. When all blocks are consumed,
|
||||
Immix stops the world and traces the object graph, marking objects but
|
||||
also the lines that objects are on. After marking, blocks contain some
|
||||
lines with live objects and others that are completely free. Spans of
|
||||
free lines are called holes. When a mutator gets a recycled block from
|
||||
the global block store, it allocates into those holes. Also, sometimes
|
||||
Immix can choose to evacuate rather than mark. Bump-pointer-into-holes
|
||||
allocation is quite compatible with conservative roots, so it's an
|
||||
interesting option for Guile, which has a lot of legacy C API users.
|
||||
|
||||
The essential difference of Whippet from Immix stems from a simple
|
||||
observation: Immix needs a side table of line mark bytes and also a mark
|
||||
bit or bits in each object (or in a side table). But if instead you
|
||||
choose to store mark bytes instead of bits (for concurrency reasons) in
|
||||
a side table, with one mark byte per granule (unit of allocation,
|
||||
perhaps 16 bytes), then you effectively have a line mark table where the
|
||||
granule size is the line size. You can bump-pointer allocate into holes
|
||||
in the mark byte table.
|
||||
|
||||
You might think this is a bad tradeoff, and perhaps it is: I don't know
|
||||
yet. If your granule size is two pointers, then one mark byte per
|
||||
granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on
|
||||
32-bit, it's a lot! On the other hand, instead of the worst case of one
|
||||
survivor object wasting a line (or two, in the case of conservative line
|
||||
marking), granule-size-is-line-size instead wastes nothing. Also, you
|
||||
don't need GC bits in the object itself, and you can use the mark byte
|
||||
array to record the object end, so that finding holes in a block can
|
||||
just read the mark table and can avoid looking at object memory.
|
||||
|
||||
Other ideas in Whippet:
|
||||
|
||||
* Minimize stop-the-world phase via parallel marking and punting all
|
||||
sweeping to mutators
|
||||
|
||||
* Enable mutator parallelism via lock-free block acquisition and lazy
|
||||
statistics collation
|
||||
|
||||
* Allocate block space using aligned 4 MB slabs, with embedded metadata
|
||||
to allow metadata bytes, slab headers, and block metadata to be
|
||||
located via address arithmetic
|
||||
|
||||
* Facilitate conservative collection via mark byte array, oracle for
|
||||
"does this address start an object"
|
||||
|
||||
* Enable in-place generational collection via card table with one entry
|
||||
per 256B or so
|
||||
|
||||
* Enable concurrent marking by having three mark bit states (dead,
|
||||
survivor, marked) that rotate at each collection, and sweeping a
|
||||
block clears metadata for dead objects; but concurrent marking and
|
||||
associated SATB barrier not yet implemented
|
Loading…
Add table
Add a link
Reference in a new issue