1
Fork 0
mirror of https://git.savannah.gnu.org/git/guile.git synced 2025-04-30 03:40:34 +02:00

Rework docs

This commit is contained in:
Andy Wingo 2024-07-27 22:26:24 +02:00
parent 3d4d4d047c
commit ab5071f97a
8 changed files with 336 additions and 75 deletions

View file

@ -8,19 +8,9 @@ Whippet is an embed-only C library, designed to be copied into a
program's source tree. It exposes an abstract C API for managed memory
allocation, and provides a number of implementations of that API.
One of the implementations is also called "whippet", and is the
motivation for creating this library. For a detailed introduction, see
[Whippet: Towards a new local
maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
a talk given at FOSDEM 2023.
## Documentation
* [Design](./doc/design.md): What's the general idea?
* [Manual](./doc/manual.md): How do you get your program to use
Whippet? What is the API?
* [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
top of Whippet.
See the [documentation](./doc/README.md).
## Source repository structure

24
doc/README.md Normal file
View file

@ -0,0 +1,24 @@
# Whippet documentation
* [Manual](./manual.md): How do you get your program to use
Whippet? What is the API?
* [Collector implementations](./collectors.md): There are a number of
implementations of the Whippet API with differing performance
characteristics and which impose different requirements on the
embedder.
- [Semi-space collector (semi)](./collector-semi.md): For
single-threaded embedders who are not too tight on memory.
- [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
but with support for multiple mutator threads. Faster than semi if
multiple cores are available at collection-time.
- [Whippet collector (whippet)](./collector-whippet.md):
Immix-inspired collector. Optionally parallel, conservative (stack
and/or heap), and/or generational.
- [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
Conservative mark-sweep collector, implemented by
Boehm-Demers-Weiser library.
* [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
top of Whippet.

20
doc/collector-bdw.md Normal file
View file

@ -0,0 +1,20 @@
# Boehm-Demers-Weiser collector
Whippet's `bdw` collector is backed by a third-party garbage collector,
the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc).
BDW-GC is a mark-sweep collector with conservative root-finding,
conservative heap tracing, and parallel tracing.
Whereas the other Whippet collectors which rely on mutators to
[periodically check if they need to
stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints),
`bdw` will stop mutators with a POSIX signal. Also, it doesn't really
support ephemerons (the Whippet `bdw` collector simulates them using
finalizers), and both ephemerons and finalizers only approximate the
Whippet behavior, because they are implemented in terms of what BDW-GC
provides.
It's a bit of an oddball from a Whippet perspective, but useful as a
migration path if you have an embedder that is already using BDW-GC.
And, it is a useful performance comparison.

69
doc/collector-pcc.md Normal file
View file

@ -0,0 +1,69 @@
# Parallel copying collector
Whippet's `pcc` collector is a copying collector, like the more simple
[`semi`](./collector-semi.md), but supporting multiple mutator threads
and multiple tracing threads.
Like `semi`, `pcc` traces by evacuation: it moves all live objects on
every collection. (Exception: objects larger than 8192 bytes are
placed into a partitioned space traces by marking in place instead of
copying.) Evacuation requires precise roots, so if your embedder does
not support precise roots, `pcc` is not for you.
Again like `semi`, `pcc` generally requires a heap size at least twice
as large as the maximum live heap size, and performs best with ample
heap sizes; between 3× and 5× is best.
## Implementation notes
Unlike `semi` which has a single global bump-pointer allocation region,
`pcc` structures the heap into 64-kB blocks. In this way it supports
multiple mutator threads: mutators do local bump-pointer allocation into
their own block, and when their block is full, they fetch another from
the global store.
The block size is 64 kB, but really it's 128 kB, because each block has
two halves: the active region and the copy reserve. Dividing each block
in two allows the collector to easily grow and shrink the heap while
ensuring there is always enough reserve space.
Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in
a slab. The first block in a slab is used by the collector itself, to
keep metadata for the rest of the blocks, for example a chain pointer
allowing blocks to be collected in lists, a saved allocation pointer for
partially-filled blocks, whether the block is paged in or out, and so
on.
`pcc` supports tracing in parallel. This mechanism works somewhat like
allocation, in which multiple trace workers compete to evacuate objects
into their local allocation buffers; when an allocation buffer is full,
the trace worker grabs another, just like mutators do.
However, unlike the simple semi-space collector which uses a Cheney grey
worklist, `pcc` uses the [fine-grained work-stealing parallel
tracer](../src/parallel-tracer.h) originally developed for [Whippet's
Immix-like collector](./collector-whippet.md). Each trace worker
maintains a [local queue of objects that need
tracing](../src/local-worklist.h), which currently has 1024 entries. If
the local queue becomes full, the worker will publish 3/4 of those
entries to the worker's [shared worklist](../src/shared-worklist.h).
When a worker runs out of local work, it will first try to remove work
from its own shared worklist, then will try to steal from other workers.
Because threads compete to evacuate objects, `pcc` uses [atomic
compare-and-swap instead of simple forwarding pointer
updates](./manual.md#forwarding-objects), which imposes around a ~30%
performance penalty. `pcc` generally starts to outperform `semi` when
it can trace with 2 threads, and gets better with each additional
thread.
I sometimes wonder whether the worklist should contain grey edges or
grey objects. [MMTk](https://www.mmtk.io/) seems to do the former, and bundles edges into work
packets, which are the unit of work sharing. I don't know yet what is
best and look forward to experimenting once I have better benchmarks.
The memory used for the external worklist is dynamically allocated from
the OS and is not currently counted as contributing to the heap size.
If you are targetting a microcontroller or something, probably you need
to choose a different kind of collector that never dynamically
allocates, such as `semi`.

23
doc/collector-semi.md Normal file
View file

@ -0,0 +1,23 @@
# Semi-space collector
The `semi` collector is simple. It is mostly useful as a first
collector to try out, to make sure that a mutator correctly records all
roots: because `semi` moves every live object on every collection, it is
very effective at shaking out mutator bugs.
If your embedder chooses to not precisely record roots, for example
instead choosing to conservatively scan the stack, then the semi-space
collector is not for you: `semi` requires precise roots.
For more on semi-space collectors, see
https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector.
Whippet's `semi` collector incorporates a large-object space, which
marks objects in place instead of moving. Otherwise, `semi` generally
requires a heap size at least twice as large as the maximum live heap
size, and performs best with ample heap sizes; between 3× and 5× is
best.
The semi-space collector doesn't support multiple mutator threads. If
you want a whole-heap copying collector for a multi-threaded mutator,
look at [pcc](./collector-pcc.md).

158
doc/collector-whippet.md Normal file
View file

@ -0,0 +1,158 @@
# Whippet collector
One collector implementation in the Whippet garbage collection library
is also called Whippet. Naming-wise this is a somewhat confusing
situation; perhaps it will change.
Anyway, the `whippet` collector is mainly a mark-region collector,
inspired by
[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
To a first approximation, Whippet is a whole-heap Immix collector with a
large object space on the side.
When tracing, `whippet` mostly marks objects in place. If the heap is
too fragmented, it can compact the heap by choosing to evacuate
sparsely-populated heap blocks instead of marking in place. However
evacuation is strictly optional, which means that `whippet` is also
compatible with conservative root-finding, making it a good replacement
for embedders that currently use the [Boehm-Demers-Weiser
collector](./collector-bdw.md).
## Differences from Immix
The original Immix divides the heap into 32kB blocks, and then divides
those blocks into 128B lines. An Immix allocation can span lines but
not blocks; allocations larger than 8kB go into a separate large object
space. Mutators request blocks from the global store and allocate into
those blocks using bump-pointer allocation. When all blocks are
consumed, Immix stops the world and traces the object graph, marking
objects but also the lines that objects are on. After marking, blocks
contain some lines with live objects and others that are completely
free. Spans of free lines are called holes. When a mutator gets a
recycled block from the global block store, it allocates into those
holes. For an exposition of Immix, see the lovely detailed [Rust
implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
The essential difference of `whippet` from Immix stems from a simple
observation: Immix needs a side table of line mark bytes and also a mark
bit or bits in each object (or in a side table). But if instead you
choose to store mark bytes instead of bits (for concurrency reasons) in
a side table, with one mark byte per granule (unit of allocation,
perhaps 16 bytes), then you effectively have a line mark table where the
granule size is the line size. You can bump-pointer allocate into holes
in the mark byte table.
You might think this is a bad tradeoff, and perhaps it is: I don't know
yet. If your granule size is two pointers, then one mark byte per
granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on
32-bit, it's a lot! On the other hand, instead of the worst case of one
survivor object wasting a line (or two, in the case of conservative line
marking), granule-size-is-line-size instead wastes nothing. Also, you
don't need GC bits in the object itself, and you can use the mark byte
array to record the object end, so that finding holes in a block can
just read the mark table and can avoid looking at object memory.
## Optional features
The `whippet` collector has a few feature flags that can be turned on or
off. If you use the [standard embedder makefile include](../embed.mk),
then there is a name for each combination of features: `whippet` has no
additional features, `parallel-whippet` enables parallel marking,
`parallel-generational-whippet` enables parallelism,
`stack-conservative-parallel-generational-whippet` uses conservative
root-finding, and `heap-conservative-parallel-generational-whippet`
additionally traces the heap conservatively. You can leave off
components of the name to get a collector without those features.
Underneath this corresponds to some pre-processor definitions passed to
the compiler on the command line.
### Generations
Whippet supports generational tracing via the [sticky mark-bit
algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm).
This requires that the embedder emit [write
barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers);
if your embedder cannot ensure write barriers are always invoked, then
generational collection is not for you. (We could perhaps relax this a
bit, following what [Ruby developers
did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).)
The write barrier is currently a card-marking barrier emitted on stores,
with one card byte per 256 object bytes, where the card location can be
computed from the object address because blocks are allocated in
two-megabyte aligned slabs.
### Parallel tracing
You almost certainly want this on! `parallel-whippet` uses a the
[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
Each trace worker maintains a [local queue of objects that need
tracing](../src/local-worklist.h), which currently has a capacity of
1024 entries. If the local queue becomes full, the worker will publish
3/4 of those entries to the worker's [shared
worklist](../src/shared-worklist.h). When a worker runs out of local
work, it will first try to remove work from its own shared worklist,
then will try to steal from other workers.
The memory used for the external worklist is dynamically allocated from
the OS and is not currently counted as contributing to the heap size.
If you absolutely need to avoid dynamic allocation during GC, `whippet`
(even serial whippet) would need some work for your use case, to
allocate a fixed-size space for a marking queue and to gracefully handle
mark queue overflow.
### Conservative stack scanning
With `semi` and `pcc`, embedders must precisely enumerate the set of
*roots*: the edges into the heap from outside. Commonly, roots include
global variables, as well as working variables from each mutator's
stack. Whippet can optionally mark mutator stacks *conservatively*:
treating each word on the stack as if it may be an object reference, and
marking any object at that address.
After all these years, *whether* to mark stacks conservatively or not is
still an open research question. Conservative stack scanning retain too
much data if an integer is confused for an object reference and removes
a layer of correctness-by-construction from a system. Sometimes it is
required, for example if your embedder cannot enumerate roots precisely.
But there are reasons to consider it even if you can do precise roots:
it removes the need for the compiler to produce a stack map to store the
precise root enumeration at every safepoint; it removes the need to look
up a stack map when tracing; and it allows C or C++ support code to
avoid having to place roots in traceable locations published to the
garbage collector. And the [performance question is still
open](https://dl.acm.org/doi/10.1145/2660193.2660198).
Anyway. Whippet can scan roots conservatively. Those roots are pinned
for the collection; even if the collection will compact via evacuation,
referents of conservative roots won't be moved. Objects not directly
referenced by roots can be evacuated, however.
### Conservative heap scanning
In addition to stack and global references, the Boehm-Demers-Weiser
collector scans heap objects conservatively as well, treating each word
of each heap object as if it were a reference. Whippet can do that, if
the embedder is unable to provide a `gc_trace_object` implementation.
However this is generally a performance lose, and it prevents
evacuation.
## Other implementation tidbits
`whippet` does lazy sweeping: as a mutator grabs a fresh block, it
reclaims memory that was unmarked in the previous collection before
making the memory available for allocation. This makes sweeping
naturally cache-friendly and parallel.
The mark byte array facilitates conservative collection by being an
oracle for "does this address start an object".
There is some support for concurrent marking by having three mark bit
states (dead, survivor, marked) that rotate at each collection; some
collector configurations can have mutators mark before waiting for other
mutators to stop. True concurrent marking and associated barriers
are not yet implemented.
For a detailed introduction, see [Whippet: Towards a new local
maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
a talk given at FOSDEM 2023.

41
doc/collectors.md Normal file
View file

@ -0,0 +1,41 @@
# Whippet collectors
Whippet has four collectors currently:
- [Semi-space collector (semi)](./collector-semi.md): For
single-threaded embedders who are not too tight on memory.
- [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
but with support for multiple mutator threads. Faster than semi if
multiple cores are available at collection-time.
- [Whippet collector (whippet)](./collector-whippet.md):
Immix-inspired collector. Optionally parallel, conservative (stack
and/or heap), and/or generational.
- [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
Conservative mark-sweep collector, implemented by
Boehm-Demers-Weiser library.
## How to choose?
If you are migrating an embedder off BDW-GC, then it could be reasonable
to first go to `bdw`, then `stack-conservative-parallel-whippet`.
If you have an embedder with precise roots, use `semi` if
single-threaded, or `pcc` if multi-threaded. That will shake out
mutator/embedder bugs. Then if memory is tight, switch to
`parallel-whippet`, possibly `parallel-generational-whippet`.
If you are writing a new project, you have a choice as to whether to pay
the development cost of precise roots or not. If choose to not have
precise roots, then go for `stack-conservative-parallel-whippet`
directly.
## More collectors
It would be nice to have a classic generational GC, perhaps using
parallel-whippet for the old generation but a pcc-style copying nursery.
Support for concurrent marking in `whippet` would be good as well,
perhaps with a SATB barrier. (Or, if you are the sort of person to bet
on conservative stack scanning, perhaps a retreating-wavefront barrier
would be more appropriate.)
Contributions are welcome, provided they have no more dependencies!

View file

@ -1,64 +0,0 @@
# Design
Whippet is mainly a mark-region collector, like
[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
See also the lovely detailed [Rust
implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
To a first approximation, Whippet is a whole-heap Immix collector with a
large object space on the side. See the Immix paper for full details,
but basically Immix divides the heap into 32kB blocks, and then divides
those blocks into 128B lines. An Immix allocation never spans blocks;
allocations larger than 8kB go into a separate large object space.
Mutators request blocks from the global store and allocate into those
blocks using bump-pointer allocation. When all blocks are consumed,
Immix stops the world and traces the object graph, marking objects but
also the lines that objects are on. After marking, blocks contain some
lines with live objects and others that are completely free. Spans of
free lines are called holes. When a mutator gets a recycled block from
the global block store, it allocates into those holes. Also, sometimes
Immix can choose to evacuate rather than mark. Bump-pointer-into-holes
allocation is quite compatible with conservative roots, so it's an
interesting option for Guile, which has a lot of legacy C API users.
The essential difference of Whippet from Immix stems from a simple
observation: Immix needs a side table of line mark bytes and also a mark
bit or bits in each object (or in a side table). But if instead you
choose to store mark bytes instead of bits (for concurrency reasons) in
a side table, with one mark byte per granule (unit of allocation,
perhaps 16 bytes), then you effectively have a line mark table where the
granule size is the line size. You can bump-pointer allocate into holes
in the mark byte table.
You might think this is a bad tradeoff, and perhaps it is: I don't know
yet. If your granule size is two pointers, then one mark byte per
granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on
32-bit, it's a lot! On the other hand, instead of the worst case of one
survivor object wasting a line (or two, in the case of conservative line
marking), granule-size-is-line-size instead wastes nothing. Also, you
don't need GC bits in the object itself, and you can use the mark byte
array to record the object end, so that finding holes in a block can
just read the mark table and can avoid looking at object memory.
Other ideas in Whippet:
* Minimize stop-the-world phase via parallel marking and punting all
sweeping to mutators
* Enable mutator parallelism via lock-free block acquisition and lazy
statistics collation
* Allocate block space using aligned 4 MB slabs, with embedded metadata
to allow metadata bytes, slab headers, and block metadata to be
located via address arithmetic
* Facilitate conservative collection via mark byte array, oracle for
"does this address start an object"
* Enable in-place generational collection via card table with one entry
per 256B or so
* Enable concurrent marking by having three mark bit states (dead,
survivor, marked) that rotate at each collection, and sweeping a
block clears metadata for dead objects; but concurrent marking and
associated SATB barrier not yet implemented