Rework docs

2025-07-08 02:20:26 +02:00 · 2024-07-27 22:26:24 +02:00 · 2024-07-27 22:26:24 +02:00 · ab5071f97a
commit ab5071f97a
parent 3d4d4d047c
8 changed files with 336 additions and 75 deletions
--- a/README.md
+++ b/README.md
@ -8,19 +8,9 @@ Whippet is an embed-only C library, designed to be copied into a
 program's source tree.  It exposes an abstract C API for managed memory
 allocation, and provides a number of implementations of that API.

-One of the implementations is also called "whippet", and is the
-motivation for creating this library.  For a detailed introduction, see
-[Whippet: Towards a new local
-maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
-a talk given at FOSDEM 2023.
-
 ## Documentation

- * [Design](./doc/design.md): What's the general idea?
- * [Manual](./doc/manual.md): How do you get your program to use
-   Whippet?  What is the API?
- * [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
-   top of Whippet.
+See the [documentation](./doc/README.md).

 ## Source repository structure

--- a/doc/README.md
+++ b/doc/README.md
@ -0,0 +1,24 @@
+# Whippet documentation
+
+ * [Manual](./manual.md): How do you get your program to use
+   Whippet?  What is the API?
+
+ * [Collector implementations](./collectors.md): There are a number of
+   implementations of the Whippet API with differing performance
+   characteristics and which impose different requirements on the
+   embedder.
+   - [Semi-space collector (semi)](./collector-semi.md): For
+     single-threaded embedders who are not too tight on memory.
+   - [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
+     but with support for multiple mutator threads.  Faster than semi if
+     multiple cores are available at collection-time.
+   - [Whippet collector (whippet)](./collector-whippet.md):
+     Immix-inspired collector.  Optionally parallel, conservative (stack
+     and/or heap), and/or generational.
+   - [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
+     Conservative mark-sweep collector, implemented by
+     Boehm-Demers-Weiser library.
+
+ * [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on
+   top of Whippet.
+
--- a/doc/collector-bdw.md
+++ b/doc/collector-bdw.md
@ -0,0 +1,20 @@
+# Boehm-Demers-Weiser collector
+
+Whippet's `bdw` collector is backed by a third-party garbage collector,
+the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc).
+
+BDW-GC is a mark-sweep collector with conservative root-finding,
+conservative heap tracing, and parallel tracing.
+
+Whereas the other Whippet collectors which rely on mutators to
+[periodically check if they need to
+stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints),
+`bdw` will stop mutators with a POSIX signal.  Also, it doesn't really
+support ephemerons (the Whippet `bdw` collector simulates them using
+finalizers), and both ephemerons and finalizers only approximate the
+Whippet behavior, because they are implemented in terms of what BDW-GC
+provides.
+
+It's a bit of an oddball from a Whippet perspective, but useful as a
+migration path if you have an embedder that is already using BDW-GC.
+And, it is a useful performance comparison.
--- a/doc/collector-pcc.md
+++ b/doc/collector-pcc.md
@ -0,0 +1,69 @@
+# Parallel copying collector
+
+Whippet's `pcc` collector is a copying collector, like the more simple
+[`semi`](./collector-semi.md), but supporting multiple mutator threads
+and multiple tracing threads.
+
+Like `semi`, `pcc` traces by evacuation: it moves all live objects on
+every collection.  (Exception:  objects larger than 8192 bytes are
+placed into a partitioned space traces by marking in place instead of
+copying.)  Evacuation requires precise roots, so if your embedder does
+not support precise roots, `pcc` is not for you.
+
+Again like `semi`, `pcc` generally requires a heap size at least twice
+as large as the maximum live heap size, and performs best with ample
+heap sizes; between 3× and 5× is best.
+
+## Implementation notes
+
+Unlike `semi` which has a single global bump-pointer allocation region,
+`pcc` structures the heap into 64-kB blocks.  In this way it supports
+multiple mutator threads: mutators do local bump-pointer allocation into
+their own block, and when their block is full, they fetch another from
+the global store.
+
+The block size is 64 kB, but really it's 128 kB, because each block has
+two halves: the active region and the copy reserve.  Dividing each block
+in two allows the collector to easily grow and shrink the heap while
+ensuring there is always enough reserve space.
+
+Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in
+a slab.  The first block in a slab is used by the collector itself, to
+keep metadata for the rest of the blocks, for example a chain pointer
+allowing blocks to be collected in lists, a saved allocation pointer for
+partially-filled blocks, whether the block is paged in or out, and so
+on.
+
+`pcc` supports tracing in parallel.  This mechanism works somewhat like
+allocation, in which multiple trace workers compete to evacuate objects
+into their local allocation buffers; when an allocation buffer is full,
+the trace worker grabs another, just like mutators do.
+
+However, unlike the simple semi-space collector which uses a Cheney grey
+worklist, `pcc` uses the [fine-grained work-stealing parallel
+tracer](../src/parallel-tracer.h) originally developed for [Whippet's
+Immix-like collector](./collector-whippet.md).  Each trace worker
+maintains a [local queue of objects that need
+tracing](../src/local-worklist.h), which currently has 1024 entries.  If
+the local queue becomes full, the worker will publish 3/4 of those
+entries to the worker's [shared worklist](../src/shared-worklist.h).
+When a worker runs out of local work, it will first try to remove work
+from its own shared worklist, then will try to steal from other workers.
+
+Because threads compete to evacuate objects, `pcc` uses [atomic
+compare-and-swap instead of simple forwarding pointer
+updates](./manual.md#forwarding-objects), which imposes around a ~30%
+performance penalty.  `pcc` generally starts to outperform `semi` when
+it can trace with 2 threads, and gets better with each additional
+thread.
+
+I sometimes wonder whether the worklist should contain grey edges or
+grey objects.  [MMTk](https://www.mmtk.io/) seems to do the former, and bundles edges into work
+packets, which are the unit of work sharing.  I don't know yet what is
+best and look forward to experimenting once I have better benchmarks.
+
+The memory used for the external worklist is dynamically allocated from
+the OS and is not currently counted as contributing to the heap size.
+If you are targetting a microcontroller or something, probably you need
+to choose a different kind of collector that never dynamically
+allocates, such as `semi`.
--- a/doc/collector-semi.md
+++ b/doc/collector-semi.md
@ -0,0 +1,23 @@
+# Semi-space collector
+
+The `semi` collector is simple.  It is mostly useful as a first
+collector to try out, to make sure that a mutator correctly records all
+roots: because `semi` moves every live object on every collection, it is
+very effective at shaking out mutator bugs.
+
+If your embedder chooses to not precisely record roots, for example
+instead choosing to conservatively scan the stack, then the semi-space
+collector is not for you: `semi` requires precise roots.
+
+For more on semi-space collectors, see
+https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector.
+
+Whippet's `semi` collector incorporates a large-object space, which
+marks objects in place instead of moving.  Otherwise, `semi` generally
+requires a heap size at least twice as large as the maximum live heap
+size, and performs best with ample heap sizes; between 3× and 5× is
+best.
+
+The semi-space collector doesn't support multiple mutator threads.  If
+you want a whole-heap copying collector for a multi-threaded mutator,
+look at [pcc](./collector-pcc.md).
--- a/doc/collector-whippet.md
+++ b/doc/collector-whippet.md
@ -0,0 +1,158 @@
+# Whippet collector
+
+One collector implementation in the Whippet garbage collection library
+is also called Whippet.  Naming-wise this is a somewhat confusing
+situation; perhaps it will change.
+
+Anyway, the `whippet` collector is mainly a mark-region collector,
+inspired by
+[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
+To a first approximation, Whippet is a whole-heap Immix collector with a
+large object space on the side.
+
+When tracing, `whippet` mostly marks objects in place.  If the heap is
+too fragmented, it can compact the heap by choosing to evacuate
+sparsely-populated heap blocks instead of marking in place.  However
+evacuation is strictly optional, which means that `whippet` is also
+compatible with conservative root-finding, making it a good replacement
+for embedders that currently use the [Boehm-Demers-Weiser
+collector](./collector-bdw.md).
+
+## Differences from Immix
+
+The original Immix divides the heap into 32kB blocks, and then divides
+those blocks into 128B lines.  An Immix allocation can span lines but
+not blocks; allocations larger than 8kB go into a separate large object
+space.  Mutators request blocks from the global store and allocate into
+those blocks using bump-pointer allocation.  When all blocks are
+consumed, Immix stops the world and traces the object graph, marking
+objects but also the lines that objects are on.  After marking, blocks
+contain some lines with live objects and others that are completely
+free.  Spans of free lines are called holes.  When a mutator gets a
+recycled block from the global block store, it allocates into those
+holes.  For an exposition of Immix, see the lovely detailed [Rust
+implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
+
+The essential difference of `whippet` from Immix stems from a simple
+observation: Immix needs a side table of line mark bytes and also a mark
+bit or bits in each object (or in a side table).  But if instead you
+choose to store mark bytes instead of bits (for concurrency reasons) in
+a side table, with one mark byte per granule (unit of allocation,
+perhaps 16 bytes), then you effectively have a line mark table where the
+granule size is the line size.  You can bump-pointer allocate into holes
+in the mark byte table.
+
+You might think this is a bad tradeoff, and perhaps it is: I don't know
+yet.  If your granule size is two pointers, then one mark byte per
+granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit.  Especially on
+32-bit, it's a lot!  On the other hand, instead of the worst case of one
+survivor object wasting a line (or two, in the case of conservative line
+marking), granule-size-is-line-size instead wastes nothing.  Also, you
+don't need GC bits in the object itself, and you can use the mark byte
+array to record the object end, so that finding holes in a block can
+just read the mark table and can avoid looking at object memory.
+
+## Optional features
+
+The `whippet` collector has a few feature flags that can be turned on or
+off.  If you use the [standard embedder makefile include](../embed.mk),
+then there is a name for each combination of features: `whippet` has no
+additional features, `parallel-whippet` enables parallel marking,
+`parallel-generational-whippet` enables parallelism,
+`stack-conservative-parallel-generational-whippet` uses conservative
+root-finding, and `heap-conservative-parallel-generational-whippet`
+additionally traces the heap conservatively.  You can leave off
+components of the name to get a collector without those features.
+Underneath this corresponds to some pre-processor definitions passed to
+the compiler on the command line.
+
+### Generations
+
+Whippet supports generational tracing via the [sticky mark-bit
+algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm).
+This requires that the embedder emit [write
+barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers);
+if your embedder cannot ensure write barriers are always invoked, then
+generational collection is not for you.  (We could perhaps relax this a
+bit, following what [Ruby developers
+did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).)
+
+The write barrier is currently a card-marking barrier emitted on stores,
+with one card byte per 256 object bytes, where the card location can be
+computed from the object address because blocks are allocated in
+two-megabyte aligned slabs.
+
+### Parallel tracing
+
+You almost certainly want this on!  `parallel-whippet` uses a the
+[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
+Each trace worker maintains a [local queue of objects that need
+tracing](../src/local-worklist.h), which currently has a capacity of
+1024 entries.  If the local queue becomes full, the worker will publish
+3/4 of those entries to the worker's [shared
+worklist](../src/shared-worklist.h).  When a worker runs out of local
+work, it will first try to remove work from its own shared worklist,
+then will try to steal from other workers.
+
+The memory used for the external worklist is dynamically allocated from
+the OS and is not currently counted as contributing to the heap size.
+If you absolutely need to avoid dynamic allocation during GC, `whippet`
+(even serial whippet) would need some work for your use case, to
+allocate a fixed-size space for a marking queue and to gracefully handle
+mark queue overflow.
+
+### Conservative stack scanning
+
+With `semi` and `pcc`, embedders must precisely enumerate the set of
+*roots*: the edges into the heap from outside.  Commonly, roots include
+global variables, as well as working variables from each mutator's
+stack.  Whippet can optionally mark mutator stacks *conservatively*:
+treating each word on the stack as if it may be an object reference, and
+marking any object at that address.
+
+After all these years, *whether* to mark stacks conservatively or not is
+still an open research question.  Conservative stack scanning retain too
+much data if an integer is confused for an object reference and removes
+a layer of correctness-by-construction from a system.  Sometimes it is
+required, for example if your embedder cannot enumerate roots precisely.
+But there are reasons to consider it even if you can do precise roots:
+it removes the need for the compiler to produce a stack map to store the
+precise root enumeration at every safepoint; it removes the need to look
+up a stack map when tracing; and it allows C or C++ support code to
+avoid having to place roots in traceable locations published to the
+garbage collector.  And the [performance question is still
+open](https://dl.acm.org/doi/10.1145/2660193.2660198).
+
+Anyway.  Whippet can scan roots conservatively.  Those roots are pinned
+for the collection; even if the collection will compact via evacuation,
+referents of conservative roots won't be moved.  Objects not directly
+referenced by roots can be evacuated, however.
+
+### Conservative heap scanning
+
+In addition to stack and global references, the Boehm-Demers-Weiser
+collector scans heap objects conservatively as well, treating each word
+of each heap object as if it were a reference.  Whippet can do that, if
+the embedder is unable to provide a `gc_trace_object` implementation.
+However this is generally a performance lose, and it prevents
+evacuation.
+
+## Other implementation tidbits
+
+`whippet` does lazy sweeping: as a mutator grabs a fresh block, it
+reclaims memory that was unmarked in the previous collection before
+making the memory available for allocation.  This makes sweeping
+naturally cache-friendly and parallel.
+
+The mark byte array facilitates conservative collection by being an
+oracle for "does this address start an object".
+
+There is some support for concurrent marking by having three mark bit
+states (dead, survivor, marked) that rotate at each collection; some
+collector configurations can have mutators mark before waiting for other
+mutators to stop.  True concurrent marking and associated barriers
+are not yet implemented.
+
+For a detailed introduction, see [Whippet: Towards a new local
+maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
+a talk given at FOSDEM 2023.
--- a/doc/collectors.md
+++ b/doc/collectors.md
@ -0,0 +1,41 @@
+# Whippet collectors
+
+Whippet has four collectors currently:
+ - [Semi-space collector (semi)](./collector-semi.md): For
+   single-threaded embedders who are not too tight on memory.
+ - [Parallel copying collector (pcc)](./collector-pcc.md): Like semi,
+   but with support for multiple mutator threads.  Faster than semi if
+   multiple cores are available at collection-time.
+ - [Whippet collector (whippet)](./collector-whippet.md):
+   Immix-inspired collector.  Optionally parallel, conservative (stack
+   and/or heap), and/or generational.
+ - [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md):
+   Conservative mark-sweep collector, implemented by
+   Boehm-Demers-Weiser library.
+
+## How to choose?
+
+If you are migrating an embedder off BDW-GC, then it could be reasonable
+to first go to `bdw`, then `stack-conservative-parallel-whippet`.
+
+If you have an embedder with precise roots, use `semi` if
+single-threaded, or `pcc` if multi-threaded.  That will shake out
+mutator/embedder bugs.  Then if memory is tight, switch to
+`parallel-whippet`, possibly `parallel-generational-whippet`.
+
+If you are writing a new project, you have a choice as to whether to pay
+the development cost of precise roots or not.  If choose to not have
+precise roots, then go for `stack-conservative-parallel-whippet`
+directly.
+
+## More collectors
+
+It would be nice to have a classic generational GC, perhaps using
+parallel-whippet for the old generation but a pcc-style copying nursery.
+
+Support for concurrent marking in `whippet` would be good as well,
+perhaps with a SATB barrier.  (Or, if you are the sort of person to bet
+on conservative stack scanning, perhaps a retreating-wavefront barrier
+would be more appropriate.)
+
+Contributions are welcome, provided they have no more dependencies!
--- a/doc/design.md
+++ b/doc/design.md
@ -1,64 +0,0 @@
-# Design
-
-Whippet is mainly a mark-region collector, like
-[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
-See also the lovely detailed [Rust
-implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
-
-To a first approximation, Whippet is a whole-heap Immix collector with a
-large object space on the side.  See the Immix paper for full details,
-but basically Immix divides the heap into 32kB blocks, and then divides
-those blocks into 128B lines.  An Immix allocation never spans blocks;
-allocations larger than 8kB go into a separate large object space.
-Mutators request blocks from the global store and allocate into those
-blocks using bump-pointer allocation.  When all blocks are consumed,
-Immix stops the world and traces the object graph, marking objects but
-also the lines that objects are on.  After marking, blocks contain some
-lines with live objects and others that are completely free.  Spans of
-free lines are called holes.  When a mutator gets a recycled block from
-the global block store, it allocates into those holes.  Also, sometimes
-Immix can choose to evacuate rather than mark.  Bump-pointer-into-holes
-allocation is quite compatible with conservative roots, so it's an
-interesting option for Guile, which has a lot of legacy C API users.
-
-The essential difference of Whippet from Immix stems from a simple
-observation: Immix needs a side table of line mark bytes and also a mark
-bit or bits in each object (or in a side table).  But if instead you
-choose to store mark bytes instead of bits (for concurrency reasons) in
-a side table, with one mark byte per granule (unit of allocation,
-perhaps 16 bytes), then you effectively have a line mark table where the
-granule size is the line size.  You can bump-pointer allocate into holes
-in the mark byte table.
-
-You might think this is a bad tradeoff, and perhaps it is: I don't know
-yet.  If your granule size is two pointers, then one mark byte per
-granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit.  Especially on
-32-bit, it's a lot!  On the other hand, instead of the worst case of one
-survivor object wasting a line (or two, in the case of conservative line
-marking), granule-size-is-line-size instead wastes nothing.  Also, you
-don't need GC bits in the object itself, and you can use the mark byte
-array to record the object end, so that finding holes in a block can
-just read the mark table and can avoid looking at object memory.
-
-Other ideas in Whippet:
-
- * Minimize stop-the-world phase via parallel marking and punting all
-   sweeping to mutators
-
- * Enable mutator parallelism via lock-free block acquisition and lazy
-   statistics collation
-
- * Allocate block space using aligned 4 MB slabs, with embedded metadata
-   to allow metadata bytes, slab headers, and block metadata to be
-   located via address arithmetic
-
- * Facilitate conservative collection via mark byte array, oracle for
-   "does this address start an object"
-
- * Enable in-place generational collection via card table with one entry
-   per 256B or so
-
- * Enable concurrent marking by having three mark bit states (dead,
-   survivor, marked) that rotate at each collection, and sweeping a
-   block clears metadata for dead objects; but concurrent marking and
-   associated SATB barrier not yet implemented