diff --git a/README.md b/README.md index e62b0661a..c922af64d 100644 --- a/README.md +++ b/README.md @@ -8,19 +8,9 @@ Whippet is an embed-only C library, designed to be copied into a program's source tree. It exposes an abstract C API for managed memory allocation, and provides a number of implementations of that API. -One of the implementations is also called "whippet", and is the -motivation for creating this library. For a detailed introduction, see -[Whippet: Towards a new local -maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum), -a talk given at FOSDEM 2023. - ## Documentation - * [Design](./doc/design.md): What's the general idea? - * [Manual](./doc/manual.md): How do you get your program to use - Whippet? What is the API? - * [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on - top of Whippet. +See the [documentation](./doc/README.md). ## Source repository structure diff --git a/doc/README.md b/doc/README.md new file mode 100644 index 000000000..fc5348ddb --- /dev/null +++ b/doc/README.md @@ -0,0 +1,24 @@ +# Whippet documentation + + * [Manual](./manual.md): How do you get your program to use + Whippet? What is the API? + + * [Collector implementations](./collectors.md): There are a number of + implementations of the Whippet API with differing performance + characteristics and which impose different requirements on the + embedder. + - [Semi-space collector (semi)](./collector-semi.md): For + single-threaded embedders who are not too tight on memory. + - [Parallel copying collector (pcc)](./collector-pcc.md): Like semi, + but with support for multiple mutator threads. Faster than semi if + multiple cores are available at collection-time. + - [Whippet collector (whippet)](./collector-whippet.md): + Immix-inspired collector. Optionally parallel, conservative (stack + and/or heap), and/or generational. + - [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md): + Conservative mark-sweep collector, implemented by + Boehm-Demers-Weiser library. + + * [Guile](./doc/guile.md): Some notes on a potential rebase of Guile on + top of Whippet. + diff --git a/doc/collector-bdw.md b/doc/collector-bdw.md new file mode 100644 index 000000000..b86a9d3b1 --- /dev/null +++ b/doc/collector-bdw.md @@ -0,0 +1,20 @@ +# Boehm-Demers-Weiser collector + +Whippet's `bdw` collector is backed by a third-party garbage collector, +the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc). + +BDW-GC is a mark-sweep collector with conservative root-finding, +conservative heap tracing, and parallel tracing. + +Whereas the other Whippet collectors which rely on mutators to +[periodically check if they need to +stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints), +`bdw` will stop mutators with a POSIX signal. Also, it doesn't really +support ephemerons (the Whippet `bdw` collector simulates them using +finalizers), and both ephemerons and finalizers only approximate the +Whippet behavior, because they are implemented in terms of what BDW-GC +provides. + +It's a bit of an oddball from a Whippet perspective, but useful as a +migration path if you have an embedder that is already using BDW-GC. +And, it is a useful performance comparison. diff --git a/doc/collector-pcc.md b/doc/collector-pcc.md new file mode 100644 index 000000000..a55b0fa4f --- /dev/null +++ b/doc/collector-pcc.md @@ -0,0 +1,69 @@ +# Parallel copying collector + +Whippet's `pcc` collector is a copying collector, like the more simple +[`semi`](./collector-semi.md), but supporting multiple mutator threads +and multiple tracing threads. + +Like `semi`, `pcc` traces by evacuation: it moves all live objects on +every collection. (Exception: objects larger than 8192 bytes are +placed into a partitioned space traces by marking in place instead of +copying.) Evacuation requires precise roots, so if your embedder does +not support precise roots, `pcc` is not for you. + +Again like `semi`, `pcc` generally requires a heap size at least twice +as large as the maximum live heap size, and performs best with ample +heap sizes; between 3× and 5× is best. + +## Implementation notes + +Unlike `semi` which has a single global bump-pointer allocation region, +`pcc` structures the heap into 64-kB blocks. In this way it supports +multiple mutator threads: mutators do local bump-pointer allocation into +their own block, and when their block is full, they fetch another from +the global store. + +The block size is 64 kB, but really it's 128 kB, because each block has +two halves: the active region and the copy reserve. Dividing each block +in two allows the collector to easily grow and shrink the heap while +ensuring there is always enough reserve space. + +Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in +a slab. The first block in a slab is used by the collector itself, to +keep metadata for the rest of the blocks, for example a chain pointer +allowing blocks to be collected in lists, a saved allocation pointer for +partially-filled blocks, whether the block is paged in or out, and so +on. + +`pcc` supports tracing in parallel. This mechanism works somewhat like +allocation, in which multiple trace workers compete to evacuate objects +into their local allocation buffers; when an allocation buffer is full, +the trace worker grabs another, just like mutators do. + +However, unlike the simple semi-space collector which uses a Cheney grey +worklist, `pcc` uses the [fine-grained work-stealing parallel +tracer](../src/parallel-tracer.h) originally developed for [Whippet's +Immix-like collector](./collector-whippet.md). Each trace worker +maintains a [local queue of objects that need +tracing](../src/local-worklist.h), which currently has 1024 entries. If +the local queue becomes full, the worker will publish 3/4 of those +entries to the worker's [shared worklist](../src/shared-worklist.h). +When a worker runs out of local work, it will first try to remove work +from its own shared worklist, then will try to steal from other workers. + +Because threads compete to evacuate objects, `pcc` uses [atomic +compare-and-swap instead of simple forwarding pointer +updates](./manual.md#forwarding-objects), which imposes around a ~30% +performance penalty. `pcc` generally starts to outperform `semi` when +it can trace with 2 threads, and gets better with each additional +thread. + +I sometimes wonder whether the worklist should contain grey edges or +grey objects. [MMTk](https://www.mmtk.io/) seems to do the former, and bundles edges into work +packets, which are the unit of work sharing. I don't know yet what is +best and look forward to experimenting once I have better benchmarks. + +The memory used for the external worklist is dynamically allocated from +the OS and is not currently counted as contributing to the heap size. +If you are targetting a microcontroller or something, probably you need +to choose a different kind of collector that never dynamically +allocates, such as `semi`. diff --git a/doc/collector-semi.md b/doc/collector-semi.md new file mode 100644 index 000000000..1900d2d27 --- /dev/null +++ b/doc/collector-semi.md @@ -0,0 +1,23 @@ +# Semi-space collector + +The `semi` collector is simple. It is mostly useful as a first +collector to try out, to make sure that a mutator correctly records all +roots: because `semi` moves every live object on every collection, it is +very effective at shaking out mutator bugs. + +If your embedder chooses to not precisely record roots, for example +instead choosing to conservatively scan the stack, then the semi-space +collector is not for you: `semi` requires precise roots. + +For more on semi-space collectors, see +https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector. + +Whippet's `semi` collector incorporates a large-object space, which +marks objects in place instead of moving. Otherwise, `semi` generally +requires a heap size at least twice as large as the maximum live heap +size, and performs best with ample heap sizes; between 3× and 5× is +best. + +The semi-space collector doesn't support multiple mutator threads. If +you want a whole-heap copying collector for a multi-threaded mutator, +look at [pcc](./collector-pcc.md). diff --git a/doc/collector-whippet.md b/doc/collector-whippet.md new file mode 100644 index 000000000..2975035d5 --- /dev/null +++ b/doc/collector-whippet.md @@ -0,0 +1,158 @@ +# Whippet collector + +One collector implementation in the Whippet garbage collection library +is also called Whippet. Naming-wise this is a somewhat confusing +situation; perhaps it will change. + +Anyway, the `whippet` collector is mainly a mark-region collector, +inspired by +[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf). +To a first approximation, Whippet is a whole-heap Immix collector with a +large object space on the side. + +When tracing, `whippet` mostly marks objects in place. If the heap is +too fragmented, it can compact the heap by choosing to evacuate +sparsely-populated heap blocks instead of marking in place. However +evacuation is strictly optional, which means that `whippet` is also +compatible with conservative root-finding, making it a good replacement +for embedders that currently use the [Boehm-Demers-Weiser +collector](./collector-bdw.md). + +## Differences from Immix + +The original Immix divides the heap into 32kB blocks, and then divides +those blocks into 128B lines. An Immix allocation can span lines but +not blocks; allocations larger than 8kB go into a separate large object +space. Mutators request blocks from the global store and allocate into +those blocks using bump-pointer allocation. When all blocks are +consumed, Immix stops the world and traces the object graph, marking +objects but also the lines that objects are on. After marking, blocks +contain some lines with live objects and others that are completely +free. Spans of free lines are called holes. When a mutator gets a +recycled block from the global block store, it allocates into those +holes. For an exposition of Immix, see the lovely detailed [Rust +implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf). + +The essential difference of `whippet` from Immix stems from a simple +observation: Immix needs a side table of line mark bytes and also a mark +bit or bits in each object (or in a side table). But if instead you +choose to store mark bytes instead of bits (for concurrency reasons) in +a side table, with one mark byte per granule (unit of allocation, +perhaps 16 bytes), then you effectively have a line mark table where the +granule size is the line size. You can bump-pointer allocate into holes +in the mark byte table. + +You might think this is a bad tradeoff, and perhaps it is: I don't know +yet. If your granule size is two pointers, then one mark byte per +granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on +32-bit, it's a lot! On the other hand, instead of the worst case of one +survivor object wasting a line (or two, in the case of conservative line +marking), granule-size-is-line-size instead wastes nothing. Also, you +don't need GC bits in the object itself, and you can use the mark byte +array to record the object end, so that finding holes in a block can +just read the mark table and can avoid looking at object memory. + +## Optional features + +The `whippet` collector has a few feature flags that can be turned on or +off. If you use the [standard embedder makefile include](../embed.mk), +then there is a name for each combination of features: `whippet` has no +additional features, `parallel-whippet` enables parallel marking, +`parallel-generational-whippet` enables parallelism, +`stack-conservative-parallel-generational-whippet` uses conservative +root-finding, and `heap-conservative-parallel-generational-whippet` +additionally traces the heap conservatively. You can leave off +components of the name to get a collector without those features. +Underneath this corresponds to some pre-processor definitions passed to +the compiler on the command line. + +### Generations + +Whippet supports generational tracing via the [sticky mark-bit +algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm). +This requires that the embedder emit [write +barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers); +if your embedder cannot ensure write barriers are always invoked, then +generational collection is not for you. (We could perhaps relax this a +bit, following what [Ruby developers +did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).) + +The write barrier is currently a card-marking barrier emitted on stores, +with one card byte per 256 object bytes, where the card location can be +computed from the object address because blocks are allocated in +two-megabyte aligned slabs. + +### Parallel tracing + +You almost certainly want this on! `parallel-whippet` uses a the +[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h). +Each trace worker maintains a [local queue of objects that need +tracing](../src/local-worklist.h), which currently has a capacity of +1024 entries. If the local queue becomes full, the worker will publish +3/4 of those entries to the worker's [shared +worklist](../src/shared-worklist.h). When a worker runs out of local +work, it will first try to remove work from its own shared worklist, +then will try to steal from other workers. + +The memory used for the external worklist is dynamically allocated from +the OS and is not currently counted as contributing to the heap size. +If you absolutely need to avoid dynamic allocation during GC, `whippet` +(even serial whippet) would need some work for your use case, to +allocate a fixed-size space for a marking queue and to gracefully handle +mark queue overflow. + +### Conservative stack scanning + +With `semi` and `pcc`, embedders must precisely enumerate the set of +*roots*: the edges into the heap from outside. Commonly, roots include +global variables, as well as working variables from each mutator's +stack. Whippet can optionally mark mutator stacks *conservatively*: +treating each word on the stack as if it may be an object reference, and +marking any object at that address. + +After all these years, *whether* to mark stacks conservatively or not is +still an open research question. Conservative stack scanning retain too +much data if an integer is confused for an object reference and removes +a layer of correctness-by-construction from a system. Sometimes it is +required, for example if your embedder cannot enumerate roots precisely. +But there are reasons to consider it even if you can do precise roots: +it removes the need for the compiler to produce a stack map to store the +precise root enumeration at every safepoint; it removes the need to look +up a stack map when tracing; and it allows C or C++ support code to +avoid having to place roots in traceable locations published to the +garbage collector. And the [performance question is still +open](https://dl.acm.org/doi/10.1145/2660193.2660198). + +Anyway. Whippet can scan roots conservatively. Those roots are pinned +for the collection; even if the collection will compact via evacuation, +referents of conservative roots won't be moved. Objects not directly +referenced by roots can be evacuated, however. + +### Conservative heap scanning + +In addition to stack and global references, the Boehm-Demers-Weiser +collector scans heap objects conservatively as well, treating each word +of each heap object as if it were a reference. Whippet can do that, if +the embedder is unable to provide a `gc_trace_object` implementation. +However this is generally a performance lose, and it prevents +evacuation. + +## Other implementation tidbits + +`whippet` does lazy sweeping: as a mutator grabs a fresh block, it +reclaims memory that was unmarked in the previous collection before +making the memory available for allocation. This makes sweeping +naturally cache-friendly and parallel. + +The mark byte array facilitates conservative collection by being an +oracle for "does this address start an object". + +There is some support for concurrent marking by having three mark bit +states (dead, survivor, marked) that rotate at each collection; some +collector configurations can have mutators mark before waiting for other +mutators to stop. True concurrent marking and associated barriers +are not yet implemented. + +For a detailed introduction, see [Whippet: Towards a new local +maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum), +a talk given at FOSDEM 2023. diff --git a/doc/collectors.md b/doc/collectors.md new file mode 100644 index 000000000..77e4206ec --- /dev/null +++ b/doc/collectors.md @@ -0,0 +1,41 @@ +# Whippet collectors + +Whippet has four collectors currently: + - [Semi-space collector (semi)](./collector-semi.md): For + single-threaded embedders who are not too tight on memory. + - [Parallel copying collector (pcc)](./collector-pcc.md): Like semi, + but with support for multiple mutator threads. Faster than semi if + multiple cores are available at collection-time. + - [Whippet collector (whippet)](./collector-whippet.md): + Immix-inspired collector. Optionally parallel, conservative (stack + and/or heap), and/or generational. + - [Boehm-Demers-Weiser collector (bdw)](./collector-bdw.md): + Conservative mark-sweep collector, implemented by + Boehm-Demers-Weiser library. + +## How to choose? + +If you are migrating an embedder off BDW-GC, then it could be reasonable +to first go to `bdw`, then `stack-conservative-parallel-whippet`. + +If you have an embedder with precise roots, use `semi` if +single-threaded, or `pcc` if multi-threaded. That will shake out +mutator/embedder bugs. Then if memory is tight, switch to +`parallel-whippet`, possibly `parallel-generational-whippet`. + +If you are writing a new project, you have a choice as to whether to pay +the development cost of precise roots or not. If choose to not have +precise roots, then go for `stack-conservative-parallel-whippet` +directly. + +## More collectors + +It would be nice to have a classic generational GC, perhaps using +parallel-whippet for the old generation but a pcc-style copying nursery. + +Support for concurrent marking in `whippet` would be good as well, +perhaps with a SATB barrier. (Or, if you are the sort of person to bet +on conservative stack scanning, perhaps a retreating-wavefront barrier +would be more appropriate.) + +Contributions are welcome, provided they have no more dependencies! diff --git a/doc/design.md b/doc/design.md deleted file mode 100644 index 1a1c69bee..000000000 --- a/doc/design.md +++ /dev/null @@ -1,64 +0,0 @@ -# Design - -Whippet is mainly a mark-region collector, like -[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf). -See also the lovely detailed [Rust -implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf). - -To a first approximation, Whippet is a whole-heap Immix collector with a -large object space on the side. See the Immix paper for full details, -but basically Immix divides the heap into 32kB blocks, and then divides -those blocks into 128B lines. An Immix allocation never spans blocks; -allocations larger than 8kB go into a separate large object space. -Mutators request blocks from the global store and allocate into those -blocks using bump-pointer allocation. When all blocks are consumed, -Immix stops the world and traces the object graph, marking objects but -also the lines that objects are on. After marking, blocks contain some -lines with live objects and others that are completely free. Spans of -free lines are called holes. When a mutator gets a recycled block from -the global block store, it allocates into those holes. Also, sometimes -Immix can choose to evacuate rather than mark. Bump-pointer-into-holes -allocation is quite compatible with conservative roots, so it's an -interesting option for Guile, which has a lot of legacy C API users. - -The essential difference of Whippet from Immix stems from a simple -observation: Immix needs a side table of line mark bytes and also a mark -bit or bits in each object (or in a side table). But if instead you -choose to store mark bytes instead of bits (for concurrency reasons) in -a side table, with one mark byte per granule (unit of allocation, -perhaps 16 bytes), then you effectively have a line mark table where the -granule size is the line size. You can bump-pointer allocate into holes -in the mark byte table. - -You might think this is a bad tradeoff, and perhaps it is: I don't know -yet. If your granule size is two pointers, then one mark byte per -granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on -32-bit, it's a lot! On the other hand, instead of the worst case of one -survivor object wasting a line (or two, in the case of conservative line -marking), granule-size-is-line-size instead wastes nothing. Also, you -don't need GC bits in the object itself, and you can use the mark byte -array to record the object end, so that finding holes in a block can -just read the mark table and can avoid looking at object memory. - -Other ideas in Whippet: - - * Minimize stop-the-world phase via parallel marking and punting all - sweeping to mutators - - * Enable mutator parallelism via lock-free block acquisition and lazy - statistics collation - - * Allocate block space using aligned 4 MB slabs, with embedded metadata - to allow metadata bytes, slab headers, and block metadata to be - located via address arithmetic - - * Facilitate conservative collection via mark byte array, oracle for - "does this address start an object" - - * Enable in-place generational collection via card table with one entry - per 256B or so - - * Enable concurrent marking by having three mark bit states (dead, - survivor, marked) that rotate at each collection, and sweeping a - block clears metadata for dead objects; but concurrent marking and - associated SATB barrier not yet implemented