Merged Whippet into libguile/whippet

2025-06-26 05:00:28 +02:00 · 2025-04-11 14:10:41 +02:00 · 2025-04-11 14:10:41 +02:00 · db181e67ff
commit db181e67ff
parent af96820e07 f909438596
112 changed files with 18115 additions and 0 deletions
--- a/libguile/whippet/doc/README.md
+++ b/libguile/whippet/doc/README.md
@ -0,0 +1,13 @@
+# Whippet documentation
+
+ * [Manual](./manual.md): How do you get your program to use
+   Whippet?  What is the API?
+
+ * [Collector implementations](./collectors.md): There are a number of
+   implementations of the Whippet API with differing performance
+   characteristics and which impose different requirements on the
+   embedder.
+
+ * [Guile](./guile.md): Some notes on a potential rebase of Guile on
+   top of Whippet.
+
--- a/libguile/whippet/doc/collector-bdw.md
+++ b/libguile/whippet/doc/collector-bdw.md
@ -0,0 +1,26 @@
+# Boehm-Demers-Weiser collector
+
+Whippet's `bdw` collector is backed by a third-party garbage collector,
+the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc).
+
+BDW-GC is a mark-sweep collector with conservative root-finding,
+conservative heap tracing, and parallel tracing.
+
+Whereas the other Whippet collectors which rely on mutators to
+[periodically check if they need to
+stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints),
+`bdw` will stop mutators with a POSIX signal.  Also, it doesn't really
+support ephemerons (the Whippet `bdw` collector simulates them using
+finalizers), and both ephemerons and finalizers only approximate the
+Whippet behavior, because they are implemented in terms of what BDW-GC
+provides.
+
+`bdw` supports the `fixed` and `growable` heap-sizing policies, but not
+`adaptive`, as BDW-GC can't reliably return memory to the OS.  Also,
+[`growable` has an effective limit of a 3x heap
+multiplier](https://github.com/wingo/whippet/blob/main/src/bdw.c#L478).
+Oh well!
+
+It's a bit of an oddball from a Whippet perspective, but useful as a
+migration path if you have an embedder that is already using BDW-GC.
+And, it is a useful performance comparison.
--- a/libguile/whippet/doc/collector-mmc.md
+++ b/libguile/whippet/doc/collector-mmc.md
@ -0,0 +1,148 @@
+# Mostly-marking collector
+
+The `mmc` collector is mainly a mark-region collector, inspired by
+[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
+To a first approximation, `mmc` is a whole-heap Immix collector with a
+large object space on the side.
+
+When tracing, `mmc` mostly marks objects in place.  If the heap is
+too fragmented, it can compact the heap by choosing to evacuate
+sparsely-populated heap blocks instead of marking in place.  However
+evacuation is strictly optional, which means that `mmc` is also
+compatible with conservative root-finding, making it a good replacement
+for embedders that currently use the [Boehm-Demers-Weiser
+collector](./collector-bdw.md).
+
+## Differences from Immix
+
+The original Immix divides the heap into 32kB blocks, and then divides
+those blocks into 128B lines.  An Immix allocation can span lines but
+not blocks; allocations larger than 8kB go into a separate large object
+space.  Mutators request blocks from the global store and allocate into
+those blocks using bump-pointer allocation.  When all blocks are
+consumed, Immix stops the world and traces the object graph, marking
+objects but also the lines that objects are on.  After marking, blocks
+contain some lines with live objects and others that are completely
+free.  Spans of free lines are called holes.  When a mutator gets a
+recycled block from the global block store, it allocates into those
+holes.  For an exposition of Immix, see the lovely detailed [Rust
+implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
+
+The essential difference of `mmc` from Immix stems from a simple
+observation: Immix needs a side table of line mark bytes and also a mark
+bit or bits in each object (or in a side table).  But if instead you
+choose to store mark bytes instead of bits (for concurrency reasons) in
+a side table, with one mark byte per granule (unit of allocation,
+perhaps 16 bytes), then you effectively have a line mark table where the
+granule size is the line size.  You can bump-pointer allocate into holes
+in the mark byte table.
+
+You might think this is a bad tradeoff, and perhaps it is: I don't know
+yet.  If your granule size is two pointers, then one mark byte per
+granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit.  Especially on
+32-bit, it's a lot!  On the other hand, instead of the worst case of one
+survivor object wasting a line (or two, in the case of conservative line
+marking), granule-size-is-line-size instead wastes nothing.  Also, you
+don't need GC bits in the object itself, and you can use the mark byte
+array to record the object end, so that finding holes in a block can
+just read the mark table and can avoid looking at object memory.
+
+## Optional features
+
+The `mmc` collector has a few feature flags that can be turned on or
+off.  If you use the [standard embedder makefile include](../embed.mk),
+then there is a name for each combination of features: `mmc` has no
+additional features, `parallel-mmc` enables parallel marking,
+`parallel-generational-mmc` enables generations,
+`stack-conservative-parallel-generational-mmc` uses conservative
+root-finding, and `heap-conservative-parallel-generational-mmc`
+additionally traces the heap conservatively.  You can leave off
+components of the name to get a collector without those features.
+Underneath this corresponds to some pre-processor definitions passed to
+the compiler on the command line.
+
+### Generations
+
+`mmc` supports generational tracing via the [sticky mark-bit
+algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm).
+This requires that the embedder emit [write
+barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers);
+if your embedder cannot ensure write barriers are always invoked, then
+generational collection is not for you.  (We could perhaps relax this a
+bit, following what [Ruby developers
+did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).)
+
+The write barrier is currently a card-marking barrier emitted on stores,
+with one card byte per 256 object bytes, where the card location can be
+computed from the object address because blocks are allocated in
+two-megabyte aligned slabs.
+
+### Parallel tracing
+
+You almost certainly want this on!  `parallel-mmc` uses a the
+[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
+Each trace worker maintains a [local queue of objects that need
+tracing](../src/local-worklist.h), which currently has a capacity of
+1024 entries.  If the local queue becomes full, the worker will publish
+3/4 of those entries to the worker's [shared
+worklist](../src/shared-worklist.h).  When a worker runs out of local
+work, it will first try to remove work from its own shared worklist,
+then will try to steal from other workers.
+
+The memory used for the external worklist is dynamically allocated from
+the OS and is not currently counted as contributing to the heap size.
+If you absolutely need to avoid dynamic allocation during GC, `mmc`
+(even `serial-mmc`) would need some work for your use case, to allocate
+a fixed-size space for a marking queue and to gracefully handle mark
+queue overflow.
+
+### Conservative stack scanning
+
+With `semi` and `pcc`, embedders must precisely enumerate the set of
+*roots*: the edges into the heap from outside.  Commonly, roots include
+global variables, as well as working variables from each mutator's
+stack.  `mmc` can optionally mark mutator stacks *conservatively*:
+treating each word on the stack as if it may be an object reference, and
+marking any object at that address.
+
+After all these years, *whether* to mark stacks conservatively or not is
+still an open research question.  Conservative stack scanning can retain
+too much data if an integer is confused for an object reference and
+removes a layer of correctness-by-construction from a system.  Sometimes
+conservative stack-scanning is required, for example if your embedder
+cannot enumerate roots precisely.  But there are reasons to consider it
+even if you can do precise roots: conservative scanning removes the need
+for the compiler to produce a stack map to store the precise root
+enumeration at every safepoint; it removes the need to look up a stack
+map when tracing; and it allows C or C++ support code to avoid having to
+place roots in traceable locations published to the garbage collector.
+And the [performance question is still
+open](https://dl.acm.org/doi/10.1145/2660193.2660198).
+
+Anyway.  `mmc` can scan roots conservatively.  Those roots are pinned
+for the collection; even if the collection will compact via evacuation,
+referents of conservative roots won't be moved.  Objects not directly
+referenced by roots can be evacuated, however.
+
+### Conservative heap scanning
+
+In addition to stack and global references, the Boehm-Demers-Weiser
+collector scans heap objects conservatively as well, treating each word
+of each heap object as if it were a reference.  `mmc` can do that, if
+the embedder is unable to provide a `gc_trace_object` implementation.
+However this is generally a performance lose, and it prevents
+evacuation.
+
+## Other implementation tidbits
+
+`mmc` does lazy sweeping: as a mutator grabs a fresh block, it
+reclaims memory that was unmarked in the previous collection before
+making the memory available for allocation.  This makes sweeping
+naturally cache-friendly and parallel.
+
+The mark byte array facilitates conservative collection by being an
+oracle for "does this address start an object".
+
+For a detailed introduction, see [Whippet: Towards a new local
+maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
+a talk given at FOSDEM 2023.
--- a/libguile/whippet/doc/collector-pcc.md
+++ b/libguile/whippet/doc/collector-pcc.md
@ -0,0 +1,84 @@
+# Parallel copying collector
+
+Whippet's `pcc` collector is a copying collector, like the more simple
+[`semi`](./collector-semi.md), but supporting multiple mutator threads,
+multiple tracing threads, and using an external FIFO worklist instead of
+a Cheney worklist.
+
+Like `semi`, `pcc` traces by evacuation: it moves all live objects on
+every collection.  (Exception:  objects larger than 8192 bytes are
+placed into a partitioned space which traces by marking in place instead
+of copying.)  Evacuation requires precise roots, so if your embedder
+does not support precise roots, `pcc` is not for you.
+
+Again like `semi`, `pcc` generally requires a heap size at least twice
+as large as the maximum live heap size, and performs best with ample
+heap sizes; between 3× and 5× is best.
+
+Overall, `pcc` is a better version of `semi`.  It should have broadly
+the same performance characteristics with a single mutator and with
+parallelism disabled, additionally allowing multiple mutators, and
+scaling better with multiple tracing threads.
+
+`pcc` has a generational configuration, conventionally referred to as
+`generational-pcc`, in which both the nursery and the old generation are
+copy spaces.  Objects stay in the nursery for one cycle before moving on
+to the old generation.  This configuration is a bit new (January 2025)
+and still needs some tuning.
+
+## Implementation notes
+
+Unlike `semi` which has a single global bump-pointer allocation region,
+`pcc` structures the heap into 64-kB blocks.  In this way it supports
+multiple mutator threads: mutators do local bump-pointer allocation into
+their own block, and when their block is full, they fetch another from
+the global store.
+
+The block size is 64 kB, but really it's 128 kB, because each block has
+two halves: the active region and the copy reserve.  Dividing each block
+in two allows the collector to easily grow and shrink the heap while
+ensuring there is always enough reserve space.
+
+Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in
+a slab.  The first block in a slab is used by the collector itself, to
+keep metadata for the rest of the blocks, for example a chain pointer
+allowing blocks to be collected in lists, a saved allocation pointer for
+partially-filled blocks, whether the block is paged in or out, and so
+on.
+
+`pcc` supports tracing in parallel.  This mechanism works somewhat like
+allocation, in which multiple trace workers compete to evacuate objects
+into their local allocation buffers; when an allocation buffer is full,
+the trace worker grabs another, just like mutators do.
+
+Unlike the simple semi-space collector which uses a Cheney grey
+worklist, `pcc` uses an external worklist.  If parallelism is disabled
+at compile-time, it uses a simple first-in, first-out queue of objects
+to be traced.  Like a Cheney worklist, this should result in objects
+being copied in breadth-first order.  The literature would suggest that
+depth-first is generally better for locality, but that preserving
+allocation order is generally best.  This is something to experiment
+with in the future.
+
+If parallelism is enabled, as it is by default, `pcc` uses a
+[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
+Each trace worker maintains a [local queue of objects that need
+tracing](../src/local-worklist.h), which currently has 1024 entries.  If
+the local queue becomes full, the worker will publish 3/4 of those
+entries to the worker's [shared worklist](../src/shared-worklist.h).
+When a worker runs out of local work, it will first try to remove work
+from its own shared worklist, then will try to steal from other workers.
+
+If only one tracing thread is enabled at run-time (`parallelism=1`) (or
+if parallelism is disabled at compile-time), `pcc` will evacuate by
+non-atomic forwarding, but if multiple threads compete to evacuate
+objects, `pcc` uses [atomic compare-and-swap instead of simple
+forwarding pointer updates](./manual.md#forwarding-objects).  This
+imposes around a ~30% performance penalty but having multiple tracing
+threads is generally worth it, unless the object graph is itself serial.
+
+The memory used for the external worklist is dynamically allocated from
+the OS and is not currently counted as contributing to the heap size.
+If you are targetting a microcontroller or something, probably you need
+to choose a different kind of collector that never dynamically
+allocates, such as `semi`.
--- a/libguile/whippet/doc/collector-semi.md
+++ b/libguile/whippet/doc/collector-semi.md
@ -0,0 +1,23 @@
+# Semi-space collector
+
+The `semi` collector is simple.  It is mostly useful as a first
+collector to try out, to make sure that a mutator correctly records all
+roots: because `semi` moves every live object on every collection, it is
+very effective at shaking out mutator bugs.
+
+If your embedder chooses to not precisely record roots, for example
+instead choosing to conservatively scan the stack, then the semi-space
+collector is not for you: `semi` requires precise roots.
+
+For more on semi-space collectors, see
+https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector.
+
+Whippet's `semi` collector incorporates a large-object space, which
+marks objects in place instead of moving.  Otherwise, `semi` generally
+requires a heap size at least twice as large as the maximum live heap
+size, and performs best with ample heap sizes; between 3× and 5× is
+best.
+
+The semi-space collector doesn't support multiple mutator threads.  If
+you want a copying collector for a multi-threaded mutator, look at
+[pcc](./collector-pcc.md).
--- a/libguile/whippet/doc/collectors.md
+++ b/libguile/whippet/doc/collectors.md
@ -0,0 +1,43 @@
+# Whippet collectors
+
+Whippet has four collectors currently:
+ - [Semi-space collector (`semi`)](./collector-semi.md): For
+   single-threaded embedders who are not too tight on memory.
+ - [Parallel copying collector (`pcc`)](./collector-pcc.md): Like
+   `semi`, but with support for multiple mutator and tracing threads and
+   generational collection.
+ - [Mostly marking collector (`mmc`)](./collector-mmc.md):
+   Immix-inspired collector.  Optionally parallel, conservative (stack
+   and/or heap), and/or generational.
+ - [Boehm-Demers-Weiser collector (`bdw`)](./collector-bdw.md):
+   Conservative mark-sweep collector, implemented by
+   Boehm-Demers-Weiser library.
+
+## How to choose?
+
+If you are migrating an embedder off BDW-GC, then it could be reasonable
+to first go to `bdw`, then `stack-conservative-parallel-mmc`.
+
+If you have an embedder with precise roots, use `pcc`.  That will shake
+out mutator/embedder bugs.  Then if memory is tight, switch to
+`parallel-mmc`, possibly `parallel-generational-mmc`.
+
+If you are aiming for maximum simplicity and minimal code size (ten
+kilobytes or so), use `semi`.
+
+If you are writing a new project, you have a choice as to whether to pay
+the development cost of precise roots or not.  If you choose to not have
+precise roots, then go for `stack-conservative-parallel-mmc` directly.
+
+## More collectors
+
+It would be nice to have a generational GC that uses the space from
+`parallel-mmc` for the old generation but a pcc-style copying nursery.
+We have `generational-pcc` now, so this should be possible.
+
+Support for concurrent marking in `mmc` would be good as well, perhaps
+with a SATB barrier.  (Or, if you are the sort of person to bet on
+conservative stack scanning, perhaps a retreating-wavefront barrier
+would be more appropriate.)
+
+Contributions are welcome, provided they have no more dependencies!
--- a/libguile/whippet/doc/guile.md
+++ b/libguile/whippet/doc/guile.md
@ -0,0 +1,26 @@
+# Whippet and Guile
+
+If the `mmc` collector works out, it could replace Guile's garbage
+collector.  Guile currently uses BDW-GC.  Guile has a widely used C API
+and implements part of its run-time in C.  For this reason it may be
+infeasible to require precise enumeration of GC roots -- we may need to
+allow GC roots to be conservatively identified from data sections and
+from stacks.  Such conservative roots would be pinned, but other objects
+can be moved by the collector if it chooses to do so.  We assume that
+object references within a heap object can be precisely identified.
+(However, Guile currently uses BDW-GC in its default configuration,
+which scans for references conservatively even on the heap.)
+
+The existing C API allows direct access to mutable object fields,
+without the mediation of read or write barriers.  Therefore it may be
+impossible to switch to collector strategies that need barriers, such as
+generational or concurrent collectors.  However, we shouldn't write off
+this possibility entirely; an ideal replacement for Guile's GC will
+offer the possibility of migration to other GC designs without imposing
+new requirements on C API users in the initial phase.
+
+In this regard, the Whippet experiment also has the goal of identifying
+a smallish GC abstraction in Guile, so that we might consider evolving
+GC implementation in the future without too much pain.  If we switch
+away from BDW-GC, we should be able to evaluate that it's a win for a
+large majority of use cases.
--- a/libguile/whippet/doc/manual.md
+++ b/libguile/whippet/doc/manual.md
@ -0,0 +1,718 @@
+# Whippet user's guide
+
+Whippet is an embed-only library: it should be copied into the source
+tree of the program that uses it.  The program's build system needs to
+be wired up to compile Whippet, then link it into the program that uses
+it.
+
+## Subtree merges
+
+One way is get Whippet is just to manually copy the files present in a
+Whippet checkout into your project.  However probably the best way is to
+perform a [subtree
+merge](https://docs.github.com/en/get-started/using-git/about-git-subtree-merges)
+of Whippet into your project's Git repository, so that you can easily
+update your copy of Whippet in the future.
+
+Performing the first subtree merge is annoying and full of arcane
+incantations.  Follow the [subtree merge
+page](https://docs.github.com/en/get-started/using-git/about-git-subtree-merges)
+for full details, but for a cheat sheet, you might do something like
+this to copy Whippet into the `whippet/` directory of your project root:
+
+```
+git remote add whippet https://github.com/wingo/whippet
+git fetch whippet
+git merge -s ours --no-commit --allow-unrelated-histories whippet/main
+git read-tree --prefix=whippet/ -u whippet/main
+git commit -m 'Added initial Whippet merge'
+```
+
+Then to later update your copy of whippet, assuming you still have the
+`whippet` remote, just do:
+
+```
+git pull -s subtree whippet main
+```
+
+## `gc-embedder-api.h`
+
+To determine the live set of objects, a tracing garbage collector starts
+with a set of root objects, and then transitively visits all reachable
+object edges.  Exactly how it goes about doing this depends on the
+program that is using the garbage collector; different programs will
+have different object representations, different strategies for
+recording roots, and so on.
+
+To traverse the heap in a program-specific way but without imposing an
+abstraction overhead, Whippet requires that a number of data types and
+inline functions be implemented by the program, for use by Whippet
+itself.  This is the *embedder API*, and this document describes what
+Whippet requires from a program.
+
+A program should provide a header file implementing the API in
+[`gc-embedder-api.h`](../api/gc-embedder-api.h).  This header should only be
+included when compiling Whippet itself; it is not part of the API that
+Whippet exposes to the program.
+
+### Identifying roots
+
+The collector uses two opaque struct types, `struct gc_mutator_roots`
+and `struct gc_heap_roots`, that are used by the program to record
+object roots.  Probably you should put the definition of these data
+types in a separate header that is included both by Whippet, via the
+embedder API, and via users of Whippet, so that programs can populate
+the root set.  In any case the embedder-API use of these structs is via
+`gc_trace_mutator_roots` and `gc_trace_heap_roots`, two functions that
+are passed a trace visitor function `trace_edge`, and which should call
+that function on all edges from a given mutator or heap.  (Usually
+mutator roots are per-thread roots, such as from the stack, and heap
+roots are global roots.)
+
+### Tracing objects
+
+The `gc_trace_object` is responsible for calling the `trace_edge`
+visitor function on all outgoing edges in an object.  It also includes a
+`size` out-parameter, for when the collector wants to measure the size
+of an object.  `trace_edge` and `size` may be `NULL`, in which case no
+tracing or size computation should be performed.
+
+### Tracing ephemerons and finalizers
+
+Most kinds of GC-managed object are defined by the program, but the GC
+itself has support for two specific object kind: ephemerons and
+finalizers.  If the program allocates ephemerons, it should trace them
+in the `gc_trace_object` function by calling `gc_trace_ephemeron` from
+[`gc-ephemerons.h`](../api/gc-ephemerons.h).  Likewise if the program
+allocates finalizers, it should trace them by calling
+`gc_trace_finalizer` from [`gc-finalizer.h`](../api/gc-finalizer.h).
+
+### Forwarding objects
+
+When built with a collector that moves objects, the embedder must also
+allow for forwarding pointers to be installed in an object.  There are
+two forwarding APIs: one that is atomic and one that isn't.
+
+The nonatomic API is relatively simple; there is a
+`gc_object_forwarded_nonatomic` function that returns an embedded
+forwarding address, or 0 if the object is not yet forwarded, and
+`gc_object_forward_nonatomic`, which installs a forwarding pointer.
+
+The atomic API is gnarly.  It is used by parallel collectors, in which
+multiple collector threads can race to evacuate an object.
+
+There is a state machine associated with the `gc_atomic_forward`
+structure from [`gc-forwarding.h`](../api/gc-forwarding.h); the embedder API
+implements the state changes.  The collector calls
+`gc_atomic_forward_begin` on an object to begin a forwarding attempt,
+and the resulting `gc_atomic_forward` can be in the `NOT_FORWARDED`,
+`FORWARDED`, or `BUSY` state.
+
+If the `gc_atomic_forward`'s state is `BUSY`, the collector will call
+`gc_atomic_forward_retry_busy`; a return value of 0 means the object is
+still busy, because another thread is attempting to forward it.
+Otherwise the forwarding state becomes either `FORWARDED`, if the other
+thread succeeded in forwarding it, or go back to `NOT_FORWARDED`,
+indicating that the other thread failed to forward it.
+
+If the forwarding state is `FORWARDED`, the collector will call
+`gc_atomic_forward_address` to get the new address.
+
+If the forwarding state is `NOT_FORWARDED`, the collector may begin a
+forwarding attempt by calling `gc_atomic_forward_acquire`.  The
+resulting state is `ACQUIRED` on success, or `BUSY` if another thread
+acquired the object in the meantime, or `FORWARDED` if another thread
+acquired and completed the forwarding attempt.
+
+An `ACQUIRED` object can then be forwarded via
+`gc_atomic_forward_commit`, or the forwarding attempt can be aborted via
+`gc_atomic_forward_abort`.  Also, when an object is acquired, the
+collector may call `gc_atomic_forward_object_size` to compute how many
+bytes to copy.  (The collector may choose instead to record object sizes
+in a different way.)
+
+All of these `gc_atomic_forward` functions are to be implemented by the
+embedder.  Some programs may allocate a dedicated forwarding word in all
+objects; some will manage to store the forwarding word in an initial
+"tag" word, via a specific pattern for the low 3 bits of the tag that no
+non-forwarded object will have.  The low-bits approach takes advantage
+of the collector's minimum object alignment, in which objects are
+aligned at least to an 8-byte boundary, so all objects have 0 for the
+low 3 bits of their address.
+
+### Conservative references
+
+Finally, when configured in a mode in which root edges or intra-object
+edges are *conservative*, the embedder can filter out which bit patterns
+might be an object reference by implementing
+`gc_is_valid_conservative_ref_displacement`.  Here, the collector masks
+off the low bits of a conservative reference, and asks the embedder if a
+value with those low bits might point to an object.  Usually the
+embedder should return 1 only if the displacement is 0, but if the
+program allows low-bit tagged pointers, then it should also return 1 for
+those pointer tags.
+
+### External objects
+
+Sometimes a system will allocate objects outside the GC, for example on
+the stack or in static data sections.  To support this use case, Whippet
+allows the embedder to provide a `struct gc_extern_space`
+implementation.  Whippet will call `gc_extern_space_start_gc` at the
+start of each collection, and `gc_extern_space_finish_gc` at the end.
+External objects will be visited by `gc_extern_space_mark`, which should
+return nonzero if the object hasn't been seen before and needs to be
+traced via `gc_trace_object` (coloring the object grey).  Note,
+`gc_extern_space_mark` may be called concurrently from many threads; be
+prepared!
+
+## Configuration, compilation, and linking
+
+To the user, Whippet presents an abstract API that does not encode the
+specificities of any given collector.  Whippet currently includes four
+implementations of that API: `semi`, a simple semi-space collector;
+`pcc`, a parallel copying collector (like semi but multithreaded);
+`bdw`, an implementation via the third-party
+[Boehm-Demers-Weiser](https://github.com/ivmai/bdwgc) conservative
+collector; and `mmc`, a mostly-marking collector inspired by Immix.
+
+The program that embeds Whippet selects the collector implementation at
+build-time.  For `pcc`, the program can also choose whether to be
+generational or not.  For `mmc` collector, the program configures a
+specific collector mode, again at build-time: generational or not,
+parallel or not, stack-conservative or not, and heap-conservative or
+not.  It may be nice in the future to be able to configure these at
+run-time, but for the time being they are compile-time options so that
+adding new features doesn't change the footprint of a more minimal
+collector.
+
+Different collectors have different allocation strategies: for example,
+the BDW collector allocates from thread-local freelists, whereas the
+semi-space collector has a bump-pointer allocator.  A collector may also
+expose a write barrier, for example to enable generational collection.
+For performance reasons, many of these details can't be hidden behind an
+opaque functional API: they must be inlined into call sites.  Whippet's
+approach is to expose fast paths as part of its inline API, but which
+are *parameterized* on attributes of the selected garbage collector.
+The goal is to keep the user's code generic and avoid any code
+dependency on the choice of garbage collector.  Because of inlining,
+however, the choice of garbage collector does need to be specified when
+compiling user code.
+
+### Compiling the collector
+
+As an embed-only library, Whippet needs to be integrated into the build
+system of its host (embedder).  There are two build systems supported
+currently; we would be happy to add other systems over time.
+
+#### GNU make
+
+At a high level, first the embedder chooses a collector and defines how
+to specialize the collector against the embedder.  Whippet's `embed.mk`
+Makefile snippet then defines how to build the set of object files that
+define the collector, and how to specialize the embedder against the
+chosen collector.
+
+As an example, say you have a file `program.c`, and you want to compile
+it against a Whippet checkout in `whippet/`.  Your headers are in
+`include/`, and you have written an implementation of the embedder
+interface in `host-gc.h`.  In that case you would have a Makefile like
+this:
+
+```
+HOST_DIR:=$(dir $(lastword $(MAKEFILE_LIST)))
+WHIPPET_DIR=$(HOST_DIR)whippet/
+
+all: out
+
+# The collector to choose: e.g. semi, bdw, pcc, generational-pcc, mmc,
+# parallel-mmc, etc.
+GC_COLLECTOR=pcc
+
+include $(WHIPPET_DIR)embed.mk
+
+# Host cflags go here...
+HOST_CFLAGS=
+
+# Whippet's embed.mk uses this variable when it compiles code that
+# should be specialized against the embedder.
+EMBEDDER_TO_GC_CFLAGS=$(HOST_CFLAGS) -include $(HOST_DIR)host-gc.h
+
+program.o: program.c
+	$(GC_COMPILE) $(HOST_CFLAGS) $(GC_TO_EMBEDDER_CFLAGS) -c $<
+program: program.o $(GC_OBJS)
+	$(GC_LINK) $^ $(GC_LIBS)
+```
+
+The optimization settings passed to the C compiler are taken from
+`GC_BUILD_CFLAGS`.  Embedders can override this variable directly, or
+via the shorthand `GC_BUILD` variable.  A `GC_BUILD` of `opt` indicates
+maximum optimization and no debugging assertions; `optdebug` adds
+debugging assertions; and `debug` removes optimizations.
+
+Though Whippet tries to put performance-sensitive interfaces in header
+files, users should also compile with link-time optimization (LTO) to
+remove any overhead imposed by the division of code into separate
+compilation units.  `embed.mk` includes the necessary LTO flags in
+`GC_CFLAGS` and `GC_LDFLAGS`.
+
+#### GNU Autotools
+
+To use Whippet from an autotools project, the basic idea is to include a
+`Makefile.am` snippet from the subdirectory containing the Whippet
+checkout.  That will build `libwhippet.la`, which you should link into
+your binary.  There are some `m4` autoconf macros that need to be
+invoked, for example to select the collector.
+
+Let us imagine you have checked out Whippet in `whippet/`.  Let us also
+assume for the moment that we are going to build `mt-gcbench`, a program
+included in Whippet itself.
+
+A top-level autoconf file (`configure.ac`) might look like this:
+
+```autoconf
+AC_PREREQ([2.69])
+AC_INIT([whippet-autotools-example],[0.1.0])
+AC_CONFIG_SRCDIR([whippet/benchmarks/mt-gcbench.c])
+AC_CONFIG_AUX_DIR([build-aux])
+AC_CONFIG_MACRO_DIRS([m4 whippet])
+AM_INIT_AUTOMAKE([subdir-objects foreign])
+
+WHIPPET_ENABLE_LTO
+
+LT_INIT
+
+WARN_CFLAGS=-Wall
+AC_ARG_ENABLE([Werror],
+  AS_HELP_STRING([--disable-Werror],
+                 [Don't stop the build on errors]),
+  [],
+  WARN_CFLAGS="-Wall -Werror")
+CFLAGS="$CFLAGS $WARN_CFLAGS"
+
+WHIPPET_PKG
+
+AC_CONFIG_FILES(Makefile)
+AC_OUTPUT
+```
+
+Then your `Makefile.am` might look like this:
+
+```automake
+noinst_LTLIBRARIES =
+WHIPPET_EMBEDDER_CPPFLAGS = -include $(srcdir)/whippet/benchmarks/mt-gcbench-embedder.h
+include whippet/embed.am
+
+noinst_PROGRAMS = whippet/benchmarks/mt-gcbench
+whippet_benchmarks_mt_gcbench_SOURCES = \
+  whippet/benchmarks/heap-objects.h \
+  whippet/benchmarks/mt-gcbench-embedder.h \
+  whippet/benchmarks/mt-gcbench-types.h \
+  whippet/benchmarks/mt-gcbench.c \
+  whippet/benchmarks/simple-allocator.h \
+  whippet/benchmarks/simple-gc-embedder.h \
+  whippet/benchmarks/simple-roots-api.h \
+  whippet/benchmarks/simple-roots-types.h \
+  whippet/benchmarks/simple-tagging-scheme.h
+
+AM_CFLAGS = $(WHIPPET_CPPFLAGS) $(WHIPPET_CFLAGS) $(WHIPPET_TO_EMBEDDER_CPPFLAGS)
+LDADD = libwhippet.la
+```
+
+We have to list all the little header files it uses because, well,
+autotools.
+
+To actually build, you do the usual autotools dance:
+
+```bash
+autoreconf -vif && ./configure && make
+```
+
+See `./configure --help` for a list of user-facing options.  Before the
+`WHIPPET_PKG`, you can run e.g. `WHIPPET_PKG_COLLECTOR(mmc)` to set the
+default collector to `mmc`; if you don't do that, the default collector
+is `pcc`.  There are also `WHIPPET_PKG_DEBUG`, `WHIPPET_PKG_TRACING`,
+and `WHIPPET_PKG_PLATFORM`; see [`whippet.m4`](../whippet.m4) for more
+details.  See also
+[`whippet-autotools`](https://github.com/wingo/whippet-autotools) for an
+example of how this works.
+
+#### Compile-time options
+
+There are a number of pre-processor definitions that can parameterize
+the collector at build-time:
+
+ * `GC_DEBUG`: If nonzero, then enable debugging assertions.
+ * `NDEBUG`: This one is a bit weird; if not defined, then enable
+   debugging assertions and some debugging printouts.  Probably
+   Whippet's use of `NDEBUG` should be folded in to `GC_DEBUG`.
+ * `GC_PARALLEL`: If nonzero, then enable parallelism in the collector.
+   Defaults to 0.
+ * `GC_GENERATIONAL`: If nonzero, then enable generational collection.
+   Defaults to zero.
+ * `GC_PRECISE_ROOTS`: If nonzero, then collect precise roots via
+   `gc_heap_roots` and `gc_mutator_roots`.  Defaults to zero.
+ * `GC_CONSERVATIVE_ROOTS`: If nonzero, then scan the stack and static
+   data sections for conservative roots.  Defaults to zero.  Not
+   mutually exclusive with `GC_PRECISE_ROOTS`.
+ * `GC_CONSERVATIVE_TRACE`: If nonzero, heap edges are scanned
+   conservatively.  Defaults to zero.
+
+Some collectors require specific compile-time options.  For example, the
+semi-space collector has to be able to move all objects; this is not
+compatible with conservative roots or heap edges.
+
+#### Tracing support
+
+Whippet includes support for low-overhead run-time tracing via
+[LTTng](https://lttng.org/).  If the support library `lttng-ust` is
+present when Whippet is compiled (as checked via `pkg-config`),
+tracepoint support will be present.  See
+[tracepoints.md](./tracepoints.md) for more information on how to get
+performance traces out of Whippet.
+
+## Using the collector
+
+Whew!  So you finally built the thing!  Did you also link it into your
+program?  No, because your program isn't written yet?  Well this section
+is for you: we describe the user-facing API of Whippet, where "user" in
+this case denotes the embedding program.
+
+What is the API, you ask?  It is in [`gc-api.h`](../api/gc-api.h).
+
+### Heaps and mutators
+
+To start with, you create a *heap*.  Usually an application will create
+just one heap.  A heap has one or more associated *mutators*.  A mutator
+is a thread-specific handle on the heap.  Allocating objects requires a
+mutator.
+
+The initial heap and mutator are created via `gc_init`, which takes
+three logical input parameters: the *options*, a stack base address, and
+an *event listener*.  The options specify the initial heap size and so
+on.  The event listener is mostly for gathering statistics; see below
+for more.  `gc_init` returns the new heap as an out parameter, and also
+returns a mutator for the current thread.
+
+To make a new mutator for a new thread, use `gc_init_for_thread`.  When
+a thread is finished with its mutator, call `gc_finish_for_thread`.
+Each thread that allocates or accesses GC-managed objects should have
+its own mutator.
+
+The stack base address allows the collector to scan the mutator's stack,
+if conservative root-finding is enabled.  It may be omitted in the call
+to `gc_init` and `gc_init_for_thread`; passing `NULL` tells Whippet to
+ask the platform for the stack bounds of the current thread.  Generally
+speaking, this works on all platforms for the main thread, but not
+necessarily on other threads.  The most reliable solution is to
+explicitly obtain a base address by trampolining through
+`gc_call_with_stack_addr`.
+
+### Options
+
+There are some run-time parameters that programs and users might want to
+set explicitly; these are encapsulated in the *options*.  Make an
+options object with `gc_allocate_options()`; this object will be
+consumed by its `gc_init`.  Then, the most convenient thing is to set
+those options from `gc_options_parse_and_set_many` from a string passed
+on the command line or an environment variable, but to get there we have
+to explain the low-level first.  There are a few options that are
+defined for all collectors:
+
+ * `GC_OPTION_HEAP_SIZE_POLICY`: How should we size the heap?  Either
+   it's `GC_HEAP_SIZE_FIXED` (which is 0), in which the heap size is
+   fixed at startup; or `GC_HEAP_SIZE_GROWABLE` (1), in which the heap
+   may grow but will never shrink; or `GC_HEAP_SIZE_ADAPTIVE` (2), in
+   which we take an
+   [adaptive](https://wingolog.org/archives/2023/01/27/three-approaches-to-heap-sizing)
+   approach, depending on the rate of allocation and the cost of
+   collection.  Really you want the adaptive strategy, but if you are
+   benchmarking you definitely want the fixed policy.
+ * `GC_OPTION_HEAP_SIZE`: The initial heap size.  For a
+   `GC_HEAP_SIZE_FIXED` policy, this is also the final heap size.  In
+   bytes.
+ * `GC_OPTION_MAXIMUM_HEAP_SIZE`: For growable and adaptive heaps, the
+   maximum heap size, in bytes.
+ * `GC_OPTION_HEAP_SIZE_MULTIPLIER`: For growable heaps, the target heap
+   multiplier.  A heap multiplier of 2.5 means that for 100 MB of live
+   data, the heap should be 250 MB.
+ * `GC_OPTION_HEAP_EXPANSIVENESS`: For adaptive heap sizing, an
+   indication of how much free space will be given to heaps, as a
+   proportion of the square root of the live data size.
+ * `GC_OPTION_PARALLELISM`: How many threads to devote to collection
+   tasks during GC pauses.  By default, the current number of
+   processors, with a maximum of 8.
+
+You can set these options via `gc_option_set_int` and so on; see
+[`gc-options.h`](../api/gc-options.h).  Or, you can parse options from
+strings: `heap-size-policy`, `heap-size`, `maximum-heap-size`, and so
+on.  Use `gc_option_from_string` to determine if a string is really an
+option.  Use `gc_option_parse_and_set` to parse a value for an option.
+Use `gc_options_parse_and_set_many` to parse a number of comma-delimited
+*key=value* settings from a string.
+
+### Allocation
+
+So you have a heap and a mutator; great!  Let's allocate!  Call
+`gc_allocate`, passing the mutator and the number of bytes to allocate.
+
+There is also `gc_allocate_fast`, which is an inlined fast-path.  If
+that returns NULL, you need to call `gc_allocate_slow`.  The advantage
+of this API is that you can punt some root-saving overhead to the slow
+path.
+
+Allocation always succeeds.  If it doesn't, it kills your program.  The
+bytes in the resulting allocation will be initialized to 0.
+
+The allocation fast path is parameterized by collector-specific
+attributes.  JIT compilers can also read those attributes to emit
+appropriate inline code that replicates the logic of `gc_allocate_fast`.
+
+### Write barriers
+
+For some collectors, mutators have to tell the collector whenever they
+mutate an object.  They tell the collector by calling a *write barrier*;
+in Whippet this is currently the case only for generational collectors.
+
+The write barrier is `gc_write_barrier`; see `gc-api.h` for its
+parameters.
+
+As with allocation, the fast path for the write barrier is parameterized
+by collector-specific attributes, to allow JIT compilers to inline write
+barriers.
+
+### Safepoints
+
+Sometimes Whippet will need to synchronize all threads, for example as
+part of the "stop" phase of a stop-and-copy semi-space collector.
+Whippet stops at *safepoints*.  At a safepoint, all mutators must be
+able to enumerate all of their edges to live objects.
+
+Whippet has cooperative safepoints: mutators have to periodically call
+into the collector to potentially synchronize with other mutators.
+`gc_allocate_slow` is a safepoint, so if you a bunch of threads that are
+all allocating, usually safepoints are reached in a more-or-less prompt
+fashion.  But if a mutator isn't allocating, it either needs to
+temporarily mark itself as inactive by trampolining through
+`gc_call_without_gc`, or it should arrange to periodically call
+`gc_safepoint`.  Marking a mutator as inactive is the right strategy
+for, for example, system calls that might block.  Periodic safepoints is
+better for code that is active but not allocating.
+
+Also, the BDW collector actually uses pre-emptive safepoints: it stops
+threads via POSIX signals.  `gc_safepoint` is a no-op with BDW.
+
+Embedders can inline safepoint checks.  If
+`gc_cooperative_safepoint_kind()` is `GC_COOPERATIVE_SAFEPOINT_NONE`,
+then the collector doesn't need safepoints, as is the case for `bdw`
+which uses signals and `semi` which is single-threaded.  If it is
+`GC_COOPERATIVE_SAFEPOINT_HEAP_FLAG`, then calling
+`gc_safepoint_flag_loc` on a mutator will return the address of an `int`
+in memory, which if nonzero when loaded using relaxed atomics indicates
+that the mutator should call `gc_safepoint_slow`.  Similarly for
+`GC_COOPERATIVE_SAFEPOINT_MUTATOR_FLAG`, except that the address is
+per-mutator rather than global.
+
+### Pinning
+
+Sometimes a mutator or embedder would like to tell the collector to not
+move a particular object.  This can happen for example during a foreign
+function call, or if the embedder allows programs to access the address
+of an object, for example to compute an identity hash code.  To support
+this use case, some Whippet collectors allow the embedder to *pin*
+objects.  Call `gc_pin_object` to prevent the collector from relocating
+an object.
+
+Pinning is currently supported by the `bdw` collector, which never moves
+objects, and also by the various `mmc` collectors, which can move
+objects that have no inbound conservative references.
+
+Pinning is not supported on `semi` or `pcc`.
+
+Call `gc_can_pin_objects` to determine whether the current collector can
+pin objects.
+
+### Statistics
+
+Sometimes a program would like some information from the GC: how many
+bytes and objects have been allocated?  How much time has been spent in
+the GC?  How many times has GC run, and how many of those were minor
+collections?  What's the maximum pause time?  Stuff like that.
+
+Instead of collecting a fixed set of information, Whippet emits
+callbacks when the collector reaches specific states.  The embedder
+provides a *listener* for these events when initializing the collector.
+
+The listener interface is defined in
+[`gc-event-listener.h`](../api/gc-event-listener.h).  Whippet ships with
+two listener implementations,
+[`GC_NULL_EVENT_LISTENER`](../api/gc-null-event-listener.h), and
+[`GC_BASIC_STATS`](../api/gc-basic-stats.h).  Most embedders will want
+their own listener, but starting with the basic stats listener is not a
+bad option:
+
+```
+#include "gc-api.h"
+#include "gc-basic-stats.h"
+#include <stdio.h>
+
+int main() {
+  struct gc_options *options = NULL;
+  struct gc_heap *heap;
+  struct gc_mutator *mut;
+  struct gc_basic_stats stats;
+  gc_init(options, NULL, &heap, &mut, GC_BASIC_STATS, &stats);
+  // ...
+  gc_basic_stats_finish(&stats);
+  gc_basic_stats_print(&stats, stdout);
+}
+```
+
+As you can see, `GC_BASIC_STATS` expands to a `struct gc_event_listener`
+definition.  We pass an associated pointer to a `struct gc_basic_stats`
+instance which will be passed to the listener at every event.
+
+The output of this program might be something like:
+
+```
+Completed 19 major collections (0 minor).
+654.597 ms total time (385.235 stopped).
+Heap size is 167.772 MB (max 167.772 MB); peak live data 55.925 MB.
+```
+
+There are currently three different sorts of events: heap events to
+track heap growth, collector events to time different parts of
+collection, and mutator events to indicate when specific mutators are
+stopped.
+
+There are three heap events:
+
+ * `init(void* data, size_t heap_size)`: Called during `gc_init`, to
+   allow the listener to initialize its associated state.
+ * `heap_resized(void* data, size_t new_size)`: Called if the heap grows
+   or shrinks.
+ * `live_data_size(void* data, size_t size)`: Called periodically when
+   the collector learns about live data size.
+ 
+The collection events form a kind of state machine, and are called in
+this order:
+
+ * `requesting_stop(void* data)`: Called when the collector asks
+   mutators to stop.
+ * `waiting_for_stop(void* data)`: Called when the collector has done
+   all the pre-stop work that it is able to and is just waiting on
+   mutators to stop.
+ * `mutators_stopped(void* data)`: Called when all mutators have
+   stopped; the trace phase follows.
+ * `prepare_gc(void* data, enum gc_collection_kind gc_kind)`: Called 
+   to indicate which kind of collection is happening.
+ * `roots_traced(void* data)`: Called when roots have been visited.
+ * `heap_traced(void* data)`: Called when the whole heap has been
+   traced.
+ * `ephemerons_traced(void* data)`: Called when the [ephemeron
+   fixpoint](https://wingolog.org/archives/2023/01/24/parallel-ephemeron-tracing)
+   has been reached.
+ * `restarting_mutators(void* data)`: Called right before the collector
+   restarts mutators.
+
+The collectors in Whippet will call all of these event handlers, but it
+may be that they are called conservatively: for example, the
+single-mutator, single-collector semi-space collector will never have to
+wait for mutators to stop.  It will still call the functions, though!
+
+Finally, there are the mutator events:
+ * `mutator_added(void* data) -> void*`: The only event handler that
+   returns a value, called when a new mutator is added.  The parameter
+   is the overall event listener data, and the result is
+   mutator-specific data.  The rest of the mutator events pass this
+   mutator-specific data instead.
+ * `mutator_cause_gc(void* mutator_data)`: Called when a mutator causes
+   GC, either via allocation or an explicit `gc_collect` call.
+ * `mutator_stopping(void* mutator_data)`: Called when a mutator has
+   received the signal to stop.  It may perform some marking work before
+   it stops.
+ * `mutator_stopped(void* mutator_data)`: Called when a mutator parks
+   itself.
+ * `mutator_restarted(void* mutator_data)`: Called when a mutator
+   restarts.
+ * `mutator_removed(void* mutator_data)`: Called when a mutator goes
+   away.
+
+Note that these events handlers shouldn't really do much.  In
+particular, they shouldn't call into the Whippet API, and they shouldn't
+even access GC-managed objects.  Event listeners are really about
+statistics and profiling and aren't a place to mutate the object graph.
+
+### Ephemerons
+
+Whippet supports ephemerons, first-class objects that weakly associate
+keys with values.  If the an ephemeron's key ever becomes unreachable,
+the ephemeron becomes dead and loses its value.
+
+The user-facing API is in [`gc-ephemeron.h`](../api/gc-ephemeron.h).  To
+allocate an ephemeron, call `gc_allocate_ephemeron`, then initialize its
+key and value via `gc_ephemeron_init`.  Get the key and value via
+`gc_ephemeron_key` and `gc_ephemeron_value`, respectively.
+
+In Whippet, ephemerons can be linked together in a chain.  During GC, if
+an ephemeron's chain points to a dead ephemeron, that link will be
+elided, allowing the dead ephemeron itself to be collected.  In that
+way, ephemerons can be used to build weak data structures such as weak
+maps.
+
+Weak data structures are often shared across multiple threads, so all
+routines to access and modify chain links are atomic.  Use
+`gc_ephemeron_chain_head` to access the head of a storage location that
+points to an ephemeron; push a new ephemeron on a location with
+`gc_ephemeron_chain_push`; and traverse a chain with
+`gc_ephemeron_chain_next`.
+
+An ephemeron association can be removed via `gc_ephemeron_mark_dead`.
+
+### Finalizers
+
+A finalizer allows the embedder to be notified when an object becomes
+unreachable.
+
+A finalizer has a priority.  When the heap is created, the embedder
+should declare how many priorities there are.  Lower-numbered priorities
+take precedence; if an object has a priority-0 finalizer outstanding,
+that will prevent any finalizer at level 1 (or 2, ...)  from firing
+until no priority-0 finalizer remains.
+
+Call `gc_attach_finalizer`, from `gc-finalizer.h`, to attach a finalizer
+to an object.
+
+A finalizer also references an associated GC-managed closure object.
+A finalizer's reference to the closure object is strong:  if a
+finalizer's closure closure references its finalizable object,
+directly or indirectly, the finalizer will never fire.
+
+When an object with a finalizer becomes unreachable, it is added to a
+queue.  The embedder can call `gc_pop_finalizable` to get the next
+finalizable object and its associated closure.  At that point the
+embedder can do anything with the object, including keeping it alive.
+Ephemeron associations will still be present while the finalizable
+object is live.  Note however that any objects referenced by the
+finalizable object may themselves be already finalized; finalizers are
+enqueued for objects when they become unreachable, which can concern
+whole subgraphs of objects at once.
+
+The usual way for an embedder to know when the queue of finalizable
+object is non-empty is to call `gc_set_finalizer_callback` to
+provide a function that will be invoked when there are pending
+finalizers.
+
+Arranging to call `gc_pop_finalizable` and doing something with the
+finalizable object and closure is the responsibility of the embedder.
+The embedder's finalization action can end up invoking arbitrary code,
+so unless the embedder imposes some kind of restriction on what
+finalizers can do, generally speaking finalizers should be run in a
+dedicated thread instead of recursively from within whatever mutator
+thread caused GC.  Setting up such a thread is the responsibility of the
+mutator.  `gc_pop_finalizable` is thread-safe, allowing multiple
+finalization threads if that is appropriate.
+
+`gc_allocate_finalizer` returns a finalizer, which is a fresh GC-managed
+heap object.  The mutator should then directly attach it to an object
+using `gc_finalizer_attach`.  When the finalizer is fired, it becomes
+available to the mutator via `gc_pop_finalizable`.
--- a/libguile/whippet/doc/perfetto-minor-gc.png
+++ b/libguile/whippet/doc/perfetto-minor-gc.png
--- a/libguile/whippet/doc/tracepoints.md
+++ b/libguile/whippet/doc/tracepoints.md
@ -0,0 +1,127 @@
+# Whippet performance tracing
+
+Whippet includes support for run-time tracing via
+[LTTng](https://LTTng.org) user-space tracepoints.  This allows you to
+get a detailed look at how Whippet is performing on your system.
+Tracing support is currently limited to Linux systems.
+
+## Getting started
+
+First, you need to build Whippet with LTTng support.  Usually this is as
+easy as building it in an environment where the `lttng-ust` library is
+present, as determined by `pkg-config --libs lttng-ust`.  You can know
+if your Whippet has tracing support by seeing if the resulting binaries
+are dynamically linked to `liblttng-ust`.
+
+If we take as an example the `mt-gcbench` test in the Whippet source
+tree, we would have:
+
+```
+$ ldd bin/mt-gcbench.pcc | grep lttng
+...
+liblttng-ust.so.1 => ...
+...
+```
+
+### Capturing traces
+
+Actually capturing traces is a little annoying; it's not as easy as
+`perf run`.  The [LTTng
+documentation](https://lttng.org/docs/v2.13/#doc-controlling-tracing) is
+quite thorough, but here is a summary.
+
+First, create your tracing session:
+
+```
+$ lttng create
+Session auto-20250214-091153 created.
+Traces will be output to ~/lttng-traces/auto-20250214-091153
+```
+
+You run all these commands as your own user; they don't require root
+permissions or system-wide modifications, as all of the Whippet
+tracepoints are user-space tracepoints (UST).
+
+Just having an LTTng session created won't do anything though; you need
+to configure the session.  Monotonic nanosecond-resolution timestamps
+are already implicitly part of each event.  We also want to have process
+and thread IDs for all events:
+
+```
+$ lttng add-context --userspace --type=vpid --type=vtid
+ust context vpid added to all channels
+ust context vtid added to all channels
+```
+
+Now enable Whippet events:
+
+```
+$ lttng enable-event --userspace 'whippet:*'
+ust event whippet:* created in channel channel0
+```
+
+And now, start recording:
+
+```
+$ lttng start
+Tracing started for session auto-20250214-091153
+```
+
+With this, traces will be captured for our program of interest:
+
+```
+$ bin/mt-gcbench.pcc 2.5 8
+...
+```
+
+Now stop the trace:
+
+```
+$ lttng stop
+Waiting for data availability
+Tracing stopped for session auto-20250214-091153
+```
+
+Whew.  If we did it right, our data is now in
+`~/lttng-traces/auto-20250214-091153`.
+
+### Visualizing traces
+
+LTTng produces traces in the [Common Trace Format
+(CTF)](https://diamon.org/ctf/).  My favorite trace viewing tool is the
+family of web-based trace viewers derived from `chrome://tracing`.  The
+best of these appear to be [the Firefox
+profiler](https://profiler.firefox.com) and
+[Perfetto](https://ui.perfetto.dev).  Unfortunately neither of these can
+work with CTF directly, so we instead need to run a trace converter.
+
+Oddly, there is no trace converter that can read CTF and write something
+that Perfetto (e.g.) can read.  However there is a [JSON-based tracing
+format that these tools can
+read](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw),
+and [Python bindings for Babeltrace, a library that works with
+CTF](https://babeltrace.org/), so that's what we do:
+
+```
+$ python3 ctf_to_json.py ~/lttng-traces/auto-20250214-091153 > trace.json
+```
+
+While Firefox Profiler can load this file, it works better on Perfetto,
+as the Whippet events are visually rendered on their respective threads.
+
+![Screenshot of part of Perfetto UI showing a minor GC](./perfetto-minor-gc.png)
+
+### Expanding the set of events
+
+As of February 2025,
+the current set of tracepoints includes the [heap
+events](https://github.com/wingo/whippet/blob/main/doc/manual.md#statistics)
+and some detailed internals of the parallel tracer.  We expect this set
+of tracepoints to expand over time.
+
+### Overhead of tracepoints
+
+When tracepoints are compiled in but no events are enabled, tracepoints
+appear to have no impact on run-time.  When event collection is on, for
+x86-64 hardware, [emitting a tracepoint event takes about
+100ns](https://discuss.systems/@DesnoyersMa/113986344940256872).