1
Fork 0
mirror of https://git.savannah.gnu.org/git/guile.git synced 2025-06-26 05:00:28 +02:00

Merged Whippet into libguile/whippet

This commit is contained in:
Andy Wingo 2025-04-11 14:10:41 +02:00
commit db181e67ff
112 changed files with 18115 additions and 0 deletions

View file

@ -0,0 +1,13 @@
# Whippet documentation
* [Manual](./manual.md): How do you get your program to use
Whippet? What is the API?
* [Collector implementations](./collectors.md): There are a number of
implementations of the Whippet API with differing performance
characteristics and which impose different requirements on the
embedder.
* [Guile](./guile.md): Some notes on a potential rebase of Guile on
top of Whippet.

View file

@ -0,0 +1,26 @@
# Boehm-Demers-Weiser collector
Whippet's `bdw` collector is backed by a third-party garbage collector,
the [Boehm-Demers-Weiser collector](https://github.com/ivmai/bdwgc).
BDW-GC is a mark-sweep collector with conservative root-finding,
conservative heap tracing, and parallel tracing.
Whereas the other Whippet collectors which rely on mutators to
[periodically check if they need to
stop](https://github.com/wingo/whippet/blob/main/doc/manual.md#safepoints),
`bdw` will stop mutators with a POSIX signal. Also, it doesn't really
support ephemerons (the Whippet `bdw` collector simulates them using
finalizers), and both ephemerons and finalizers only approximate the
Whippet behavior, because they are implemented in terms of what BDW-GC
provides.
`bdw` supports the `fixed` and `growable` heap-sizing policies, but not
`adaptive`, as BDW-GC can't reliably return memory to the OS. Also,
[`growable` has an effective limit of a 3x heap
multiplier](https://github.com/wingo/whippet/blob/main/src/bdw.c#L478).
Oh well!
It's a bit of an oddball from a Whippet perspective, but useful as a
migration path if you have an embedder that is already using BDW-GC.
And, it is a useful performance comparison.

View file

@ -0,0 +1,148 @@
# Mostly-marking collector
The `mmc` collector is mainly a mark-region collector, inspired by
[Immix](http://users.cecs.anu.edu.au/~steveb/pubs/papers/immix-pldi-2008.pdf).
To a first approximation, `mmc` is a whole-heap Immix collector with a
large object space on the side.
When tracing, `mmc` mostly marks objects in place. If the heap is
too fragmented, it can compact the heap by choosing to evacuate
sparsely-populated heap blocks instead of marking in place. However
evacuation is strictly optional, which means that `mmc` is also
compatible with conservative root-finding, making it a good replacement
for embedders that currently use the [Boehm-Demers-Weiser
collector](./collector-bdw.md).
## Differences from Immix
The original Immix divides the heap into 32kB blocks, and then divides
those blocks into 128B lines. An Immix allocation can span lines but
not blocks; allocations larger than 8kB go into a separate large object
space. Mutators request blocks from the global store and allocate into
those blocks using bump-pointer allocation. When all blocks are
consumed, Immix stops the world and traces the object graph, marking
objects but also the lines that objects are on. After marking, blocks
contain some lines with live objects and others that are completely
free. Spans of free lines are called holes. When a mutator gets a
recycled block from the global block store, it allocates into those
holes. For an exposition of Immix, see the lovely detailed [Rust
implementation](http://users.cecs.anu.edu.au/~steveb/pubs/papers/rust-ismm-2016.pdf).
The essential difference of `mmc` from Immix stems from a simple
observation: Immix needs a side table of line mark bytes and also a mark
bit or bits in each object (or in a side table). But if instead you
choose to store mark bytes instead of bits (for concurrency reasons) in
a side table, with one mark byte per granule (unit of allocation,
perhaps 16 bytes), then you effectively have a line mark table where the
granule size is the line size. You can bump-pointer allocate into holes
in the mark byte table.
You might think this is a bad tradeoff, and perhaps it is: I don't know
yet. If your granule size is two pointers, then one mark byte per
granule is 6.25% overhead on 64-bit, or 12.5% on 32-bit. Especially on
32-bit, it's a lot! On the other hand, instead of the worst case of one
survivor object wasting a line (or two, in the case of conservative line
marking), granule-size-is-line-size instead wastes nothing. Also, you
don't need GC bits in the object itself, and you can use the mark byte
array to record the object end, so that finding holes in a block can
just read the mark table and can avoid looking at object memory.
## Optional features
The `mmc` collector has a few feature flags that can be turned on or
off. If you use the [standard embedder makefile include](../embed.mk),
then there is a name for each combination of features: `mmc` has no
additional features, `parallel-mmc` enables parallel marking,
`parallel-generational-mmc` enables generations,
`stack-conservative-parallel-generational-mmc` uses conservative
root-finding, and `heap-conservative-parallel-generational-mmc`
additionally traces the heap conservatively. You can leave off
components of the name to get a collector without those features.
Underneath this corresponds to some pre-processor definitions passed to
the compiler on the command line.
### Generations
`mmc` supports generational tracing via the [sticky mark-bit
algorithm](https://wingolog.org/archives/2022/10/22/the-sticky-mark-bit-algorithm).
This requires that the embedder emit [write
barriers](https://github.com/wingo/whippet/blob/main/doc/manual.md#write-barriers);
if your embedder cannot ensure write barriers are always invoked, then
generational collection is not for you. (We could perhaps relax this a
bit, following what [Ruby developers
did](http://rvm.jp/~ko1/activities/rgengc_ismm.pdf).)
The write barrier is currently a card-marking barrier emitted on stores,
with one card byte per 256 object bytes, where the card location can be
computed from the object address because blocks are allocated in
two-megabyte aligned slabs.
### Parallel tracing
You almost certainly want this on! `parallel-mmc` uses a the
[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
Each trace worker maintains a [local queue of objects that need
tracing](../src/local-worklist.h), which currently has a capacity of
1024 entries. If the local queue becomes full, the worker will publish
3/4 of those entries to the worker's [shared
worklist](../src/shared-worklist.h). When a worker runs out of local
work, it will first try to remove work from its own shared worklist,
then will try to steal from other workers.
The memory used for the external worklist is dynamically allocated from
the OS and is not currently counted as contributing to the heap size.
If you absolutely need to avoid dynamic allocation during GC, `mmc`
(even `serial-mmc`) would need some work for your use case, to allocate
a fixed-size space for a marking queue and to gracefully handle mark
queue overflow.
### Conservative stack scanning
With `semi` and `pcc`, embedders must precisely enumerate the set of
*roots*: the edges into the heap from outside. Commonly, roots include
global variables, as well as working variables from each mutator's
stack. `mmc` can optionally mark mutator stacks *conservatively*:
treating each word on the stack as if it may be an object reference, and
marking any object at that address.
After all these years, *whether* to mark stacks conservatively or not is
still an open research question. Conservative stack scanning can retain
too much data if an integer is confused for an object reference and
removes a layer of correctness-by-construction from a system. Sometimes
conservative stack-scanning is required, for example if your embedder
cannot enumerate roots precisely. But there are reasons to consider it
even if you can do precise roots: conservative scanning removes the need
for the compiler to produce a stack map to store the precise root
enumeration at every safepoint; it removes the need to look up a stack
map when tracing; and it allows C or C++ support code to avoid having to
place roots in traceable locations published to the garbage collector.
And the [performance question is still
open](https://dl.acm.org/doi/10.1145/2660193.2660198).
Anyway. `mmc` can scan roots conservatively. Those roots are pinned
for the collection; even if the collection will compact via evacuation,
referents of conservative roots won't be moved. Objects not directly
referenced by roots can be evacuated, however.
### Conservative heap scanning
In addition to stack and global references, the Boehm-Demers-Weiser
collector scans heap objects conservatively as well, treating each word
of each heap object as if it were a reference. `mmc` can do that, if
the embedder is unable to provide a `gc_trace_object` implementation.
However this is generally a performance lose, and it prevents
evacuation.
## Other implementation tidbits
`mmc` does lazy sweeping: as a mutator grabs a fresh block, it
reclaims memory that was unmarked in the previous collection before
making the memory available for allocation. This makes sweeping
naturally cache-friendly and parallel.
The mark byte array facilitates conservative collection by being an
oracle for "does this address start an object".
For a detailed introduction, see [Whippet: Towards a new local
maximum](https://wingolog.org/archives/2023/02/07/whippet-towards-a-new-local-maximum),
a talk given at FOSDEM 2023.

View file

@ -0,0 +1,84 @@
# Parallel copying collector
Whippet's `pcc` collector is a copying collector, like the more simple
[`semi`](./collector-semi.md), but supporting multiple mutator threads,
multiple tracing threads, and using an external FIFO worklist instead of
a Cheney worklist.
Like `semi`, `pcc` traces by evacuation: it moves all live objects on
every collection. (Exception: objects larger than 8192 bytes are
placed into a partitioned space which traces by marking in place instead
of copying.) Evacuation requires precise roots, so if your embedder
does not support precise roots, `pcc` is not for you.
Again like `semi`, `pcc` generally requires a heap size at least twice
as large as the maximum live heap size, and performs best with ample
heap sizes; between 3× and 5× is best.
Overall, `pcc` is a better version of `semi`. It should have broadly
the same performance characteristics with a single mutator and with
parallelism disabled, additionally allowing multiple mutators, and
scaling better with multiple tracing threads.
`pcc` has a generational configuration, conventionally referred to as
`generational-pcc`, in which both the nursery and the old generation are
copy spaces. Objects stay in the nursery for one cycle before moving on
to the old generation. This configuration is a bit new (January 2025)
and still needs some tuning.
## Implementation notes
Unlike `semi` which has a single global bump-pointer allocation region,
`pcc` structures the heap into 64-kB blocks. In this way it supports
multiple mutator threads: mutators do local bump-pointer allocation into
their own block, and when their block is full, they fetch another from
the global store.
The block size is 64 kB, but really it's 128 kB, because each block has
two halves: the active region and the copy reserve. Dividing each block
in two allows the collector to easily grow and shrink the heap while
ensuring there is always enough reserve space.
Blocks are allocated in 64-MB aligned slabs, so there are 512 blocks in
a slab. The first block in a slab is used by the collector itself, to
keep metadata for the rest of the blocks, for example a chain pointer
allowing blocks to be collected in lists, a saved allocation pointer for
partially-filled blocks, whether the block is paged in or out, and so
on.
`pcc` supports tracing in parallel. This mechanism works somewhat like
allocation, in which multiple trace workers compete to evacuate objects
into their local allocation buffers; when an allocation buffer is full,
the trace worker grabs another, just like mutators do.
Unlike the simple semi-space collector which uses a Cheney grey
worklist, `pcc` uses an external worklist. If parallelism is disabled
at compile-time, it uses a simple first-in, first-out queue of objects
to be traced. Like a Cheney worklist, this should result in objects
being copied in breadth-first order. The literature would suggest that
depth-first is generally better for locality, but that preserving
allocation order is generally best. This is something to experiment
with in the future.
If parallelism is enabled, as it is by default, `pcc` uses a
[fine-grained work-stealing parallel tracer](../src/parallel-tracer.h).
Each trace worker maintains a [local queue of objects that need
tracing](../src/local-worklist.h), which currently has 1024 entries. If
the local queue becomes full, the worker will publish 3/4 of those
entries to the worker's [shared worklist](../src/shared-worklist.h).
When a worker runs out of local work, it will first try to remove work
from its own shared worklist, then will try to steal from other workers.
If only one tracing thread is enabled at run-time (`parallelism=1`) (or
if parallelism is disabled at compile-time), `pcc` will evacuate by
non-atomic forwarding, but if multiple threads compete to evacuate
objects, `pcc` uses [atomic compare-and-swap instead of simple
forwarding pointer updates](./manual.md#forwarding-objects). This
imposes around a ~30% performance penalty but having multiple tracing
threads is generally worth it, unless the object graph is itself serial.
The memory used for the external worklist is dynamically allocated from
the OS and is not currently counted as contributing to the heap size.
If you are targetting a microcontroller or something, probably you need
to choose a different kind of collector that never dynamically
allocates, such as `semi`.

View file

@ -0,0 +1,23 @@
# Semi-space collector
The `semi` collector is simple. It is mostly useful as a first
collector to try out, to make sure that a mutator correctly records all
roots: because `semi` moves every live object on every collection, it is
very effective at shaking out mutator bugs.
If your embedder chooses to not precisely record roots, for example
instead choosing to conservatively scan the stack, then the semi-space
collector is not for you: `semi` requires precise roots.
For more on semi-space collectors, see
https://wingolog.org/archives/2022/12/10/a-simple-semi-space-collector.
Whippet's `semi` collector incorporates a large-object space, which
marks objects in place instead of moving. Otherwise, `semi` generally
requires a heap size at least twice as large as the maximum live heap
size, and performs best with ample heap sizes; between 3× and 5× is
best.
The semi-space collector doesn't support multiple mutator threads. If
you want a copying collector for a multi-threaded mutator, look at
[pcc](./collector-pcc.md).

View file

@ -0,0 +1,43 @@
# Whippet collectors
Whippet has four collectors currently:
- [Semi-space collector (`semi`)](./collector-semi.md): For
single-threaded embedders who are not too tight on memory.
- [Parallel copying collector (`pcc`)](./collector-pcc.md): Like
`semi`, but with support for multiple mutator and tracing threads and
generational collection.
- [Mostly marking collector (`mmc`)](./collector-mmc.md):
Immix-inspired collector. Optionally parallel, conservative (stack
and/or heap), and/or generational.
- [Boehm-Demers-Weiser collector (`bdw`)](./collector-bdw.md):
Conservative mark-sweep collector, implemented by
Boehm-Demers-Weiser library.
## How to choose?
If you are migrating an embedder off BDW-GC, then it could be reasonable
to first go to `bdw`, then `stack-conservative-parallel-mmc`.
If you have an embedder with precise roots, use `pcc`. That will shake
out mutator/embedder bugs. Then if memory is tight, switch to
`parallel-mmc`, possibly `parallel-generational-mmc`.
If you are aiming for maximum simplicity and minimal code size (ten
kilobytes or so), use `semi`.
If you are writing a new project, you have a choice as to whether to pay
the development cost of precise roots or not. If you choose to not have
precise roots, then go for `stack-conservative-parallel-mmc` directly.
## More collectors
It would be nice to have a generational GC that uses the space from
`parallel-mmc` for the old generation but a pcc-style copying nursery.
We have `generational-pcc` now, so this should be possible.
Support for concurrent marking in `mmc` would be good as well, perhaps
with a SATB barrier. (Or, if you are the sort of person to bet on
conservative stack scanning, perhaps a retreating-wavefront barrier
would be more appropriate.)
Contributions are welcome, provided they have no more dependencies!

View file

@ -0,0 +1,26 @@
# Whippet and Guile
If the `mmc` collector works out, it could replace Guile's garbage
collector. Guile currently uses BDW-GC. Guile has a widely used C API
and implements part of its run-time in C. For this reason it may be
infeasible to require precise enumeration of GC roots -- we may need to
allow GC roots to be conservatively identified from data sections and
from stacks. Such conservative roots would be pinned, but other objects
can be moved by the collector if it chooses to do so. We assume that
object references within a heap object can be precisely identified.
(However, Guile currently uses BDW-GC in its default configuration,
which scans for references conservatively even on the heap.)
The existing C API allows direct access to mutable object fields,
without the mediation of read or write barriers. Therefore it may be
impossible to switch to collector strategies that need barriers, such as
generational or concurrent collectors. However, we shouldn't write off
this possibility entirely; an ideal replacement for Guile's GC will
offer the possibility of migration to other GC designs without imposing
new requirements on C API users in the initial phase.
In this regard, the Whippet experiment also has the goal of identifying
a smallish GC abstraction in Guile, so that we might consider evolving
GC implementation in the future without too much pain. If we switch
away from BDW-GC, we should be able to evaluate that it's a win for a
large majority of use cases.

View file

@ -0,0 +1,718 @@
# Whippet user's guide
Whippet is an embed-only library: it should be copied into the source
tree of the program that uses it. The program's build system needs to
be wired up to compile Whippet, then link it into the program that uses
it.
## Subtree merges
One way is get Whippet is just to manually copy the files present in a
Whippet checkout into your project. However probably the best way is to
perform a [subtree
merge](https://docs.github.com/en/get-started/using-git/about-git-subtree-merges)
of Whippet into your project's Git repository, so that you can easily
update your copy of Whippet in the future.
Performing the first subtree merge is annoying and full of arcane
incantations. Follow the [subtree merge
page](https://docs.github.com/en/get-started/using-git/about-git-subtree-merges)
for full details, but for a cheat sheet, you might do something like
this to copy Whippet into the `whippet/` directory of your project root:
```
git remote add whippet https://github.com/wingo/whippet
git fetch whippet
git merge -s ours --no-commit --allow-unrelated-histories whippet/main
git read-tree --prefix=whippet/ -u whippet/main
git commit -m 'Added initial Whippet merge'
```
Then to later update your copy of whippet, assuming you still have the
`whippet` remote, just do:
```
git pull -s subtree whippet main
```
## `gc-embedder-api.h`
To determine the live set of objects, a tracing garbage collector starts
with a set of root objects, and then transitively visits all reachable
object edges. Exactly how it goes about doing this depends on the
program that is using the garbage collector; different programs will
have different object representations, different strategies for
recording roots, and so on.
To traverse the heap in a program-specific way but without imposing an
abstraction overhead, Whippet requires that a number of data types and
inline functions be implemented by the program, for use by Whippet
itself. This is the *embedder API*, and this document describes what
Whippet requires from a program.
A program should provide a header file implementing the API in
[`gc-embedder-api.h`](../api/gc-embedder-api.h). This header should only be
included when compiling Whippet itself; it is not part of the API that
Whippet exposes to the program.
### Identifying roots
The collector uses two opaque struct types, `struct gc_mutator_roots`
and `struct gc_heap_roots`, that are used by the program to record
object roots. Probably you should put the definition of these data
types in a separate header that is included both by Whippet, via the
embedder API, and via users of Whippet, so that programs can populate
the root set. In any case the embedder-API use of these structs is via
`gc_trace_mutator_roots` and `gc_trace_heap_roots`, two functions that
are passed a trace visitor function `trace_edge`, and which should call
that function on all edges from a given mutator or heap. (Usually
mutator roots are per-thread roots, such as from the stack, and heap
roots are global roots.)
### Tracing objects
The `gc_trace_object` is responsible for calling the `trace_edge`
visitor function on all outgoing edges in an object. It also includes a
`size` out-parameter, for when the collector wants to measure the size
of an object. `trace_edge` and `size` may be `NULL`, in which case no
tracing or size computation should be performed.
### Tracing ephemerons and finalizers
Most kinds of GC-managed object are defined by the program, but the GC
itself has support for two specific object kind: ephemerons and
finalizers. If the program allocates ephemerons, it should trace them
in the `gc_trace_object` function by calling `gc_trace_ephemeron` from
[`gc-ephemerons.h`](../api/gc-ephemerons.h). Likewise if the program
allocates finalizers, it should trace them by calling
`gc_trace_finalizer` from [`gc-finalizer.h`](../api/gc-finalizer.h).
### Forwarding objects
When built with a collector that moves objects, the embedder must also
allow for forwarding pointers to be installed in an object. There are
two forwarding APIs: one that is atomic and one that isn't.
The nonatomic API is relatively simple; there is a
`gc_object_forwarded_nonatomic` function that returns an embedded
forwarding address, or 0 if the object is not yet forwarded, and
`gc_object_forward_nonatomic`, which installs a forwarding pointer.
The atomic API is gnarly. It is used by parallel collectors, in which
multiple collector threads can race to evacuate an object.
There is a state machine associated with the `gc_atomic_forward`
structure from [`gc-forwarding.h`](../api/gc-forwarding.h); the embedder API
implements the state changes. The collector calls
`gc_atomic_forward_begin` on an object to begin a forwarding attempt,
and the resulting `gc_atomic_forward` can be in the `NOT_FORWARDED`,
`FORWARDED`, or `BUSY` state.
If the `gc_atomic_forward`'s state is `BUSY`, the collector will call
`gc_atomic_forward_retry_busy`; a return value of 0 means the object is
still busy, because another thread is attempting to forward it.
Otherwise the forwarding state becomes either `FORWARDED`, if the other
thread succeeded in forwarding it, or go back to `NOT_FORWARDED`,
indicating that the other thread failed to forward it.
If the forwarding state is `FORWARDED`, the collector will call
`gc_atomic_forward_address` to get the new address.
If the forwarding state is `NOT_FORWARDED`, the collector may begin a
forwarding attempt by calling `gc_atomic_forward_acquire`. The
resulting state is `ACQUIRED` on success, or `BUSY` if another thread
acquired the object in the meantime, or `FORWARDED` if another thread
acquired and completed the forwarding attempt.
An `ACQUIRED` object can then be forwarded via
`gc_atomic_forward_commit`, or the forwarding attempt can be aborted via
`gc_atomic_forward_abort`. Also, when an object is acquired, the
collector may call `gc_atomic_forward_object_size` to compute how many
bytes to copy. (The collector may choose instead to record object sizes
in a different way.)
All of these `gc_atomic_forward` functions are to be implemented by the
embedder. Some programs may allocate a dedicated forwarding word in all
objects; some will manage to store the forwarding word in an initial
"tag" word, via a specific pattern for the low 3 bits of the tag that no
non-forwarded object will have. The low-bits approach takes advantage
of the collector's minimum object alignment, in which objects are
aligned at least to an 8-byte boundary, so all objects have 0 for the
low 3 bits of their address.
### Conservative references
Finally, when configured in a mode in which root edges or intra-object
edges are *conservative*, the embedder can filter out which bit patterns
might be an object reference by implementing
`gc_is_valid_conservative_ref_displacement`. Here, the collector masks
off the low bits of a conservative reference, and asks the embedder if a
value with those low bits might point to an object. Usually the
embedder should return 1 only if the displacement is 0, but if the
program allows low-bit tagged pointers, then it should also return 1 for
those pointer tags.
### External objects
Sometimes a system will allocate objects outside the GC, for example on
the stack or in static data sections. To support this use case, Whippet
allows the embedder to provide a `struct gc_extern_space`
implementation. Whippet will call `gc_extern_space_start_gc` at the
start of each collection, and `gc_extern_space_finish_gc` at the end.
External objects will be visited by `gc_extern_space_mark`, which should
return nonzero if the object hasn't been seen before and needs to be
traced via `gc_trace_object` (coloring the object grey). Note,
`gc_extern_space_mark` may be called concurrently from many threads; be
prepared!
## Configuration, compilation, and linking
To the user, Whippet presents an abstract API that does not encode the
specificities of any given collector. Whippet currently includes four
implementations of that API: `semi`, a simple semi-space collector;
`pcc`, a parallel copying collector (like semi but multithreaded);
`bdw`, an implementation via the third-party
[Boehm-Demers-Weiser](https://github.com/ivmai/bdwgc) conservative
collector; and `mmc`, a mostly-marking collector inspired by Immix.
The program that embeds Whippet selects the collector implementation at
build-time. For `pcc`, the program can also choose whether to be
generational or not. For `mmc` collector, the program configures a
specific collector mode, again at build-time: generational or not,
parallel or not, stack-conservative or not, and heap-conservative or
not. It may be nice in the future to be able to configure these at
run-time, but for the time being they are compile-time options so that
adding new features doesn't change the footprint of a more minimal
collector.
Different collectors have different allocation strategies: for example,
the BDW collector allocates from thread-local freelists, whereas the
semi-space collector has a bump-pointer allocator. A collector may also
expose a write barrier, for example to enable generational collection.
For performance reasons, many of these details can't be hidden behind an
opaque functional API: they must be inlined into call sites. Whippet's
approach is to expose fast paths as part of its inline API, but which
are *parameterized* on attributes of the selected garbage collector.
The goal is to keep the user's code generic and avoid any code
dependency on the choice of garbage collector. Because of inlining,
however, the choice of garbage collector does need to be specified when
compiling user code.
### Compiling the collector
As an embed-only library, Whippet needs to be integrated into the build
system of its host (embedder). There are two build systems supported
currently; we would be happy to add other systems over time.
#### GNU make
At a high level, first the embedder chooses a collector and defines how
to specialize the collector against the embedder. Whippet's `embed.mk`
Makefile snippet then defines how to build the set of object files that
define the collector, and how to specialize the embedder against the
chosen collector.
As an example, say you have a file `program.c`, and you want to compile
it against a Whippet checkout in `whippet/`. Your headers are in
`include/`, and you have written an implementation of the embedder
interface in `host-gc.h`. In that case you would have a Makefile like
this:
```
HOST_DIR:=$(dir $(lastword $(MAKEFILE_LIST)))
WHIPPET_DIR=$(HOST_DIR)whippet/
all: out
# The collector to choose: e.g. semi, bdw, pcc, generational-pcc, mmc,
# parallel-mmc, etc.
GC_COLLECTOR=pcc
include $(WHIPPET_DIR)embed.mk
# Host cflags go here...
HOST_CFLAGS=
# Whippet's embed.mk uses this variable when it compiles code that
# should be specialized against the embedder.
EMBEDDER_TO_GC_CFLAGS=$(HOST_CFLAGS) -include $(HOST_DIR)host-gc.h
program.o: program.c
$(GC_COMPILE) $(HOST_CFLAGS) $(GC_TO_EMBEDDER_CFLAGS) -c $<
program: program.o $(GC_OBJS)
$(GC_LINK) $^ $(GC_LIBS)
```
The optimization settings passed to the C compiler are taken from
`GC_BUILD_CFLAGS`. Embedders can override this variable directly, or
via the shorthand `GC_BUILD` variable. A `GC_BUILD` of `opt` indicates
maximum optimization and no debugging assertions; `optdebug` adds
debugging assertions; and `debug` removes optimizations.
Though Whippet tries to put performance-sensitive interfaces in header
files, users should also compile with link-time optimization (LTO) to
remove any overhead imposed by the division of code into separate
compilation units. `embed.mk` includes the necessary LTO flags in
`GC_CFLAGS` and `GC_LDFLAGS`.
#### GNU Autotools
To use Whippet from an autotools project, the basic idea is to include a
`Makefile.am` snippet from the subdirectory containing the Whippet
checkout. That will build `libwhippet.la`, which you should link into
your binary. There are some `m4` autoconf macros that need to be
invoked, for example to select the collector.
Let us imagine you have checked out Whippet in `whippet/`. Let us also
assume for the moment that we are going to build `mt-gcbench`, a program
included in Whippet itself.
A top-level autoconf file (`configure.ac`) might look like this:
```autoconf
AC_PREREQ([2.69])
AC_INIT([whippet-autotools-example],[0.1.0])
AC_CONFIG_SRCDIR([whippet/benchmarks/mt-gcbench.c])
AC_CONFIG_AUX_DIR([build-aux])
AC_CONFIG_MACRO_DIRS([m4 whippet])
AM_INIT_AUTOMAKE([subdir-objects foreign])
WHIPPET_ENABLE_LTO
LT_INIT
WARN_CFLAGS=-Wall
AC_ARG_ENABLE([Werror],
AS_HELP_STRING([--disable-Werror],
[Don't stop the build on errors]),
[],
WARN_CFLAGS="-Wall -Werror")
CFLAGS="$CFLAGS $WARN_CFLAGS"
WHIPPET_PKG
AC_CONFIG_FILES(Makefile)
AC_OUTPUT
```
Then your `Makefile.am` might look like this:
```automake
noinst_LTLIBRARIES =
WHIPPET_EMBEDDER_CPPFLAGS = -include $(srcdir)/whippet/benchmarks/mt-gcbench-embedder.h
include whippet/embed.am
noinst_PROGRAMS = whippet/benchmarks/mt-gcbench
whippet_benchmarks_mt_gcbench_SOURCES = \
whippet/benchmarks/heap-objects.h \
whippet/benchmarks/mt-gcbench-embedder.h \
whippet/benchmarks/mt-gcbench-types.h \
whippet/benchmarks/mt-gcbench.c \
whippet/benchmarks/simple-allocator.h \
whippet/benchmarks/simple-gc-embedder.h \
whippet/benchmarks/simple-roots-api.h \
whippet/benchmarks/simple-roots-types.h \
whippet/benchmarks/simple-tagging-scheme.h
AM_CFLAGS = $(WHIPPET_CPPFLAGS) $(WHIPPET_CFLAGS) $(WHIPPET_TO_EMBEDDER_CPPFLAGS)
LDADD = libwhippet.la
```
We have to list all the little header files it uses because, well,
autotools.
To actually build, you do the usual autotools dance:
```bash
autoreconf -vif && ./configure && make
```
See `./configure --help` for a list of user-facing options. Before the
`WHIPPET_PKG`, you can run e.g. `WHIPPET_PKG_COLLECTOR(mmc)` to set the
default collector to `mmc`; if you don't do that, the default collector
is `pcc`. There are also `WHIPPET_PKG_DEBUG`, `WHIPPET_PKG_TRACING`,
and `WHIPPET_PKG_PLATFORM`; see [`whippet.m4`](../whippet.m4) for more
details. See also
[`whippet-autotools`](https://github.com/wingo/whippet-autotools) for an
example of how this works.
#### Compile-time options
There are a number of pre-processor definitions that can parameterize
the collector at build-time:
* `GC_DEBUG`: If nonzero, then enable debugging assertions.
* `NDEBUG`: This one is a bit weird; if not defined, then enable
debugging assertions and some debugging printouts. Probably
Whippet's use of `NDEBUG` should be folded in to `GC_DEBUG`.
* `GC_PARALLEL`: If nonzero, then enable parallelism in the collector.
Defaults to 0.
* `GC_GENERATIONAL`: If nonzero, then enable generational collection.
Defaults to zero.
* `GC_PRECISE_ROOTS`: If nonzero, then collect precise roots via
`gc_heap_roots` and `gc_mutator_roots`. Defaults to zero.
* `GC_CONSERVATIVE_ROOTS`: If nonzero, then scan the stack and static
data sections for conservative roots. Defaults to zero. Not
mutually exclusive with `GC_PRECISE_ROOTS`.
* `GC_CONSERVATIVE_TRACE`: If nonzero, heap edges are scanned
conservatively. Defaults to zero.
Some collectors require specific compile-time options. For example, the
semi-space collector has to be able to move all objects; this is not
compatible with conservative roots or heap edges.
#### Tracing support
Whippet includes support for low-overhead run-time tracing via
[LTTng](https://lttng.org/). If the support library `lttng-ust` is
present when Whippet is compiled (as checked via `pkg-config`),
tracepoint support will be present. See
[tracepoints.md](./tracepoints.md) for more information on how to get
performance traces out of Whippet.
## Using the collector
Whew! So you finally built the thing! Did you also link it into your
program? No, because your program isn't written yet? Well this section
is for you: we describe the user-facing API of Whippet, where "user" in
this case denotes the embedding program.
What is the API, you ask? It is in [`gc-api.h`](../api/gc-api.h).
### Heaps and mutators
To start with, you create a *heap*. Usually an application will create
just one heap. A heap has one or more associated *mutators*. A mutator
is a thread-specific handle on the heap. Allocating objects requires a
mutator.
The initial heap and mutator are created via `gc_init`, which takes
three logical input parameters: the *options*, a stack base address, and
an *event listener*. The options specify the initial heap size and so
on. The event listener is mostly for gathering statistics; see below
for more. `gc_init` returns the new heap as an out parameter, and also
returns a mutator for the current thread.
To make a new mutator for a new thread, use `gc_init_for_thread`. When
a thread is finished with its mutator, call `gc_finish_for_thread`.
Each thread that allocates or accesses GC-managed objects should have
its own mutator.
The stack base address allows the collector to scan the mutator's stack,
if conservative root-finding is enabled. It may be omitted in the call
to `gc_init` and `gc_init_for_thread`; passing `NULL` tells Whippet to
ask the platform for the stack bounds of the current thread. Generally
speaking, this works on all platforms for the main thread, but not
necessarily on other threads. The most reliable solution is to
explicitly obtain a base address by trampolining through
`gc_call_with_stack_addr`.
### Options
There are some run-time parameters that programs and users might want to
set explicitly; these are encapsulated in the *options*. Make an
options object with `gc_allocate_options()`; this object will be
consumed by its `gc_init`. Then, the most convenient thing is to set
those options from `gc_options_parse_and_set_many` from a string passed
on the command line or an environment variable, but to get there we have
to explain the low-level first. There are a few options that are
defined for all collectors:
* `GC_OPTION_HEAP_SIZE_POLICY`: How should we size the heap? Either
it's `GC_HEAP_SIZE_FIXED` (which is 0), in which the heap size is
fixed at startup; or `GC_HEAP_SIZE_GROWABLE` (1), in which the heap
may grow but will never shrink; or `GC_HEAP_SIZE_ADAPTIVE` (2), in
which we take an
[adaptive](https://wingolog.org/archives/2023/01/27/three-approaches-to-heap-sizing)
approach, depending on the rate of allocation and the cost of
collection. Really you want the adaptive strategy, but if you are
benchmarking you definitely want the fixed policy.
* `GC_OPTION_HEAP_SIZE`: The initial heap size. For a
`GC_HEAP_SIZE_FIXED` policy, this is also the final heap size. In
bytes.
* `GC_OPTION_MAXIMUM_HEAP_SIZE`: For growable and adaptive heaps, the
maximum heap size, in bytes.
* `GC_OPTION_HEAP_SIZE_MULTIPLIER`: For growable heaps, the target heap
multiplier. A heap multiplier of 2.5 means that for 100 MB of live
data, the heap should be 250 MB.
* `GC_OPTION_HEAP_EXPANSIVENESS`: For adaptive heap sizing, an
indication of how much free space will be given to heaps, as a
proportion of the square root of the live data size.
* `GC_OPTION_PARALLELISM`: How many threads to devote to collection
tasks during GC pauses. By default, the current number of
processors, with a maximum of 8.
You can set these options via `gc_option_set_int` and so on; see
[`gc-options.h`](../api/gc-options.h). Or, you can parse options from
strings: `heap-size-policy`, `heap-size`, `maximum-heap-size`, and so
on. Use `gc_option_from_string` to determine if a string is really an
option. Use `gc_option_parse_and_set` to parse a value for an option.
Use `gc_options_parse_and_set_many` to parse a number of comma-delimited
*key=value* settings from a string.
### Allocation
So you have a heap and a mutator; great! Let's allocate! Call
`gc_allocate`, passing the mutator and the number of bytes to allocate.
There is also `gc_allocate_fast`, which is an inlined fast-path. If
that returns NULL, you need to call `gc_allocate_slow`. The advantage
of this API is that you can punt some root-saving overhead to the slow
path.
Allocation always succeeds. If it doesn't, it kills your program. The
bytes in the resulting allocation will be initialized to 0.
The allocation fast path is parameterized by collector-specific
attributes. JIT compilers can also read those attributes to emit
appropriate inline code that replicates the logic of `gc_allocate_fast`.
### Write barriers
For some collectors, mutators have to tell the collector whenever they
mutate an object. They tell the collector by calling a *write barrier*;
in Whippet this is currently the case only for generational collectors.
The write barrier is `gc_write_barrier`; see `gc-api.h` for its
parameters.
As with allocation, the fast path for the write barrier is parameterized
by collector-specific attributes, to allow JIT compilers to inline write
barriers.
### Safepoints
Sometimes Whippet will need to synchronize all threads, for example as
part of the "stop" phase of a stop-and-copy semi-space collector.
Whippet stops at *safepoints*. At a safepoint, all mutators must be
able to enumerate all of their edges to live objects.
Whippet has cooperative safepoints: mutators have to periodically call
into the collector to potentially synchronize with other mutators.
`gc_allocate_slow` is a safepoint, so if you a bunch of threads that are
all allocating, usually safepoints are reached in a more-or-less prompt
fashion. But if a mutator isn't allocating, it either needs to
temporarily mark itself as inactive by trampolining through
`gc_call_without_gc`, or it should arrange to periodically call
`gc_safepoint`. Marking a mutator as inactive is the right strategy
for, for example, system calls that might block. Periodic safepoints is
better for code that is active but not allocating.
Also, the BDW collector actually uses pre-emptive safepoints: it stops
threads via POSIX signals. `gc_safepoint` is a no-op with BDW.
Embedders can inline safepoint checks. If
`gc_cooperative_safepoint_kind()` is `GC_COOPERATIVE_SAFEPOINT_NONE`,
then the collector doesn't need safepoints, as is the case for `bdw`
which uses signals and `semi` which is single-threaded. If it is
`GC_COOPERATIVE_SAFEPOINT_HEAP_FLAG`, then calling
`gc_safepoint_flag_loc` on a mutator will return the address of an `int`
in memory, which if nonzero when loaded using relaxed atomics indicates
that the mutator should call `gc_safepoint_slow`. Similarly for
`GC_COOPERATIVE_SAFEPOINT_MUTATOR_FLAG`, except that the address is
per-mutator rather than global.
### Pinning
Sometimes a mutator or embedder would like to tell the collector to not
move a particular object. This can happen for example during a foreign
function call, or if the embedder allows programs to access the address
of an object, for example to compute an identity hash code. To support
this use case, some Whippet collectors allow the embedder to *pin*
objects. Call `gc_pin_object` to prevent the collector from relocating
an object.
Pinning is currently supported by the `bdw` collector, which never moves
objects, and also by the various `mmc` collectors, which can move
objects that have no inbound conservative references.
Pinning is not supported on `semi` or `pcc`.
Call `gc_can_pin_objects` to determine whether the current collector can
pin objects.
### Statistics
Sometimes a program would like some information from the GC: how many
bytes and objects have been allocated? How much time has been spent in
the GC? How many times has GC run, and how many of those were minor
collections? What's the maximum pause time? Stuff like that.
Instead of collecting a fixed set of information, Whippet emits
callbacks when the collector reaches specific states. The embedder
provides a *listener* for these events when initializing the collector.
The listener interface is defined in
[`gc-event-listener.h`](../api/gc-event-listener.h). Whippet ships with
two listener implementations,
[`GC_NULL_EVENT_LISTENER`](../api/gc-null-event-listener.h), and
[`GC_BASIC_STATS`](../api/gc-basic-stats.h). Most embedders will want
their own listener, but starting with the basic stats listener is not a
bad option:
```
#include "gc-api.h"
#include "gc-basic-stats.h"
#include <stdio.h>
int main() {
struct gc_options *options = NULL;
struct gc_heap *heap;
struct gc_mutator *mut;
struct gc_basic_stats stats;
gc_init(options, NULL, &heap, &mut, GC_BASIC_STATS, &stats);
// ...
gc_basic_stats_finish(&stats);
gc_basic_stats_print(&stats, stdout);
}
```
As you can see, `GC_BASIC_STATS` expands to a `struct gc_event_listener`
definition. We pass an associated pointer to a `struct gc_basic_stats`
instance which will be passed to the listener at every event.
The output of this program might be something like:
```
Completed 19 major collections (0 minor).
654.597 ms total time (385.235 stopped).
Heap size is 167.772 MB (max 167.772 MB); peak live data 55.925 MB.
```
There are currently three different sorts of events: heap events to
track heap growth, collector events to time different parts of
collection, and mutator events to indicate when specific mutators are
stopped.
There are three heap events:
* `init(void* data, size_t heap_size)`: Called during `gc_init`, to
allow the listener to initialize its associated state.
* `heap_resized(void* data, size_t new_size)`: Called if the heap grows
or shrinks.
* `live_data_size(void* data, size_t size)`: Called periodically when
the collector learns about live data size.
The collection events form a kind of state machine, and are called in
this order:
* `requesting_stop(void* data)`: Called when the collector asks
mutators to stop.
* `waiting_for_stop(void* data)`: Called when the collector has done
all the pre-stop work that it is able to and is just waiting on
mutators to stop.
* `mutators_stopped(void* data)`: Called when all mutators have
stopped; the trace phase follows.
* `prepare_gc(void* data, enum gc_collection_kind gc_kind)`: Called
to indicate which kind of collection is happening.
* `roots_traced(void* data)`: Called when roots have been visited.
* `heap_traced(void* data)`: Called when the whole heap has been
traced.
* `ephemerons_traced(void* data)`: Called when the [ephemeron
fixpoint](https://wingolog.org/archives/2023/01/24/parallel-ephemeron-tracing)
has been reached.
* `restarting_mutators(void* data)`: Called right before the collector
restarts mutators.
The collectors in Whippet will call all of these event handlers, but it
may be that they are called conservatively: for example, the
single-mutator, single-collector semi-space collector will never have to
wait for mutators to stop. It will still call the functions, though!
Finally, there are the mutator events:
* `mutator_added(void* data) -> void*`: The only event handler that
returns a value, called when a new mutator is added. The parameter
is the overall event listener data, and the result is
mutator-specific data. The rest of the mutator events pass this
mutator-specific data instead.
* `mutator_cause_gc(void* mutator_data)`: Called when a mutator causes
GC, either via allocation or an explicit `gc_collect` call.
* `mutator_stopping(void* mutator_data)`: Called when a mutator has
received the signal to stop. It may perform some marking work before
it stops.
* `mutator_stopped(void* mutator_data)`: Called when a mutator parks
itself.
* `mutator_restarted(void* mutator_data)`: Called when a mutator
restarts.
* `mutator_removed(void* mutator_data)`: Called when a mutator goes
away.
Note that these events handlers shouldn't really do much. In
particular, they shouldn't call into the Whippet API, and they shouldn't
even access GC-managed objects. Event listeners are really about
statistics and profiling and aren't a place to mutate the object graph.
### Ephemerons
Whippet supports ephemerons, first-class objects that weakly associate
keys with values. If the an ephemeron's key ever becomes unreachable,
the ephemeron becomes dead and loses its value.
The user-facing API is in [`gc-ephemeron.h`](../api/gc-ephemeron.h). To
allocate an ephemeron, call `gc_allocate_ephemeron`, then initialize its
key and value via `gc_ephemeron_init`. Get the key and value via
`gc_ephemeron_key` and `gc_ephemeron_value`, respectively.
In Whippet, ephemerons can be linked together in a chain. During GC, if
an ephemeron's chain points to a dead ephemeron, that link will be
elided, allowing the dead ephemeron itself to be collected. In that
way, ephemerons can be used to build weak data structures such as weak
maps.
Weak data structures are often shared across multiple threads, so all
routines to access and modify chain links are atomic. Use
`gc_ephemeron_chain_head` to access the head of a storage location that
points to an ephemeron; push a new ephemeron on a location with
`gc_ephemeron_chain_push`; and traverse a chain with
`gc_ephemeron_chain_next`.
An ephemeron association can be removed via `gc_ephemeron_mark_dead`.
### Finalizers
A finalizer allows the embedder to be notified when an object becomes
unreachable.
A finalizer has a priority. When the heap is created, the embedder
should declare how many priorities there are. Lower-numbered priorities
take precedence; if an object has a priority-0 finalizer outstanding,
that will prevent any finalizer at level 1 (or 2, ...) from firing
until no priority-0 finalizer remains.
Call `gc_attach_finalizer`, from `gc-finalizer.h`, to attach a finalizer
to an object.
A finalizer also references an associated GC-managed closure object.
A finalizer's reference to the closure object is strong: if a
finalizer's closure closure references its finalizable object,
directly or indirectly, the finalizer will never fire.
When an object with a finalizer becomes unreachable, it is added to a
queue. The embedder can call `gc_pop_finalizable` to get the next
finalizable object and its associated closure. At that point the
embedder can do anything with the object, including keeping it alive.
Ephemeron associations will still be present while the finalizable
object is live. Note however that any objects referenced by the
finalizable object may themselves be already finalized; finalizers are
enqueued for objects when they become unreachable, which can concern
whole subgraphs of objects at once.
The usual way for an embedder to know when the queue of finalizable
object is non-empty is to call `gc_set_finalizer_callback` to
provide a function that will be invoked when there are pending
finalizers.
Arranging to call `gc_pop_finalizable` and doing something with the
finalizable object and closure is the responsibility of the embedder.
The embedder's finalization action can end up invoking arbitrary code,
so unless the embedder imposes some kind of restriction on what
finalizers can do, generally speaking finalizers should be run in a
dedicated thread instead of recursively from within whatever mutator
thread caused GC. Setting up such a thread is the responsibility of the
mutator. `gc_pop_finalizable` is thread-safe, allowing multiple
finalization threads if that is appropriate.
`gc_allocate_finalizer` returns a finalizer, which is a fresh GC-managed
heap object. The mutator should then directly attach it to an object
using `gc_finalizer_attach`. When the finalizer is fired, it becomes
available to the mutator via `gc_pop_finalizable`.

Binary file not shown.

After

Width:  |  Height:  |  Size: 169 KiB

View file

@ -0,0 +1,127 @@
# Whippet performance tracing
Whippet includes support for run-time tracing via
[LTTng](https://LTTng.org) user-space tracepoints. This allows you to
get a detailed look at how Whippet is performing on your system.
Tracing support is currently limited to Linux systems.
## Getting started
First, you need to build Whippet with LTTng support. Usually this is as
easy as building it in an environment where the `lttng-ust` library is
present, as determined by `pkg-config --libs lttng-ust`. You can know
if your Whippet has tracing support by seeing if the resulting binaries
are dynamically linked to `liblttng-ust`.
If we take as an example the `mt-gcbench` test in the Whippet source
tree, we would have:
```
$ ldd bin/mt-gcbench.pcc | grep lttng
...
liblttng-ust.so.1 => ...
...
```
### Capturing traces
Actually capturing traces is a little annoying; it's not as easy as
`perf run`. The [LTTng
documentation](https://lttng.org/docs/v2.13/#doc-controlling-tracing) is
quite thorough, but here is a summary.
First, create your tracing session:
```
$ lttng create
Session auto-20250214-091153 created.
Traces will be output to ~/lttng-traces/auto-20250214-091153
```
You run all these commands as your own user; they don't require root
permissions or system-wide modifications, as all of the Whippet
tracepoints are user-space tracepoints (UST).
Just having an LTTng session created won't do anything though; you need
to configure the session. Monotonic nanosecond-resolution timestamps
are already implicitly part of each event. We also want to have process
and thread IDs for all events:
```
$ lttng add-context --userspace --type=vpid --type=vtid
ust context vpid added to all channels
ust context vtid added to all channels
```
Now enable Whippet events:
```
$ lttng enable-event --userspace 'whippet:*'
ust event whippet:* created in channel channel0
```
And now, start recording:
```
$ lttng start
Tracing started for session auto-20250214-091153
```
With this, traces will be captured for our program of interest:
```
$ bin/mt-gcbench.pcc 2.5 8
...
```
Now stop the trace:
```
$ lttng stop
Waiting for data availability
Tracing stopped for session auto-20250214-091153
```
Whew. If we did it right, our data is now in
`~/lttng-traces/auto-20250214-091153`.
### Visualizing traces
LTTng produces traces in the [Common Trace Format
(CTF)](https://diamon.org/ctf/). My favorite trace viewing tool is the
family of web-based trace viewers derived from `chrome://tracing`. The
best of these appear to be [the Firefox
profiler](https://profiler.firefox.com) and
[Perfetto](https://ui.perfetto.dev). Unfortunately neither of these can
work with CTF directly, so we instead need to run a trace converter.
Oddly, there is no trace converter that can read CTF and write something
that Perfetto (e.g.) can read. However there is a [JSON-based tracing
format that these tools can
read](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw),
and [Python bindings for Babeltrace, a library that works with
CTF](https://babeltrace.org/), so that's what we do:
```
$ python3 ctf_to_json.py ~/lttng-traces/auto-20250214-091153 > trace.json
```
While Firefox Profiler can load this file, it works better on Perfetto,
as the Whippet events are visually rendered on their respective threads.
![Screenshot of part of Perfetto UI showing a minor GC](./perfetto-minor-gc.png)
### Expanding the set of events
As of February 2025,
the current set of tracepoints includes the [heap
events](https://github.com/wingo/whippet/blob/main/doc/manual.md#statistics)
and some detailed internals of the parallel tracer. We expect this set
of tracepoints to expand over time.
### Overhead of tracepoints
When tracepoints are compiled in but no events are enabled, tracepoints
appear to have no impact on run-time. When event collection is on, for
x86-64 hardware, [emitting a tracepoint event takes about
100ns](https://discuss.systems/@DesnoyersMa/113986344940256872).