Instead of sending a message to each worker, we pthread_cond_broadcast
at the start. Instead of having N worker threads, we have N-1 threads
and the main thread also does work. Instead of termination being
detected by the worker threads, let the main thread detect it. Avoid
parallelism if the mark stack is small enough, which can be the case for
ephemeron chains. Let aux threads exit when they have no work instead
of spinning: sharing will start them up again.