Fork-Join with std::barrier

April 16, 2026

To improve the performance of my game engine, I was looking for a simple and almost native solution (i.e no boilerplate or framework), to render many entities on different threads. And that's in fact fairly easy to do thanks to std::barrier!

There is a few step to implement:

a pool of threads
few structs to hold data
feed the workers with some jobs to do

In ParallelManager.cppm:

//as soon as better alternative are available, std::function will be removed
struct WorkerContext
{
  //std::function_ref<void()> task;
  //std::move_only_function<void()> task;
  std::function<void()> task;
};

//will be used for the pointer that will hold everything together
struct BarrierData;    

export class ParallelManager
{
  public:

   ParallelManager();
   ~ParallelManager();

   std::size_t size() const { return _workers.size(); }
   void execute(std::span<std::function<void()>> tasks);

  private:
    //number of thread available
    std::ptrdiff_t _thread_count { static_cast<std::ptrdiff_t>(std::thread::hardware_concurrency()) };

    //pointer to glue everything together
    std::unique_ptr<BarrierData> _barrier_data;

   //vector of workers context
   std::vector<WorkerContext> _worker_contexts;

  //threads pool
  std::vector<std::jthread> _workers;
}

Then in my ParallelManager.cpp:

  //initialize the pool thread
  ParallelManager::ParallelManager()
  : _barrier_data{ std::make_unique<BarrierData>() }
  {
    _worker_contexts.resize(static_cast<std::size_t>(_thread_count));
    _workers.reserve(static_cast<std::size_t>(_thread_count));

    for (std::size_t t { 0 }; t < static_cast<std::size_t>(_thread_count); ++t) {
      _workers.emplace_back([this, t](std::stop_token stoken) {
        while (!stoken.stop_requested()) {
          _barrier_data->start.arrive_and_wait();
          if (stoken.stop_requested()) break;
          if (_worker_contexts[t].task) {
            _worker_contexts[t].task();
          }
          _barrier_data->done.arrive_and_wait();
        }
      });
    }
  }

  //don't forget to clean up
  ParallelManager::~ParallelManager()
  {
    for(auto& worker : _workers) {
        worker.request_stop();
    }
    _barrier_data->start.arrive_and_wait();
  }

//execute any task
  void ParallelManager::execute(std::span<std::function<void()>> tasks)
  {
    for (std::size_t i {0}; i < tasks.size(); i++) {
      _worker_contexts[i].task = std::move(tasks[i]);
    }

    _barrier_data->start.arrive_and_wait();
    _barrier_data->done.arrive_and_wait();

    for (std::size_t i {0}; i < tasks.size(); i++) {
      _worker_contexts[i].task = nullptr;
    }
  }
}

Now that the setup is done, we just need to give some work to the workers, so in the game loop when I want to render entities:

   //getting the size of a chunk
    auto const count { ParallelManagerLocator::get()->size() };
    auto const chunk_size { _all_ids.size() / count };
    auto const remainder  { _all_ids.size() % count };

    std::vector<std::function<void()>> tasks;
    tasks.reserve(count);

    std::size_t offset { 0 };
    for (std::size_t t { 0 }; t < count; ++t) {
      auto const n { chunk_size + (t < remainder ? 1 : 0) };
      auto const ids { std::span{ _all_ids.data() + offset, n } };
      tasks.emplace_back([ids, delta_time, camera_view_matrix, this]() {
        for (auto const id : ids) {
          renderEntity(id, delta_time, camera_view_matrix);
        }
      });
      offset += n;
    }
    ParallelManagerLocator::get()->execute(tasks);

The implementation is simple and quite efficient, just with this I gain more than 50 fps on a simple scene, it will probably be even better when std::function is replaced by std::moveonlyfunction or std::function_ref.

I am expecting a great improvement in performance overall now that I can run simple tasks in parallel!

Come discuss Join the conversation on the Vyroda Forum. Check out the latest engine features on Vyroda.com mastodon Contact me