Fork-Join with std::barrier
To improve the performance of my game engine, I was looking for a simple and almost native solution (i.e no boilerplate or framework), to render many entities on different threads. And that's in fact fairly easy to do thanks to std::barrier!
There is a few step to implement:
- a pool of threads
- few structs to hold data
- feed the workers with some jobs to do
In ParallelManager.cppm:
//as soon as better alternative are available, std::function will be removed
struct WorkerContext
{
//std::function_ref<void()> task;
//std::move_only_function<void()> task;
std::function<void()> task;
};
//will be used for the pointer that will hold everything together
struct BarrierData;
export class ParallelManager
{
public:
ParallelManager();
~ParallelManager();
std::size_t size() const { return _workers.size(); }
void execute(std::span<std::function<void()>> tasks);
private:
//number of thread available
std::ptrdiff_t _thread_count { static_cast<std::ptrdiff_t>(std::thread::hardware_concurrency()) };
//pointer to glue everything together
std::unique_ptr<BarrierData> _barrier_data;
//vector of workers context
std::vector<WorkerContext> _worker_contexts;
//threads pool
std::vector<std::jthread> _workers;
}
Then in my ParallelManager.cpp:
//initialize the pool thread
ParallelManager::ParallelManager()
: _barrier_data{ std::make_unique<BarrierData>() }
{
_worker_contexts.resize(static_cast<std::size_t>(_thread_count));
_workers.reserve(static_cast<std::size_t>(_thread_count));
for (std::size_t t { 0 }; t < static_cast<std::size_t>(_thread_count); ++t) {
_workers.emplace_back([this, t](std::stop_token stoken) {
while (!stoken.stop_requested()) {
_barrier_data->start.arrive_and_wait();
if (stoken.stop_requested()) break;
if (_worker_contexts[t].task) {
_worker_contexts[t].task();
}
_barrier_data->done.arrive_and_wait();
}
});
}
}
//don't forget to clean up
ParallelManager::~ParallelManager()
{
for(auto& worker : _workers) {
worker.request_stop();
}
_barrier_data->start.arrive_and_wait();
}
//execute any task
void ParallelManager::execute(std::span<std::function<void()>> tasks)
{
for (std::size_t i {0}; i < tasks.size(); i++) {
_worker_contexts[i].task = std::move(tasks[i]);
}
_barrier_data->start.arrive_and_wait();
_barrier_data->done.arrive_and_wait();
for (std::size_t i {0}; i < tasks.size(); i++) {
_worker_contexts[i].task = nullptr;
}
}
}
Now that the setup is done, we just need to give some work to the workers, so in the game loop when I want to render entities:
//getting the size of a chunk
auto const count { ParallelManagerLocator::get()->size() };
auto const chunk_size { _all_ids.size() / count };
auto const remainder { _all_ids.size() % count };
std::vector<std::function<void()>> tasks;
tasks.reserve(count);
std::size_t offset { 0 };
for (std::size_t t { 0 }; t < count; ++t) {
auto const n { chunk_size + (t < remainder ? 1 : 0) };
auto const ids { std::span{ _all_ids.data() + offset, n } };
tasks.emplace_back([ids, delta_time, camera_view_matrix, this]() {
for (auto const id : ids) {
renderEntity(id, delta_time, camera_view_matrix);
}
});
offset += n;
}
ParallelManagerLocator::get()->execute(tasks);
The implementation is simple and quite efficient, just with this I gain more than 50 fps on a simple scene, it will probably be even better when std::function is replaced by std::moveonlyfunction or std::function_ref.
I am expecting a great improvement in performance overall now that I can run simple tasks in parallel!
Come discuss Join the conversation on the Vyroda Forum. Check out the latest engine features on Vyroda.com mastodon Contact me