0x1 Vulkan mechansim for multithread cpu

Vulkan uses command buffer to record the gpu states, then execute the command buffer. on opengl, we have only one command buffer to record the gpu states, but on vulkan, we can have several command buffers to record gpu states in parallel.

If the draw task is cpu bounding, which means the loading is cpu heavy, and these tasks can be splitted into several threads to execute in parallel, then we can assign different command buffers to threads, and record the gpu states into these command buffers in parallel, after all threads are ready, we can submit these command buffers to gpu driver, then gpu driver executes it.

Here is the vulkan command buffer execution models.

Is it possible for every graphics draw pipeline can be benefit from the multithread command buffer mechansim?

It is case by case.

If the drawing data preparing task on cpu can’t be splited into several parallel tasks, likes it has dependency each other(one draw has to be drawed before another one), it can’t be benefited from the mulithread command buffer mechansim.

Otherwise it can benefit from it.

0x2 Test case analysis

We use SaschaWillems’s Vulkan example as the test case to check how vulkan can be benefited from multithread.

This test generates command buffers in parallel using multithreaded mechansim. these generation command buffers are configured as the vulkan secondary command buffers, they are executed and submitted together with the primary buffer once all threads have finished.

Here is the sequence about how it works.

Let’s discuss the detail sequence of this test case.

0x21 Prepare

It prepares the vulkan initialization, load the mesh, create the multithread for command buffer execution.

VulkanExampleBase::prepare();
// Create a fence for synchronization
VkFenceCreateInfo fenceCreateInfo = vks::initializers::fenceCreateInfo(VK_FLAGS_NONE);
vkCreateFence(device, &fenceCreateInfo, NULL, &renderFence);
loadMeshes();
setupVertexDescriptions();
setupPipelineLayout();
preparePipelines();
prepareMultiThreadedRenderer();
updateMatrices();
prepared = true;

Here is the code of VulkanExampleBase::prepare(), it does vulkan initialization.

createCommandPool();
setupSwapChain();
createCommandBuffers();
setupDepthStencil();
setupRenderPass();
createPipelineCache();
setupFrameBuffer();

createCommandPool() creates command buffer through vkCreateCommandPool().

setupSwapChain() creates the swapchain.

createCommandBuffers() creates one command buffer for each swap chain image and reuse for rendering.

setupDepthStencil() steups depth and stencil setting.

setupRenderPass() create render pass through vkCreateRenderPass().

createPipelineCache() create pipeline cache through vkCreatePipelineCache().

setupFrameBuffer() creates frame buffers for every swap chain image through vkCreateFramebuffer().

0x22 Command buffer generation

It creates primary/secondary command buffer through vkAllocateCommandBuffers.

It will create thread data for each thread, the thread’s number depends on its core’s number.

For each thread, it will create a command pool for it, then create one secondary command buffer, then create command buffers for each objects.

The buffer number for objcts is numObjectsPerThread, it is the number of animated objects to be rendered per thread, in this test case, the total animated objects is 512, so numObjectsPerThread is 512/numThreads, numThreads is core’s number.

Then it initializes push constants for each object.

// create a primar command buffer
VK_CHECK_RESULT(vkAllocateCommandBuffers(device, &cmdBufAllocateInfo, &primaryCommandBuffer));
// create a secondary command buffer for rendering the star sphere
cmdBufAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_SECONDARY;
VK_CHECK_RESULT(vkAllocateCommandBuffers(device, &cmdBufAllocateInfo, &secondaryCommandBuffer));
......		
threadData.resize(numThreads);
......
for (uint32_t i = 0; i < numThreads; i++)
{
	ThreadData *thread = &threadData[i];
		
	// create one command pool for each thread
	VK_CHECK_RESULT(vkCreateCommandPool(device, &cmdPoolInfo, nullptr, &thread->commandPool));
	// one secondary command buffer per object that is updated by this thread
	thread->commandBuffer.resize(numObjectsPerThread);
	// generate secondary command buffers for each thread
	VkCommandBufferAllocateInfo secondaryCmdBufAllocateInfo =
		vks::initializers::commandBufferAllocateInfo(
			thread->commandPool,
			VK_COMMAND_BUFFER_LEVEL_SECONDARY,
			thread->commandBuffer.size());
	VK_CHECK_RESULT(vkAllocateCommandBuffers(device, &secondaryCmdBufAllocateInfo, thread->commandBuffer.data()));
	thread->pushConstBlock.resize(numObjectsPerThread);
	thread->objectData.resize(numObjectsPerThread);
	for (uint32_t j = 0; j < numObjectsPerThread; j++)
	{
		float theta = 2.0f * float(M_PI) * uniformDist(rndGenerator);
		float phi = acos(1.0f - 2.0f * uniformDist(rndGenerator));
		thread->objectData[j].pos = glm::vec3(sin(phi) * cos(theta), 0.0f, cos(phi)) * 35.0f;
		thread->objectData[j].rotation = glm::vec3(0.0f, rnd(360.0f), 0.0f);
		thread->objectData[j].deltaT = rnd(1.0f);
		thread->objectData[j].rotationDir = (rnd(100.0f) < 50.0f) ? 1.0f : -1.0f;
		thread->objectData[j].rotationSpeed = (2.0f + rnd(4.0f)) * thread->objectData[j].rotationDir;
		thread->objectData[j].scale = 0.75f + rnd(0.5f);
		thread->pushConstBlock[j].color = glm::vec3(rnd(1.0f), rnd(1.0f), rnd(1.0f));
	}
}

0x23 Draw

Here is the code about drawing.

VulkanExampleBase::prepareFrame();
updateCommandBuffers(frameBuffers[currentBuffer]);
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &primaryCommandBuffer;
VK_CHECK_RESULT(vkQueueSubmit(queue, 1, &submitInfo, renderFence));
// wait for fence to signal that all command buffers are ready
VkResult fenceRes;
do
{
	fenceRes = vkWaitForFences(device, 1, &renderFence, VK_TRUE, 100000000);
} while (fenceRes == VK_TIMEOUT);
VK_CHECK_RESULT(fenceRes);
vkResetFences(device, 1, &renderFence);
VulkanExampleBase::submitFrame();

The main update function is updateCommandBuffers().

It uses a thread pool to generate drawing command in each thread, the thread function is threadRenderCode.

Firstly, it start to queue command to primary command buffer through vkBeginCommandBuffer().
The primary command buffer didn’t contain any rendering commands, the rendering command are stored (and retrieved) from the secondary command buffers.
Then it calls vkCmdBeginRenderPass() to start a new RenderPass.

// Updates the secondary command buffers using a thread pool 
// and puts them into the primary command buffer that's 
// lat submitted to the queue for rendering
VkCommandBufferBeginInfo cmdBufInfo = vks::initializers::commandBufferBeginInfo();
......
// Set target frame buffer
VK_CHECK_RESULT(vkBeginCommandBuffer(primaryCommandBuffer, &cmdBufInfo));
// The primary command buffer does not contain any rendering commands
// These are stored (and retrieved) from the secondary command buffers
vkCmdBeginRenderPass(primaryCommandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);

Once the setup for primary command buffer is ready, it starts to config secondary command buffers.

The secondary command buffer is for star background sphere rendering.

// Inheritance info for the secondary command buffers
VkCommandBufferInheritanceInfo inheritanceInfo = vks::initializers::commandBufferInheritanceInfo();
inheritanceInfo.renderPass = renderPass;
// Secondary command buffer also use the currently active framebuffer
inheritanceInfo.framebuffer = frameBuffer;
// Contains the list of secondary command buffers to be executed
std::vector<VkCommandBuffer> commandBuffers;
// Secondary command buffer with star background sphere
updateSecondaryCommandBuffer(inheritanceInfo);
commandBuffers.push_back(secondaryCommandBuffer);

Now we will see how command buffer generation through multithread.

Each object is executed in one thread.

After secondary command buffer is ready, it executes render commands from the secondary command buffer through vkCmdExecuteCommands()

// Add a job to the thread's queue for each object to be rendered
for (uint32_t t = 0; t < numThreads; t++)
{
	for (uint32_t i = 0; i < numObjectsPerThread; i++)
	{
		threadPool.threads[t]->addJob([=] { threadRenderCode(t, i, inheritanceInfo); });
	}
}
		
threadPool.wait();
// Only submit if object is within the current view frustum
for (uint32_t t = 0; t < numThreads; t++)
{
	for (uint32_t i = 0; i < numObjectsPerThread; i++)
	{
		if (threadData[t].objectData[i].visible)
		{
			commandBuffers.push_back(threadData[t].commandBuffer[i]);
		}
	}
}
// Execute render commands from the secondary command buffer
vkCmdExecuteCommands(primaryCommandBuffer, commandBuffers.size(), commandBuffers.data());
vkCmdEndRenderPass(primaryCommandBuffer);
VK_CHECK_RESULT(vkEndCommandBuffer(primaryCommandBuffer));

0x24 Multithread comand buffer generation

Let’s see how the multithread function threadRenderCode work.

It builds the secondary command buffer for one object of each thread.

threadIndex is thread index.
cmdBufferIndex is the command buffer index.

It begin to push command buffer to gpu driver through vkBeginCommandBuffer().

Then it prepares the data for push constant.

And push the data to gpu driver through vkCmdPushConstants().

Then it pass vertex data and index data through vkCmdBindVertexBuffers() and vkCmdDrawIndexed().

Then it issues draw operation through vkCmdDrawIndexed().

Then it stops the recording of command buffer through vkEndCommandBuffer().

// Builds the secondary command buffer for each thread
void threadRenderCode(uint32_t threadIndex, uint32_t cmdBufferIndex, VkCommandBufferInheritanceInfo inheritanceInfo)
{
	ThreadData *thread = &threadData[threadIndex];
	ObjectData *objectData = &thread->objectData[cmdBufferIndex];
	// Check visibility against view frustum
	objectData->visible = frustum.checkSphere(objectData->pos, objectSphereDim * 0.5f); 
	if (!objectData->visible)
	{
		return;
	}
	VkCommandBufferBeginInfo commandBufferBeginInfo = vks::initializers::commandBufferBeginInfo();
	commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT;
	commandBufferBeginInfo.pInheritanceInfo = &inheritanceInfo;
	VkCommandBuffer cmdBuffer = thread->commandBuffer[cmdBufferIndex];
	VK_CHECK_RESULT(vkBeginCommandBuffer(cmdBuffer, &commandBufferBeginInfo));
	VkViewport viewport = vks::initializers::viewport((float)width, (float)height, 0.0f, 1.0f);
	vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);
	VkRect2D scissor = vks::initializers::rect2D(width, height, 0, 0);
	vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);
	vkCmdBindPipeline(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelines.phong);
	// Update objectData
	objectData->rotation.y += 2.5f * objectData->rotationSpeed * frameTimer;
	......
	objectData->model = glm::scale(objectData->model, glm::vec3(objectData->scale));
	thread->pushConstBlock[cmdBufferIndex].mvp = matrices.projection * matrices.view * objectData->model;
	// Update shader push constant block
	// Contains model view matrix
	vkCmdPushConstants(
		cmdBuffer,
		pipelineLayout,
		VK_SHADER_STAGE_VERTEX_BIT,
		0,
		sizeof(ThreadPushConstantBlock),
		&thread->pushConstBlock[cmdBufferIndex]);
	VkDeviceSize offsets[1] = { 0 };
	vkCmdBindVertexBuffers(cmdBuffer, 0, 1, &models.ufo.vertices.buffer, offsets);
	vkCmdBindIndexBuffer(cmdBuffer, models.ufo.indices.buffer, 0, VK_INDEX_TYPE_UINT32);
	vkCmdDrawIndexed(cmdBuffer, models.ufo.indexCount, 1, 0, 0, 0);
	VK_CHECK_RESULT(vkEndCommandBuffer(cmdBuffer));
}

0x3 Performance analysis

Here we will compare the performance with/without multithread support using vulkan.

And then check the thread profiling data of them.

0x31 With multithread

Here is the performance of using vulkan with multithread support(4 threads), the fps is 28.

Here is the cpu profiling data of it, we can see the cpu loading is balanced to 5 threads, one is main thread, other 4 threads are working thread for generating object for vulkan command buffer, the working thread number 4 is the number of cpu cores.

0x32 Without multithread

Here is the performance of using vulkan without multithread support(1 thread), the fps is 17.

We can see the fps gain from 1 thread to 4 threads is 65%(17 -> 28).

Here is the cpu profiling data of it, we can see the cpu loading is only bounded to two threads, one is the main thread, another is the working thread.