Rendering less isn’t always faster.

It would be natural to believe that rendering less each frame would be faster but this isn’t always the case. With game programming allot of the expense comes with changing the OpenGL state machine and with processing on the CPU. The raw power of the GPU still surprises me as it did just now and inspired the writing of this post.

The city level in my zombie game is constructed of about 2000 meshs. All objects in the level combine to about 50000 vertices. These objects range from 10’s of vertices for the buildings to 100’s for segments of the terrain. Rendering all 2000 objects each frame without any culling resulted in a frame rate of about 35 which is unacceptable. Profiling shows that the majority of processing time is spent preparing to draw a mesh.

Optimized culling techniques such as a quad tree and frustum culling restored the average frame rate to just under 60 by reducing the number of rendered objects to ~200 (5000 vertices). Although not as expensive as previous it was immediately evident that these culling techniques were expensive for the CPU and were going to cause a problem when physics, animations, game logic etc.. were added.

Rendering every mesh is too slow and culling meshes out of the line of sight is also too slow, so what can one do? The OpenGL best practices state

How small or how large should a VBO be?
You can make it as small as you like but it is better to put many objects into one VBO and attempt to reduce the number of calls you make to glBindBuffer and glVertexPointer and other GL functions.

What does this mean?
It simply means that you should batch similar objects together so that they are rendered with a single draw call. Objects that use the same texture and shader should be batched into 1 Vertex Buffer Object. (be aware that there is a maximum size for VBO’s. “You can also make it as large as you want but keep in mind that if it is too large, it might not be stored in VRAM or perhaps the driver won’t allocate your VBO and give you a GL_OUT_OF_MEMORY. 1MB to 4MB is a nice size according to one nVidia document. The driver can do memory management more easily. It should be the same case for all other implementations as well like ATI/AMD, Intel, SiS.”)

By combining the similar meshes (texture and shader in common) into 1 VBO I now have 14 VBO’s instead of 2000 and render all vertices ~50000 each frame. The GPU still runs at about 10ms per frame showing no performance difference to when it was rendering the culled 5000 vertices however CPU performance has gone from ~16ms per frame to 3ms.