Developer Blog

Using XCode profiling tools to increase OpenGL performance.

I’ve always heard of how fickle OpenGL is to changes in its state machine and the performance hit this has however working on relatively low tech projects on pretty powerful desktop computers, this was never really an issue. Trying to build a more complex world on a lower powered device (iPad 3 and iPod Touch) performance was starting to struggle. Thankfully XCode has some great profiling tools available to help with this.

When running an OpenGL application on a device and not in the simulator the ‘Debug Navigator’ tab in XCode’s solution explorer is a quick and easy way to see how the app is running.

DebugNavigator

Debug Navigator Tab in XCode Solution Explorer

 

Once selected the Performance of the running application is shown.

OriginalPerformance

Original performance of Woof! running on my iPad. 23ms per frame is very poor performance for rendering simple geometry.

The CPU is taking 23ms per frame and the GPU is taking 11ms. This means that the GPU is sitting idle more than 50% of the time waiting for the CPU. This doesn’t really make sense as the GPU is doing all the work. (Transformations, curving of the road, per pixel lighting). Time to do some profiling :)

Profile your app via the ‘Product->Profile’ XCode menu. I like to start with OpenGL ES Analysis.

XCode Instruments OpenGL ES Analysis

XCode Instruments OpenGL ES Analysis

 

Simply let your app run for a period of time, allowing some time for all tracelines to be analyzed (left column)

Analysis-results-1

 

Pretty quickly you can see that in a mere 30 seconds there were almost half a million redundant OpenGL calls from 19 difference places in code. By following the tree down (click on the arrow beside ‘Redundant call’), you can see each of the 19 redundant calls and how often they are called, then further follow the tree to see the guilty code.

Redundant OpenGL commands prior to performance improvements.

Redundant OpenGL commands prior to performance improvements.

 

Click on any of the commands and the right column shows extended details such as

This command was redundant:

glUniformMatrix4fv(9, 1, 0u, {0.5000000f, -0.0000000f, -0.8660254f, 0.0000000f, -0.8660254f, -0.0000000f, -0.5000000f, 0.0000000f, 0.0000000f, 1.0000000f, -0.0000000f, 0.0000000f, 0.0000000f, -300.0000000f, 0.0000000f, 1.0000000f})
A GL function call that sets a piece of GL state to its current value has been detected. Minimize the number of these redundant state calls, since these calls are performing unnecessary work.

as well as a strack trace showing the guilt line(s) of code.

Stack trace showing guilty lines of code

Stack trace showing guilty lines of code

 

By analyzing all the redundant calls you may start to get a better understanding of where the bottlenecks are. In my case it was simply as the OpenGL profiler told me. I was continually setting OpenGL state which I did not need to set. My sins were:

  • Objects were not sorted before rendering. Each object was 100% self managing which is nice from a design perspective but not good for performance. In this architecture objects can be rendered in any order resulting in unnecessary texture, vertexbuffer and shader binds. By sorting the objects into a hierarchy these binds can be greatly reduced. e.g. In the video above (or the feature image for this post) unsorted rendering results in 277 vertex buffer binds, 20 shader binds and 238 texture binds but sorting gives 90 vertex buffer binds, 7 shader binds and 12 texture binds. Better results may be possible but this requires some investigation.
  • The projection matrix was been set for each model each frame even though it seldom changed. This should only be set when the shader is first loaded and update each shader if the projection matrix changes (orientation of device changes)
  • The view matrix was been set for every model rendered even though 95% of the models use the same shader. The camera moves forward each frame so the viewMatrix for the shader does need to be updated but it should only be updated once per shader regardless of how many models are rendered with that shader.
  • glGetUniformLocation was been used to get the id of a uniform shader variable whenever it was needed to set that variable. Very wasteful as this id never changes once the shader has been compiled. It should only get the id once at the start then use this const value for all future state updates.
  • The model matrix was been recalculating each frame even though 99% of the models in the world are static and never move. By simply using an isDirty flag on each object the model matrix is only recalculated when needed, saving 100’s if not 1000’s of matrix multiplications per frame.
Performance of Woof after fixing redundant calls and using a sorted render queue.

Performance of Woof after fixing redundant calls and using a sorted render queue.

 

These simple changes, which only took a few hours of work, result in the above performance. The CPU now takes 33% of the time per frame. Some more juice can probably be saved using the CPU profiling tools but that is for another day.

 

Edit: 28th May

Yesterday was spent trying to optimize the render queue with some but limited success. Internally the render queue is a map of a map of a map so that each mesh can be sorted on texture, vertexbuffer and shader. The main expense is building the queue as the queue must first be searched for an existing entry, if it exists it is appended too otherwise a new entry is created and then appended too. This must be done for each of the 3 levels of the queue. What is the best way to optimize this? This raises a similar question of ‘What is the most efficient way to do something?’ which I now answer with ‘The most efficient way to do something is not to have to do it at all’ and this is the approach which has further increased performance by another 25%. (from 8ms a frame to 5-6ms per frame).

Originally the render queue was being constructed each frame but is that required? Woof is a very linear game. The camera only moves forward and can’t rotate. The world only moves into the distance and anything that has been visited cannot be visited again. The render queue for frame N will be identical to that of frame N+1 except in 1 instance. That instance is when a patch gone behind the camera is deleted and a new patch is added to the end of the world (allowing for an infinite street). With this is mind what sense does it make to generate the render queue each frame? None. The best optimization to the render queue is to simply not generate it unless absolutely necessary and that is only the case when the world changes.

This change means that instead of the render queue being generated once per frame, aprox 100 times per second, it is now generated when the world changes which depends on the speed the dog is running. With a current speed of 4m per second and a patch length of 20m, the render queue only needs to be updated once every 500 frames. 499 frames will take 5-6ms, the 500th frame will take 8ms. That’s enough of an optimization for me.

Leave a reply

Your email address will not be published. Required fields are marked *

Back to Top