C# performance tuning, part 7. Summary.

Previous parts of this series:
Part 1: Introduction
Part 2: Collisions explained
Part 3: Obvious optimizations
Part 4: Quad Tree
Part 5: Quad Tree improvements and small fixes
Part 6: Minimize loop repeats

I’ve went through whole optimization process. Getting better and better results at each step, finally reaching the limit when getting 362 times faster than first version of code. That’s a lot. Is that a limit? I doubt. We could do few things more at this point. Replacing property calls with private field should give a little bit more performance. Investigating algorithms more deeply may uncover some more optimizations.

Is that needed? At that point to me – no. I’ve been testing it under pretty stressful conditions – small area packed with 1000 triangles means a lot of collisions happening. Changing area size to 10 000 x 10 000 gives serial collision check time at 2ms level. Leave area size 1000 x 1000 and limit triangle number to 500 and time is as low as 2ms as well. And it’s still probably way more collisions going on per frame than in usual 2D game where we either don’t check for collisions against objects close to each other (player vs. floor) or objects are usually not colliding (player vs. wall or vs. player).

How does the data look?

Change Collisions serial Time serial Collisions parallel Time parallel
initial 8058 1814 8058 562
line segments initialized 8058 1061 8058 308
bounding boxes 8058 25 8058 7
quad tree basic 7054 30 7054 10
_allChildren added 7054 25 7054 7
two segments check 6685 19 6685 7
rectangle optimization 6685 19 6685 7
line precalculations 6685 18 6685 7
triangle contains linq replacement 6685 13 6685 4
QuadTree collison check 3163 5 3163 3

As you can see collisions detected over time decreased as I’ve made the code more precise at the same time improving performance.

What’s to learn from this exercise? First – performance often comes with the price. You can do some things faster, but you will need to pay the price in memory usage (when precalculatig some values), code readability (when replacing LINQ with code duplications, however one could argue if resulted code was harder to read in this particular case) or time take to rewrite the code (changing simple list and replacing it with more complex Quad Tree).

Also – but that’s pretty common knowledge – there is no point in doing performance improvements without profiler set up. You may get better results, you may not, but you will only be guessing without something that will measure your code’s performance.

And final thought – many times there are complains that C# and Java are slow. Well, I’m pretty sure those collisions written in C or C++ would be faster. But would it make that much of a difference? I don’t think so, at least not in my case. For some applications it can be critical to get code running as fast as possible, for others 2ms is fast enough. And tool you know will serve you better at that point than tool you are not that experienced with.


6 thoughts on “C# performance tuning, part 7. Summary.

  1. It sounds like you have had an interesting little adventure. I have been going through my own C# QuadTree/Physics optimization adventure over the past week or so.

    Some things that helped me quite a bit were making better use of structures and passing them by reference where possible. For example, my collision object is now a structure returned via an iterator. This means I am not reserving big blocks of memory to store it, and the garbage collector never has to clean it up. Garbage Collection events dropped by 97%. One of the benefits of struct, which people often do not fully consider, is that they are stored wherever they are declared. If you put them in an array, that’s where their data lives. So processing an array of structures is very easy for the CPU–it all fits nicely in cache. Processing an array of classes… that involves a long series of lookups into the heap and no nice cache benefit. And by using ref arguments, you can send pass them in 32 or 64 bits just like a class.

    I also replaced recursive calls with stack based algorithms. I have one stack for the whole tree, again to avoid wasteful memory allocation, and the entire tree is processed from the top level call. Whenever I would normally call child.Add or child.Query, instead I call stack.Push(child).

    Another win for me was doing away with interface types except where absolutely necessary. For example, I was originally returning all of my collections as generic interfaces. Bad idea! Performance increased quite a bit when I returned exactly the collections I was using to collect my items. Did I use a HashSet? Return a HashSet. Did I use a List? Return a List. Then the compiler knows *exactly* how to get the best performance out of it.

    More recently, I started experimenting with the darker side of C# to see what sort of benefits it could offer for a QuadTree. By dark, I just mean keywords like “unsafe” and “fixed” and “stackalloc” which allow you to do things a little more like C++ without going full native.

    For example, I was able to prototype a new structure for a QuadTree, which stores one node and two generations of its children. It is a struct, and by using fixed buffers, all of its data is actually stored in the struct instead of in the heap somewhere. I used a strategy for mapping the tree structure similar to what one might do for a heap, but instead of having 2 children, I made sure it had 4. The struct only contains 2 generations of children because, otherwise, the memory use would grow out of control for any reasonably sized QuadTree. This way, I can still have empty nodes, but a good portion of tree walking will take place in contiguous blocks of data sitting nicely in the cache, instead of pulling data from all over the heap for literally every operation. Anyway, I have not had a chance to benchmark it yet, but it has been fun to play with.

      • I have one more for you, which helped me tremendously. You were returning query results the same way I was at first – in a collection. I saw a massive performance boost by passing my query method a delegate, instead. This way, if you want to build a collection of results, you can build only one collection instead of one for each node, only to be combined later. Or, if you can process them right away, you can avoid building a collection at all.

        All in all, my unsafe code did result in a pretty big boost. With the original tree with the modification I just mentioned, the simulation would choke with 10,000 randomly colliding objects and never recover. With the new tree, 10,000 objects is nothing, and if I add 16,000, it chokes for 5-10 seconds and then runs smoothly. This is from, essentially, storing almost all of the data about the QuadTree within a single struct using fixed buffers. The code is not the prettiest… and easily 3 times longer than before. But the performance boost is pretty spectacular.

        Anyway, good luck with your projects!

      • One last update. I finished it and then did that comparison with normal .NET arrays of struct node data. The normal .NET array had the same performance with much more flexible and friendlier code, so I went with that. So, the QuadTree itself is almost double the performance of the old one for Add/Remove/Update. The difference is much less for Query, where intersection checks take up most of the time.

        I found that storing 6-9 levels of the QuadTree in each “tier” seemed to work best. 10 or more levels, and memory use skyrockets. Less than 6, and performance starts to sink back down to the levels of the standard QuadTree where each node holds a reference to the next.

        In case you are ever interested in trying something similar, the idea is pretty simple.

        // Holds all of your node data for the chunk/tier
        QuadTreeNodeStruct[NodeCount] Nodes;

        // If you want to get a node’s first child:
        childIndex = nodeIndex * 4 + 1;

        childIndex + 0 // Top Left
        childIndex + 1 // Top Right
        childIndex + 2 // Bottom Left
        childIndex + 3 // Bottom Right

        // This stores more QuadTrees as children of the last level nodes
        QuadTree[LastLevelNodeCount][4] Children;

        Pretty much all you should need to know to come up with an implementation, should you ever want to.

      • Hey man, why don’t you turn this into blog post/series? Sounds like fun thing to work with, would love to see the code with some benchmarks and stuff.

  2. I should add that add/remove operations were cut to <60% run time over the original QuadTree, and they were always the most expensive part. Querying was marginally better, but nothing spectacular. I managed to fit 7 levels of QuadTree data into the struct before .NET complained about the size of the struct. Next step is add a normal .NET array of this struct to contain the next 7 levels, and work that into the tree traversal algorithms. (and then to profile a similar strategy using normal .NET arrays, to see if I've been wasting my time with fixed buffers)

    Anyway, I also replaced my Stack with a stackalloc’d array of indices to process. Push is processing[processingCount++] and Pop () is nodeIndex = processing[–processingCount]; Fast, lightweight, and no heap!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s