I got that sorted out (huhuhu)

Hey monkey,

This is gonna be a long post with nerd talks about sorting algorithms. So if math, graphs and micro optimization bore you to death, I won’t blame you if you just walk away from it.



This is an echo to this post http://hub.jmonkeyengine.org/groups/development-discussion-jme3/forum/topic/very-bad-sorting-algorithm-in-geometrylist-sort/

originally posted by Phate666.

He deleted his original post because he went berserk and left the community after some other story but anyway, here is an abstract of what it was about.



His first claim was that JME merge sort for sorting geometries was terribly slow and that Arrays.sort (java6 and java7) was faster.

He pointed out that java 7 had a sorting algorithm called TimSort that was performing very well and consumed only n/2 temp memory for sorting (n being the number of elements to sort in the array). (our merge sort consumes n temp memory)

His claim was based on the SortUtil.test() method

http://code.google.com/p/jmonkeyengine/source/browse/trunk/engine/src/core/com/jme3/util/SortUtil.java#119



After investigations, I realized he was not completely right (about the merge sort being slow).

  • First this test is kind of flawed, because it does not reflect the reality of what we are sorting in JME (small arrays of geometries once for every frame). There is no magical sorting algorithm that performs the best in any situation, it’s very dependent to how the data you feed it with are distributed.
  • Second and most, the test has a bug. The merge sort actually partially sorts the array passed as the first parameter. So the “original” array comes out of this sort partially sorted and is fed to the Array.sort, and TimSort is very good at sorting partially ordered array (ascending or descending).

    Timsort though has performance just a little better than merge sort with randomly distributed arrays.



    So basically his first numbers posted (now deleted unfortunately) were showing that Timsort was like 15 times faster than merge sort, but that’s not the case in real life.



    But…he was not completely wrong neither, because Timsort IS indeed faster in many situations.



    So I went and grabbed the Java7 Timsort and made some tests. I tried to find a proper test that would represent well the general JME case, and tested all our sorting algorithms against several types of arrays.

    Gnome sort is terrible against reverted arrays so i ruled it out.

    Quick sort is not that quick and not stable, so i rule it out too.

    TimSort was always one head ahead of Merge sort. and blew it off with already ordered arrays, ascending or descending.

    I ended up using a real scene and sorting real geometries instead of benching on arrays of floats.



    For a matter of copyright, I couldn’t keep Java 7 Timsort as is, in JME source, so I had to re-implement it from the original paper by Tim Peters (list sort for Python) that can be found here :

    http://svn.python.org/projects/python/trunk/Objects/listsort.txt

    and here is the C code of Python’s listsort

    http://svn.python.org/projects/python/trunk/Objects/listobject.c

    Also i made sure the temp arrays needed for the merge were not allocated on each sort, but kept with the geometry list ( as we do right now for the merge sort).



    What is TimSort?

    I won’t get into the details because, I would bore the bravest among you, but in short, TimSort is a sort that takes advantage of natural order of things.

    The assumption is that there are parts ,of an array to sort, that are already ordered. Those parts, called “runs”, can be merged together (merge sort style).

    But the subtlety of it is that TimSort only consider runs of a certain size. If a run length is below a certain threshold, the remaining elements to the threshold are sorted with a binary insertion sort (

    http://en.wikipedia.org/wiki/Insertion_sort A sort that almost can’t be outperformed on small arrays sorting).

    In original C implementation the Threshold is hard coded to 64 elements. in Java 7 implementation it’s called MIN_MERGE and hard coded to 32 (read Timsort javadoc to understand why Josh Blosh chose 32).

    I called it MIN_SIZE because I found MIN_MERGE was somehow misleading into being the minimum number of merge, which is not.

    I used 32 as it was the value chosen for Java 7.



    I used a scene with 200 objects with 3 different materials. all objects were moving around a bit.

    Here are the results :



    here is the google spread sheets

    https://docs.google.com/spreadsheet/ccc?key=0AsE_ydpoOkOhdFM1eW5aQnUxLVFMdlFaUVhycXZ5Y1E#gid=6

    So …except that java 7 beats the hell out off java 6, this graphs shows that TimSort is faster than merge sort for this kind of scene : around 15% in java 6 and 25% in java 7.

    And all of this with half temp memory consumption.

    I was thrilled and I thought testing it on android would confirm this. I was wrong.



    For android I used a similar scene but i dropped the number of geometries to 100 to have a descent frame rate (22 to 23) so the test does not last too long.

    Here are the results. I also tested with dalvik’s VM Arrays.sort that is also a TimSort.



    here is the spread sheet

    https://docs.google.com/spreadsheet/ccc?key=0AsE_ydpoOkOhdFM1eW5aQnUxLVFMdlFaUVhycXZ5Y1E#gid=7



    Even if i was happy to beat android’s Arrays.sort, I was very disappointed that current merge sort was faster.



    So I went and read the paper again.

    I found out that the MIN_SIZE parameter could have strong influence on the sort performance. so I decided to try with different values : 32 (java7 timsort), 64 (original c implementation), 128, (said to work in the paper), 256 (said to be slower in the paper).



    Here I used a scene with 1000 objects on desktop :



    and the spread sheet

    https://docs.google.com/spreadsheet/ccc?key=0AsE_ydpoOkOhdFM1eW5aQnUxLVFMdlFaUVhycXZ5Y1E#gid=5



    As you can see, the sort are very close to each others, but 256 is definitely the slowest. 64 being the best, 128 being just between 64 and 32.



    I then made this test on android too using the same 100 objects scene as before.



    the spread sheet

    https://docs.google.com/spreadsheet/ccc?key=0AsE_ydpoOkOhdFM1eW5aQnUxLVFMdlFaUVhycXZ5Y1E#gid=3



    So as you can see,Dalvik’s Arrays.sort is too slow. Timsort 32 and 64 are slower than merge sort.TimSort 128 is par to merge sort (very very very slightly better when you look at the numbers). TimSort 256 is slightly better.



    This means basically that on android Binary insertion sort performs better than other sorts.

    Setting the MinMerge to 128 or higher means that a 100 elements arrays will never have a merge performed on it… Only binary insertion sort is used. And the more naturally order is the array the faster the sort will be.



    From that graph and previous results I concluded that 128 was the best value to use to fit both desktop and android cases. 64 is too slow on android and 256 is too slow on desktop.

    This way we’d have a faster sorting for desktop and about the same speed sorting on android with half memory consumption.



    I performed a last test on desktop, with a scene with 200 objects.



    spread sheet

    https://docs.google.com/spreadsheet/ccc?key=0AsE_ydpoOkOhdFM1eW5aQnUxLVFMdlFaUVhycXZ5Y1E#gid=1



    So as you can see with this kind of scene, timsort 128 is a bit slower with java 6 than timsort 32 and on par on java 7.



    Anyway, don’t get too wet about all this. With all the tests I did none of the algorithm made the frame rate change. That means that the sort operation is very unlikely to be the bottleneck in a scene rendering process.



    When you think of it a 0.5 second gain over 100000 frame at 60 frame seconds is a 0.5 second gain over about 28 minutes…

    This is micro micro micro optimization.

    But still…faster is faster.



    So guys…what do you say? do I put that in?
12 Likes

nice investigation :slight_smile: I think anything which pushes more towards android improvements is more important, as the hardware is generally worse, but I think you balanced it nicely

Wow, great work nehon :slight_smile: . As you self noted, sorting is tricky and the performance of various algorithm have different sweetspots… I would say keep the old one since it is already battletested, and because the improvement was not really that significant anyway.

If I got that right it would increase the performance by 0.0002%. The other question is: is a case imaginable where the optimization gets relevant? If so then, yeah, this should be in.



Currently I think it should be at the nice-to-have-list.



edit: sorry, I meant 0.03%.

As ususal nehon does impressive work. TomSort is Faster on desktop, android with half memory consumption. Sounds like a good thing to have but on the other hand what kwando said. The current implementation is battletested and that is worth more I think. Maybe :slight_smile:

Given that the size of the array makes a big difference in algorithm performance, I wonder how much of the difference between android and desktop is actually the scene size difference. As you say, with 100 objects and 128 min merge then android never merges but desktop will.



And it’s good you found the bug in the original benchmark. The numbers were crazy perplexing.

I would say put it in, because if you do like grass rendering with many batches you easily have now 3k elements as the base wich needs to be sorted, and the difference will probably be much higher then.



Also why not put in a interface, and let the user decide at startup time?



eg settings.setSortAlgorithm()

This is a win-win situation (literally). An improvement is an improvement, whatever how much it is. Desktop performance isn’t that much of a problem, but Android is.



As @EmpirePhoenix mentions, when you get really high number of geometries, shit will hit the fan. The more we can push that boundary, the better, even if that would only add 10, it’s 10 more than before. :wink:

Thanks for the nice comments guys.



Actually with a very high number of geometries you’ll have other problems than sorting…

I’m gonna try some tests on a very crowded scene and try to make an example where it does influence the frame rate.



I thought of making some API to allow users to change the algorithm…but, very few users will use it, most of them won’t even know it exists. And it will make so few difference in the general case that, IMO it’s not worth the hassle.



I’d like to put that in…first because I worked on it a fair amount of time and even if it was very educative, i’d like it to not be wasted :p. Then…it’s faster…no matter how you look at it.

But the “battle tested” argument is a very valid one.

@nehon said:
Actually with a very high number of geometries you'll have other problems than sorting...
I'm gonna try some tests on a very crowded scene and try to make an example where it does influence the frame rate.


I concur, but at that point you'll be sure that everything has been done on the sorting side of things; that homework has been done and every inch of improvement has been pulled from the code.

I do not think jME need a sorting algorithm API, in my world that would only bring unneeded complexity into the core and I do believe that one will have heaps of other problems before scene sorting becomes a major bottleneck.



But it would a shame to see hard work wasted, so if it is put in make it as unobtrusive as possible :slight_smile:

So…i made some more quick tests with a scene with 3000 still objects with one material and a still camera.



Java 6, Merge sort : 39 fps (very stable)

Java 6, TimSort : 41 fps (switch back and forth from 40 to 41)



Java 7, Merge sort : 64 fps (very stable)

Java 7, TimSort : 66 fps (very stable)



Same scene with 200 objects

java 6 merge sort : around 490 to 495 fps

java 6 TimSort : around 510 to 515 fps.



java 7 mergesort : 760 to 775 fps

java 7 Timsort : 790 to 800 fps.



That’s around a 3% increase.

It seems that Timsort does have a small impact on still scenes. But still scene are very unlikely to happen since a least the camera moves around…

1 Like

FPS improvement is ALWAYS welcomed. :smiley:

@kwando said:
I do not think jME need a sorting algorithm API, in my world that would only bring unneeded complexity into the core and I do believe that one will have heaps of other problems before scene sorting becomes a major bottleneck.

But it would a shame to see hard work wasted, so if it is put in make it as unobtrusive as possible :)


Note: JME already has its own sorting and we're just talking about replacing it with a better one.
@pspeed said:
Note: JME already has its own sorting and we're just talking about replacing it with a better one.

I'm well aware of that, I've been tracing through the inner workings of the rendering path more than once :)

What happens to your test if you let the camera move around on a given path? Since having alrge amount of static, and only a few dynamic objects is kinda common for most fps/strategy/rpg games.

Mhh my test was too quick to be correct.

Actually I looked into it again and the way I place objects make them already ordered in the GeometryList.

So that’s not still objects that make TimSort faster, that’s already ordered objects :stuck_out_tongue:

So the test is even more a particular case…

@nehon, did you check how much time JME actually spends sorting? If it’s 5%, then increasing the frame rate by 3% is pretty impressive by my book.



Oh, and 3% is 3%. Add a dozen or more optimizations like this and you get a doubled frame rate. I’d move that in unless there’s serious concern about whether it’s battle-tested enough.

You mean the sort time compared to everything else?

I don’t know, but 5% seems to be a lot, I think it’s far lower than that.



My test case was flawed because the geometries were already ordered. Yes it’s faster, but the case it very unlikely to happen in real life.

anyway, this sort is faster in any situation. But not as much as when the array is already sorted :stuck_out_tongue:



About the battle tested argument, i’ve been running this algorithm since then and didn’t notice any issue. I think i’ll put that in once i’ve check with other devs that it’s ok.

ACtuall tha case is not that unlikly, i hae currently a large astroid field with rotating astroids in the login screen, however they do not move, so they are already perfectly orderd, and since the camera does not move stay that way.