I have noticed a huge amount of garbage which is being created in application. It turned out that 90% of it comes from single place:
RenderQueue.renderShadowQueue
RenderQueue.renderGeometryList
RenderManager.renderGeometry
Material.selectTechnique
Technique.makeCurrent
DefineList.getCompiled
… and then StringBuffer allocations …
From a very cursory look, it looks like that shadow defines are recompiling/recreating shader text on each frame, because they are different than normal ones - and it goes over and over and over. We are talking about hundreds of such actions per second, which results in 50-100MB of garbage each second just because of that in very simple application.
I’m still using DirectionalLightShadowRenderer, not Filter, so maybe it would get better with Filter one? Anyway, I think this is more generic problem - seems that only one combination of parameters is getting cached for given shader?
How many different materials do you have in your scene?
I wonder what parameter make the define change…
The shadows parameters are passed to all materials on each frame. But they should not make the shader recompile, so there must be an issue.
I’m gonna look into it, thanks for digging this out.
A lot of data is replaced in that moment, would not astound me to find a lot of garbage being created but if its in a stack situation that usually is no reason for concern.
I haven’t actually seen the compilation stacktrace, as I was debugging memory, not cpu. I have just seen a lot of strings with preprocessed scripts created - don’t know if they are compiled, or maybe there is some extra cache at the end to avoid that. Atm I’m concerned with 50-100MB of garbage per second and gc running each 10-15 seconds with 1GB heap, for 4 monsters in the scene.
@abies said:
I haven't actually seen the compilation stacktrace, as I was debugging memory, not cpu. I have just seen a lot of strings with preprocessed scripts created - don't know if they are compiled, or maybe there is some extra cache at the end to avoid that. Atm I'm concerned with 50-100MB of garbage per second and gc running each 10-15 seconds with 1GB heap, for 4 monsters in the scene.
That’s excessive.
In my current Mythruna engine, at 60 FPS, I seem to get about 1 meg of garbage every 4-5 seconds but it cleans itself up without a full GC. This goes up to about a meg a second if I turn off vsync so I believe it is rendering related (it could also be my environment code).
At any rate, your experience seems like something is definitely wrong. I don’t use shadows, though. Maybe it’s related?
if (techniqueSwitched) {
// If the technique was switched, check if the define list changed
// based on material parameters.
DefineList newDefines = new DefineList();
Collection params = owner.getParams();
for (MatParam param : params) {
String defineName = def.getShaderParamDefine(param.getName());
if (defineName != null) {
newDefines.set(defineName, param.getVarType(), param.getValue());
}
}
if (!defines.getCompiled().equals(newDefines.getCompiled())) {
newDefines are created each time and getCompiled is creating a shader text, as it is not cached. One in existing defines is cached. As nothing is actually changed, nothing gets recompiled. Still - newDefines.getCompiled() ends up being called a lot of times producing considerably big strings. On top of that, StringBuilder inside it is not presized, so it gets extended and extends, producing even more garbage.
Switching happens on Default, PreShadow, Glow and PostShadow15 techniques (which are probably all I’m using). With Default technique, there are sometimes 10-15 of them in row where no switch happens, but sometimes it does.
Edit:
Changing the line above to
if (!defines.equals(newDefines)) {
cuts GC considerably. It is still way too much (15MB per second), but it might be already my problem, so I’ll investigate further.
EditEdit:
And the next big winner is a TreeMap + a lot of TreeNodes, created from… DefineList constructor.
I think it might be a fault of techniqueSwitched being invoked too often for my app?
Shadows switch technique on every frames for several materials. there shouldn’t be that much overhead…
You mean that the define compare generate garbage?
ok…i think i get where the garbage goes…
Each time you switch the technique the define list is recreated, so basically each time the getCompiled is called it’s regenerated and instantiate a new StringBuilder.
What I don’t get is the amount of garbage it generates for you… 100mb is huge.
filling it out creates HashMap iterator (just one) and ton of TreeNodes
current define.getCompiled comparison is generating a lot of char[] for newDefine and StringBuffer reallocations inside, plus TreeMap iterator
drect define equality creates TreeMap iterator only
Out of that, char[] are major issue, changing it to direct equality still is not perfect, as DefineList TreeMap is quite costly structure for temporary object. I suppose that way to optimize it away would be to add a special method in DefineList which would check equality to Collection - to avoid creating temporary structures. This would mean just few iterators created here and there. Of course, if there was a real change, there would be allocation of TreeMap - but this seems to be not happening by itself.
People might be not experiencing it, as it might depend on ShadowRenderer - and I suppose people either are not using it, or switched to ShadowFilter already. Or maybe combination of ShadowRenderer, Glow and something else? In any case, these optimalizations are quite generic and can help in future. I can try to implement it if you wish, should be quite self-contained.
@abies said:
Regarding garbage:
- creating temporary DefineList is creating TreeMap
- filling it out creates HashMap iterator (just one) and ton of TreeNodes
- current define.getCompiled comparison is generating a lot of char[] for newDefine and StringBuffer reallocations inside, plus TreeMap iterator
- drect define equality creates TreeMap iterator only
Out of that, char[] are major issue, changing it to direct equality still is not perfect, as DefineList TreeMap is quite costly structure for temporary object. I suppose that way to optimize it away would be to add a special method in DefineList which would check equality to Collection - to avoid creating temporary structures. This would mean just few iterators created here and there. Of course, if there was a real change, there would be allocation of TreeMap - but this seems to be not happening by itself.
People might be not experiencing it, as it might depend on ShadowRenderer - and I suppose people either are not using it, or switched to ShadowFilter already. Or maybe combination of ShadowRenderer, Glow and something else? In any case, these optimalizations are quite generic and can help in future. I can try to implement it if you wish, should be quite self-contained.
Well, I guess a fair amount of people use shadows, but no one never went that far
Of course if you can make patch I’m interested. I’ll test it on my local copy, this is a quite central system in the rendering process, and it has to be thoroughly tested.
I still don’t get why you get that much garbage from this though…
@pspeed said:
When you have this much garbage per second, what is your frame rate?
60 fps, using vsync.
After removing some noise garbage produced by my app, I get following numbers:
583MB/13 seconds = 44.5MB/s
After changing comparison from compiled strings to directly on DefineList, it is reduced to
595MB/41 seconds = 14.5 MB/s
So difference is around 30MB/s, which comes to half megabyte per frame in this stringbuffer business.
I’ll do rest of optimalization later today and let’s see how much it can get improved.
Optimized compareParams to avoid creating temporary DefineList if not needed and I’m getting
596MB/100seconds = 6MB/s
As I was already running profiler, I tried going further, even if it is not really related to that part:
Caching some string in AbstractShadowRenderer:
285MB/72s = 4MB/s
And at this point I give up. Next memory hotspot is ShadowUtil.updatedShadowCamera BoundingBox allocations/transforms, but it is too invasive change for small amount of garbage - and it very best case, it could save only another 1MB of garbage per second.
Important part is Technique/DefineList change. AbstractShadowRenderer is just for fun (even if it saves 2MB/s with 60 fps) - up to you. It works in my testcase, but obviously there might be strange corner cases which I have not seen.
It works very well, and indeed the heap used grows a lot slower.
I’m not gonna commit it right away though, I’m gonna keep it on my local copy for a couple of weeks to be sure there is no regression elsewhere.
Nice work, thank you.
Just as a proof of concept - this one is further reducing it from 3.6 MB/s to 2.4 MB/s. On top of previous changes, there is also a small patch for optimizing access to ListMap to avoid creating iterators and ugly one for avoiding to convert floats/ints to strings just for sake of comparison. Because of ugly part, I’m not 100% sure if it is really worth it - just posting it here just in case you would like to try. This is not going to make any noticeable difference for my test program - but maybe it looks different for very complicated scenegraphs?