Targeting the right functions for lag profiling

Hi guys, I’m trying to optimize my project a little so that it could run on laptops with Intel HD cards. I mean, it does run, but at 2-3fps, which is unplayable. I’m trying to pinpoint what changed since 1 month ago (and a lot did but what exactly) that made it go from 15fps to 2-3fps on my laptop. I didn’t add new shaders and I’m still using billboards with texture atlas, which is quite amazing on my regular PC, runs at about 350fps. I tried to cull every single thing possible and I’m still getting 4-5fps so I know all the 4000 billboards only take 1-2fps to draw. I put a profiling function call at the beginning and end of every function I had exposed in my project and for instance, I can track that simpleUpdate() loop which is the biggest function I have (roughly 1000 lines) where all sorts of manipulations occur only takes 2ms to execute. That means that if it were only up to simpleUpdate() the game would run at a couple hundred fps at least. Of course, there are many many many other functions not necessarily exposed in my project, that’s why I’m turning to you guys to identify what functions exactly I could override to put a profiling call on. Here are those that I’m already profiling and all of those take less than 2ms per frame to execute:

onAnalog()
onAction()
simpleUpdate()
collision()
prePhysicsTick()
physicsTick()
onInputEvent()

Are there any other functions I could override and put my profiling calls on to monitor?

Thanks :smiley:

textures can be a problem, especially onboard cards, have no vram, and need to access a way slower normal memory (with another bytour trough the processors memory interface)
So some texture intesive algorithsm, for example antialising, might drastically increase the amounts of texel data necessary for a full frame.

There are some tools trying to show the gpu usage, I think gpuz is capable of that. Maybee your problem is not the code directly, but the context.

Accessing stuff in a shader in an unexpected way, (eg random coordinates in a texture) might destroy further benefits from read ahead strategies.

One test scenenario would be (could be)

Serialize the whole scenegraph via the jme internal logic. (but cleanup all controlls and logic before)
Load into a seperate application, and see if the rendering alone is slow, or if it is fast then without the logic.

Hi and thanks for replying. I have found one major bottlenecking component in my project and one that has been added in the past month like I stated in the OP. Could you or anybody try and explain why this model draws so many fps when it’s not culled? Simply culling it brings the fps up by a factor of nearly 300% (on the laptop obviously, not on my regular PC with a decent GPU of course).

It only has 1490 vertices and here’s the MTL file that goes with it (only 1 texture and the OBJ was compiled to a j3o file)

[java]
newmtl basic_skin-texture.jpg.001
Ns 0.000000
Ka 0.000000 0.000000 0.000000
Kd 0.548517 0.459726 0.287604
Ks 0.000000 0.000000 0.000000
Ni 1.000000
d 1.000000
illum 2
map_Kd skin-texture.jpg
[/java]

You can see the model here, the screenshot was taken on my regular PC, on the laptop it’s 8fps:

There are a few open questions from your post.

How are you timing? I assume System.nanoTime() but it is bad to assume. System.currentTimeMillis() is useless for timing.

Are those 4000 billboards separate objects? Do they have BillboardControls or are you doing billboarding in a different way? You should generally target no more than 2000 Geometries in a whole scene.

How are you “culling everything possible”? Setting the cullhint? Not attaching them? This is important to because culled objects will still update their controls… and if 4000 of them are billboarded objects you will spend a lot of time updating them every frame.

Without hacking the engine, one way to time the whole update pass (inputmanager, app state update/render, control update/render) you can share some kind of time tracking object which you start by overriding SimpleApplication.update() and stop/check in the render method of an app state that you add last.

Overriding update will allow you to start the timer at the absolute beginning of the update loop. The app state render() methods are run right before renderManager.render() is called. (The same app state you use to check the update time could also have a postRender() to check the render time.)

and note: if you have access to the machine and can run the SDK on it… then running the profiler is a way to avoid doing any of that hand-timing mess.

@.Ben. said: Hi and thanks for replying. I have found one major bottlenecking component in my project and one that has been added in the past month like I stated in the OP. Could you or anybody try and explain why this model draws so many fps when it's not culled? Simply culling it brings the fps up by a factor of nearly 500% (on the laptop obviously, not on my regular PC with a decent GPU of course).

It only has 1490 vertices and here’s the MTL file that goes with it (only 1 texture and the OBJ was compiled to a j3o file)

[java]
newmtl basic_skin-texture.jpg.001
Ns 0.000000
Ka 0.000000 0.000000 0.000000
Kd 0.548517 0.459726 0.287604
Ks 0.000000 0.000000 0.000000
Ni 1.000000
d 1.000000
illum 2
map_Kd skin-texture.jpg
[/java]

You can see the model here, the screenshot was taken on my regular PC, on the laptop it’s 8fps:

Can you enable JME stats in your screen shots so we can see what’s going on in the screens? How big is the skin texture?

Hi Paul, thanks for replying. I’m timing using this VERY simple function I made. I call it at the beginning of the function and at the end:

[java]
public void profile_start(String str_name){
long time_start = System.currentTimeMillis();
arr_profile_times.put(str_name, time_start);
}

public void profile_end(String str_name){
long time_end = System.currentTimeMillis();
if(time_end - arr_profile_times.get(str_name) > 2){ // Anything that takes longer than 2ms…
System.out.println(s);
}
}
[/java]

Why is System.currentTimeMillis() bad for timing tough?

No, even tough I have thousands of billboards, they’re only 1 geometry with a texture atlas, no control whatsoever, no rigidbodies or anything fancy, don’t worry about this old part, it’s been optimized 3 months ago. Like I stated, culling them only brings 1-2fps more. BUT culling that little j3o hand model brings 12-13fps more tough (!!) so there’s a problem with the j3o model it seems… at least, on the laptop, culling the hands make a WORLD OF DIFFERENCE between the game being playable and the game running at 2fps. That leaves me puzzled at the moment.

@pspeed said: Can you enable JME stats in your screen shots so we can see what's going on in the screens? How big is the skin texture?

YES! I forgot about that. The texture is 512x512, 61K JPEG… it’s like nothing at all. Here are screenshots with and without the hands model and this time on the laptop so you see that insane fps difference just by culling that little j3o model. Sorry about the actual PHOTOS of the screen (lol) the PRINT SCREEN key would only capture the Windows desktop since the app was running fullscreen.

9 fps with the hand models:

A whopping 21 fps (!!) with the hand models CULLED… that’s like totally insane to me:

Figure that it makes quite a difference in playability, we’re specifically talking about 250% faster just by culling a couple thousand vertices and 2 textures. I’m wondering if I could just try this: NOT to compile the 2 models as j3o and instead only import their vertices but share the same material/texture. I’d put my hand in fire it would make a difference but maybe I’m wrong. Normen if you’re reading this, remember a couple weeks ago, we talked very briefly about this and you told me not to apply textures manually, because the j3o was already optimized and that no texture would be loaded twice by the asset manager. But the question is: are they the same material in the GPU in the end or not? Because if not, loading 1 material and sharing it manually on a non-j3o models would make a difference, wouldn’t it?

Thx everybody for your input on this.

@.Ben. said: Why is System.currentTimeMillis() bad for timing tough?

Because the resolution it reports is OS/System specific and may be as much as 10 or 13 ms. Where as System.nanoTime() has none of these issues and there are 0 reasons not to use the more accurate timing.

Even if nothing else, System.currentTimeMillis() can’t be used to time anything less than a millisecond… but realistically and reliably, nothing smaller than 10 ms.

Are the hands animated? Do they have a skeleton? Just having a skeleton means that they are skinned every frame.

@pspeed, System.currenttimeinmillis is kinda guranteed to have 1 ms resolution on modern computers.
-> All windows 7 certified pcs are required to use a hpet in 64bits, java is capable of using it instead of the normal timer. Still nanotime is of course better for profiling.

@Empire Phoenix said: @pspeed, System.currenttimeinmillis is kinda guranteed to have 1 ms resolution on modern computers. -> All windows 7 certified pcs are required to use a hpet in 64bits, java is capable of using it instead of the normal timer. Still nanotime is of course better for profiling.

The javadoc still says (even as of JDK 7):
“Returns the current time in milliseconds. Note that while the unit of time of the return value is a millisecond, the granularity of the value depends on the underlying operating system and may be larger. For example, many operating systems measure time in units of tens of milliseconds.”

I’ve seen it as high as 15+ ms in my years of Java development.

If you want to know the time since January 1st, 1970 then currentTimeMillis() is a fine method to use.

If you want to time code then nanoTime() is the only correct answer.

Actually, you can run this on your computer right now and see what your results are:
[java]
long lastTime = System.currentTimeMillis();
while(true) {
long time = System.currentTimeMillis();
if( time != lastTime ) {
System.out.println( “time:” + time );
lastTime = time;
}
}
[/java]

My win7 box gets 1 ms resolution and my XP box gets 16 ms resolution.

Either way, 1 ms is a HUGE amount of time for timing Java code. If I let the blue box demo run flat out I get better than 6000 frames per second. In other words, currentTimeMillis() would only measure 6 frames at a time!

@Empire Phoenix said: @pspeed, System.currenttimeinmillis is kinda guranteed to have 1 ms resolution on modern computers. -> All windows 7 certified pcs are required to use a hpet in 64bits, java is capable of using it instead of the normal timer. Still nanotime is of course better for profiling.

There is no reason to confuse the issue: If you are timing things use nanoTime (stop-watch). If you want to know what the wall-clock (time of day) is you use currentTimeMillis. They have different meaning and built for different use cases.

timer != clock

1 Like
@pspeed said: Are the hands animated? Do they have a skeleton? Just having a skeleton means that they are skinned every frame.

They are not animated using a skeleton, but the model is moving using MotionPath(). Does that account for that many fps tough?

EDIT: Just tried to comment out the MotionPath() and it doesn’t affect the fps at all.

EDIT #2: Tried without textures at all (recompiled the j3o without it) and it renders exactly at the same speed on all computers. I’ll try and see if it’s the default j3o material that is causing this issue, maybe? I’ll force the unshaded material on it and see…

EDIT #3: Sharing the same material does not affect the fps, so it’s clear j3o objects already index materials and use them commonly. Stripping the material file completely and applying Lighting.j3md on it bears about the same results. Applying Unshaded.j3md explains where the whopping 210% fps diminution on my laptop comes from.

Lighting hands: Regular PC: not culled 456 / culled 459 | Mobile Intel HD: not culled 10 / culled 21
Unshaded hands: Regular PC: not culled 459 / culled 459 | Mobile Intel HD: not culled 19 / culled 21

Well, there I have it. Intel HD mobile cards suck at lighting shading. The model vertices are fine and do not draw that much, altough 10% on a laptop can make the whole difference, so I could bake a simpler version of the model only for laptops.

Thx all for contributing.

@Empire Phoenix said: @pspeed, System.currenttimeinmillis is kinda guranteed to have 1 ms resolution on modern computers. -> All windows 7 certified pcs are required to use a hpet in 64bits, java is capable of using it instead of the normal timer. Still nanotime is of course better for profiling.

Are you sure tough? Either way, maybe it’s better to switch to nano resolution and be done with it.

EDIT: Would something like this be suited?

[java]
public static void main(String args) {

  Stopwatch timer = new Stopwatch().start();

  long total = 0;
  for (int i = 0; i < 10000000; i++) {
     total += i;
  }

  timer.stop();
  System.out.println(timer.getElapsedTime());

}
[/java]

What is Stopwatch?

@pspeed said: What is Stopwatch?

Same thing as mine but with System.nanoTime(); instead of System.currentTimeMillis(); and wrapped in a class instead of 2 global functions I guess.