Low-Memory modifications to TerraMonkey for Android

dpwhittaker · December 3, 2010, 9:04pm

I tried getting the terrain engine working with the new Android OpenGL ES 2.0 renderer, and ran into a few problems.

Mainly, the current implementation is not optimized for memory usage. I ended up with OutOfMemory exceptions on a 512×512 terrain. It would load about half of the TerrainQuads before running out of breathing room. This makes sense given that the Android platform supports a maximum heap size of 16 MB. (Which also makes sense given that the machine is only required to have a total of 64 MB of ram).

Each vertex requires:

3 floats – position

3 floats – normal

2 floats – texcoords

~2 int – index

that’s 40 bytes per vertex. 512 x 512 is 256k, times 40 bytes is 10 MB, over half of the available space. With the Bitmap (I made a BitmapBasedHeightmap class based on the ImageBasedHeightmap to remove the AWT dependency), splat textures and alpha map (they weigh in at about 7.5 MB in bitmap form), it’s no wonder it ran out of memory.

I wrote a Terrain engine for Ogre a while back that used the idea that the x and z coordinates of every quad of a given LOD are the same, assuming that the local coordinate system of the quad is always the same, and the whole quad is translated into place. This means that every quad can use the same VBOs for x and z coordinates, and these coordinates can be represented by a byte if there page size is <= 129 (just scale the page before you translate it). The y parameter can easily be represented by a short, with an offset/scale parameter passed to the vertex shader (I know of no terrain generators or datasets that need more than 16 bits to represent the terrain). Also, floats are overkill for normals, as is the third parameter. Two bytes are plenty if you think of them as x and z values in the -1 to 1 range, where y is calculated by sqrt(1-xx-zz). That gets you about 1 degree of resolution on normals. Especially with all the distance and interpolation between values, you won’t notice that lack of precision compared to using floats. If they are stored in a texture, you also get integer-to-floating point conversion for free, not to mention that your normals don’t change as you remove vertices, though you end up doing the sqrt, 2 mutiplies, and 2 adds per fragment rather than per vertex. The texcoords can also be derived from the x/z position, with an offset/scale parameter passed to the vertex shader, rather than storing them in yet another array.

All in all, this system requires, per vertex:

1 short - y position

2 bytes - normal

Plus one quad worth of, per vertex:

2 bytes - x, z position (pages should be sized 129 or less)

2 short - indexes (1291292 < 65536)

And for each lower LOD, for (width / 2^LOD)^2 vertices:

2 short - indexes

For a total of 4 bytes per vertex, 1/10th of the space of the standard heightmap, plus one page worth of 6 bytes per vertex. All for a cost of about 14 instructions in the vertex shader. (Note that this happens to also fit in a single RGBA texture if VTF happens to be faster than VBO on embedded GPUs). Most of the literature I’ve read says that mobile GPUs suffer an even greater disparity between memory bandwidth and processing power than their desktop counterparts, so some memory-for-processing tradeoff is probably a good deal.

As for LOD, I prefer making the LOD level of a particular vertex dependent only on its distance from the camera, rather than on some error metric (a.la. Geom clipmaps). [Quick definition: I consider the terrain patch with 33x33 vertices to be a lower LOD than it’s parent 65x65 patch, in opposition to convention, since it is difficult to speak of lower detail patches with higher level of detail]. If the distance between LOD level cutoffs is greater than the distance from corner-to-corner across a terrain quad, then you guarantee that no quad will have more than 2 LODs present. Then you can use two additional y-position buffers to point the vertex shader at the two neighboring vertices that contribute to its new position in the next LOD (or at least the position it would be if the vertex weren’t removed), and morph to that position over some range of camera distance values (say from 80%-100% of the LOD cutoff distance). An additional xz buffer will hold the new location the vertex should pop to after the morph has completed (creating degenerate triangles), and the triangle will stay there until the entire patch has demoted to the next LOD. But since it morphed first, you don’t notice the pop, though you may notice the land “deflating” before the pop. Then, the entire patch can be changed out for a lower vertex count patch, and the process can be repeated until the patch only has two triangles. Since morphing vertices on two neighboring patches are at the same position, and therefore the same distance from the camera, they will morph together and degenerate together, avoiding T-junctions, and basically doing stitching automatically on a per-vertex (rather than per-patch-edge) level. Visually, in wireframe, you end up seeing concentric circles of successively lower LOD. But all this means that the same index buffers work for all patches at a given LOD. So, if your maximum terrain patch size is 129x129, then you only need 7 index buffers to represent any size terrain. That’s about 22k vertices, * ~2 indices per vertex * 2 bytes per index is about 90k of index data. There are 3 vertices that each vertex cares about, with 3 pieces of data (xz position, y position, xz normal) that needs to be brought in. That sounds like 9 attributes (OpenGL ES only mandates 8 ), but you don’t need the xz position of the vertex you are not popping to after the morph (you’re just averaging the y position and the normal during the morph and using your own xz position), so it fits exactly within mobile hardware limitations. So this version requires 3 shorts per vertex, plus the 2 byte normals, for a total of 8 bytes per vertex, but allows morphing. This can be reduced to 6 bytes per vertex if the final “pop” is omitted, but will introduce T-junctions that may leave pinholes in the terrain. This can be resolved by leaving a higher LOD ring of vertices around the edge of each page, but that brings back in some of the data we were trying to leave out. With Vertex Texture Fetches, we can get back down to 4 bytes per vertex (or even 2 if we calculate normals in the shader), but the OpenGL ES standard does not require VTF to be supported. Still, this may be a useful approach for desktop apps just trying to build the most massive terrain possible. The best compromise may be settling for the use of a byte for the offset to the morphed position. You still need a short for the pop position, unless you impose the limitation that no quad can have a difference of more than 255 units between its elevation extremes. That limitation would allow you to use shorts for everything, but would not make your artists happy, especially if they are already griping about the difficulty of creating true sheer cliffs in a heightmap

The optimal approach for memory is to put the normals in with the alpha map. I’d use 2 textures with 6 alpha maps and 2 normal components. The normals would have to be renormalized due to interpolation anyway, so still no reason to include the y value, and this means the lighting stays consistent regardless of the motion of the geometry, and is no more expensive than bump mapping. This also leaves a need for 6 splat textures, which fits perfectly into the 8 texture units required by the OpenGL ES spec. Bump mapping terrain details that are stored in 128x128 maximum size (all we have room for) textures doesn’t make much sense anyway.

Grand total now, per vertex:

1 short - y value

1 short - pop-to y value

1 byte - morph-to offset

2 bytes - x,z normals

3-6 bytes - alpha map

=========

10-13 bytes per vertex (166k - 216k per 129x129 patch, 42k - 55k per 65x65 patch)

Grand total, shared by patches:

2 bytes - x,z value

2 bytes - pop-to x,z value

2 shorts - indexes

3 bytes * 3-6 - colors for detail textures

===========

8 bytes per shared vertex (133k total for 129x129 patch size, 34k for 65x65) + 9-18 bytes for 128x128 detail textures (147k - 294k)

So, I’m going to start trying to make these modifications and see if I can make a good low-memory paging terrain engine targeted specifically for Android. The idea will be that you break the world up into 256x256 segments (664k - 864k each), then load 9 segments at a time (the one you’re on and the 8 surrounding ones). This fits in 6-8 MB (depending on number of splat textures), and leaves enough room for your other models, game state, etc. This gives you a minimum view distance of 256 * xz resolution. A 1km view distance at 4 m resolution is something I can live with on a cell phone. I’ve also wanted to integrate bicubic interpolation into the vertex shader or patch generator (only replacing the 4 patches you are standing close to) so we can store the terrain at, say, 8m resolution, and interpolate it down to 1m resolution without requiring 64 times the storage to do so, but that’s an extra feature for later.

David

erlend_sh · December 3, 2010, 9:45pm

An outstanding contribution David, we greatly appreciate the effort.

Who says simultaneous publishing for PCs and mobiles might be bit far-fetched!

Momoko_Fan · December 3, 2010, 11:44pm

This is all quite complicated, but I think it could work. I think it could also benefit the desktop a lot as well.

normen · December 4, 2010, 12:05am

Yes, I think you and @Sploreg should get together so that both platforms use the same system. I remember he wanted to add “dynamic” loading anyway.

dpwhittaker · December 4, 2010, 12:15am

Yeah, on the PC, fitting 1 square km in 10 MB translates to being able to fit 100 square km in a GB. The numbers just get bigger… or you have room for silhouette maps for gorgeous directional shadows, even under varying lighting conditions.

One approach I had forgotten about. If the pages get physically bigger the farther from the camera you get, you can see a lot farther with the same memory usage. This requires strict bounds on all the LOD cutoffs, and complicates storing and loading the patches, but could be worth it. I’ll have to do some calculations and see what kind of improvements we can expect before I dig in too deep to change it. The patch size would have to be much smaller to allow enough patches to be worthwhile. But I think we can fit almost 100 32x32 patches in memory… I’ll do the math when I have a chance to sit down and draw it out.

Sploreg · December 4, 2010, 5:06am

Nice work @dpwhittaker

I have a bunch of time this month to work on Terra Monkey, so feel free to message me any time and I will see what I can do to help out.

There are several places in the current terrain system where memory usage can be reduced without rewriting too much of it. They are on my todo list.

Vortex · December 4, 2010, 4:35pm

@Sploreg:

I personally hope that you can finish the Terrain Editor in jMP.

dpwhittaker · December 4, 2010, 10:21pm

The idea of increasing the size of the patches the farther you get from the camera instead of reducing the number of vertices works, even for smaller numbers of patches.

Click here to view the image in Google Docs.

There are similar numbers of patches in each of the scenarios, but the tiny area in the bottom left is what happens if you don’t do the scaling as you expand. With about 100 patches, you can display an area that would normally take 6400 patches to cover, if they were all the same size. Another benefit is, the farther you get from the terrain, the larger area you can cover, so it should scale well for games that give you the ability to walk around as well as fly.

You need at least 100 patches to really make the idea work. On the cell phone, this means each patch would be a 33x33 patch. Also, the minimum LOD cutoffs are related to the patch size, such that the LOD cutoff should be patch size * 1.5 * 2^LOD (where LOD 0 is the highest LOD and LOD 1 is the next farthest out and so on). This means with a 33x33 patch at 1 m horizontal resolution, the LOD cutoffs will be 48m, 96m, 192m, 384m, 768m, 1536m, 3072m. So, at ~100 patches, you get 768m view distance. But adding more patches exponentially increases the view distance. At each LOD, you are only adding a ring of 12-25 patches (note that the above image is representative of a patch size * 2 * 2^LOD formula for cutoff calculations, and shows that the next LOD ring might need as many as 28 patches to remain in the quad-tree structure, but the 1.5 value requires this less often, and still prevents any occurrence of two neighboring patches differing by more than one LOD). In any case, you can double your view distance every 20 patches on average. 100 33x33 patches takes about 1.3 MB of memory, so there really is room for a good bit more. At 200 patches, you end up at a 24 km view range… that’s roughly the distance to the horizon on earth from 45m up. Remember, the higher you go, the larger the innermost patches are, and therefore the larger the whole terrain is, so as you go up, you can see more and more. The short story is, you can see farther than earth’s horizon for all distances above the terrain except for the range from 45-48m, at which point you are still pretty close. The only problem is, you start losing detail at 48m from the camera, which can be quite noticeable.

So maybe there is room to go up to 65x65 patches. Those can fit 140 patches in under 8 MB. That’s a 6 km view distance on the ground, and you keep full detail for the first 96 m (at 1m resolution). This is more acceptable. While it no longer mimics the distance to the horizon on earth, I think the accuracy is probably worth it.

Now there’s some more analysis to do. a 33x33 patch has about 2k polys. 200 patches is 400k triangles. The marketing specs for the PowerVR SGX 530 (which basically the minimum GPU in any OpenGL 2.0 android phone right now) say it is capable of hitting 14 M polys per second. So it should just barely squeeze in at 30 fps if it can actually maintain that speed with the shader code running. In any case, next generation SGX 540 GPUs should be able to handle it. At the 65x65 size though, each patch has 8k polys. So even dropping down to 140 patches still requires over a million polys. NVIDIAs Tegra 2 processor (at 90 million polys per second) is probably the only mobile processor that can actually handle this load. However, using 4 times as many 33x33 patches as 65x65 patches will give you the same results as long as you use the higher LOD cutoff (1.5 * patch size * 2^LOD is just a minimum). There are just 4 times as many batches. But I doubt mobile processors can optimize the larger batches that well anyway. So the patch size should probably be set constant to 33x33 for this scheme, and the LOD cutoff rate and number of patches can be tweaked to alter the memory consumption, frame rate, quality, and maximum view distance of the terrain.

How hard would this be to implement? It basically just means allowing TerrainPatches to exist at places other than the deepest level of the quadtree. It also means storing the terrain in a format that easily supports region of interest queries at different LODs. So either as a blocked mipmap or pyramid representation. Then you can seek to the right portion of the big terrain file and read out just the data for a single patch. Loading from disk/flash should probably be prefetched in a separate thread. A constant number of split/merge operations should happen each frame, and merges should be prioritized over splits, and splits higher in the tree should be prioritized. This keeps at least a low detail version of the terrain available, even if the camera is moving too fast to keep up with the high detail terrain. A patch is split into 4 patches whenever any corner of the patch is closer to the camera than the LOD cutoff for the next higher LOD level. 4 patches are merged into one whenever all 4 external corners are farther from the camera than the LOD cutoff for its LOD level. Actually, there needs to be a fudge factor there so that you don’t end up splitting and merging the same for patches when the user decides to run around in circles right on the cutoff point. So, if you would normally split/merge a patch at 48m, you might decide to wait until 40 m to split, and 56 m to merge. This means that the morph needs to happen in the range of 32-40m or so. And there should be a buffer on the other end, say at 64m for when to load the data from disk for the split (you don’t need to load any data for a merge, it’s already there, you’re just deleting every other vertex and moving the rest into one patch).

Lots of interesting things to try. If you have a lot of time to work on it, then feel free to run with any of these ideas, as it will probably take me several months to implement given my current schedule. Just post if you have any questions or want any deeper explanations (or email me - my forum ID is my google ID).

Sploreg · December 5, 2010, 5:05pm

Hey @dpwhittaker

This month I am going to wrap up the terrain editor in JMP, and finish some the collision. After that, I want to put in hydraulic erosion. So it might be a couple months until I free up to help on this.

The current terrain system is built so it can support split-screen: multiple cameras and different points of high LOD. So it might be easier to copy some of the current TerrainQuad/Patch code and and then implement your terrain structure separate from it. I don’t see apps on android necessarily having split screen (at least not yet until screens get much bigger).

Also keep in mind physics and collision. You will want to use heightfield collision as it is much, much, faster than straight up mesh collsion.

If needed, it might be reasonable to have a separate low-memory terrain system. The app could determie at launch if it should use low memory or the regular terrain based on what environment it is running in. I know some people want spherical terrain system, and that will no doubt need a separate implementation entirely.

Your description seems sound, I will ponder it some more and see what I can come up with in ways to help you out.

dpwhittaker · December 5, 2010, 7:23pm

That’s cool. The editor and collision will be more universally useful, and I can always write a conversion tool that can convert from big PC heightmaps to optimized mobile heightmaps. As far as hydraulic erosion goes, let me suggest this approach. It uses fragment shaders on the GPU to do the processing, so it’s much faster than CPU implementations, works on any graphics card, and seems to give pretty good results. I’d love to see a terrain generator that can do erosion in anything close to real time. The L3DT model of tweak and wait a day just isn’t very efficient. And even commercial tools like Grome and Terragen can take hours to do erosion.

The only other thing I think should make the high priority list is the paging system. Of course, as terrains get large enough to need it, memory usage becomes more important, so maybe the low-memory terrain should be the default for streaming terrains. Hopefully the editor will have enough reusable components that some of the code can be reused to edit large streaming terrains as well.

Sploreg · December 5, 2010, 8:41pm

Yep that was the erosion approach I was going to use It seems to yield decent results and runs quite fast. I would like to have it as real-time as possible. I hate running a post-processing routine that takes a long time, only to see that it didn’t produce what you wanted, so you have to run it again.

The editor in JMP works against the Terrain.java interface and expects it to be a Node (I might change Terrain to be a subclass of Node in the future, but I haven’t quite decided yet). So if any terrain system implements its methods it should work.

My approach for paging terrain was to have a grid of terrainQuad trees (each grid point is a tree) and those trees get streamed in. And also translating the terrain to avoid any float errors. Similar to google maps’ tile system. I’ve experimented with this a bit and it worked well.

Of course, if your low memory model works out well we can switch to it instead for paging.

erlend_sh · December 6, 2010, 2:05am

This is going to come a bit out of the blue and I’m not 100% sure it’s relevant, but as far as Android optimization goes, it seems @antonyudin’s spaceship simulator has an efficient terrain implementation. I believe he simply created a terrain with fracplanet and then imported that into his scene (it can export to Blender so the pipeline is all set). Since his game runs fine that must mean it’s a pretty optimized terrain, though I’d love to know how this approach matches up to @dpwhittaker’s design.

FYI fracplanet can also do spheres, and on top of it all there is a Java port called TerraJ which is surely worth looking into. The project is long since abandoned, but the author is still available for contact, as we’ve exchanged mails.

dpwhittaker · December 6, 2010, 3:34am

I think @antonyudin’s spaceship simulator runs on the PC and uses Android as a remote control. It does look like a nice terrain though. Maybe he’ll share his code.

Fractal terrain generation is a cool topic, and actually on my list to play with (in real time using tessellation shaders, assuming I can find or create a decent parallel noise function).

@sploreg: That’s actually closer to the implementation I did in my first paging terrain engine for Ogre. Its only downfall is that you end up having to pull a lot of stuff from disk all at once, so you have to be more proactive in grabbing that big chunk of data. The low memory system will have to load from disk at the patch level, so that each patch can be loaded quickly and independently, and you only load exactly what you need to show on the screen, but a “chunkier” system will probably actually be more optimal for the desktop that has the memory to handle it. You’ll need to have at least 9 trees loaded at a time, probably more like 25, depending on view range, and how early you start prefetching data that is just outside of view.

As for floating point errors, putting the terrain under the camera at the origin and lining up all its neighbors works. You just have to be careful when stuff moves from one page to the next (translate by the width of a terrain patch, don’t try to subtract the global position of the two terrains and translate by that).

dpwhittaker · December 7, 2010, 12:04am

@sploreg: I just noticed that the Terrain interface has a function useLOD(boolean). Were you planning on calling useLOD(false) in the editor to have access to all the data? For some parameters of view range and LOD in the low-memory paging system, turning of LOD might as well be equivalent to:

throw new OutOfMemoryException();

I guess that just means I’d have to have the useLOD implementation shrink the view range far enough so that it still fits in memory, or just make it turn off LOD for the patch under the camera. Either way, what is the purpose of this method supposed to be? To me, it seems like the Terrain should do LOD under the hood without the user really knowing about it.

If you’d like some help debugging the editor or anything, let me know.

Sploreg · December 7, 2010, 12:17am

That method is not used in any way yet and I haven’t decided if it will stay or not.

For smaller terrain and certain instances it could be very useful to turn off LOD, so these options need to be left open for consideration.

Of course for super large terrains it is absurd to turn off LOD, but hopefully the developer will figure that out as soon as they do it, and decide if they need to use that feature for their app.

dpwhittaker · December 15, 2010, 1:04am

Initial pre-pre-alpha results are in

Untextured Vertex Colored Terrain

Wireframe view

This is a view of 8,000km (just to see what I could do) of fractally generated landscape (the fractal function still needs work)… I had to set the flycam speed to 10000 m/s just to be able to move around it. The detail in the foreground is at 1m resolution, while some of the stuff in the way back is at 128m resolution. I can move around it freely at high framerates as long as I either drop the flyCam speed down to 100m/s or stay at an altitude of 1000m. I need to experiment, but it looks like setting max speed to 10*altitude should maintain consistent frame rates.

The wireframe view shows a pretty good fidelity between pixels and triangles, though there should be an image of a very steep mountain in the picture to get the best idea. It’s roughly 8 pixels per edge on average with these settings.

Just to put it in perspective, I was running on the ground, loading full 1m detail, at 340m/s. That’s the speed of sound. I averaged 40fps with a few hiccups here and there, and the loading logic is pretty simplistic right now.

Now, I hate it when people get this far and post framerates and say that the primary problem has been solved. I still need to add in morphing and texturing (and a much better fractal function), then start tackling vegetation, static objects, and floating point scale issues. I’ll hopefully continue to build tools for it, like a multi-resolution aware terrain editor extension, and even look at wrapping a sphere with 6 of these (I’m thinking about calculating the curve as quadratic bezier patches, or just using enough of them to make a flat tesselation unobvious at the joints) and making a planet simulator. I’m hoping that a combination of adding more stuff and optimizing what is there will let me render about the same size page, just prettier. If not, there’s still plenty of room to reduce the view range as the camera approaches the ground to keep the framerates up. As it is, the framerates are in the 600’s in the sky. Lots of fun to be had by all.

erlend_sh · December 15, 2010, 2:15am

Absolutely awesome! Where’s the code at?

Sploreg · December 15, 2010, 2:20am

Excellent, nice progress!

dpwhittaker · December 15, 2010, 3:18am

erlend_sh said:
Absolutely awesome! Where's the code at?

There's a reason I called them pre-pre-alpha results :)

erlend_sh · December 15, 2010, 3:49am

dpwhittaker said:
There's a reason I called them pre-pre-alpha results :)

Did you now? My brain just went "register: 'results', 'imglink', 'imglink'. return: 'compliment', 'MOAR!' -quit" :P

No honestly I read the whole thing even though I only understand 10% of it at best, but I really do want to keep tabs on this exciting development. All except "initial pre-pre-alpha", that part I did in fact not register at all :D