Question about Mesh Generation Performance in Bloxel Games

Hello,

I am currently working on a game that uses bloxels (similar to Minecraft) as a core component. This isn’t to be a Minecraft clone but rather to facilitate an editable world that doesn’t take too much effort to understand.

I have written a custom Mesh class that loops through the chunks and then checks the neighboring blocks (including in neighboring chunks) and then adds triangles, verticies, etc. as needed. It already skips over any unused faces.

The issue is that the method to generate the mesh often takes > 1s which is unacceptable when the player edit’s the world. When I load an old version of Minecraft the world loads almost instantly (at the maximum view distance) which leads me to believe that I am doing something inefficiently.

EDIT: Some things I have already tried:

  1. Decreasing the size of the chunk (This mesh is already called on an alternate thread so this did help, but it made my world have too many objects for a decent FPS)
  2. Remove calls to external chunks (having any faces that face an external chunk simply be absent from the mesh – it looked ugly but I figured it was fine to see if external chunks were the issue – they wearn’t)
  3. Switch the Chunk object to use an AtomicIntegerArray to hold block IDs rather than synchronized access to a 3D short array (Block IDs are shorts) (this caused significant performance loss) (chunks must be accessible from multiple threads or performance is entirely unacceptable)

Here is my ChunkMesh:

package src.john01dav.<redacted>.client.game.world;

import com.jme3.math.Vector2f;
import com.jme3.math.Vector3f;
import com.jme3.scene.Mesh;
import com.jme3.scene.VertexBuffer;
import com.jme3.util.BufferUtils;

import src.john01dav.<redacted>.client.game.world.textures.AtlasCoordinates;
import src.john01dav.<redacted>.client.game.world.textures.TextureProvider;
import src.john01dav.<redacted>.common.blocks.BlockSide;
import src.john01dav.<redacted>.common.world.Chunk;
import src.john01dav.<redacted>.common.world.World;

import java.util.Arrays;

public class ChunkMesh extends Mesh{
    private static final Vector3f POSITIVE_X = new Vector3f(1, 0, 0);
    private static final Vector3f NEGATIVE_X = new Vector3f(-1, 0, 0);
    private static final Vector3f POSITIVE_Y = new Vector3f(0, 1, 0);
    private static final Vector3f NEGATIVE_Y = new Vector3f(0, -1, 0);
    private static final Vector3f POSITIVE_Z = new Vector3f(0, 0, 1);
    private static final Vector3f NEGATIVE_Z = new Vector3f(0, 0, -1);
    private final Chunk chunk;
    private final WorldAppState worldAppState;
    private Vector3f[] verticies;
    private Vector3f[] normals;
    private Vector2f[] textureCoordinates;
    private int[] indecies;
    private int currentVertex;
    private int currentIndex;

    public ChunkMesh(Chunk chunk, WorldAppState worldAppState){
        this.chunk = chunk;
        this.worldAppState = worldAppState;
    }

    public void buildMeshData(){
        long start = System.currentTimeMillis();
        int cx, cy, cz;

        verticies = new Vector3f[1024];
        normals = new Vector3f[1024];
        textureCoordinates = new Vector2f[1024];
        indecies = new int[4096];

        currentVertex = 0;
        currentIndex = 0;

        synchronized(chunk.getModificationLock()){
            for(cx=0;cx<Chunk.SIZE;cx++){
                for(cy=0;cy<World.HEIGHT;cy++){
                    for(cz=0;cz<Chunk.SIZE;cz++){
                        buildBlockData(cx, cy, cz);
                    }
                }
            }
        }

        verticies = Arrays.copyOf(verticies, currentVertex);
        normals = Arrays.copyOf(normals, currentVertex);
        textureCoordinates = Arrays.copyOf(textureCoordinates, currentVertex);
        indecies = Arrays.copyOf(indecies, currentIndex);

        setBuffer(VertexBuffer.Type.Position, 3, BufferUtils.createFloatBuffer(verticies));
        setBuffer(VertexBuffer.Type.Normal, 3, BufferUtils.createFloatBuffer(normals));
        setBuffer(VertexBuffer.Type.TexCoord, 2, BufferUtils.createFloatBuffer(textureCoordinates));
        setBuffer(VertexBuffer.Type.Index, 3, BufferUtils.createIntBuffer(indecies));

        int vLength = verticies.length;
        int iLength = indecies.length;

        //allow these arrays to be garbage collected
        verticies = null;
        normals = null;
        textureCoordinates = null;
        indecies = null;

        updateBound();

        System.out.println("Mesh generation took: " + (System.currentTimeMillis() - start) + "ms " + vLength + " " + iLength); //this is how I am measuring the performance of the mesh generation
    }

    private void buildBlockData(int cx, int cy, int cz){
        if(isBlockSolid(cx, cy, cz)){
            verticies = minSize(verticies, 24, currentVertex);
            if(normals.length != verticies.length) normals = Arrays.copyOf(normals, verticies.length);
            if(textureCoordinates.length != verticies.length) textureCoordinates = Arrays.copyOf(textureCoordinates, verticies.length);
            indecies = minSize(indecies, 36, currentIndex);

            Vector3f v1 = new Vector3f(cx, cy + 1, cz);
            Vector3f v2 = new Vector3f(cx, cy, cz);
            Vector3f v3 = new Vector3f(cx + 1, cy, cz);
            Vector3f v4 = new Vector3f(cx + 1, cy + 1, cz);
            Vector3f v5 = new Vector3f(cx, cy + 1, cz + 1);
            Vector3f v6 = new Vector3f(cx, cy, cz + 1);
            Vector3f v7 = new Vector3f(cx + 1, cy, cz + 1);
            Vector3f v8 = new Vector3f(cx + 1, cy + 1, cz + 1);

            TextureProvider textureProvider = worldAppState.getTextureProvider(chunk.getBlockWithLocalCoordiantes(cx, cy, cz));

            if(!isBlockSolid(cx + 1, cy, cz)){
                buildFace(v4, v8, v3, v7, POSITIVE_X, textureProvider.getAtlasCoordinates(BlockSide.POSITIVE_X));
            }

            if(!isBlockSolid(cx - 1, cy, cz)){
                buildFace(v5, v1, v6, v2, NEGATIVE_X, textureProvider.getAtlasCoordinates(BlockSide.NEGATIVE_X));
            }

            if(!isBlockSolid(cx, cy + 1, cz)){
                buildFace(v5, v8, v1, v4, POSITIVE_Y, textureProvider.getAtlasCoordinates(BlockSide.POSITIVE_Y));
            }

            if(!isBlockSolid(cx, cy - 1, cz)){
                buildFace(v2, v3, v6, v7, NEGATIVE_Y, textureProvider.getAtlasCoordinates(BlockSide.NEGATIVE_Y));
            }

            if(!isBlockSolid(cx, cy, cz + 1)){
                buildFace(v8, v5, v7, v6, POSITIVE_Z, textureProvider.getAtlasCoordinates(BlockSide.POSITIVE_Z));
            }

            if(!isBlockSolid(cx, cy, cz - 1)){
                buildFace(v1, v4, v2, v3, NEGATIVE_Z, textureProvider.getAtlasCoordinates(BlockSide.NEGATIVE_Z));
            }
        }
    }

    private boolean isBlockSolid(int x, int y, int z){
        try{
            return chunk.getBlockWithLocalCoordiantes(x, y, z).isSolid();
        }catch(ArrayIndexOutOfBoundsException e){
            if(y < 0){
                return true;
            }else if(y >= World.HEIGHT){
                return false;
            }

            int gx = chunk.getCoordinates().getX() * Chunk.SIZE + x;
            int gz = chunk.getCoordinates().getZ() * Chunk.SIZE + z;
            return chunk.getWorld().getBlockAt(gx, y, gz).isSolid();
        }
    }

    private void buildFace(Vector3f a, Vector3f b, Vector3f c, Vector3f d, Vector3f normal, AtlasCoordinates atlasCoordinates){
        int aNum = currentVertex++;
        int bNum = currentVertex++;
        int cNum = currentVertex++;
        int dNum = currentVertex++;

        verticies[aNum] = a;
        verticies[bNum] = b;
        verticies[cNum] = c;
        verticies[dNum] = d;

        normals[aNum] = normal;
        normals[bNum] = normal;
        normals[cNum] = normal;
        normals[dNum] = normal;

        textureCoordinates[aNum] = new Vector2f(atlasCoordinates.getX(), atlasCoordinates.getY());
        textureCoordinates[bNum] = new Vector2f(atlasCoordinates.getX() + AtlasCoordinates.TEXTURE_SIZE, atlasCoordinates.getY());
        textureCoordinates[cNum] = new Vector2f(atlasCoordinates.getX(), atlasCoordinates.getY() + AtlasCoordinates.TEXTURE_SIZE);
        textureCoordinates[dNum] = new Vector2f(atlasCoordinates.getX() + AtlasCoordinates.TEXTURE_SIZE, atlasCoordinates.getY() + AtlasCoordinates.TEXTURE_SIZE);

        indecies[currentIndex++] = aNum;
        indecies[currentIndex++] = bNum;
        indecies[currentIndex++] = cNum;

        indecies[currentIndex++] = cNum;
        indecies[currentIndex++] = bNum;
        indecies[currentIndex++] = dNum;
    }

    private <T> T[] minSize(T[] array, int neededItems, int current){
        int minLength = current + neededItems;

        if(array.length >= minLength){
            return array;
        }else{
            int newLength = current;

            while(newLength < minLength){
                newLength*=2;
            }

            return Arrays.copyOf(array, newLength);
        }
    }

    private int[] minSize(int[] array, int neededItems, int current){
        int minLength = current + neededItems;

        if(array.length >= minLength){
            return array;
        }else{
            int newLength = current;

            while(newLength < minLength){
                newLength*=2;
            }

            return Arrays.copyOf(array, newLength);
        }
    }

    public Chunk getChunk(){
        return chunk;
    }

}

Yeah, that’s way too long. In Mythruna, if a chunk takes more than 10 ms to generate then I get worried. (For reference, my ‘chunks’ are 32x32x32) So > 1 second is pretty scary.

Some random things:

That right there can suck all kinds of time away… even if you don’t have any thread contention. Avoid synchronized at all costs… but especially here where it is completely unnecessary. The ‘synchronized’ keyword is the 50 lb sledge hammer of multithreading… there are almost always way better approaches.

Treating an ‘exception’ as something non-exceptional is also very expensive. A try/catch block on its own incurs some overhead but expecting a whole exception and stack trace to be filled in just to avoid out-of-bounds conditions… and then to do that six times in your inner loop. Molasses.

Better to just check the bounds. An if statement is going to be 100x less expensive.

Even better, let the chunk data overlap and then you don’t have to worry about it.

Also, rather than all of the extra copies of arrays you make, it might be better just to build an intermediate data structure and build the arrays the size you need. So just build a list of some kind of Face objects and then when you know how many, allocate the arrays once… and then you might as well go ahead and make them float arrays.

That’s just the stuff I noticed on quick glance.

Do you have any sources to back this up? I was looking fore more information on exactly what makes it so expensive and everything I found seems to say that it is quite lightweight in modern JVMs (and I intend to ship Java 8 with my game).

This alone caused it to go down to ~20ms with 200ms max just as the client starts and there are a bunch of threads active. It will take longer than Mythruna’s because my chunks are 32x256x32.

It may be ‘fast’ but it’s still the absolute slowest way to do multithreading. That’s why the java.util.concurrent package goes to such great lengths to avoid ‘synchronized’ whenever possible. Even worse is that there are no monitor guarantees on who gets the thread back upon contention.

…and in general, I’d suggest that a design that relies on a synchronized where you have it could use some rework. Queues are overall a much better way. A synchronized block implies a bunch of things, setting up the monitor, memory barrier, etc… when there is any contention then it’s even worse as you potentially lose your time slice (depending on implementation).

Best avoided, really. There are often much cleaner designs anyway.

The synchronized block is there so that the chunk can’t be modified while the mesh is being refreshed. This ensures that each mesh that appears in the world is actually representative of the chunk at some point in time. Is there a better way to ensure this? If so, what is it? Keep in mind that the chunk needs to be able to be modified from any thread as the server is almost entirely concurrent.

Often synchronized seems to be the simplest way to go. I fully understand how to use all of the other low-ish level thread constructs (basically anything below the level of a ConcurrentHashMap, this includes the various Lock classes). Given this, what can I do to find these “much cleaner designs”?

It also means that the mesh can’t be refreshed while it is being modified. What are the various threads that are contending for this? If one of them is the render thread then you’ve probably created some big frame drops. Even though it’s the source data you are protecting here, there is the chance that you are causing contention on the other side.

Generally, the approach for mesh generation is to do it all on another thread. When it’s done being generated, submit it to a queue that the render thread picks up. There is never any chance for multiple threads to access it because there is never any chance that multiple threads even have it.

There are more complicated versions of that but they all revolve around the same premise: only one thread even has the mesh at a time. When the render thread needs a new generated mesh it kicks of the worker to create another, which is then submitted in the same way, and so on. Generally you can then also kick the old unused mesh back to a worker thread to free it, too.

Mythruna uses something towards the more complicated end but its based on the Pager library I open sourced. A builder object manages both the retrieval of the data and the generation of the mesh and all of that is done on the background thread. The results are applied to the scene graph on the render thread.

Though in my case, I don’t worry about chunk reading issues because even if I get a partial view I’m about to get kicked again to generate a new version anyway. The chunks were already retrieved through a Guava LoadingCache so there state was ‘good’ by the time any threads get to use them. If some process inserts data into a cell while I’m using it then it’s not that big of a deal.

Anyway… if you feel that threading isn’t causing any issues then there is no reason to change your approach.

Other random things I’ve discovered can speed things up:
-include an overlapping border in your chunks so that you never have to retrieve data from outside. It’s way easier to poke edge data into multiple chunks than it is to constantly read past edges.
-if you aren’t already, keep your chunk data in a one dimension array that you calculate an index into. If you are currently using a 3D array then this incurs three dereferences, three array length checks, and three array indexes. Not to mention that 3D arrays may take up more memory than a one dimensional array of ‘block type’…depending on your native type. A private or final method for calculating the one dimensional index from 3D coordinates is generally inlined by the JIT so those few extra adds and multiplies are much smaller than the other.

Also, if you get really sneaky you can eventually just iterate over the whole array in sequence. Neighbors offsets can be precalculated. Save that optimization for when you are really bored.

One other random performance thing, you might consider breaking your chunks up vertically also. This is a pretty big performance gain.

Completely empty chunks don’t need generating and completely full chunks also don’t need generating. Vertically, chunks are quite often full empty or full solid if you split them.

You may have other reasons you need tall chunks but I thought I would mention it. Mythruna didn’t arrive at it’s sizes by accident… and while I mostly treat the world the same as you do (even though I’m split vertically I still treat them together when loading from disk and generating terrain, etc.)… having them split can save a lot of time.

Chunks not generated take 0 nanoseconds to generate. :slight_smile:

On the client, the only other calls to the chunk at all are to decode entire chunks sent over the network in compressed form (ran async, the chunk doesn’t even try to refresh until this finishes) and a simple call to setBlockAt() in the SpiderMonkey thread to handle when a chunk is modified by a player. In my game, a 20ms delay to a message is meaningless.

I may just temporarily switch off the modification lock altogether and see if any issues arise. The actual data is already stored in an AtomicIntegerArray and I can’t think of any cases in which something bad would happen, but a gut feeling tells me otherwise.

A profiler tells me that 95%-99% of the JME thread’s time is spent rendering the world (in LWJGL packages) which should not be effected at all by any of my synchronization code.

I imagine this would make saving the maps work badly in that you either need to have a bunch of data stored more than once or the save file expands each time as chunks are generated around the edges of the currently generated area.

If your chunk sizes are 32x32x256 then just make the internal arrays 34x34x256. You never need to check a neighbor outside of that but the savings is HUGE. The only cost is that (in your case) you might have to update four chunks at once (at the corners). But that’s a place where the cost is easily unnoticed. (a small cost anyway)

But the savings you get avoiding edge misses is just huge.

Have you done any measurements for the exact size of these savings?
Also, how do you handle the saving of such chunks to the hard drive and sending them over the network? Do you just let all of the edge data get stored twice?

Yep. Small price to pay.

I mean, I did… but really it’s obvious.

This…

Is going to be waaaaaay faster than this:

For every edge miss… potenrially 4 x 32 x 256 in your case… you are going to do that slower access instead of an array lookup.

I have modified my isBlockSolid method to the following:

private boolean isBlockSolid(int x, int y, int z){
    if(y < 0){
        return true;
    }else if(y >= World.HEIGHT){
        return false;
    }else if(x < 0 || x >= Chunk.SIZE || z < 0 || z >= Chunk.SIZE){
        int gx = chunk.getCoordinates().getX() * Chunk.SIZE + x;
        int gz = chunk.getCoordinates().getZ() * Chunk.SIZE + z;
        return chunk.getWorld().getBlockAt(gx, y, gz).isSolid();
    }else{
        return chunk.getBlockWithLocalCoordiantes(x, y, z).isSolid();
    }
}

This is what caused the reduction to ~20ms. the world.getBlockAt() method just does some division, access a ConcurrentHashMap, and then access the other chunk which then access an AtomicIntegerArray and an AtomicReferenceArray. None of that seems particularly heavy nor are there any loading issues at all right now except for those caused by terrain generation that are easy to fix by having the server generate the world as soon as it starts.