Multithreading approach for many AI

guillaumevds · May 6, 2015, 8:59am

Hi all,

As this is my first post I want to thank all the contributors to the great project that is jmonkeyengine. I will surely contribute if one day I write something worth sharing.

At the moment I’m still learning (and having a lot of fun) while writing a little game for which I have a general design question that is not even jmonkeyengine specific.

In my game I have what I call enemies (100 or 200 at most). Those enemies require to do cpu intensive tasks like pathfinding.
For that reason I’ve put their logic in callables. Each enemy has its own callable and is called many times per seconds. Many is about 60 right now but that may change.

It’s working great when I only read final attributes.

As expected, things start to get weird when I stop caring about thread safety. (For example accessing other enemies’ positions as is).

There is a couple of solutions (like copying everything I need or using java’s sync/atomic mechanisms) but I don’t know where I should head to.

I see a lot of downsides for each approach. I fear there is no such thing as a best way and I don’t want to implement-test-profile them all if that can be avoided.

What would you do if you were in my shoes ?

Please keep in mind that :
(1) there is about 50-100 and up to 200 enemies
(2) I’m not a multithreading guru
(3) each enemy has its own callable
(4) each callable is called many times per seconds right now but I may slow that down and extrapolate what I can.
(5) the callable is updating many enemy attributes (it’s not just about pathfinding)
(6) I’m doing my best to fix (2). (3),(4), (5) and (6) can change if you have better ideas.

Best regards and thanks again to all contributors,

Darkchaos · May 6, 2015, 9:33am

I 've actually thought about a similar Problem but I didn’t have the time to implement it.

What you could do, though, is simply having all logic done in jme’s update thread and only multithread stuff like pathfinding.

That way you have a kinda frame-lock for the positions.

You could also work with lock() as simple method to ensure you don’t have cocolissions. You’ll have race conditons though.

Empire_Phoenix · May 6, 2015, 11:57am

Well I use a entity system so the general logic is already kinda providide, but alos possible without one.

At some point of your update process, copy all state to different variables. Then do all calculation on those.
If you do DataObject replacing instead of object updateing, you will only see either the new or the old object but no half updated stuff inbetween.

guillaumevds · May 6, 2015, 12:49pm

Thank you both for your answers !

Darkchaos : yes putting almost everything back into the main thread is probably a good idea as it’s easy and makes everything less complex too. I like to keep things simple but I’d like to dig further in multithreading before I give up.

Empire_Phoenix : “If you do DataObject replacing instead of object updateing, you will only see either the new or the old object but no half updated stuff inbetween.”

That would be perfect because I don’t always need up-to-date positions.

Right now the enemy updates its position in the callable through the position’s reference (I don’t clone() the Vector3f before using/modifying it in the callable). As long as enemies are not looking at each other’s positions it’s fine.

Are you saying that all I have to do is to clone() the position’s vector before passing it to the callable ? (and then, when the callable is done, just use this new reference as “THE” new position)

That sounds almost too easy to be true … I’ve already tested that in the past and still had problems. I’ll try again tonight and give you a feedback. Thank you sir.

pspeed · May 6, 2015, 2:30pm

Well, you still have to make sure that your cloning is done consistently and in a thread safe way.

Generally, having 100 callables all smashing into one another is going to be tough to manage unless you also have discrete steps where everyone synchs.

So for example, you could have all of your logic AI run single threaded and those logical tasks can send requests out to background threads for things that will take a long time (pathfinding, etc). Those requests all hit a consistent state because if the logical AI is updating something it’s not updating the global state but some intermediate form.

Either it updates local state that gets pushed to global state at the end of the AI logic pass or the AI logic pass really just adds “commands” to some queue that is flushed at the end of the AI logic pass. In either of these cases, the long-running background tasks would have to “synchronize” against some world ‘lock’ when they want to access something.

An ES is a little different because everyone would already have their own local view of anything they were interested in. So the pathfinding workers would have their own copy of the world state they are interested in and only update that when they aren’t busy. Versus grabbing a world lock every time they want to look at something.

This is where an ES is particularly elegant at the cost of a little RAM (the world state that gets copied is only a tightly targeted subset of the whole world state.)

guillaumevds · May 6, 2015, 3:19pm

thank you for the information pspeed.

According to what I understand about multithreading and my own code I think I can safely go the Empire_Phoenix’s way.
I failed to do that in the past but I don’t understand why. I’ll carry on this evening/night as it’s just about adding two lines in my code.

May I ask you gurus yet another question ? The answer to this one is probably more obvious.
I’m using Callables many times per second for many enemies. That many many sounds like too much to me.
Shouldn’t I just use my own threads (looping forever) with my own discrete syncs/sleeps ? That would bypass the whole thread pool queueing / scheduling and the callable’s method call.

pspeed · May 6, 2015, 3:27pm

It’s not really clear specifically how you are doing it now… so it’s hard to say if anything would better.

Is it a giant thread pool? A loop over a list? Too many questions.

guillaumevds · May 6, 2015, 4:12pm

Yes sorry about that.
Below is what I do for the moment. Please pardon the possible naiveness of the approach. It’s from my biological memory and oversimplifyed but it should be enough to understand the idea.

update loop method (tpf) {
    for each (enemy) {
        accumulate_tpf();        
        if (enemy's callable isn't done yet) {
            // do nothing
        } else {
            if (callable_has_interesting_results) {
                read_results_from_future();
                update_scene_graph (example: update positions, rotations, attacks, animations, make explosions for things that need to explode, ...)
            }
            if (time elapsed since last callable submission is long enough) {
                copy_information_needed_by_thread_in_an_information_provider_object();
                submit_callable_once_again_and_pass_it_the_information_provider_object();
                empty_the_tpf_accumulator();
            }
        }
    }
}

This shows the flows between the main update loop and callables.
For example the main update loop keeps track of each enemy’s tpf by “accumulating” it. As the AI is not executed in sync with the update loop I have to keep track of the “real” tpf for each of them.

note : by “time elapsed since last callable submission is long enough” I mean the period for which the frenquency is lower than 60 Hz.
copy_information_needed_by_thread_in_an_information_provider_object() : this is where I’ll pass the position.clone() instead of just position;

note : this is in update_scene_graph that I plan to extrapolate data so I can reduce the callables’ execution frequency.

Submitting a callable (even if it’s not new but reused) to the thread pool each time seems overkill to me …

guillaumevds · May 6, 2015, 4:49pm

Ok I’m back home and unfortunately “.clone()” is not that magic
I’ll dig further later this night.
Thank you all anyway, that will help me.

Tryder · May 6, 2015, 7:48pm

Well, like pspeed said, it’s hard to get a grasp on exactly what you’re doing in your AI threads, but I would probably throw them all into a single thread that runs alongside the main thread. So you might have something like:

public class AIThread extends Thread {
    private boolean cancelled = false;

    private boolean updatingQueue = false;
    private List<AICallable> queue = new LinkedList<AICallable>();
    private List<AICallable> pending = new LinkedList<AICallable>();

    public AIThread() {
    }

    public boolean isCancelled() {
        return cancelled;
    }

    public void cancel() {
        cancelled = true;
    }
    
    @Override
    public void run() {
        while (!isCancelled()) {
            for (AICallback ai : queue) {
                ai.call();
            }
            queue.clear();

            updatingQueue = true;
            queue.addAll(pending);
            pending.clear();
            updatingQueue = false;
        }
    }

    public void addAI(AICallback ai) {
        while(updatingQueue) {
            continue;
        }

        pending.add(ai);
    }
}

Obviously in your ai.call() method you’d wrap any code that updates the scenegraph in a callable that is queued up in the thread manager so it’s run on the main thread.

You could also use synchronized in some places here, and while that definitely keeps things safe, it’s SLOW.

Moreover as a rule of thumb I try not to create so many different threads. I’ve heard that you’re not going to get any performance benefit if you use more than one or two threads per core. Certainly you can use more, but spawning a couple hundred threads to run simultaneously is probably not that great of a solution. Plus there’s some overhead involved with spawning a new thread, running all your AIs in a single thread alongside the main thread limits this overhead to just one thread rather than 100-200!

Plus your AIs aren’t running simultaneously to one another so accessing a variable in another AI won’t cause a thread safety problem.

Empire_Phoenix · May 6, 2015, 8:02pm

Basically in main thread
→ Copy all data for callables
→ let callable run
→ -> calalble enques another callable to main thread to write back/copy results

That way all changes to global view are only done in main thread.
All callables are complelty independent from each other.

Tryder · May 6, 2015, 8:17pm

@Empire_Phoenix I’m curious, how do you get around the performance issues when running so many threads at once? If there’s no performance gain from running more than one - two threads per core, but there is a performance hit for each thread spawned then it seems that, for performance reasons, everything would run smoother using just a single thread to iterate over the AIs.

I suppose you could split this into multiple threads on multi-core systems, 2-8, but a couple hundred threads just seems like the CPU is getting slammed too hard.

Empire_Phoenix · May 6, 2015, 10:00pm

Well a ES system only looks at a very small part of the problem, usually small enough to fit completly into L2-L3 cpu cache. This is reached by having multiple dump threads in my case.
Also my ES uses threadpools for some systems. It would be trivial to change all Systems to only use one shared threadpool. Point is with my ES more threads make it actually simpler to code due to its nature.

I currently run around 50 threads concurrently with no noticeable performance impact at all.
What you have to keep in mind, most threads are done with their work before they are even once switched by the Operatingsystem scheduler, so they pose no real cost.

Tryder · May 6, 2015, 10:17pm

I guess I wasn’t considering the amount of workload a thread poses. I haven’t really done much in the way of coding AIs that do very little work in their threads. The game I’m working on now is a turn based strategy so the AI has a lot more work to do than the AI for an FPS for example.

Most of the time the AI is pretty quick, but as the AIs empire expands the threads start running longer and longer, sometimes they re-run several times in the event the AI encounters a new item of interest and needs to re-evaluate its decisions for the turn. All in all it’s still pretty quick, but can take a good number of seconds.

Plus memory becomes an issue with concurrent threading, for me. Originally I was running the boundary generators alongside one another in separate threads. Usually there’s not more than one boundary being generated at a time, but sometimes there is and when they run side by side the memory consumption doubles, triples and so forth. So now I run them in serial, it’s fast enough that it’s not noticeable, but saves on memory.

Empire_Phoenix · May 7, 2015, 6:18am

Well of course your target system is highly relevant for your inner engine design.
For example I target nothing less than quad core with 16gb ram (as it will e quite common once I finish the game )

Tryder · May 7, 2015, 5:47pm

Heh, my system is a dual core with 3GB of RAM

I feel those with low end systems need game lovin’ too :), but that certainly doesn’t mean that people shouldn’t target high end systems either. I’m trying to make my game somewhat scalable, an option for particle detail and lighting detail which switches between the use of normal-mapped per-pixel specular shaders and non-normal mapped vertex lit diffuse shaders exist, though the options menu itself does not yet exist.

Kecon · May 7, 2015, 8:32pm

I suggest reducing the amount of threads to about 20 using a fixed thread pool (Executors.newFixedThreadPool). Too many threads is only resulting in a lot of task switching and more time for threads to wait for locks.

I’m writing AI for a population in a simulation game, and aim to use about 1000 persons or more in the game. I execute orders for a person in a runnable. Running the following code each tick (yes, it’s Java 1.8, this game will never run on Android or IOS)

public void run() {
        final List<Future<?>> futures = this.population.getPersons().stream()
                .map(person -> this.executorService.submit(new PersonTick(this.world, person, this.getIdleOrder()))).collect(Collectors.toList());

        for (final Future<?> future : futures) {
            try {
                future.get();
            } catch (InterruptedException ignored) {
            } catch (ExecutionException e) {
                this.logger.error("Unknown error occurred", e);
            }
        }
    }

PersonTick implementation is using a state machine for executing orders, and each order has a cost, and several orders may be executed at one tick, if they are cheap.

public PersonTick(final World world, final Person person, final Order idleOrder) {
        super();
        this.world = world;
        this.person = person;
        this.idleOrder = idleOrder;
}

@Override
public void run() {
        try {
            int ticks = 0;
            while (true) {
                if (this.person.getOrders() == null || this.person.getOrders().isEmpty()) {
                    this.idleOrder.execute(this.world, this.person);
                } else {
                    final Order order = this.person.getOrders().get(0);
                    final int cost = order.getOrderCost(this.world, this.person);

                    if (ticks + cost < MAX_TICK) {
                        break;
                    }

                    OrderResult result = order.execute(this.world, this.person);
                    if (result == OrderResult.REMOVE) {
                        this.person.getOrders().remove(0);
                    }

                    ticks += cost;

                    if (ticks == cost) {
                        break;
                    }
                }
            }
        } catch (Exception e) {
            LOGGER.error("Unknown error occurred for {}", person, e);
        }
}

The order contain all information and the code for executing the order to reduce time for looking up implementations. Using this structure makes it easier to write unit tests for each order as well.

I recommend to use immutable objects if possible, since you don’t need to use synchronization for them. I generate special aggregated immutable maps for path finding in a separate thread. I put the latest version of the aggregated map in an AtomicRef, to make it available for each person in the population.

guillaumevds · May 8, 2015, 10:17am

Ok, news about my project : AI is now working as expected AND in a multithreaded way My problems were coming from a not-so-thread-safe and not-so-important block of code that I’ve now commented and that I’ll rewrite differently one day. Cloning my positions was not even needed.

@Kecon: gg for you using the new java 8 features. Have you ever tried to measure their performance effects in the context of game programming ? I haven’t measured anything yet myself but I’m really interested by the results. (ex: traditional foreach vs streamParallel()). This seems a bit off topic tho so maybe we should use private message

Also I like the way you use costs. Simple, clever, efficient

Kecon · May 8, 2015, 4:58pm

I’ve not made any tests my self, but I’ve read some blogs, and what they said is that’s not a big difference. The streamParallel() might help some, but it all depends how you use it. There is usually no gain if you have a few entries, you should probably have 1000+ before you will gain something, since there is a setup cost for using streamParallel (dividing work load and acquire threads). You will probably not gain so much if you already is using a lot of threads where they all are using streamParallel().

Use the new Java 8 features if it makes your code more readable and easier to maintain. Don’t do preoptimizations, since it’s hard to know exactly where your bottlenecks will be, and you will probably only waste time. I suggest start with just stream() and modify it later if there are any problems.

atomix · May 10, 2015, 3:54am

Nice! This is kind of what I’m doing. Let’s check RxJava if you didn’t have a look yet.

The idea is “streaming signals” of AI activities into the rendering pipelines. So what ever you do in so call “AI threads”, let the rendering thread know after awhile to get it update.

However some will find RxJava so big for his small game, or don’t want to restructure his multi threading algorithm. I think he can go with you approach.

But I think there are still some un-certain points in this design. The Java collection utils just promised some cases of thread-safe modifications. I don’t see your implementation code in detail. But I from what I’ve read here can smell something is not right. May be just me…