SimEthereal, BufferOverflowException

Mithrin · February 23, 2019, 6:56pm

I’ve occasionally been having a problem that crashes my game server.

07:22:25,916 ERROR [StateCollector] Collection error
java.nio.BufferOverflowException
	at java.nio.Buffer.nextPutIndex(Buffer.java:521) ~[?:1.8.0_172]
	at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:169) ~[?:1.8.0_172]
	at com.jme3.network.serializing.serializers.ByteSerializer.writeObject(ByteSerializer.java:51) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.serializing.serializers.ArraySerializer.writeArray(ArraySerializer.java:124) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.serializing.serializers.ArraySerializer.writeObject(ArraySerializer.java:109) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.serializing.serializers.FieldSerializer.writeObject(FieldSerializer.java:202) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.serializing.Serializer.writeClassAndObject(Serializer.java:458) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.base.MessageProtocol.messageToBuffer(MessageProtocol.java:73) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.jme3.network.base.DefaultServer$Connection.send(DefaultServer.java:582) ~[jme3-networking-3.3.0-SNAPSHOT.jar:3.3-6587]
	at com.simsilica.ethereal.net.StateWriter.endMessage(StateWriter.java:437) ~[sim-ethereal-1.2.1-SNAPSHOT.jar:?]
	at com.simsilica.ethereal.net.StateWriter.flush(StateWriter.java:451) ~[sim-ethereal-1.2.1-SNAPSHOT.jar:?]
	at com.simsilica.ethereal.NetworkStateListener.endFrameBlock(NetworkStateListener.java:226) ~[sim-ethereal-1.2.1-SNAPSHOT.jar:?]
	at com.simsilica.ethereal.zone.StateCollector.collect(StateCollector.java:262) [sim-ethereal-1.2.1-SNAPSHOT.jar:?]
	at com.simsilica.ethereal.zone.StateCollector$Runner.run(StateCollector.java:313) [sim-ethereal-1.2.1-SNAPSHOT.jar:?]

Any thoughts on how to troubleshoot this problem? I turned off message splitting with this

getService(EtherealHost.class).getStateListener(conn).setMaxMessageSize(65535);

because of other problems we previously encountered.

Thanks!

pspeed · February 23, 2019, 9:39pm

Your messages are getting too big for JME’s poopy Serializer. Since it’s based off of buffers instead of streams, the only way to write objects is to guess at some size and hope you don’t run out of RAM for picking a size too big. SpiderMonkey picks 32767 as the max message size… mostly because the data size in the protocol was already a signed short.

github.com

jMonkeyEngine/jmonkeyengine/blob/master/jme3-networking/src/main/java/com/jme3/network/base/MessageProtocol.java#L69


      
          /*
           * Copyright (c) 2009-2012 jMonkeyEngine
           * All rights reserved.
           *
           * Redistribution and use in source and binary forms, with or without
           * modification, are permitted provided that the following conditions are
           * met:
           *
           * * Redistributions of source code must retain the above copyright
           *   notice, this list of conditions and the following disclaimer.
           *
           * * Redistributions in binary form must reproduce the above copyright
           *   notice, this list of conditions and the following disclaimer in the
           *   documentation and/or other materials provided with the distribution.
           *
           * * Neither the name of 'jMonkeyEngine' nor the names of its contributors
           *   may be used to endorse or promote products derived from this software
           *   without specific prior written permission.
           *
           * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
           * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
           * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
           * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
           * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
           * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
           * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
           * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
           * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
           * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
           * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
           */
          package com.jme3.network.base;
          
          import java.nio.ByteBuffer;
          import com.jme3.network.Message;
          
          /**
           *  Consolidates the conversion of messages to/from byte buffers
           *  and provides a rolling message buffer.  ByteBuffers can be
           *  pushed in and messages will be extracted, accumulated, and 
           *  available for retrieval.  The MessageBuffers returned are generally
           *  not thread safe and are meant to be used within a single message 
           *  processing thread.  MessageProtocol implementations themselves should
           *  be thread safe.
           *
           *  <p>The specific serialization protocol used is up to the implementation.</p>
           *
           *  @author    Paul Speed
           */ 
          public interface MessageProtocol {
              public ByteBuffer toByteBuffer( Message message, ByteBuffer target );
              public Message toMessage( ByteBuffer bytes );
              public MessageBuffer createBuffer();
          }

One of my big mistakes in life was not rewriting the Serializer when I rewrote the rest of SpiderMonkey… but now it is what it is.

Well, you should fix your real problem, then… since you’ve tied SimEthereal’s hands.

Mithrin · February 23, 2019, 9:44pm

Simethereal crashes if I let it split messages … as per our previous messages regarding this. (probably over a year ago now)

So my real problem is that simethereal has a message splitting bug I don’t know how to fix? …

pspeed · February 23, 2019, 9:51pm

In my timeline, a year ago might as well be a decade.

Are the messages getting too big because you have tons of objects or because bad connections are letting ACKs accumulate?

Edit: also are you running the latest SimEthereal or an older version?

Mithrin · February 23, 2019, 9:54pm

I get around 90 physics objects maximum at any given time.

I am running an older version because I didn’t want to update to the latest 4 days before releasing my game. (I am using version 1.2.1 I think? before you redid the time sync stuff)

In the above error message I had 5 players joined.

Mithrin · February 24, 2019, 2:08am

Did you address message splitting at all in the new release?

asser_fahrenholz · February 24, 2019, 8:34am

Did you address message splitting at all in the new release?

Did you check the release notes?

pspeed · February 24, 2019, 8:39am

Which new release? I’m still not sure which release you are running so I can’t comment on what new might help fix/resolve/debug this issue. I’m pretty sure I haven’t fixed anything directly related to your issue. (link to previous thread could be helpful, by the way)

The point is that even if I did somehow find and fix something then it’s still an uphill battle for you to use it anyway… so I’m unlikely to get much feedback on whether something I do fixes it or not. That’s all.

I’m going to try some tests to see if I can reproduce it locally.

pspeed · February 24, 2019, 9:16am

Missed this before. So ignore my comments about not knowing which version you are running.

pspeed · February 24, 2019, 9:53am

So in local testing, if I create a giant number of objects (for me)… like 200+ in the local zone… everything works fine until I leave the space that can see that zone.

I then get an error about bad things happening in the ack processing or something like that. Upping the message size to 32000 fixes that (and is small enough to avoid any issues with SpiderMonkey). So there seems to be something I can try to fix.

However, these parameters tend to make me worry about your setup. Like number of objects, whether you’ve tweaked the update rates, or if you are sending object updates faster than 60 Hz, etc.

In the normal setup, objects will be updated as some application-defined frequency… I guess usually 60 FPS or less on the server. By default, the state collector then bundles and sends these 20 times per second… so will try to send three frames per message.

Somehow, in your setup, you are managing to exceed 32767 bytes in message size for 1/20th a second’s worth of data. Considering that 80 or so constantly moving/rotating objects can fit in under 1500 bytes, that seems pretty crazy.

Now, clearly there is something bad going on when objects leave visibility… and I’m not willing to rule out that it is a cumulative error. (Though local testing has not indicated that it is.)

pspeed · February 24, 2019, 11:29am

For what it’s worth, I have locally solved the watchdog overflow problem (at least in prototype form), ie: seeing this exception:

github.com

Simsilica/SimEthereal/blob/master/src/main/java/com/simsilica/ethereal/net/StateWriter.java#L249


}

public void startFrame( long time, ZoneKey centerZone ) throws IOException {

    // Watchdog to check for mismatched time sources.
    long delta = Math.abs(time - timeSource.getTime()); 
    if( delta > 1000000000 ) {
        // more then a second difference means they are waaaaaay off.
        // Even a ms difference would be large.
        log.warn("Mismatched time sources.  Delta:" + (delta/1000000000.0) + " seconds");
    }

    // End any previous frame that we might be in the middle of    
    endFrame();
 
    // Reset the message counter.  We use this to see how many messages
    // we split a frame into.  Note: it could stay 0 if we are stacking
    // multiple frames into a single message.        
    messagesPerFrame = 0;
            
    // Make sure we have a current message started

If that was your old problem then it may be fixed by the upcoming changes.

If your old problem was different then a link to the old thread/messages would be helpful.

zissis · February 24, 2019, 1:45pm

@pspeed this who thread is deja-vu for me. Remember when I stress tested your library in production with 1500 simultaneous users about 3 years ago and found this bug? I spent a few weeks finding the root cause and patching my version of the code. I do seem to recall you fixing it at some point because today I am running your latest version.

pspeed · February 24, 2019, 1:51pm

It’s possible… it also could have been something different.

Locally I’m able to replicate the problem by rapidly spawning objects. One per 0.1 second with a 30 second decay. Overall that’s a churn of about 300 objects where every 0.1 second 1 is created and 1 is removed… very taxing on the network code.

I’ve committed a fix to master that is available to anyone who builds from head.

Changes can be seen here:
https://github.com/Simsilica/SimEthereal/commit/6ef246767f30daad27c16a4e94a7e4ae246c347b

Locally it fixes my problem.

This stress test is a bit unrealistic but it’s good for tracking down issues… as I already have two more problems to look into triggered by this test (but not a more realistic test).

Edit: note that most of the changes in the diff are the addition of trace logging. The actual fix is just to make the watchdog max variable based on message lag conditions.

Mithrin · February 24, 2019, 3:40pm

Hello Pspeed,

First, thank you, I really appreciate your help … I admit fully that debugging this thing is a bit beyond my skills as a programmer. I full admit that.

When we previously had issues, the size >= 128 line ‘fixed’ it, however I never had the opportunity to test with more players. Since I released my game, I’ve been getting more players and then encountered this error.

I have not updated any of the update rates.

We previously had 80 objects max when I tested this last time you asked about it, however with 5 players its possible that its bumped up a little from there, but honestly not much.

I simply don’t create/destroy game objects quickly enough to get more than around 100 at any given time.

However, I do frequently change zones, very frequently in fact.

Twitch to illustrate the game. Zone switching happens a fair amount, especially on larger maps. Perhaps its a issue with the scale I’m using? each tile is 16 units and my grid size is 256 … the map in the video is 55x55 tiles I think?

from the discussion so far it sounds like I should update to the latest master and set my max message size to 32000. Does that sound correct?

Again, thank you for your help pspeed, your code has taught me so much, I am super grateful.

Wobblytrout

pspeed · February 24, 2019, 10:12pm

Your game always looks so cool.

Definitely try master if you can. Your current message size is also way too big so yeah, if you set it at all then definitely lower it to 32000 or so.

Splits were causing the issue faster because the code was burning through message IDs faster. You may find that you don’t need to worry about that more and can go back to the default. That being said, if you aren’t experiencing performance issues with the larger message size then you might as well leave it big (32000 or so) to avoid splitting at all.

Mithrin · February 25, 2019, 2:45am

Thanks man, it means alot … I couldn’t have made it without your help.

I will update all my stuff to master, thank you!

Wobblytrout

pspeed · February 26, 2019, 1:22am

Just a note: I’ve just committed some additional changes which should help a lot in these cases.

In my own local testing I have 300 active objects where one is being created every 0.1 seconds and one is being destroyed every 0.1 seconds. So there is a constant churn of about 300 objects.

I was seeing a lot of strange issues when crossing zone boundaries. Missing baseline messages, etc.

One thing that was nagging at me was that as the ACK lists grew (because of message lag or whatever), that array would take up more and more of the object state messages. This concerned me because SimEthereal tries to be a really tight protocol (I count every bit). I had the idea that mostly (always?) these ACK lists would be one contiguous set of values. I can’t really count on “always” so I decided to write a Set implementation that internally keeps track of ranges.

Thus a new class: IntRangeSet (and unit test suite)

github.com

Simsilica/SimMath/blob/master/src/main/java/com/simsilica/mathd/util/IntRangeSet.java

/*
 * $Id$
 * 
 * Copyright (c) 2019, Simsilica, LLC
 * All rights reserved.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions 
 * are met:
 * 
 * 1. Redistributions of source code must retain the above copyright 
 *    notice, this list of conditions and the following disclaimer.
 * 
 * 2. Redistributions in binary form must reproduce the above copyright 
 *    notice, this list of conditions and the following disclaimer in 
 *    the documentation and/or other materials provided with the 
 *    distribution.
 * 
 * 3. Neither the name of the copyright holder nor the names of its 
 *    contributors may be used to endorse or promote products derived

This file has been truncated. show original

I then converted the internal ACK tracking to use it.

When looking at the SentState message reading/writing, I noticed it has been calculating its header size wrong all this time… I also noticed that it was only using 8 bits for the array size. When SimEthereal was always throwing exceptions >= 128 IDs before this would never have been a problem… but now that I let the buffer grow based on message lag, anything over 255 ACKs in the array would cause issues.

Fortunately, all of that code has been replaced and not only ranges are sent… and in all of my testing, only ever one range. So instead of 4*ACK count bytes of header, it takes up 7 bytes of header. (And I could probably reduce that further by a byte or three.)

Bottom line: the new code is much more stable, maybe a little faster, and a lot more line-efficient on the network.

…I can also run my 300 cycling objects test without any issues. A good sign.

Edit: do note that to build SimEthereal right now, you will also need to build SimMath. Should be as easy as gradle install in both projects (assuming you’re using gradle for your projects).

Mithrin · February 26, 2019, 3:07am

Oh wow, that is awesome. Thank you pspeed!

I will update my game as soon as I can so I can try things out. Hopefully this weekend I’ll have enough time.

Thanks

asser_fahrenholz · February 26, 2019, 11:43am

When I use gradle for dependencies, does ‘snapshot’ account for latest commit? or do you not build inter-release commits to the jcenter/?-repositories?

asser_fahrenholz · February 26, 2019, 11:43am

Good work by the way @pspeed