SimEthereal endless exceptions

ia97lies · September 3, 2017, 3:08pm

Hi all

My sim-eth-es based game sometimes ends an endless

17:02:29,188 ERROR [StateCollector] Collection error
java.lang.RuntimeException: Very bad things have happened in the receivedAcks set.
	at com.simsilica.ethereal.net.StateWriter.startMessage(StateWriter.java:227) ~[sim-ethereal-1.2.1.jar:?]
	at com.simsilica.ethereal.net.StateWriter.startFrame(StateWriter.java:194) ~[sim-ethereal-1.2.1.jar:?]
	at com.simsilica.ethereal.NetworkStateListener.endFrame(NetworkStateListener.java:273) ~[sim-ethereal-1.2.1.jar:?]
	at com.simsilica.ethereal.zone.StateCollector.publishFrame(StateCollector.java:214) ~[sim-ethereal-1.2.1.jar:?]
	at com.simsilica.ethereal.zone.StateCollector.collect(StateCollector.java:246) [sim-ethereal-1.2.1.jar:?]
	at com.simsilica.ethereal.zone.StateCollector$Runner.run(StateCollector.java:301) [sim-ethereal-1.2.1.jar:

I have not the slightest clue to be honest. At least I would like to disconnect clients which causes this, no clue where I should catch it or what hook I maybe have to implement to handle this. Any ideas?

pspeed · September 3, 2017, 3:53pm

Not sure. When I put those kinds of messages in, it’s because I expect it to never happen in real life and so haven’t guessed at what might cause it.

Here is the source code for that exception with the comment there:

        // Just in case we'll put a watchdog in here.  The reason this
        // is unlikely to trigger is because the receivedAcks set only
        // grows when we've received a message from the client.  And at
        // that point we get to remove every receivedAck the client 
        // confirms.  So one new ID should always nearly empty the set.
        int size = receivedAcks.size(); 
        if( size >= 128 ) {
            throw new RuntimeException( "Very bad things have happened in the receivedAcks set." );
        }

I guess it indicates that we are receiving messages from the client too fast or something?

I can’t really explain it because the receivedAcks set is only filled with unique message IDs from client messages. They are sending us the IDs of the messages they’ve received. So they are receiving messages from us, we are receiving messages from them… the set should stay small.

I guess if they receive a ton of messages from the server without their responses getting through, then they could have accumulated a large ID set. In that case, I suppose one received message could have more than 128 IDs in it… but then that client was unable to send to the server for a long time… but was somehow still receiving timely messages.

…it would be nice if the message had told us how big size was.

If you are particularly curious and adventurous, you could add that (as a PR would also be nice) and locally you could do some additional logging when messages are received from the client to see if a single message similarly has a large payload of acks.

ia97lies · September 3, 2017, 4:44pm

I hoped for wellknown issue
Ok. Not sure how I can use a own build simethereal together with gradle and my project but I’ll find out.

One more pointer is this did normally only happen with Mac as a client and Linux as a server (I use wlan for my test).

I’ll let you know if I have more news on that.

pspeed · September 3, 2017, 4:53pm

If you have mavenLocal in your repositories list then it should be just a matter of doing gradle install in the sim-ethereal project.

…probably you want to tack something onto the end of the version or something (and refer to that in your build).

asser_fahrenholz · September 3, 2017, 5:34pm

This issue has come up before on this forum. And it was also then accredited to sending too much info in too short a time. Like sending too many entities/entity-information over the the line at the very beginning of a game… I guess it has to be paced somehow.

pspeed · September 3, 2017, 7:43pm

Yeah, updates should be sent at ~20 x per second. I think 70 or 80 active (moving) objects could be sent in one frame and there would be ~3 frames per message if your physics runs a 60 FPS.

But let’s say you have 700 ACTIVE objects in the above scenario. In theory, you would be sending a message per frame * 10… so each update would be sending 30 messages, 20 times per second.

For that watchdog to trigger, you’d have to have quadruple that amount of objects or the client messages would have to have been dropped for almost 1/4 of a second.

Given that it’s platform specific, I’m not sure that it’s an entity count issue, though.

ia97lies · September 3, 2017, 9:05pm

I have only a few entities. two ships and 6 obstacles, so I do not believe it is an entity issue. but that 1/4 of a second seems maybe the case…

pspeed · September 3, 2017, 9:15pm

If you are running with the correct settings, there should only be 20 messages per second in that case. It would take 6 seconds of “client receiving messages but failing to send them back” before the noted problem occurred (as described in the scenario above).

asser_fahrenholz · September 4, 2017, 7:47am

Here is one of them:

ia97lies · September 4, 2017, 8:12am

thanks. as I said I only have 8 entities in the scene whereas only two of them are mobile objects and the rest are static entities. so for what ever reason it happens when I connect an OSX to my Linux game server in the beginning (I guess 6-10 seconds or so). But as I now understand the issue I am close to a fix I guess. The only thing I miss is a way to detect it on the server (around what do I have put a try catch) to at least get rid of a gone cracy connection as it affects all (extrem slow down). And of course I also want to find the why-does-it-happen…

Mithrin · September 4, 2017, 12:09pm

I admit … I hacked in some code that ‘fixed’ this problem.

        // Just in case we'll put a watchdog in here.  The reason this
        // is unlikely to trigger is because the receivedAcks set only
        // grows when we've received a message from the client.  And at
        // that point we get to remove every receivedAck the client 
        // confirms.  So one new ID should always nearly empty the set.
        int size = receivedAcks.size(); 
        if( size >= 128 ) {
            // This exception crashes my server when one player has a bad connection.
            //throw new RuntimeException( "Very bad things have happened in the receivedAcks set." );
            // drop the first item from the treeset, if I'm doing this properly it should remove the oldest message from the recievedAcks list.
            // this will mean that my server will continue to work.
            ((TreeSet) receivedAcks).pollFirst();

            // however, we do want to drop some debugging information here just in case you know, the connection 
            // being garbage isn't actually the problem.
            if ((System.currentTimeMillis() - printRecievedAckDebugInformationCooldown) > 3000) {
                System.out.println("Connection Information: " + conn.getAddress());
                System.out.println("Connection Attributes: " + conn.attributeNames());
                System.out.println("Connection Ack Miss Percent: " + stats.getAckMissPercent());
                System.out.println("Connection Ping Time: " + stats.getAveragePingTime());
                System.out.println("Connection Average Message Size: " + stats.getAverageMessageSize());
                System.out.println("Connection Total Message Bytes: " + stats.getTotalMessageBytes());
                printRecievedAckDebugInformationCooldown = System.currentTimeMillis();
            }
        }

pspeed · September 4, 2017, 3:52pm

If we knew why it happened then I might be more comfortable proposing a proper solution. At this point, the best solution I can think of is to kick that particular player.

Removing the first ID might be ok but it also might leave cruft around on one end or the other. More likely if the clients are really recovering right away then simply upping that watchdog value is probably enough.

I’m really glad that you guys are finding these issues, though. I left those watchdogs in for this very reason… and so while I know it’s frustrating, it’s also part of the tightening/polishing process to make things bullet-proof.

Edit: to your debugging information, it might be nice to add the actual ack set, current message ID, etc… I don’t know if those types of information are easy to get in this method or not.

Edit 2: also, if those were logs instead of printlns then they would end up in the log file for easy browsing later.

ia97lies · September 4, 2017, 7:28pm

Ok I did some more testing. And it only happens for my OSX machine, the Linux machine stay fine and never have an issue. So it must be something with that OS or with the ethernet card of that machine (it is a quite old MacBook Pro and FPS are also bit low compared to my Linux machines).
With wlan it happens pretty quick (around 10 seconds or so) with lan it happens after 1 or 2 minutes.

What I can say for sure is that throwing an exception all the time is very bad perfromance wise and I wish a less noise way to handle it. I also can tell that it never recovers if that happens except I stop the OSX client. My suggestion, beside try to find the reason (I guess my next step will be to build SimEthereal for my self so that I can goof in the source a bit), is to make that exception bomb less frequent as this also makes it impossible to recover as the server performance goes down. It would be cool to see an error (but not an exception) and it would be cool that after a while of beeing over 128 that it automatically disconnects that client.

I keep you posted.

pspeed · September 4, 2017, 11:58pm

I don’t mind both increasing the watchdog size and disconnecting the client if it’s encountered. The nature of the error is that bad things were happening for a while already… so if the client never recovers anyway then we might as well disconnect them.