Friday Facts #340 - Deep desyncs

Posted by Klonan on 2020-03-27

Not mentioning it would be weird

I really think everybody has heard all about this and nothing else over the last few weeks, but yes, the Coronavirus.

For now, with Factorio, everything seems okay. We are all working from home, the team is still going, and so far we are following our plan quite well. We released the Character GUI and Statistics GUI last week, and some improvements such as new water splashes and leaf animations this week. Things are still moving along.

However it is still early days, we haven't really had any experience having the whole team work remotely, so there may be some challenges we need to tackle as time goes on. At the moment we don't know whether this will affect our 1.0 release date, I guess it will one way or the other, but for now we aren't announcing any changes.

Business as usual

Apart from the development side still running, our e-shop is also remaining operational, and we have just restocked on all variants of our t-shirts. While we can't guarantee anything, if you order from us at this time, we should still be able to get your t-shirts to you.

Deep desyncs

Last week there we had quite a few more players than usual, and Krastorio 2 was released, which meant a lot more hours into a lot more areas of the game. During the week, Boskid and I received quite a few desync reports with mods. Generally we believe, that it is probably the mods causing the problem, as it is quite easy to cause a desync with the control scripting if you don't know some very important gotchas.

But still, we decided to investigate to help the players find out which mod is causing the problems.

Serpent cyclic references

One quite tricky desync we found, is related to cyclic references, and the way serpent serialises the global Lua data.

Take the example here of the Construction Drones mod. You have a player which dispatches the drones to go do work; the player needs to keep track of what drones he owns, and the drones need to remember which player they belong to.

Now this works fine during runtime, you can access the drone from the player object, and access the player from the drone object. When the game is saved, Serpent goes through all the data in 'global' and serialises it for later. To handle the cyclic references, if serpent detects that it has 'seen' an object already, it writes a placeholder value, and comes back to fix it later.

The problem, is that serpent chose nil as the placeholder value. In Lua, writing a table value as nil is the same as deleting the key, and the key won't be seen when looping the table in the future. When serpent then goes back to 'fix-up' the placeholder values, it ends up with each peer saving a different table ordering (because of some even deeper technicalities with our custom version of Lua).

So the problem is deep and was quite hard to find, but the fix is quite simple. Boskid simply changed the placeholder value from nil to 0, and now the iteration order is saved and loaded correctly every time.

Unit group max speed

Another report came in from a player using Krastorio 2 and Rampant mods. This one on the surface really seemed like the mods fault, since all the diffs in the desync reports were related to unit positions, and the Rampant mod is all about units.

The desync was extremely fragile, and was very hard to reproduce, but eventually Boskid managed to narrow it down. This is a screenshot taken on the exact tick that the desync occurred.

What jumped out to me as immediately suspicious, is that the biter is only just moving onto the creep. This is because the creep (from Krastorio 2) gives the units a speed penalty when walking on it. I did a quick experiment by commenting out the tile modifier code in the engine, and the desync did not occur again.

But of course removing the code here would only treat the symptom, and the cause will be something deeper (FFF-296). So with lots of patience, Boskid narrowed it down further and further, deep into the movement and unit behaviour logic. What we found, is that the unit is part of a unit group, and this group was caching the value of its 'max speed' based on the slowest unit in the group. However this value was not saved with the save game, but was recalculated each time the game was loaded. Since the unit was walking on creep, its speed was affected, so the group calculated a lower max speed when the game was loaded.

Now this logic has been in the game since unit groups began, but it only became a problem more recently. In versions of Factorio 0.16 and before, a units max speed was always going to be based on its prototype, which does not change after the game is loaded. With version 0.17 we added 2 (really nice) mod capabilities:

  • Allowing units to be affected by tiles;
  • Allowing scripts to change unit speeds directly.
It didn't come up as a problem initially, as the freeplay base game doesn't really use these capabilities.

Well every change has the potential to break things. Since we know the only 2 cases where the units speed can change, Oxyd made it so that the unit will notify the group to recalculate its max speed when it is necessary.