Friday Facts #242 - Offensive programming

Posted by kovarex on 2018-05-11

Hello,
this post is going to be more technical than usual, yet it might still be interesting to know the background of the process for some people.

Why are there suddenly so many problems with save loading in experimental?

0.16.36 is stable, but as we needed to fix some additional bugs, we started releasing experimental releases 0.16.37 up to 0.16.42 and so on. From the perspective of the player, there were a lot of new bugs introduced, unloadable saves, weird train problems, or even the infamous version 0.16.40 that disabled all signals causing trainocalypse which even made a baby cry.

Most of these problems (apart from the trainocalypse) were actually caused on purpose. It might sound weird, but it should make sense to you soon.

Offensive programming

As far as I can tell, offensive programming is the best way to keep complicated codebase like Factorio relatively bug-free in the long run. The general idea is, that when we have some rule of something in the code being always true no matter what (this is called an invariant), we should never just ignore it if the rule is broken.

Invariant examples in Factorio that were broken

One of the invariants we had was, that there can never be wall on top of another wall. This can't be normally done, as the player can't just build wall over another wall, but with script, you can place entities even in a way that they would collide. But in the case of walls, you can't even build two walls on top of each other with a script. The reason for this is, that the walls connect to each other, and since there would be 2 walls at the same place, in multiplayer it could happen that the neighbour wall piece could connect to different walls for different players which eventually could (and it also did) cause desyncs. Similar invariants are set for belts, pipes and rails.

The second part of the problem occurred when the deconstruction planner and blueprints came into play, as we gradually changed the game in a way, that you should be able to mark any area of factory for deconstruction and plan a blueprint over it even before the area was cleaned. So suddenly, you can have a belt marked for deconstruction with another belt (in the form of a ghost) on top of it like this:

The third part of the problem was, that we decided that ghost belts and walls should connect with each other so it looks nice as explained in fff-211

To allow these 3 things to co-exist, we had to make several changes. The first thing to change was something you might have noticed: belts and walls get disconnected when marked for deconstruction.

The second thing was to allow walls (or belts) on top of each other as long as one is marked for deconstruction and other is a ghost. As things marked for deconstruction are not candidates for connection, the connection candidate is still well defined.

You can imagine, that making sure that the first invariant is still true might not be that trivial, for example:
  • Make sure, that a ghost on top of an entity marked for deconstruction gets removed when the deconstruction is cancelled.
  • Make sure, using teleport fails if the result would be in conflict with the invariant.
  • Make sure, there are not other ways it could happen we are just not aware of.

The last point is the biggest problem, because if there is some other way the invariant can be broken, I want to know about it rather than have to investigate very complex desync reports. But how do I check that this never happens without affecting performance in normal games? For these kind of things, we have a method we call "consistency check". It goes through all the map and it checks different kind of integrity stuff in it. The checks take quite some time to perform, so calling it on every save/load would affect the game too much, so we decided to call it only on version transition, which includes also transition of any version of any mod. The check can be also executed manually using a console command:

/c game.consistency_check()

The question is, what to do when the check fails? To make sure that it will actually get reported, we decided that when the consistency check fails, the game instantly stops (crashes) and writes the cause and stack trace into the log. This forces the user (at least some of them) to give us a bug report, so we can try to figure out what is going on. After that we can just re-activate the migration that removes conflicting entities in a new version to make the save loadable again.

The train bugs

All the train bugs were also originating from the same problem. I figured out, that rail signals marked for deconstruction didn't disconnect from the rail, and could block building of blueprints with rail signals on top of them. This changed the invariant for the rail signal from always connected to rail if possible to be only connected when not marked for deconstruction. Most of parts of the code were fixed properly, but there was one particular piece of code, that re-connected signal even when marked for deconstruction, which made the internal state inconsistent and the save not loadable until the migration to re-build rail segments was re-activated for the next version transition.

Conclusion

So now you know, why are there much more crashes when loading games and you hopefully hate me less, because now you know, that it was done in the sake of the long-term code correctness.

As always, let us know your thoughts and feedback on our forum.