07 Nov 2021
This week I get even more technical than usual describing a bug that turns out to be a nest of bugs.
On Tuesday evening I continue a play-test of a long-running game. I’m trying to determine if there is enough food on a map to survive until planted blueberry bushes provide fruit. Along with a few minor issues I encounter a very annoying bug where Dwergs are unable to fight. I save the game so I can investigate further but when I load it on Wednesday morning the bug does not happen.
Thursday morning and although I’ve fixed a bug related to loading combat I have not found the reason for the failing-to-melee bug. This is disproportionately discouraging to me. I’ve not been able to reproduce it. I write additional tests that disprove a hypotheses that would explain the bug. I’ve examined the code and found some leads to investigate. Investigating those leads is going to take time. I’ve already spent a day on this, how much longer do I invest in it?
I justify continuing to fix this bug because the player could lose all their Dwergs as a result of it. But really I just don’t like the feeling that I don’t understand the systems I’ve created. I’ve lost control of my software!
While fixing I find something suspicious. Using source control I trace back where the offending line of code was introduced, how it had moved a couple of times, was refactored into a function of its own and re-used by some other parts along the way. It was originally fired when a Dwerg was attacked. I remove the offending line of code and see what tests fail. One of the failing tests is “Removing floors over solid terrain removes water”. This test creates a world with some water and a Dwerg mines it out and we assert the water is gone. With my change the water remains. Looking at the log I find that a Flintworm was spawned while the Dwerg was mining. What!?!?!?
At some point I’d added a block of flint to my test world. Then I made the first piece of mined flint always spawn a Flintworm. The diversion code I introduced recently causes the Dwerg to attack any hostile creature in its path. Attacking the creature causes the Dwerg to cancel it’s current job. My recent change is to not consider a cancelled job as ‘failed’. Failed jobs are returned to a provider’s job queue. If I cancel a job but don’t return it to the job-queue it won’t ever be completed. So that explains why the water remains and tells me why my change is not a good one.
The problem with move and attack jobs is they don’t have a queue. There’s one of these jobs per agent at a time. If an agent is already performing one of these jobs and it is then instructed to do another move job, the move-job-in-progress is considered as failed and the new job replaces it in the move-job provider. When we come to resolve the failed jobs we re-create the new job but with a lower priority. Now that new job is below the priority for awareness to kick in and the awareness system will start another new job, replacing the old new job and so the process repeats and the Dwerg is stuck failing to melee.
That was one hypotheses that did not explain the failing-to-melee bug.
I can’t manually reproduce this bug. It involved me doing a lot of micro-managing trying to get my Dwergs into a position where three of them are facing a single Copperpede, like so:
So can I write a unit test to reproduce it?
I’m distracted by two other bugs I notice while debugging: 1. There are melee-combats but the combatants are not adjacent to each other. This shouldn’t be happening. 2. When a need-satisfying job fails any move-job for the agent has its priority reduced. Do I address these or are they somehow the cause of the failing-to-melee bug?
If I write failing tests for the other two bugs I can get a handle on how I might write a test for the failing-to-melee bug.
Writing a failing test for the need-satisfying job fails reveals this is even worse - the move-job has its priority reduced to the lowest possible priority.
While trying to reproduce melee-combats where combatants are not adjacent, I find another bug with the path-finding. Finding a route to the closest in a cluster fails when a route is found for one destination in the cluster but fails for a more distant destination in the cluster. This is the cause of the failing-to-melee bug. I write a unit test to verify and three lines of code to fix it.
Now I question why does mining the first piece of flint always spawn a Flintworm? This was added so a later tutorial that references Flintworms makes sense. I change that script so that it only references Flintworms if they have been encountered and now I can remove always spawning at the first flint-mining.
On Friday I’m almost finished the Weapons and Inventory tutorial but find inconsistencies that prevent the scripts from listening for inventory-related jobs. Some changes to the script-moving code and the merchant moves beside the crafted weapon and the player’s Dwerg moves on top of it and the explanation for the inventory makes sense.
That tutorial is close to finished. I want to play-test it a bit more so it flows well and I’d like to prevent the player’s Dwerg from moving while the inventory is being explained.
For the next week I’ll be tackling what I hope are the last two tutorials; the Hunting, Tanning and Armour tutorial and the Ore, Smelting and Bronze tutorial.
Enter your email to receive a monthly summary of development progress on Dwerg Saga.