This is a story about tackling those niggling little hard-to-reproduce bugs and how it can be good for both you and your codebase.
I have the dubious honour of being the go-to person for input issues here at Polystream. That means that I get the privilege of tracking down and fixing a number of strange things that happen transiently during people’s sessions including random mouse jumps and stuck key-presses.
And so begins our tale. Every now and again I would get someone complaining that one of their keys had got stuck while streaming an application.
“Can I see it?” I would ask.
“No, it cleared up as soon as I pressed another key”.
“Were you doing anything unusual when it happened?”
“Erm…. OK then.”
This was a rare bug with no known reproducible, therefore it went on the special pile marked “look into it later” (a.k.a. “ignore for now”). And there it stayed until we ran a 100 CCU test. Suddenly with a lot more people playing for a lot longer, a rare and hard-to-reproduce defect became the number one issue raised by our testers.
Feeling slightly sheepish for having ignored this for so long I sat down and actually tried to reproduce the problem. I started with Half-Life 2 (as that was the title I felt had come up most often in word-of-mouth bug reports) and just played it a number of times, streaming from my on-site server to my desktop PC. I did not see the bug. At this point, it would have been easy to just sigh, mark it as “can not reproduce” and move on, but by now it was clear that there was something there, so I rolled up my sleeves and set to finding that bug!
“Right!”, I thought. “Maybe this is a latency/timing thing rather than a plain input handling error?” So I grabbed a coffee and started a new session playing from one of our servers in Amsterdam. After a little while playing, I saw a stuck key for the first time! And even better, I ran a few more runs and managed to see the bug again! So this leads me straight on to my first learning.
1. Don’t give up. If it happened before, it can happen again. Keep trying to find a way to reproduce the bug!
Upon sharing my bug reproduction (or “repro” for short) with the team there was instantly some
wild and unfounded speculation thoughtful and reasonable discourse as to the cause. One thing that people were forcefully claiming was definitely the issue suggesting was a potential avenue for investigation was the complications around correctly handling system keys on Windows. I managed to promptly eliminate this possibility by refining my repro to one that only involved using the right and left arrow keys without any overlap between presses – in a certain spot slightly into the level I moved side to side for about 4 minutes before, eventually, I would get a stuck key. Even better, this now reliably reproduced the bug every time (eventually). This leads me to the second learning.
2. Try not to hypothesise too much before you know what’s actually going on. It is very easy to jump to conclusions that can make you focus on total irrelevances
An aside: you know that jumping mouse I mentioned earlier, well it turns out that that was a good example of why it is important to be led by observations, not your theories as to what is going wrong. In that case I managed to find out that the bug only happened on our cloud instances and it happened even when no one was touching the mouse. Having stepped through and logged the code to oblivion and found nothing that corresponded with the mouse jump, I went back to basics and decided to study the events themselves. Timing with a stopwatch I found that the jumps were exactly 4 minutes apart, which was too regular and too odd an interval to be a coincidence. Looking in the scheduled tasks on the machine I found that the sysinternals bginfo tool was running every 4 minutes. Bingo! Every 4 minutes the background was being reset causing explorer to refresh and making the game window momentarily lose focus. No amount of debugging or tracing would ever show that up, but 30 minutes with a pencil and paper and a stopwatch led right to the solution!
By this point, I had started to narrow down the field of investigation from the initial realms of focus issues, system key handling, and other weird and wonderful things. It was clear that something was going wrong with the handling of individual keypress events. And I also had a reliable repro.
Even though I had isolated what was going wrong, why it was going wrong was still a pretty big area, so I tried to narrow things down even further. I tried out my new repro back on the onsite lab machine but was unable to reproduce the problem again. Then I added in a little delay to my connection using a network emulator and success! I had a repro in an environment that was much easier to debug.
As the network conditions were now under my total control I started altering them to see if I could further isolate the conditions that were causing the stuck keys. And as I did this my repro went away. This led me to learning number 3:
3. Cherish a repro of a rare bug and work fast – you do not know what is causing that repro to work right now, so you need to dig right in before anything changes. Record exactly what you are doing so you can go back and check that the issue is still occurring to guard against “false dawns” where you think you have fixed the bug but in fact have just broken the repro.
At this point, I became aware of how fragile the repro seemed to be, and of how many variables were at play that could be contributing factors – the amount of geometry being streamed, the length of the stream so far, the particular updates happening that frame, the bandwidth induced by the frame, the time of day, etc. So I decided it was time for a change of tack and started instrumenting the code that related to what I knew about the bug so far. In particular, this was the code that handled key-press events which I logged at every stage of the pipeline to see if I could spot a point where events were being dropped. I looked at the logs as I played and waited for the bug to re-occur. The bug struck again. All the events were there – there were no missing keystrokes. In hindsight, this led to learning number 4.
4. Pay attention to detail. This could have been a shorter story! It is a fact that 0xc8000021 and 0xcf006743 are not the same. However if you have 100 lines of logging all of which say 0xe0000021 and when you are looking for missing data (not corrupted data) then it is suprisingly easy to see the second but read it as the first. If I had formatted my logging so that I could have done a diff between the client and server logs then the error would have shown up immediately. But I didn’t so…
We seemed to be sending all the right input events in the right order, but things were sometimes not working. Tracking causes is one thing, but what we really care about are outcomes so it makes sense to try and track them directly instead. I decided to add logging to the captured and generated windows messages on the server and the client to try and find some disparity – maybe we were altering some global state that was affecting the way messages were created or processed? I added logging and tried running the game. I tried several times and I couldn’t get the bug to return. Adding logging had fixed the Problem! It was absolutely, certainly a timing issue.
5. Don’t jump to conclusions. With rare bugs and uncertain repros you do not really know why things are happening. At this point I had verified (incorrectly) that all the events were getting through and had just ascertained (by random coincidence) that adding logging to the messaging code made the bug disappear. This very much looked like a timing issue in our input code. That conclusion was totally false. Luckily, given the nature of the bug I was wary and still on the lookout for confirmatory or contradictory evidence. And that was just what I was about to find.
I have danced this dance with flaky bugs many times before and know that until you check something an infinite amount of times (or some close approximation thereof) it could just be a random chance that is calling the shots. And so I tried to reproduce the bug again. And again. And again. And then one more time for luck, and it turns out luck was on my side because this time I got a stuck key. And this time I did pay attention to the details and noticed that one of the WM_KEYUP messages did not have the same payload as the others. The contents were garbled. And now I smelt blood.
Looking back over the logs from the previous runs (keep all your logs when investigating things like this – you never know what you will want to go back and look for) I spotted the previous mismatches between the event code we had sent and the event code we had recieved. This instantly caused me to focus on one particular piece of code: our encryption.
As some of the things people type in games are passwords and other sensitive information, we encrypt our keyboard stream to keep it safe from prying eyes. I added logging to look at the inputs and the outputs of our key-press encryption and ran the game again. This time instead of just seeing a stuck key event I saw a spurious keypress that opened up a menu. And suddenly the whole thing became clear.
Thinking about what happens when you mangle a keypress, it quickly became apparent why this bug was so infrequent. A garbage keypress event will most likely be for a non-existent keycode and hence ignored. If it is for a valid keycode it is 50% likely to be a keyup event for a key that is not pressed and so will have no effect. Hence the vast majority of corruptions will not generate any visible effect. And if the now-missing original event was a key-down, then that will also go unnoticed much of the time as a held down key (like the arrow keys for movement) will be almost immediately repeated. Those repeats for held keys make up the vast majority of key events, meaning that most missing key events will pass without notice. It was only when the mangled key event was a key-up that we got a noticeable error. Suddenly a raft of strange one-off “What happened there?” bugs could be explained.
At this point, I was feeling good. I had a strong hypothesis for what was going on, I understood how the bug caused the perceived effects, and how the nature of the bug led to the semi-random occurrence of the problem. Now I dived into the code to look for likely issues and straight away saw a loop that was running over a buffer which wrapped-around when it got to the end. A little logging and yes, plain as day there was corruption every time the buffer wrapped around!
So what was the puzzle? We know the error, we understand how it causes the behaviour we see, and in this case, we understand how an error that occurs very frequently can only cause infrequent problems. Well, encryption is both a very nicely self-contained piece of code and one that you don’t want to get wrong. So we had unit tests. We had tests that covered this particular piece of code. We had tests that covered the wrap-around of this particular buffer. And the code had not been changed since both the original code and the relevant tests had been added to the codebase.
I ran the test. It passed. This surprised me. I double-checked that the test was right and that it was running the actual code I was looking at. Maybe the code that looked wrong was actually right? Maybe I hadn’t run the right test? I ran the test, and it passed again, but this time I happened to notice that it didn’t re-compile after a change to the code. It turns out that we only have our tests as part of the debug configuration for our Visual Studio Solution and if you run a test while you have the release configuration active Visual Studio will happily re-run your last debug build!
A quick select and build of the tests and I was running the release build, and now the test failed. And so we have learning 6.
6. Run your tests in both Debug and Release. Debug will trigger a bunch of runtime checks that will help your tests to catch more errors, but Release will test the code you are actually shipping! Also do actually run your tests – if you don’t make running your test part of your development “tight loop” then you will lose a lot of the benefit, not least the benefit of constantly checking that your tests are easy to run!
Why did the tests pass in debug but not in release? Well, in release the end of the buffer was random garbage, but in debug it was always the same buffer-overrun guard that the CRT adds to check for corruption and in this particular case that happened to cover up the bug!
But the final icing on the cake came when we went to add running the release configuration of the tests to our CI scripts. It turns out that due to a bug in our CI, at some point we had stopped running these unit tests anyway! Which leads us nicely on to the final learning:
7. Make failure to run any tests a test failure!
So in the end, tackling head-on a rare and slightly annoying input bug fixed an error in our encryption code and possibly fixed a whole raft of strange one-off problems. But more importantly, checking that there was a failing test before making the fix uncovered a number of issues with our testing framework!
I could have given up trying to reproduce the bug…
I could have written it off as an unknowable timing error and relegated it to the backlog…
I could have just submitted the fix without running the tests…
I could have just run the test after making my fix and seen it pass…
But in each case, a bit of tenacity and double-checking paid off! So that is my take-away and the reason I wrote this up for you to read. Don’t give up, don’t ignore things, don’t assume. Double-check both the problem and the fix. Sometimes it is really worth it.