Chasing Ghosts: When a bug isn't a bug

Have you ever investigated a bug that was triggering consistently, you thought you’d nailed the repro steps, then it suddenly stopped occurring? It can be infuriating. Was it wasted time? Should you try just one more time? Should you just log the bug anyway? Should you ignore it and risk an active bug remaining undetected? You were probably exploring a tangent to a test when you first saw it, a test which has now far exceeded its estimated execution time. Should you pass the test if you don’t have a bug to log the failure against?

In this article I’ll share some stories of when myself or other testers have spent a lot of time chasing ghosts, often with no productive outcome at all. 

No legitimate bug. It was pure wasted time. Through being aware of these examples you can be better equipped to identify when this is happening in your own teams and take action sooner to prevent frustration and wasted time. 

In each example I’ll give details about how each ghost bug was triggered and what could have been done to avoid wasting time investigating it. In every case the problem was entirely avoidable and due to an error, misunderstanding or inexperience of the tester investigating the bug. These inefficiencies should not be seen as an accepted reality of testing but instead as an opportunity to learn and improve. 

Invalid performance bugs

Setting up and executing performance tests is tricky because there are so many factors that can influence the results and provide a false picture. In my book I describe it like conducting a science experiment in that you can only change one variable at a time to get meaningful and comparable results. When measuring game performance of any kind, the main variables include: processing hardware (gpu, cpu, memory), display hardware (displays), hardware drivers, operating system version, device state (parallel apps), device charging state, network connection type, game build flavour (debug/profile/release), dev environment (connected services), state of over-the-air configs, state of over-the-air asset delivery, player profile maturity and direct player actions. That’s a lot of moving parts to consider before beginning a performance test. While they might not all be relevant to every game or every test, it pays off to be aware of them when analysing results, as we’ll see in the next example of chasing performance ghosts.

This ghost bug comes from testing performance on live ops games which were run as a continuous service and where performance testing was only interested in the question, “Is the new release more or less performant than the live game?”. Rather than having a concrete performance bar, the goal was to verify that we weren’t making the existing game performance significantly worse with the update. However, it was accepted that new features and content would increase memory and sometimes lower framerate, but the delta should be proportional to the feature additions. Given this brief, analysing the performance results and identifying what was classified as a bug was not trivial. Every data point was relative to the new content and features we were adding to the game.

During this testing we looked at framerate, loading times and memory usage across multiple game areas. We also split the test coverage across the supporting platforms (Android and iOS) and again across a high-end and a low-end device for each platform, and again across early game and late game player profiles, creating a total of 8 basic performance flows. The test script was written to define the route through the game precisely and perform each action at specific timestamps during the 30 minute flow. The plan was to create graphs that could be overlaid onto each other to allow for an easy comparison. While the approach was generally successful and allowed us to compare performance over many game releases, there were multiple occasions where we spent time investigating false performance bugs that turned out to be non-issues.

One common offender was due to unique network activity that occurred on the first game launch and caused the loading screen to be particularly long (depending on the network speed). It was because the game would download any missing game assets on first launch, and with moderate file sizes, the player-facing impact was noticeable. Not only did this skew the loading time recordings which were taken as an average of 5 game launches, but the increased time spent on the simple graphical loading screen would improve the average frame rate for the entire session, skewing that result too. To add to the confusion, the quantity of assets downloaded would change over time and the time to download the assets would be unique to the network speed of each device. Not recording the first launch of a freshly installed game quickly became part of our test setup, but despite that, this anomaly still tripped up some of our testers who either reported the results anyway or spent time investigating the increased loading on first launch, thinking it was a bug.

Another common offender on mobile was the operating system throttling the CPU performance in an attempt to lower the device temperature. We had multiple instances across different releases of where periods of lowered frame rate were seen during the performance runs. The FPS graph would look normal and drop off a cliff at a point during the run, and not return to the higher value. Naturally, the tester tried the performance run again to confirm the bug and identify the game area and player actions that caused the lower frame rate. Another tactic on our team was to reproduce the bug on another device. 

The results were very confusing. 

Sometimes another device was able to see the issue, sometimes not. Additionally, the original device, still hot from the first run, now exhibited the bug throughout the entire session. These inconsistent results would usually trigger further tests to normalise the result. Because most bugs occur as a result of in-game actions, it wasn't natural for the testers to consider the operating system as the cause of the bug, or at least, we would assume a device configuration bug would at least trigger consistency on the same device.

The first clue only came by inspecting the performance data more closely. Someone noticed that the CPU utilisation dropped at the same time as the frame rate. What's more, CPU utilisation was reported per-core and at least half of them were at dead zero during the 'buggy' period. Through this new discovery the root cause was traced to CPU throttling by the operating system. 

The lower frame rate was not related to anything we had added to that release. We had all learned something, but no bug. We did, however, run a follow up investigation to determine if the device temperature was caused by our game in the first place. While the game was a contributor, it wasn't behaving badly. It was simply a result of running the game on a lower end device which got hot if you used it for extended periods, like we did in testing. 

Rogue server deployments

Server-side features, fixes and configuration has always been somewhat of a mystery to game testers. Unlike the client game, testers can't move between server builds freely or control when fixes enter testing. There aren't even distinct build numbers for server code, and the test team doesn't have visibility on the change log. The only control that testers often have is to change which development server they are connecting to through client debug controls. This allows multiple server branches to be deployed to different environments and tested in parallel. Though even this can cause more harm than good if the state of each server isn't documented and updated regularly.

These attributes make server code and configuration fertile ground for wasted time chasing ghosts. Here are a selection of brief accounts where testers have been left confused by server changes, resulting in wasted test time. 

One of the first times I saw this occur we were working on a synchronous multiplayer game which had both microservices which served all players continuously and a game server for each game session being played. The game server was being built by the multiplayer dev team and the microservices were being built by the server dev team, both were being updated fairly continuously throughout development. The problem was that it was difficult to keep the testers apprised of every deployment. Many times a tester had observed a bug and had been investigating it, only to find that it had suddenly stopped occurring. Because the game was multiplayer-only, bug investigation also took a long time and required the coordination of other testers. I found out retrospectively that these bugs aligned with deployments to the microservices that we were using. The development team would identify and fix bugs before they'd been found by the test team, which in itself was fine, but created confusion. My action was to begin relaying upcoming server deployments to the testers so they knew when they were 'moving to a new server build'. This allowed them to make better decisions during bug investigations and avoid some of their previous wasted time. 

On the same project we also observed a flurry of network errors in short and random intervals. Suddenly, anyone on the team who was interacting with a server-driven feature would begin seeing errors or other buggy behaviour and then begin investigating it. It turned out these were caused by the services being redeployed and were essentially offline during deployment. In some cases, this provided an accidental failure test of how the client failed to gracefully handle server-down scenarios, resulting in legitimate bugs. However, it mostly caused confusion, wasted time bugging the errors and a long-term desensitivity to server errors. When there were legitimate and low-repro bugs server side, testers were more likely to dismiss these as one-offs or attribute them to something the server team were doing. 

Invalid server configurations

On a later mobile project we had up to 8 dev servers which would host different branches of in-development server code (to be clear, these servers were hosting our microservices, it wasn't a synchronous multiplayer game so there was no dedicated ‘game server’). The client dev builds had debug which allowed moving between each dev server and some client branches would also point at different servers by default when there was a strict compatibility requirement, eliminating the human error of having to use the debug to switch.

We were working on a major new asynchronous multiplayer game mode, at least six months work, which was heavily server-driven. The server components for the feature were substantial and were made up of multiple microservices which communicated with one-another as well as data storage for new leaderboards.

I'll refrain from devolving into further technical jargon here, the takeaway is that this wasn't a single, neat and encapsulated server, it was a web of interconnected components.

During testing we had HTTP tools for intercepting network data on the client. We could edit client requests to set different server configs and spoof failure states. The problem was that this ‘traditional’ test team approach to testing client<->server interaction was limited to direct network interactions with the client game, leaving any server<->server interactions out of reach. The solution to this was to work closely with the server dev team to have them perform their own testing as well as set up artificial failure states server-side so that we could test how the client handled them. These failure states were deployed to one of the dev servers alongside the ‘healthy’ deployment of the same code.

The approach was effective, but not without issues. The speed of server config updates and the number of possible dev servers created opportunity for human error and wasted time on multiple occasions. For one, we didn’t have a good way of tracking and communicating what configuration was deployed to each dev server at any one time. Testers would assume a configuration was still deployed to a specific dev server and run tests only to find ghost bugs which would turn out to be non-issues because the server had been updated. Additionally, when legitimate bugs required the server to be in a specific failure state, testing on the wrong server would show the bug as ‘fixed’, causing bugs to be wrongly closed. Furthermore, because of the complexity of the server architecture, there were almost continuous changes being made and deployed to the dev servers. Even if it wasn’t the feature code itself, server configurations and architectural changes were being updated to fix bugs found by the server dev team. Many ghost bugs were investigated and dropped during this time because each tester had no way of confirming the configuration of a server when they began their testing.

For this feature, the complexity and insight into the server components were both a necessity and a burden for the test team. There was no doubt that the coordination with the server team allowed us to test scenarios that our regular testing wouldn’t have been able to. We caught many legitimate bugs as a result. However, communication and organisation was difficult and not properly managed. The server dev team were not accustomed to serving a separate team with deployments or the longer test passes which we ran, instead preferring to operate on their own quicker schedule. Additionally, the idea of multiple dev servers and the contents of each was a new concept to many testers on the team who were previously accustomed to just finding bugs in the client game. Despite ‘how-to’ documents and other test documentation, most of the test team went through a learning curve when testing this feature and many ghost bugs were investigated as a result of that learning curve.

My takeaway from that feature was that I underestimated the technicality and complexity of the testing setup. It was a very big ask for our remote team of testers. They did a great job in the circumstance, but in the most ideal situation we'd have a local team of more senior testers sitting with the server dev pod. 

Environment confusion

When we test on development builds, all services, databases and configurations they connect to are (usually) exclusive and entirely independent from their production/live game counterparts. We can call this the test 'environment', and it's what I'm referring to in this section. 

I thought it was assumed knowledge, testers would understand the test environment is independent of the production environment and that they are not one and the same. It would be extremely imprudent to run a debug client build that connected to production environment services. It’s a configuration mistake that would put unachievable high scores on the game’s multiplayer leaderboard or trigger achievement unlocks on the real game store before the game was live, or simply pollute real player analytics and error report databases with test data from debug builds. Aside from data hygiene, the production environment usually requires a different technical setup and server architecture to handle the huge player base using it. In many cases, it wouldn’t be technically or financially feasible for these environments to be shared.

Apparently this wasn’t as obvious as I thought.

I was working on a project that had an opt-in cloud save feature which allowed players to upload their data to the cloud and retrieve it on another device. Our dev client builds connected to a small development cloud save server and database. Both the dev server and database were separate from their production environment counterparts. Our dev server held the latest in-development cloud save code and stored only player profiles from dev client builds, whereas the production server stored the data for the entire production player base and was configured to support the higher load.

We regularly ran tests with a crowdsourced testing company to supplement our test efforts and part of that testing process required the tester to check bugs against the live store version of the game because we were primarily interested in bugs introduced in the current release. Due to this cross-checking between environments, we received frequent bug reports relating to cloud save not functioning correctly in various ways, which was revealed to be caused by testers trying to move their cloud data between the development build and the production store build. Even if the tester didn’t intend this specific action, some were simply using the same social media account to sign into both environments, which would confuse the cloud save identities logic and produce various negative effects. The testers didn’t understand why their cloud save profiles created on the dev builds weren’t being retrieved when signing into the same account on the production build.

I’ll venture that part of the confusion in this circumstance comes from the flow of testing closed betas and release-ready playtests. Many game teams prepare their beta tests so that existing players can take their production game progress into the beta. They may even migrate the player’s production data into a separate beta environment so that a copy of the player data is made and any changes can be reversed if the beta feature needs to be rebalanced or go back into development. Game teams aim to make it as easy as possible for playtesters to take part in a beta test and provide productive feedback. Crowdsource testers are regularly involved in such playtesting because of their ability to scale and relatively low cost, therefore it's understandable to assume that all testing includes this seamless integration with live game data. 

My takeaway from this is there’s a surprising level of hidden complexity in the test environment that most testers haven’t ever considered, at least not until they come across a problem. QA teams should seek to fully understand and document the technical nuances of their own project’s different environments so they can avoid making assumptions during testing. These nuances and any testing ‘rules’ derived from them need to be documented and communicated to all teams of testers working on the project. Certainly, in this case we could have done a better job of defining some more testing rules.

Enjoying my content?


Consider buying my book to keep reading or support further articles with a small donation.

Buy me a coffee
Previous
Previous

9 Types of bug fix that every game tester should know

Next
Next

Poor performers: Hiding the sins of bad test planning