Don’t break stuff

When you change code, how do you know that you’re not breaking something? If the code change is small, it can be pretty easy (especially if you have tests). I’m talking about more complicated code changes.

Sure, you have unit tests, but how do you know that everything in the system works together correctly?

Sure, you have integration tests in your code, but how do you know that all of the systems involved are working together?

Sure, you have end to end acceptance tests for all of the scenarios that you know about, but what if a scenario exists that you didn’t think of?

We can come up with any number of excuses for why bugs happen in production. But what happens when a bug could cause a big problem?

I’m in this situation right now because the relatively large code change I’m implementing could impact how much money we charge our customers. The code I’m writing takes in some input files from a third party and processes them. We could have a big problem if someone gets overcharged or undercharged, or even worse, not at all. There are lots of things that can go wrong. I have a lot of good acceptance tests for my code, so I have good test coverage for the scenarios that I know. What I worry about is the scenarios that I don’t know, or what could cause my code to not get called at all. Because of this, I have to think past the traditional ways that I usually test things.

Searching for test cases

I did a lot of analysis on the data that my code is going to consume. I even wrote some small apps and SQL queries that will parse historical data and look for different combinations of data, and used that to come up with the test cases. What’s tricky is that some of the scenarios do not happen very often, so that means I need to do more digging to find them. At some point, this can become tedious, but in my case it’s worth it.

Watching for unexpected scenarios

Thanks to my data analysis, I have a list of expected scenarios that I’ve seen from past data. In addition to writing tests for all of these scenarios, I’m also writing some queries that will check for evidence that some unexpected scenario happened in production, and I’m running these queries every day. By doing this, I’ll be aware of any potential changes in how the source data comes in, and I also avoid having to write acceptance tests for scenarios that probably aren’t going to happen.

Throw exceptions when you find unexpected scenarios

Not only do I have queries to check for unexpected scenarios, I also have inline checks that will cause the process to stop and throw an exception if I encounter specific unexpected scenarios that could cause an issue, especially when the issue would not otherwise be obvious (in other words, it wouldn’t cause the system to fail, but the system might give me incorrect results). In many cases, it’s a waste of time to discuss, implement, and test some edge cases that is very unlikely to ever happen, but I at least want someone to know if by chance that happens. Now if you’re writing a UI for a website, you might not have the luxury of being able to do this, but when you’re just processing backend files, you can get away with things like this.

Parallel testing against production data

We have a test environment set up with 2 databases – one has 2 day old production data and one has 1 day old production data. We restore these databases from backups every morning and then run in all of the files from 2 days ago into the 2 day old environment, then we compare the results with what’s in the 1 day old database, and theoretically everything should match. There are always exceptions, especially when you have data that was modified by a user after it was created in the database, but this allows me to easily check almost every column on a database table and see that the results from my process match what is currently in production. This has been a huge lifesaver, not only in finding bugs in my process, but for finding unknown scenarios, and for finding bugs in other related processes as well. People have a lot more confidence in your work when you’re able to show tangible proof that your system is working the same as what’s in production.

Writing audit queries that validate the process end to end

In addition to the queries that check for unexpected data, I also have queries that validate that any data received at the beginning of the process will have some resulting output. I need to make sure that the process isn’t silently failing or is failing in a way that I don’t expect.

Checking logs

Hopefully your app logs exceptions somewhere, so as a last resort you should always check for any fallout in the logs.

A different thought process

I find this way of thinking to be a different thought process when it comes to testing. I’m moving past the mechanics of testing (what are my test cases, what tests am I going to write, how can I mock this class, etc.), and trying to find whatever means necessary to not break stuff. Sometimes this is done with automated tests, manual tests, inline sanity checks, writing exploratory code to help me discover scenarios, or some form of production monitoring. The latter three are where I’ve found I’m doing things in a way that I haven’t always done before.

I think the shift in mindset partly comes from owning the responsibility for a feature. At one point a long time ago, I wanted to get my code to work and let QA test it and find bugs. Then I moved to wanting my code to work and wanting to come up with a way to test it. The next step in the progression for me is finding ways to be able to make large changes to the code, not break stuff, and ensure that everything will work with no impact to the business. In each step in the progression, I’m starting to look at the problem at a larger scale and looking at business value and business impact instead of just technical concerns.