Twenty years ago, I was a junior developer maintaining a legacy FinTech system, and I learned the hard way the concept of fragile code. By fixing a basic bug, I awoke another critical one, causing several user-sessions to crash every day, at a very busy time.
The bug I fixed was a very simple indexing mistake, in an in-house implementation of quicksort. It was impacting the efficiency of the sorting, so I was super proud of fixing it, significantly improving the performance of a daily procedure. My fix was included in the next patch of the system.
A few weeks later, the patch was installed at a large German bank. Everything went well for several days, until one day at 4pm London time — the interbank currency exchange rates were fixed everyday at that time. The system crashed while processing the new data. Boom. Panic. I jumped on the problem right away, but didn't make the connection with my innocuous (and legitimate) fix. Every day at the same time of the day, the system would crash, randomly, for several traders, who would lose time and money restarting their sessions, instead of reacting to the new exchange rates.
After several days, I found the horrible bug. It was in another part of the code, in a naive implementation of a message queue as a simple string buffer, written 10 years before by someone who had left the company since. The sender was writing sequentially until the end of the buffer, then was going back to the start of the buffer to continue writing, overwriting the old messages, without checking if they had actually already been consumed. It wasn't crashing before my fix because the sender was using the slow version of quicksort, which happened to prevent it from overwriting messages before they were consumed. After my fix, the sender sped up and overtook the consumer, making it crash.
A simple fix had awoken a dormant bug, in a completely different part of the code, with critical consequences. Sadly, this is not an exception, and this is not either a memory from last century that doesn't happen anymore. Quite the contrary, it still happens all the time and I bet any developer who has ever dealt with a legacy system can tell a similar story. And it's normal — this is the nature of code.
Your code is VUCA
My code is what? This acronym, introduced by the military education, stands for Volatile, Uncertain, Complex and Ambiguous, and describes an unpredictable system, where surprise is the rule, and where cause-and-effect chain is an illusion. War is about the most VUCA system you can find. But think about a typical coding environment, and you'll find as much volatility (any part of the code can change), uncertainty (you can't be sure how those changes will affect the system), complexity (the system is too large to reasonably predict the effects of a change), and ambiguity (poor naming or outdated comments lead to wrong assumptions about the system behavior) as you could ask for.
It can be a big monolith that you must painfully work around. It can also be a dense jungle of coupled and inextricable trees. Except that in this jungle, you can’t progress just by cutting branches here and there; some are important, and an unfortunate machete cut can break a critical feature, or wake up plenty of dormant bugs. Even worse, you could open a new route for a malicious attacker.
On the contrary, it can be a well-designed fortress, with secured access points, open for extension, tested and documented so that new contributors can easily find their way in.
In a VUCA world you cannot apply a well-known recipe and assume the expected outcome — you have to constantly adapt to new conditions. It's no wonder that eXtreme Programmers claim that code is a liability, and they follow rules such as simplicity and merciless refactoring to keep the code easy to navigate in, easy to understand, and easy to maintain at all times. Otherwise, your code becomes fragile, which is the last thing you want in a fast-paced world.
Let's see how military theories deal with this.
The OODA theory is the answer for navigating in VUCA environments
John Boyd was an American military strategist. As a pilot during the Korean war, while trying to explain the surprising superiority of the American F86 Saber over the Russian MIG151, he came up with the OODA loop theory. He modeled the pilot’s behavior as a 4-phase-cycle: Observe-Orient-Decide-Act, that he considered as the central mechanism enabling adaptation to any fast-changing world.
- Observation: collect the data.
- Orientation: analyze the collected data to create a new mental model of the situation, based on your experience, knowledge, etc. with a resulting new set of possible actions.
- Decision: determine the next course of action (the decision could be to go back and observe or orient ourselves further).
- Action: execute the decision.
The key success factors are then to
execute each step in the loop more effectively,
and to loop faster than your enemy through this cycle.
And we can observe that the American pilots were doing better on several aspects of this model:
- The F86, with its curved windshield, was offering a better observation than the MIG15, which also suffered an “inadequate defrosting of its canopy and windshield which obscured pilot vision”, according to a study by US Air Force colonel Roger C. Taylor2.
- The American pilots, trained by World War II, had more experience than their opponents.
- According to Harry Hillaker (chief designer of the F-16 plane), in his Tribute to John R. Boyd, “Time is the dominant parameter. The pilot who goes through the OODA cycle in the shortest time prevails because his opponent is caught responding to situations that have already changed”. And the F86 Saber was reacting faster to commands than the MIG15, which, according to Taylor's study, suffered from a “poor aircraft control at high indicated airspeeds”, allowing the American pilots to go through the OODA loop faster than their enemies.
Practically, does this work on fragile code?
Can these abstract military principles actually allow us to stay on top of the ever-shifting conditions of our codebase? They indeed translate into proven good practices.
A few years after my terrible experience with the quicksort fix, many similar bad experiences were happening, and there were parts of the code so fragile that developers were never confident hacking in, knowing their changes would likely have unforeseen (and possibly disastrous) consequences. I was given the mission to improve the development life cycle, and I set up a team with the mission of getting back control on our legacy code. Although none of us knew about the OODA theory at the time, the solutions we implemented were totally aligned with the OODA principles.
The first thing we did was to deploy code review. And the unmaintained and inscrutable parts of the system, like the hand-rolled message queue that sealed my fate, would be flagged — listed in the reviewers check list — and scrutinized during the code review sessions. As shown by this famous illustration from Glen Lipka, the simple observation of the code is already an effective step towards maintaining its readability at the expected level.
In addition to this human observation, we deployed automated code analysis solutions:
- A web-based code search solution, for browsing the codebase and its history;
- Linters, to enforce a consistent coding style throughout the codebase;
- Code analyzers, that helped us measure and control code complexity, as well as understand and navigate complex data flows.
The Orientation phase,
as described by John Boyd
in his Destruction and Creation essay 3,
is a process where you analyze/deduce/derive your observations into specifics,
and then create a new generality out of these specifics by synthesis/induction/integration.
In this phase, the individuals use their knowledge, their experience, their culture etc.,
and Boyd points out that
“In a cooperative sense, where skills and talents are pooled, the removal or overcoming of obstacles represents an improved capacity for independent action for all concerned.”
Collective intelligence is a major element of the orientation phase. And that’s true for software development.
As an example, using and contributing to Stack Overflow or other community knowledge bases is common practice.
With my team, we contributed to the shift from a culture of reinventing the wheel, to a culture of reusing open source code. Basic algorithms such as sort functions, or critical thread-safe data structures such as message queues, were no longer reinvented in-house.
We deployed an internal developer forum, like an internal Stack Overflow, and weekly Coding Dojos. These initiatives were instrumental to understanding our code.
We also promoted enterprise-wide code reviews, encouraging developers to review the code of other teams. Tom Bolton, from the Semmle data science team found clear evidence that “Increasing the number of reviewers, and not the time individuals spend reviewing” is the best way to improve code quality.
Tempo is the dominant factor of the OODA loop theory. You should loop and adapt at a faster pace than your environment is. The practices we put in place at that time were all designed with quick feedback in mind.
We deployed a lightweight peer code review tool so that the code review happened as soon as the code was written, and not when the author had already moved on to something else.
At that time our Quality Control team had automated the execution of the end-to-end tests of the system, but we decided to shorten this feedback loop.
- We promoted, via the coding dojos, the practice of TDD.
- We set up a BDD framework to extend this quick feedback imperative to the collaboration between developers and functional teams.
- We deployed Continuous Integration, giving developers a continuous execution of these tests on their latest code changes.
Observation, orientation, feedback loop ... we were improving our OODA loop. It took some time, but it paid off. Catastrophic bugs in production became a distant memory.
Now explore further!
I encourage you to explore further and find your own new applications of OODA, as in our VUCA world it doesn’t make sense to stick to a bunch of existing “best practices”. In a future post I will share some ideas for improving code quality, that I find exciting at the moment.
The MIG was superior to the F86 in ceiling, acceleration, rate of climb and zoom, and had a heavier firepower. However, the Americans claimed a victory ratio of 10:1 at the end of the war. More recent and impartial researchers concluded to a more realistic, but still surprising, victory ratio of 2:1 in favor of the American F86. https://en.wikipedia.org/wiki/North_American_F-86_Sabre#Korean_War↩
MIG OPERATIONS IN KOREA by Roger C. Taylor Colonel, USAF - Research - March 1986. https://apps.dtic.mil/dtic/tr/fulltext/u2/a177788.pdf↩
DESTRUCTION AND CREATION, John Boyd, September 1976, http://www.goalsys.com/books/documents/DESTRUCTION_AND_CREATION.pdf↩