Tightening the Ratchet

Introduction

This article is aimed at software engineers and managers who are working on mature software systems. Though given this context, the principle can be broadly applied.

Background

I spent the first part of my career working on new, novel projects; this was and is always exciting. These projects are blank slates where you can do it the "right" way, use fancy new tools and not be bogged down by technical debt. I got to try out designing systems from scratch, and was even able to fully rewrite some sections of them. But eventually, I was thrown into a code base which was 100k lines of C++ code. The build system was a combination of make and the company's proprietary build system. The testing was flaky and constantly had people comparing failing test runs to tease out which results were actual new failures. This process was frustrating and time-consuming, and led to a large amount of developer dissatisfaction.

So this got me thinking; how do we take a system that is distrusted and a cause of frustration, and fix it without throwing it away and starting again? Often systems with large amounts of technical debt are also systems that cannot be taken offline and restructured. Thinking and working within this system, allowed me to slowly develop a technical debt migration/removal strategy, that I call "Tightening the Ratchet".

Example Case

To help explain the idea, I thought it best to work through a synthetic case study. Let's imagine a system which produces a code artifact that is deployed weekly to millions of users. The system is worked on by tens of developers and requires multiple new features to be added each year. The developers are also part of disparate teams, but contribute to a common code base. Each team has a group of features that they own, but often they touch the same code paths as other teams.

In this example, to allow for features to coexist and prevent conflicting interactions, it is common for teams to add if (FEATURE_ENABLED) blocks, leading to confusing code paths and nondeterminism. These FEATURE_ENABLED flags are set in a common header file, which should get updated whenever features are added, enabled, disabled or removed. How can we evolve this system from this uncontrolled and error-prone approach to one which is deterministic and predictable?

Stage 0: Charting the Path

Let's set some requirements, and then we can work from there. In our improved future system, 1/ all configuration settings (FEATURE_ENABLED) are part of one type-checked structure, 2/ in the creation of the configuration, the configuration is guaranteed to only have compatible features, and 3/ header-style feature flags are disallowed.

Stage 1: Stopping the Expansion

Finding and replacing all instances of these feature flags, all at once, is not an impossible task, but it would most likely mean a large code change that would require extensive testing and review. It would also force all feature branches to be rebased, and potentially also cause new feature work to be paused while the migration is performed. Though this is not an unreasonable approach, it is unlikely that finding the time to pause development will be easy, and therefore, an incremental approach is much more palatable.

The first step needs to be stopping the creation of any new feature flags using the old method, otherwise this will be a never ending task. But we cannot stop until we have provided a new method which is effectively the same but uses the new structures we are introducing. We can prevent new flags by creating a lint/build step which stops the build if it detects new lines in the common header file. The error that is thrown can then point developers to the new method, which may just be a dictionary or structure of flags. We have now added friction which slows the expansion, and provided a new method which does not require much more effort than the previous way, but we can confine the scope.

Stage 2: Creating Confidence in the Current State

Now that we have a container for our configuration options we can start to encode our business logic into the creation and use.

The first bit of business logic which we should encode, is the ability to prevent the enabling of incompatible combinations of flags. This is often done when flags are used, which means that as the combinations change, there are multiple places in code that need to be updated, and could be missed. So why not have the configuration checked at compile time or initialisation, rather than at runtime in multiple parts of the code?

This encoding of incompatibilities could be done via a few methods, for example as part of an enumeration which then requires distinct selection of non-compatible options, or via asserts in a configuration initialisation function, or a combination of the two. Either way you are now concentrating this decision into one part of the code base and providing business value by increasing the confidence of selected configuration.

Finally, for this step, we should create documentation and an onboarding plan to allow for other developers to start using the new functionality. The documentation can be part of the code comments or in a team wiki, but should be easily accessible and therefore prevent developers becoming frustrated and falling back to the old model.

Stage 3: Migrating All Old Implementations

Now that we have a clear method for creating new flags and configuration combinations, we need to remove the old implementations. How this is done will be highly dependent on the size of the organization and the complexity of the use cases. Having a single team handle all migrations will allow them to move quicker, but they will be hampered if deep domain knowledge is needed. In the case of deep domain knowledge being required, then having a developer from each team perform the migration will be less error-prone, but may require more handholding and may take longer, since you will be competing with other team priorities.

Before the migration is started, a testing plan should be developed and committed to. Often code migrations encounter idiosyncrasies in the system, or at least implicit assumptions that lack explanation. The ideal system already has a wealth of unit and integration tests, often it is not the case, and then a clear plan for testing that the configuration does not break as flags are moved across, should be created.

Stage 4: Pushing Improvements

Now that we have managed to migrate all the flags across and are using a configuration structure to set up the system on initialisation, we can start to benefit from further improvements. For example, a basic but highly beneficial first step would be to programmatically create the multiple deployed or tested configurations and use this as part of our integration testing pipeline, allowing for more deterministic test scenarios and increased automation in your test pipeline.

We could even fuzz all configuration options to ensure that all code paths are covered by the new method, and that we have fully removed all flags defined using the old method.

Final thoughts

This is a relatively basic example of how you can apply the "Tightening the Ratchet" principle. Using this principle at work, we have successfully routed out failing integration tests, as well as improved the typing of interfaces to take advantage of the Rust compiler. Given my experience, it has provided a useful and practical mental model for evolving mature systems.