I've been reading a lot lately about version control branching strategies in an effort to come up with an alternative way of managing hotfixes and bugs in releases that doesn't involve merging downstream branches back into upstream development branches. This seems to be a common practice and one I've encountered a lot in my career. I've also encountered the chaos and confusion and hours of debugging that can result.
Many CI/CD and source control tools offer features to automate these hotfix merges back into a develop branch (which I will call 'automerge' in this document). Atlassian's Gitflow Docs state:
As soon as the fix is complete, it should be merged into both main and develop (or the current release branch), and main should be tagged with an updated version number.
In theory, this sounds like a fine thing to do. It might even work most of the time as long as the develop branch hasn't diverged too drastically from the downstream release or production code being patched. This probably works reasonably well for teams that truly do continuous integration (i.e. releasing daily or even hourly). In that case, the branches are unlikely to have diverged too much, and so the hotfix probably can be merged back into develop with minimal fuss.
If your team isn't releasing frequently and the upstream development branches begin to diverge significantly, this way lies madness. Automerging can especially bite you. I think it's a common misconception that if a merge in git completes without raising a conflict, then everything is fine. However, versions control systems like git can only detect textual conflicts. There's another type of merge error that Martin Fowler refers to as 'Semantic Errors'. His excellent article on branching strategies discusses semantic conflicts while giving an in-depth explanation of various branching strategies and their strengths and weaknesses. It's a great article that I definitely recommend.
I'll leave the detailed descriptions of various branching strategies to Martin. What I'd like to do here is demonstrate what a semantic merge conflict is, and propose a workflow that can help avoid them. I'll also show why automerging can be really bad. My example program is admittedly contrived, but the order of operations leading to a semantic merge conflict are identical to those that occur all too frequently in the real software development world.
For this example, let's pretend we've landed a gig at a hot startup. They're disrupting the IDaaS (Integer Division as a Service) space and investors can't get enough. The stock is through the roof and business is booming. Here's a look at the revolutionary program that is minting new millionaires daily:
Fantastic. This is great stuff and the influencers can't get enough of snapping pictures and making videos of them performing integer division on the command line. The executives are loving the windfall revenue and profits rolling in as well.
But then, disaster strikes. Users start posting photos of some glitchy behavior, and word spreads fast. The stock begins to tank and several celebrities make videos about how they might give up on IDaaS if quality doesn't improve. A pop star makes a song about how the company ruined her social life. Not good. Here's some of the viral photos of IDaaS failing hard:
The dreaded ValueError, completely flummoxing to the average user...
... and one in which we've apparently angered the math gods
Something needs to be done, and fast - like yesterday. So, a junior developer is tapped to push out a hotfix to production quickly on a Saturday night. It's not an optimal solution as the program still terminates after receiving invalid input, but at least it squelches the cryptic error messages that were trending on the social sites. This will do for now until another team is ready to release enhanced input validation that is currently being developed and tested.
The hotfix prevents users from generating cryptic errors attempting to perform division on letters.
It guards against this now too...
... and it still performs that wonderfully lucrative integer division function
A week later, the team working on the enhanced input validation feature completes their testing and is ready to merge their PR into develop. Here's an exclusive sneak preview of IDaaS 2.0:
Much better! After limping along on the hotfix for a while, it will be great to release this more fully-developed solution that gives the user a chance to correct their invalid input. It took a week to get this implemented, so it wasn't suitable to be pushed out as a hotfix. But now we're ready to merge our PR into develop!
Great! No merge conflicts with develop. We've merged the new feature into develop and...
... now it doesn't work at all!
But how could that be? All the tests passed on the feature branch before we merged the PR. The hotfix version worked. The improved version also worked. The merge didn't fail or raise a conflict. What's going on? Well... it looks like we've got some debugging to do. We can't release this and other developers can't merge anything on the broken develop branch. Management is wondering why the awesome new version that was supposedly ready for release is now crashing spectacularly with even more cryptic errors. That's the last thing the company needs is more cryptic error scandals and memes.
The different versions of the files making up our example program can be explored in the code comparison tool below. Can you figure out what happened and save the day?
This is a classic semantic merge conflict scenario. Just because something merges successfully in version control does not mean it is without errors! Git (and other tools) will happily mash code together in merges in ways that don't raise a merge conflict, but aren't programmatically correct. Merging isn't magic. We still need to review and test the results.
There are various 'merge strategies' and they differ in implementation and uses. I don't want to get in the weeds here about how merging works, but in general, most algorithms are based on lines of code. If you edit line 3 and someone else edits line 5, a merge will apply both changes above and below line 4. Merge conflicts occur only when the same lines are edited and the algorithm can't decide which version to use. But, there are many examples of code that can be merged without conflict in ways that still break the program. Here's the output of the offending diff in this example:
The text in red was code that was merged in from our enhanced input validation feature. The text in blue came from the hotfix. They work separately in their own branches, but when combined, they conflict. The
get_input_int function now returns an integer, which does not have the string method
.isnumeric(). Even though git combined these changes without a textual conflict, they do not work together programmatically.
There are a few variations of this scenario where semantic merge errors are introduced in common branching strategies.
In this scenario, a hotfix merge into develop breaks the environment due to an incompatible feature that has been merged in ahead of the hotfix merge. The semantic error occurs when the hotfix is merged in.
This is especially problematic when using automerging. Many workflows can be configured to automerge main back into develop, and this occurs without PR review! Some tools will generate an automatic merge conflict PR if the merge fails, but if no merge conflict occurs, the tooling will happily contaminate develop with incompatible semantic conflicts. Sometimes unit testing can detect the error, but then you still have to investigate why the develop branch suddenly started failing. Semantic merge conflicts can be very time consuming and difficult to debug.
This is similar to Scenario 1, except that the hotfix is merged in before the feature. The semantic error occurs when the incompatible feature is merged into develop. This is the scenario from our IDaaS example.
Notice how in both cases, yet another feature branch/PR is required to fix the broken develop environment. Eventually, this 'fix' gets promoted back up to main (or production). But, it's a waste of time. It would have been better to not have merged this hotfix in the first place, since the hotfix was replaced (and incompatible with) improved functionality introduced in the new feature.
This is another issue with automerging: It doesn't make sense to automerge every hotfix. Many hotfixes are no longer relevant in the develop environment due to refactoring or new features. Debugging and undoing merges that never needed to occur in the first place is a waste of developer time.
This scenario shows how the semantic error can be caught and fixed in the feature branch before breaking develop. The hotfix is still automerged to develop, but the feature branch was updated with the latest develop changes after this occured via a merge. The error was discovered and fixed in the feature branch prior to merging it back into develop.
This is better than the first 2 scenarios, but it relies on developers always remembering to pull in the latest changes from develop before merging (though this is still a good habit/practice). However, it's not perfect and the likelihood for 'race' conditions exist (especially on large teams with numerous frequent commits). The code in develop can diverge or change between the time you merge develop into your feature branch and test, and when you merge your feature back into develop.
It's especially frustrating when an automerge fires off in this time, meaning that there's always the chance that a semantic merge error can still be introduced in develop - even with frequent updating or rebasing of feature branches.
This scenario is a little better, but time is still wasted undoing the results of the unnecessary merge as part of the feature.
This diff shows the fix for the semantic conflict. The fix is to back out the incompatible hotfix code that is deprecated and incompatible with our new enhanced input validation feature.
Here's a diagram showing a better way to handle the scenario in our IDaaS example:
The hotfix is not merged from main into develop. It doesn't need to be in this case since the feature replaces it, and the two changes aren't compatible together anyways.
When it's time to promote develop to main (i.e. for a release), a promotion PR is created that first backs out any hotfixes by reverting (or rolling back) to the previous release, then merging develop into this promotion PR. This promotion PR is then reviewed and ultimately merged into main.
There are several advantages to this approach:
Of course, the disadvantage of this method is that in cases where a hotfix needs to be applied to both main/production and upstream branches like develop, this has to be done in separate commits. However, I don't think this is a bad thing for the following reasons:
Conceptually, I think the RaM strategy makes a lot of sense. Here's an analogy: Think of branches like wire or cable. Over time, cables need to be spliced and patched. Eventually, they're replaced. But, they're not replaced by transferring (or 'merging') the old splices to the new cable. The old cable is replaced with a pristine splice-free cable with guaranteed continuity. Same with tire inner tubes containing several patches. Eventually this has to be replaced with a new inner tube that is free of patches and the defects that necessitated them in the first place. There shouldn't be any need to apply old patches to a new tire.
Likewise, we want to replace worn, patched code with pristine release code fresh from testing. Bugs that were discovered and patched with hotfixes in previous releases should be holistically resolved as bugfixes or features as part of the new release. We don't want to just slap the same quick, hackish patches onto newer releases.
Since our new code should - in theory - address the issues that arose as hotfixes, we no longer need the hotfixes. They're like tire patches or splices that can be thrown out when the new replacement arrives. We throw them out by reverting to the previous release, then applying the new release on top of the previous one with a fast-forward merge.
Note: Here's a StackOverflow answer that briefly describes the various ways to revert. For Revert-and-Merge, we want to use a technique that preserves history, so the section in this answer titled "Undo published commits with new commits" is the relevant one. Ranges of commits can be specified, and each reverted commit can be added as a separate commit with its own log entry, or collapsed into one revert commit encompassing a range of commits that were reverted.
In summary, here's how to avoid wasted time and grief from semantic merge conflicts that seem to 'randomly' break builds for hard-to-debug reasons
Here's a diagram that shows a more complete example:
Look how beautiful and clean this diagram is! No conflicting merges running both ways and no features that had to be created just to fix the semantic errors.
We don't want branch diagrams that look like this, with merges flowing in both directions. This can lead to semantic errors and circular merge loopback insanity, when upstream merges conflict with downstream merges which results in even more bug/hotfixes that create even more conflicting bi-directional merge activity.
We want branch diagrams to look like this, with merges flowing in one direction. This is clean, easy to reason about, and makes the git gods happy.