When a system starts to hurt, the instinct is to look everywhere at once. Dashboards light up, the backlog fills with “tech debt,” and someone, somewhere, suggests a rewrite. I have learned to resist all of it until I have answered a narrower question: which parts are actually load-bearing?
Most of a system is not load-bearing. It is ordinary code doing ordinary work, and it will keep doing that work whether or not anyone touches it. The problems — the real ones, the expensive ones — concentrate in a small number of places. The component under the highest write load. The one path where consistency must be exact. The piece of state that three teams quietly depend on and nobody owns.
Why the concentration happens
Structural pressure is not evenly distributed, because demands on a system are not evenly distributed. A reporting screen that runs twice a day and a checkout path that runs ten thousand times a minute are not the same kind of problem, even if they share a database. Treating them as equivalent is how teams end up optimising things that did not need it and ignoring the things that did.
The skill is not in working harder across the whole system. It is in knowing where the whole system is actually being held up.
The load-bearing points are where cost accumulates, where fragility originates, and where a single good decision pays for itself many times over. They are also, conveniently, few. That is the good news buried in a struggling platform: you do not have to fix everything. You have to fix the right things.
How I look for them
I start with characteristics, not code. Before I open the architecture diagram, I want to know what the system is actually being asked to do.
- Where is the highest transaction rate, and is that part transactional or analytical?
- Where must consistency be enforced immediately, and where can it be eventual?
- Where does concurrency create genuine risk rather than theoretical risk?
- Which piece of state, if it were wrong for thirty seconds, would cause real harm?
The answers tend to point at the same two or three places that the incident history already pointed at. That convergence is the signal. When the pressure data, the failure data, and the cost data all indicate the same component, you have found a load-bearing point — and you have found where the next piece of work should go.
The discipline is in stopping
The hardest part is not finding these points. It is leaving everything else alone. There is always a temptation to tidy the surrounding code while you are in there, to modernise the bits that look dated, to churn what is already working. That temptation is how a focused fix becomes a six-month project.
Find the points that bear the load. Understand precisely why they behave the way they do. Fix those — and have the discipline to stop there. The rest of the system was never the problem.
Working through something like this?