TDD: Red-Green-Refactor, Baby Steps, and the FIRST Principles

TDD draws two extreme reactions. The first: “I already write tests, so I do TDD.” The second: “writing tests first just reverses the effort without buying much.” Both miss what TDD really is. It is neither a coverage question nor a simple ordering trick. It is a design discipline that forces you to make an intention explicit before writing the code that satisfies it.

This article opens a series on TDD. Before comparing the schools (Chicago, London, ATDD double loop, strict TDD), the shared trunk has to be laid down: the Red-Green-Refactor cycle, the practice of baby steps, and the FIRST properties that define a test that deserves the name. The next articles will build on this foundation to make the trade-offs tangible with concrete examples.

The Red-Green-Refactor cycle

Kent Beck stated it in three phases, and the order is not negotiable.

Red: write a test that fails because the behaviour it describes does not exist yet.
Green: write the smallest amount of production code that makes the test pass.
Refactor: improve the structure of code and tests, without adding any feature or changing any behaviour.

Three minutes per loop in the ideal case, ten when the test requires thinking through a boundary. What is called “doing TDD” is nothing more than this loop, repeated dozens of times a day.

The classic trap is to merge Green and Refactor into a single step, thinking “I’ll just write the clean code directly.” That is upfront design in disguise. The cycle loses its most useful property, which is being able to move forward knowing you can break something at any moment and detect it within seconds.

Why Red first

The “test then code” order asks one simple question before touching the implementation: what is the exact intention of this behaviour?

def test_invoice_with_discount_above_amount_becomes_free():
    invoice = Invoice(amount_cents=5000)
    invoice.apply_discount(discount_cents=6000)
    assert invoice.total_cents == 0

Before this test, the question “what do we do when the discount exceeds the amount?” had no explicit answer. The test pins it down. Production code no longer has to invent an answer, it has to satisfy that one.

The other benefit of Red is that it validates the test itself. A test that passes immediately without any implementation is suspicious: it might be testing existing behaviour, or its assertion is too weak to distinguish success from failure. Seeing the test fail for the right reason is how you check it will be useful tomorrow to catch a regression.

Green: the smallest code that passes

This is the most counterintuitive step. The instruction is to write the most naive code that makes the test pass, even if that code is obviously insufficient for real use.

class Invoice:
    def __init__(self, amount_cents):
        self.amount_cents = amount_cents
        self.total_cents = amount_cents

    def apply_discount(self, discount_cents):
        self.total_cents = 0  # enough to make the current test pass

That version will make people wince. It does not handle the case where the discount is smaller than the amount. That is exactly the point. The next test will force the code to grow:

def test_invoice_with_partial_discount_subtracts_from_amount():
    invoice = Invoice(amount_cents=5000)
    invoice.apply_discount(discount_cents=2000)
    assert invoice.total_cents == 3000

Now self.total_cents = 0 is no longer enough. The code has to evolve to satisfy both tests. With each new test, the code grows by exactly what is needed. This step-by-step growth is called triangulation: you derive the general implementation from a series of concrete examples, never over-anticipating.

The opposite, writing the complete code that anticipates every case up front, is precisely what TDD seeks to avoid. Speculative over-engineering is the leading cause of useless abstractions in a project.

Refactor: the step everyone skips

Once the tests are green, you have permission to touch the code without changing its behaviour. This is the only moment when you can rename, extract a function, remove a duplicate, with no risk. Existing tests act as a safety net.

It is also the most neglected step. The reflex is “tests pass, on to the next feature.” In the short term, things move forward. In the medium term, the code accumulates debt because code written just to make a test pass has no inherent reason to be readable. Refactor is the step that turns code that “works” into code that “can be evolved.”

What you refactor during this phase:

names that say nothing (tmp, data, do_stuff),
duplication between two successive Green passes,
nested conditionals that triangulation has exposed,
the tests themselves (test code is code, it deserves the same rigour).

What you do not do during Refactor: add an abstraction “just in case,” introduce a pattern because it is elegant, generalise to a case the test does not cover. The rule is strict. No feature changes during Refactor. If you need to add behaviour, you go back to Red.

Baby steps: step size matters

RGR loops must be short. Not by dogma, but because a long cycle mixes several decisions and drowns the signal when something goes wrong. If a test stays red for twenty minutes, you have already lost the most valuable property of TDD: knowing immediately which change broke what.

The practice of baby steps means cutting progress into tiny increments. Each test targets a single micro-decision. Each change to the production code is the minimum required to make the current test pass. You can literally move forward a few lines at a time.

The benefit is twofold. First, the cost of rolling back is trivial: undoing thirty seconds of work instead of thirty minutes. Second, cognitive pressure drops, because you only hold one test in mind at a time, not an entire “feature.”

The common objection is that you write “too many” tests. That confuses quantity and quality. Three tests that isolate three distinct cases are better than a single catch-all test that checks several things and becomes uninterpretable when it breaks.

FIRST: the properties of a test that deserves the name

Not all tests are equal. Robert C. Martin summarised what distinguishes a usable test from a decorative one in five letters: FIRST.

Fast. A test must run in milliseconds. If the suite takes five minutes, nobody runs it between two changes, and the safety net is gone. Touching the database, the network, or the filesystem is to be avoided for the bulk of unit tests. Integration tests exist for those cases, but they are not the majority.

Independent. No test should depend on the execution order of another. If test_b only passes because test_a inserted a row in the database just before, isolation is broken. Running a single test at random should yield the same result as running the whole suite.

Repeatable. The test produces the same result on every run, on any machine, in any order. A test that fails one time out of ten because of a poorly controlled datetime.now() or an asyncio.sleep() is not a test, it is a trap people end up ignoring and then disabling.

Self-verifying. The test can tell on its own whether it passed or failed. No human log reading required to interpret the result. A clear assertion, pointing to the real cause of the failure, not a print you have to decode.

Precise. The test is precise in its intent. One test, one behaviour. If a test function contains five heterogeneous assertions, its failure does not tell you what broke. Precision is also won in naming: test_invoice_with_discount_above_amount_becomes_free is a one-line spec. test_invoice_discount says nothing.

Those five properties are the criteria you should be able to check off for every test you add. A test that violates one is not a “to be improved later” test. It is a test that degrades the whole suite, because it introduces noise the other tests have to compensate for.

What TDD is not

A few widespread confusions to clear before closing this trunk.

TDD is not a coverage metric. You can have 100% coverage without doing TDD, and you can do TDD without watching coverage. Coverage measures lines traversed, not the relevance of assertions. A test that executes code without asserting anything counts towards coverage but protects nothing.

TDD is not a guarantee against bugs. A test can only catch the behaviours it describes. Bugs often come from behaviours nobody thought to test, not from ones that were tested. TDD reduces certain classes of errors (regressions, drift between intention and implementation) and does not touch others (bad product design, poor understanding of the domain).

TDD is not slower. That belief comes from an accounting bias: you see the time spent writing tests, you do not see the time saved on debugging, PR reviews, and avoided regressions. Over a few weeks, TDD is neutral or faster on most non-trivial projects. Over a single isolated sprint, it can look slower, especially while learning.

TDD does not free you from thinking about design. The Red-Green-Refactor cycle does not invent architecture for you. It forces you to articulate an intention before implementing it, which helps, but it replaces neither the reflection on system boundaries nor the choice of right abstractions. That is precisely where the schools come in, distinguished by where they let those design decisions arise.

Up next: the schools

The cycle, baby steps, and FIRST are common to every TDD practice. Differences appear once you ask: where do you start? The core of the domain, from the simplest entities outward (inside-out, Chicago school)? The user boundary, from the outside in (outside-in, London school)? With an acceptance test as a guardrail (ATDD double loop)? Or by writing production code directly inside the test body (strict TDD)?

Each school answers those questions differently, and each answer has consequences on the final design, on the maintenance cost of tests, and on the kinds of bugs you catch. The next articles in the series will compare them one by one, on the same example, to make the trade-offs tangible rather than theoretical.

The Red-Green-Refactor cycle#

Why Red first#

Green: the smallest code that passes#

Refactor: the step everyone skips#

Baby steps: step size matters#

FIRST: the properties of a test that deserves the name#

What TDD is not#

Up next: the schools#

Newsletter