A Systematic Debugging Method
A repeatable, calm method for solving problems you've never seen before.
Good troubleshooting is a method, not a talent. Panic and random changes make things worse; a calm loop makes any problem tractable:
- Reproduce — find a reliable way to trigger the issue.
- Observe — gather facts from logs, metrics, and error messages. Don’t guess.
- Hypothesize — form one testable theory about the cause.
- Test — change one thing, observe the result.
- Repeat — narrow down until you find the cause, then fix and verify.
The core discipline: change one variable at a time and read the actual error message. Most “mysteries” are stated plainly in a log you haven’t looked at yet.
Symptom: "App returns 502 since 14:00."
1. Reproduce: curl the endpoint → confirms 502 consistently.
2. Observe: nginx log shows "connection refused" to the backend.
3. Hypothesize: the backend process crashed.
4. Test: systemctl status app → "failed"; journalctl shows OOMKilled.
5. Fix: raise memory limit / fix the leak, restart, re-curl → 200. Verify. - Write down the 5 steps and apply them to a recent bug you fixed.
- Practice the “what changed?” question on a hypothetical sudden outage.
- Take any error message and find the single most informative line in it.
- Explain why changing one variable at a time matters.
Cheat Sheet▾
| Step | Question |
|---|---|
| Reproduce | Can I trigger it on demand? |
| Observe | What do the logs/metrics actually say? |
| Hypothesize | What single cause would explain this? |
| Test | What one change will confirm or deny it? |
| Verify | Is it truly fixed, and why did it happen? |
| Always ask | What changed recently? |
Common Interview Questions▾
Walk me through how you debug an unfamiliar production issue.
Reproduce it reliably, gather facts from logs and metrics, form one testable hypothesis, change a single variable to test it, and iterate until the root cause is found — then fix and verify. Ask early what changed recently.
Why change only one thing at a time?
So you can attribute the result to that change. Changing several variables at once makes it impossible to know which one mattered — and can mask or create new issues.