Architecture Debugging Checklist
When you’re debugging a backend or architecture problem—especially performance—you often don’t have all the information you need.
This checklist helps you consider every layer of a typical web stack so you don’t miss a bottleneck. It applies to any kind of issue, including performance; the goals are knowing the parts of a multi-tier architecture and clearly identifying where the problem might be.
Use it as a high-level guide: work through each layer, then combine with observability (metrics, logs, traces per layer) where you have it.
Once you suspect a layer, use the Optimization Quick Reference and Infrastructure Building Blocks for concrete fixes.
At each layer, ask: Slow? Blocked? Wrong? The “What to check” bullets under each layer are the concrete form of those three questions for that layer.
Frontend
Section titled “Frontend”What it is: The client—browser or app—that renders the UI and calls your APIs.
What to check
Slow?
- Render: Client-side rendering time, heavy scripts, layout thrashing
- Assets: Asset size and load order (JS/CSS/images); blocking requests
Blocked?
- API patterns: N+1, waterfall requests, missing caching
Wrong?
- Cache: Client-side caching and staleness (e.g. service worker, local storage)
How well do you know this part? Can you distinguish “slow first load” vs “slow API” vs “slow after interaction” using dev tools or RUM (real user monitoring)?
Networking
Section titled “Networking”What it is: DNS, load balancers, proxies, and the path requests take before they reach your app.
What to check
Slow?
- DNS: Resolution delays or failures
- TLS: Handshake cost; certificate or cipher issues
- Path: Geographic routing and latency to origin
Blocked?
- Load balancer: Health, timeouts, connection limits, backend pool saturation
- Proxy: Timeouts and connection pooling (e.g. keep-alive, max connections)
Wrong?
- TLS/config: Certificate or cipher misconfiguration
How well do you know this part? Can you trace a request from client to app and identify where time is spent in the network path?
Web Application Server
Section titled “Web Application Server”What it is: The app tier (e.g. Ruby on Rails, Django, Node, .NET, Spring) that handles requests and talks to data stores.
What to check
Slow?
- Routes: Request handling time per route or handler
- CPU/memory: Saturation; GC or event-loop stalls
- Startup: Cold starts, JIT, or initialization cost after deploys
Blocked?
- Threads/workers: Pool exhaustion; blocking I/O on the request path
- Connections: Pool usage to DB, cache, and downstream services
Wrong?
- Config/state: Issues that cause incorrect behavior (often show up as slow or blocked above)
How well do you know this part? Can you tell if slowness is in “our code,” in a library, or in a downstream call (DB, cache, API)?
Database
Section titled “Database”What it is: The primary data store (e.g. PostgreSQL, MySQL, SQL Server) for transactional or core application data.
What to check
Slow?
- Indexes: Slow or missing indexes; full table scans
- Replication: Lag if reads go to replicas; primary saturation
Blocked?
- Connections: Pool exhaustion or connection leaks
- Locks: Lock contention
Wrong?
- Queries: Query shape (N+1, large result sets, unnecessary joins)
- Schema: Migration impact (locks, long-running DDL)
How well do you know this part? Can you read query plans and tie latency spikes to specific queries or connection pressure?
Key-Value / Document Store
Section titled “Key-Value / Document Store”What it is: Caches or document/KV stores (e.g. Redis, MongoDB) used for sessions, caching, or specialized access patterns.
What to check
Slow?
- Hit rate: Eviction; memory or key-size growth
- Serialization: Cost and large values; pipeline vs many round-trips
- Cluster: Topology (sharding, replicas); hot keys or partitions
Blocked?
- Connections: Pool or connection limits to the store
Wrong?
- Timeouts/consistency: Retries; consistency vs availability trade-offs (config/state)
How well do you know this part? Can you tell whether the store is the bottleneck (latency, errors, saturation) vs the app’s usage pattern?
Infrastructure
Section titled “Infrastructure”What it is: The compute and platform layer (e.g. EC2, Google Cloud, Azure)—VMs, containers, quotas, and shared resources.
What to check
Slow?
- Compute: CPU, memory, and disk I/O saturation; noisy neighbors
- Network: Throughput and packet loss within the region or AZ
Blocked?
- Quotas: Limits (API rate limits, disk, connections) and throttling
- Scaling: Triggers and cooldowns (e.g. autoscaling lag)
Wrong?
- Placement: Affinity (e.g. same AZ for app and DB); misconfigured placement
How well do you know this part? Can you distinguish “the box is full” from “the app is inefficient” using OS-level or cloud metrics?
How to Use This Checklist
Section titled “How to Use This Checklist”- Before working through layers: Circumscribe the problem: note what is working and what isn’t, and be precise about how the system is misbehaving. That keeps you from chasing the wrong layer.
- Simplify when you can: Create a minimal or simplified case that still reproduces the issue (e.g. one request type, smaller dataset). It makes hypothesis-testing, communication, and adding regression tests easier.
- When you have limited info: Work through the layers in order (frontend → networking → app → database → KV/store → infrastructure). At each layer, ask Slow? Blocked? Wrong? and use the bullets below as the concrete checks. Form a hypothesis (e.g. “DB pool exhausted” or “cache miss storm”), then use observability to confirm or rule it out—faster than blindly splitting the system in half.
- Combine with observability: Use Observability (SLIs, dashboards, tracing) to get signals per layer. The checklist tells you what to look for; observability tells you what you see.
- Once you suspect a layer: Use the System Design Checklist for vocabulary and components, and the Optimization Quick Reference for “this pain → this fix.” The Infrastructure Building Blocks doc explains why each type of infrastructure exists and when to use it.
Handoff: Writing a Useful Message to a Teammate
Section titled “Handoff: Writing a Useful Message to a Teammate”When you hand off a debugging task to someone else—a good message does 3 things:
- Gives context so they understand the issue and why it matters
- Summarizes what’s already known or ruled out so they don’t redo work
- Gives 1–3 concrete next steps and where to look (layer + observability)
Use this template when writing an email or Slack message:
**Subject:** [Brief symptom] — suspected [layer], next steps below
**What we know:**:[1–2 sentences, who’s affected, symptom, any metrics/dashboards]
**What we’ve already checked / ruled out:**[e.g. "Frontend and network look fine; slowness starts at app tier."]
**Suggested next steps:**1. [Concrete action, e.g. "Check app metrics for route X and DB pool usage."]2. [Where to look, e.g. "Dashboard: …" or "Runbook: …"]3. [Optional: "If that doesn’t show it, work through the checklist from [layer]."]
**Useful links:** [This checklist, relevant runbook, dashboard.]Before You Close the Loop
Section titled “Before You Close the Loop”After you find and fix the cause:
- Add a test (or scenario) that would have caught this bug so it doesn’t come back.
- Document the fix (runbook, postmortem, or code comment) so others benefit next time.
- Consider the bug class: Ask whether this is one instance of a broader failure mode and scan for similar cases.