Architecture Debugging Checklist

First PublishedMar 2, 2026ByAtif Alam

When you’re debugging a backend or architecture problem—especially performance—you often don’t have all the information you need.

This checklist helps you consider every layer of a typical web stack so you don’t miss a bottleneck. It applies to any kind of issue, including performance; the goals are knowing the parts of a multi-tier architecture and clearly identifying where the problem might be.

Use it as a high-level guide: work through each layer, then combine with observability (metrics, logs, traces per layer) where you have it.

Once you suspect a layer, use the Optimization Quick Reference and Infrastructure Building Blocks for concrete fixes.

At each layer, ask: Slow? Blocked? Wrong? The “What to check” bullets under each layer are the concrete form of those three questions for that layer.

Frontend

What it is: The client—browser or app—that renders the UI and calls your APIs.

What to check

Slow?

Render: Client-side rendering time, heavy scripts, layout thrashing
Assets: Asset size and load order (JS/CSS/images); blocking requests

Blocked?

API patterns: N+1, waterfall requests, missing caching

Wrong?

Cache: Client-side caching and staleness (e.g. service worker, local storage)

How well do you know this part? Can you distinguish “slow first load” vs “slow API” vs “slow after interaction” using dev tools or RUM (real user monitoring)?

Networking

What it is: DNS, load balancers, proxies, and the path requests take before they reach your app.

What to check

Slow?

DNS: Resolution delays or failures
TLS: Handshake cost; certificate or cipher issues
Path: Geographic routing and latency to origin

Blocked?

Load balancer: Health, timeouts, connection limits, backend pool saturation
Proxy: Timeouts and connection pooling (e.g. keep-alive, max connections)

Wrong?

TLS/config: Certificate or cipher misconfiguration

How well do you know this part? Can you trace a request from client to app and identify where time is spent in the network path?

Web Application Server

What it is: The app tier (e.g. Ruby on Rails, Django, Node, .NET, Spring) that handles requests and talks to data stores.

What to check

Slow?

Routes: Request handling time per route or handler
CPU/memory: Saturation; GC or event-loop stalls
Startup: Cold starts, JIT, or initialization cost after deploys

Blocked?

Threads/workers: Pool exhaustion; blocking I/O on the request path
Connections: Pool usage to DB, cache, and downstream services

Wrong?

Config/state: Issues that cause incorrect behavior (often show up as slow or blocked above)

How well do you know this part? Can you tell if slowness is in “our code,” in a library, or in a downstream call (DB, cache, API)?

Database

What it is: The primary data store (e.g. PostgreSQL, MySQL, SQL Server) for transactional or core application data.

What to check

Slow?

Indexes: Slow or missing indexes; full table scans
Replication: Lag if reads go to replicas; primary saturation

Blocked?

Connections: Pool exhaustion or connection leaks
Locks: Lock contention

Wrong?

Queries: Query shape (N+1, large result sets, unnecessary joins)
Schema: Migration impact (locks, long-running DDL)

How well do you know this part? Can you read query plans and tie latency spikes to specific queries or connection pressure?

Key-Value / Document Store

What it is: Caches or document/KV stores (e.g. Redis, MongoDB) used for sessions, caching, or specialized access patterns.

What to check

Slow?

Hit rate: Eviction; memory or key-size growth
Serialization: Cost and large values; pipeline vs many round-trips
Cluster: Topology (sharding, replicas); hot keys or partitions

Blocked?

Connections: Pool or connection limits to the store

Wrong?

Timeouts/consistency: Retries; consistency vs availability trade-offs (config/state)

How well do you know this part? Can you tell whether the store is the bottleneck (latency, errors, saturation) vs the app’s usage pattern?

Infrastructure

What it is: The compute and platform layer (e.g. EC2, Google Cloud, Azure)—VMs, containers, quotas, and shared resources.

What to check

Slow?

Compute: CPU, memory, and disk I/O saturation; noisy neighbors
Network: Throughput and packet loss within the region or AZ

Blocked?

Quotas: Limits (API rate limits, disk, connections) and throttling
Scaling: Triggers and cooldowns (e.g. autoscaling lag)

Wrong?

Placement: Affinity (e.g. same AZ for app and DB); misconfigured placement

How well do you know this part? Can you distinguish “the box is full” from “the app is inefficient” using OS-level or cloud metrics?

How to Use This Checklist

Before working through layers: Circumscribe the problem: note what is working and what isn’t, and be precise about how the system is misbehaving. That keeps you from chasing the wrong layer.
Simplify when you can: Create a minimal or simplified case that still reproduces the issue (e.g. one request type, smaller dataset). It makes hypothesis-testing, communication, and adding regression tests easier.
When you have limited info: Work through the layers in order (frontend → networking → app → database → KV/store → infrastructure). At each layer, ask Slow? Blocked? Wrong? and use the bullets below as the concrete checks. Form a hypothesis (e.g. “DB pool exhausted” or “cache miss storm”), then use observability to confirm or rule it out—faster than blindly splitting the system in half.
Combine with observability: Use Observability (SLIs, dashboards, tracing) to get signals per layer. The checklist tells you what to look for; observability tells you what you see.
Once you suspect a layer: Use the System Design Checklist for vocabulary and components, and the Optimization Quick Reference for “this pain → this fix.” The Infrastructure Building Blocks doc explains why each type of infrastructure exists and when to use it.

Handoff: Writing a Useful Message to a Teammate

When you hand off a debugging task to someone else—a good message does 3 things:

Gives context so they understand the issue and why it matters
Summarizes what’s already known or ruled out so they don’t redo work
Gives 1–3 concrete next steps and where to look (layer + observability)

Use this template when writing an email or Slack message:

**Subject:** [Brief symptom] — suspected [layer], next steps below

**What we know:**:
[1–2 sentences, who’s affected, symptom, any metrics/dashboards]

**What we’ve already checked / ruled out:**
[e.g. "Frontend and network look fine; slowness starts at app tier."]

**Suggested next steps:**
1. [Concrete action, e.g. "Check app metrics for route X and DB pool usage."]
2. [Where to look, e.g. "Dashboard: …" or "Runbook: …"]
3. [Optional: "If that doesn’t show it, work through the checklist from [layer]."]

**Useful links:** [This checklist, relevant runbook, dashboard.]

Before You Close the Loop

After you find and fix the cause:

Add a test (or scenario) that would have caught this bug so it doesn’t come back.
Document the fix (runbook, postmortem, or code comment) so others benefit next time.
Consider the bug class: Ask whether this is one instance of a broader failure mode and scan for similar cases.