Cross-cutting performance rules for all languages and runtimes. Extends root rule #10: "Performance from the outset."
- The best time for 1000x wins is the design phase. Once the architecture is set, you're fighting for 10% gains.
- Before writing code, sketch resource usage against 4 resources x 2 characteristics:
| Bandwidth | Latency | |
|---|---|---|
| Network | How much data? | How many round trips? |
| Disk | How much I/O? | Random vs sequential? |
| Memory | Working set size? | Allocation rate? |
| CPU | Total instructions? | Cache friendliness? |
- Choose algorithms and data structures that minimize the dominant resource. A single design decision (batching, caching, compression) often matters more than all micro-optimizations combined.
- Question every network call. Question every disk write. These are orders of magnitude slower than memory and CPU.
- Optimize for the slowest resource first: network > disk > memory > CPU.
- One eliminated network round trip beats any amount of CPU optimization. One avoided disk seek beats any memory optimization.
- Adjust for frequency: a cheap operation called 10M times may dominate an expensive operation called once.
- Measure to confirm which resource is the actual bottleneck. Intuition is unreliable.
- Amortize costs. N+1 queries are bugs, not performance issues to "optimize later."
- Batch database queries, API calls, file operations, message publishes. One call with N items, not N calls with 1 item.
// Bug -- N+1
for hotel in hotels:
rooms = db.query("SELECT * FROM rooms WHERE hotel_id = ?", hotel.id)
// Fixed -- single batch query
rooms = db.query("SELECT * FROM rooms WHERE hotel_id IN (?)", hotel_ids)
- When batching isn't possible natively, accumulate and flush: collect items, process in chunks.
- Set batch size limits. Unbounded batches become memory problems.
- Reuse expensive objects. These are singletons or long-lived pooled instances, never per-request allocations:
ObjectMapper/ JSON serializersHttpClient/ HTTP connection pools- Database connection pools
- Thread pools / executor services
- Compiled regex patterns
- SSL contexts
// Kotlin -- shared, configured once
companion object {
val mapper: ObjectMapper = jacksonObjectMapper().apply { ... }
val httpClient: OkHttpClient = OkHttpClient.Builder().build()
}
// Banned -- per-request allocation
fun handleRequest(req: Request): Response {
val mapper = ObjectMapper() // NO
val client = OkHttpClient() // NO
}// Go -- package-level singletons
var (
httpClient = &http.Client{Timeout: 5 * time.Second}
isoRegex = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}$`)
)
// Banned -- recompiling on every call
func parse(s string) bool {
re := regexp.MustCompile(`^\d{4}-\d{2}-\d{2}$`) // NO
return re.MatchString(s)
}# Python -- module-level singletons (constructed once at import)
_session = requests.Session()
_iso_re = re.compile(r"^\d{4}-\d{2}-\d{2}$")
# Banned -- per-call construction
def fetch(url: str) -> Response:
return requests.Session().get(url) # NO -- builds a new pool each call- Understand where allocations happen. Every allocation is future GC pressure.
- Prefer stack over heap where the language permits (Go value types, Zig comptime, Java value-based classes).
- Avoid unnecessary boxing:
intnotInteger,longnotLongon hot paths (JVM). Use primitive-specialized collections where available. - Reuse buffers for I/O. Pre-allocate collections when the size is known:
ArrayList(expectedSize),make([]T, 0, expectedCap). - Watch for hidden allocations: varargs create arrays, string concatenation creates intermediate strings, lambda captures may allocate.
- Never optimize without measuring first. Intuition about performance is wrong more often than right.
- Profile before optimizing. Measure after optimizing. If you can't measure the improvement, it didn't happen.
- Tools by platform:
- JVM: async-profiler (flamegraphs), JFR (allocation + GC), VisualVM
- Go:
pprof(CPU + memory + goroutine),runtime/trace - Python:
py-spy(sampling, no instrumentation),cProfile,memray(allocation tracking),scalene(CPU + memory) - Node:
--prof, Chrome DevTools profiler,clinic.js - General: flamegraphs, heap dumps, GC logs
- Profile in conditions that approximate production. Profiling in dev with toy data proves nothing.
- Use proper benchmarking tools. Microbenchmarks without proper tooling produce noise, not data.
- JVM: JMH (handles JIT warmup, dead code elimination, loop optimization)
- Go:
testing.B(handles iteration count, timer reset) - Python:
pytest-benchmark,timeitfor inline microbench;asvfor tracking over time - JS:
Benchmark.jsormitata
- Warm up the JIT before measuring (JVM, V8). Run enough iterations to reach steady state.
- Report percentiles (p50, p95, p99), not averages. Averages hide tail latency.
- Control for variance: run multiple times, report standard deviation. Reject results with high variance.
- Benchmark the thing that matters: end-to-end latency, throughput, allocation rate -- not isolated function call time.
- Identify hot paths through profiling. Optimize them relentlessly. Ignore cold paths.
- Keep hot paths allocation-free where possible. Pre-allocate. Reuse. Pool.
- Move validation, logging, and debug checks out of hot paths (or guard them):
// Logging guarded on hot path
if (logger.isDebugEnabled) {
logger.debug("Processing item: {}", item)
}- Pre-compute what you can. Lookup tables over runtime calculation. Compiled patterns over runtime compilation.
- Avoid virtual dispatch on hot paths where the language allows. Concrete types, not interfaces, on the critical path.
- Cache expensive computations and remote call results. Caching is the highest-leverage performance tool after design.
- Always bounded. Use LRU, LFU, or size-bounded caches. Unbounded caches are memory leaks.
- Always with TTL. Prefer TTL-based expiry over event-based invalidation. Invalidation is a distributed systems problem; TTL is a clock read.
- Set cache sizes explicitly based on expected working set. Monitor hit rates. A cache with a low hit rate is wasted memory.
- Cache immutable data aggressively. Cache mutable data cautiously with short TTLs.
- Never cache errors long-term. Cache negative results briefly (to prevent thundering herds) or not at all.
- Reuse connections for HTTP, database, and any persistent protocol. Connection establishment is expensive (TCP handshake, TLS negotiation, auth).
- Set pool sizes explicitly. Never rely on library defaults -- they are tuned for "works," not for your workload.
- Set idle timeouts to reclaim unused connections. Set max lifetime to rotate connections and avoid stale server-side state.
- Monitor pool utilization: if the pool is always exhausted, increase it or reduce hold time. If it's always idle, shrink it.
- Lazy initialization for expensive resources that may not be needed on every code path. Avoid paying the cost until first use.
- Eager initialization for resources needed on every request. Pay the cost once at startup, not on the first request (which adds latency to a real user).
- Lazy is not free -- it adds synchronization cost and complexity. Use it deliberately, not by default.
- On the JVM, use
lazy {}(Kotlin) orSuppliers.memoize()(Guava) for thread-safe lazy init. Avoid double-checked locking by hand.
- Avoid string concatenation in loops. Use
StringBuilder(JVM),strings.Builder(Go),"".join(parts)(Python), template literals (JS),fmt.Sprintf(Go for complex formatting).
// Good
val sb = StringBuilder(estimatedSize)
for (item in items) {
sb.append(item.name).append(", ")
}
// Bad -- O(n^2) allocation
var result = ""
for (item in items) {
result += item.name + ", "
}- Intern frequently-compared strings on the JVM when the set is bounded and known.
- Prefer
toByteArray()/ byte buffers over string manipulation for binary protocols. - Use
contentEquals/regionMatchesfor partial comparisons instead of creating substrings.