Scaling Notes

View as Markdown

14.10.1 API tier (stateless)

  • Run 3+ replicas behind the LB.
  • HPA target: CPU 60%, memory 75%.
  • Steady-state CPU per pod for ~100 RPS: ~250–500 m.
  • Memory ~256–512 Mi per pod under steady load.
  • Connection-pool size: min(50, n_postgres_connections / n_pods) — see pgxpool.Config.MaxConns.

14.10.2 Postgres

  • Vertical first: start with 4 vCPU / 16 GB RAM, increase before sharding.
  • Read replicas: for issuer-analytics + dashboard reads, deploy 1–2 read replicas; route via DSN logic in pkg/repository.
  • Connection pooler: PgBouncer in transaction-pooling mode if total clients × MaxConns > Postgres max_connections.
  • Indexes: the 002_* migrations cover the hot paths; profile new queries with EXPLAIN ANALYZE before adding.
  • Vacuum: autovacuum on; tune autovacuum_vacuum_scale_factor=0.05 for the large audit_events and credentials tables.

14.10.3 Redis

  • Single-instance is fine for OTP / rate-limit / session up to ~5k RPS.
  • Replicated (1 primary + 2 replicas) for HA — failover via Sentinel or managed (ElastiCache).
  • Cluster mode only if memory > 100 GB or > 50k RPS — IDA’s working set is small (TTL-bounded keys).
  • Eviction: allkeys-lru (compose default) — safe because all data has natural TTLs except rate-limit buckets, which self-cleanup.

14.10.4 Blockchain RPC

  • Self-hosted node preferred for hot workloads — avoids per-call quota.
  • RPC pool: maintain 2–3 RPC URLs and round-robin in the chain client; failover on 5xx or RPC error.
  • WebSocket subscription for events (DIDRegistry, RevocationRegistry) to keep off-chain cache fresh.
  • Gas budgeting: monitor eth_gasPrice; alert if > 2x baseline. Bulk operations should retry with EIP-1559 fee bump.

14.10.5 Capacity planning checklist

QuestionWhere to find the answer
What is the peak request rate?Prometheus http_requests_total, 95th percentile / 1 min
What is the worst-case latency?http_request_duration_seconds p99
Is the DB the bottleneck?pg_stat_statements, slow-query log
Is the chain the bottleneck?RPC error rate + tx confirmation time dashboard
Can the API scale linearly?HPA replica count vs. RPS — if RPS does not scale linearly, find the shared lock