Architecture Decisions

Leveraging AI for research and RFC generation while keeping humans in the decision-making loop

Real Case: Notification System Design at Client A

Step 1: Research with AI

You: "Compare approaches for notification system:

Requirements:
- Multiple channels (email, SMS, in-app)
- 10k+ daily notifications
- User preferences
- Retry logic
- Priority levels

Options:
1. Event-driven with message queue
2. Polling + cron jobs
3. Webhooks + callbacks

Analyze: scalability, complexity, cost, reliability"

AI provides detailed comparison

Step 2: Generate RFC with AI

You: "Based on option 1 (event-driven), draft RFC:

Include:
- Architecture diagram (mermaid)
- Component breakdown
- Data flow
- Scalability plan
- Failure modes
- Migration strategy

Follow format from @docs/rfcs/template.md"

Step 3: Human Review and Decision

Team reviews RFC
Discusses trade-offs AI can't know (budget, existing infrastructure, team skills)
Makes decision
AI helps with implementation plan

Lesson: AI excellent for research and documentation. Humans make final decisions based on context AI doesn't have.

Leveraging Extended Thinking for Complex Problems

Modern AI can engage in deeper reasoning before responding. This is valuable for complex architectural problems that benefit from thorough analysis.

When to Use Deep Reasoning

Problem types that benefit:

Architecture decisions - Trade-off analysis across multiple dimensions
Security audits - Following attack vectors through the system
Complex debugging - Multi-layer issues with subtle interactions
Performance optimization - Systemic bottlenecks requiring holistic view
Migration planning - Dependencies, risks, rollback strategies

Common pattern:

These problems have:

Multiple valid solutions with non-obvious trade-offs
Long-term consequences
Cross-cutting concerns
Need to consider many constraints simultaneously

How to Prompt for Thorough Analysis

Instead of: "What's the best architecture for this?"

Try:

"Analyze authentication architecture options for SaaS with these constraints:

Requirements:
- Multi-tenant (500+ organizations)
- SSO support (SAML, OAuth)
- Role-based access control
- API keys for programmatic access
- Session management

Constraints:
- Team of 4 developers
- 6-month timeline
- Budget: $50k for auth infrastructure
- Must be SOC2 compliant
- Current stack: Node.js, PostgreSQL

Compare:
1. Build custom (JWT + sessions)
2. Auth0 / Okta integration
3. Open-source (Keycloak, Ory)

For each, analyze:
- Development time
- Ongoing maintenance
- Cost at 1k, 10k, 100k users
- Compliance implications
- Team expertise needed
- Vendor lock-in risk

Think through second-order effects before recommending."

Key elements:

Specific constraints (time, budget, team)
Multiple options to compare
Clear evaluation criteria
Explicit request for thorough analysis

When Extended Thinking Is Overkill

Don't overthink:

Simple CRUD endpoints - Follow established patterns, don't philosophize
Bug fixes with clear root cause - Just fix it
Boilerplate code - Standard implementation, no decisions needed
Well-established patterns - If team has done it 10 times, just do it again
Time-sensitive hotfixes - Analysis paralysis is worse than imperfect solution

Rule of thumb:

If the answer is in your existing codebase or standard practice, don't ask AI to reinvent it.

Pattern: Think First, Then Implement

For complex problems, use a two-phase approach:

Phase 1: Analysis (thinking mode)

You: "Before we implement, analyze:
     - What could go wrong with approach X?
     - What are the failure modes?
     - What constraints did I miss?
     - What assumptions need validation?"

AI: [Thorough analysis, identifies issues]

You: [Reviews, discusses with team, decides on approach]

Phase 2: Implementation (execution mode)

You: "Based on our discussion, implement approach X
     with these specific decisions:
     - [Decision 1]
     - [Decision 2]
     - [Decision 3]

     Follow patterns from @existing-code"

AI: [Implements according to plan]

Why separate phases:

Analysis doesn't get rushed by "just ship it" pressure
Team can review reasoning before committing resources
Implementation is clearer with decisions already made
Easier to course-correct early

Real Example: Database Scaling Decision

Rushed approach (don't do this):

You: "Our database is slow, should we add read replicas?"
AI: "Yes, here's how to set up read replicas..."
[Implements complex infrastructure]
[Doesn't solve the actual problem - was missing indexes]

Thoughtful approach:

You: "Database performance degrading. Before jumping to solutions, analyze:

Current state:
- 50k users
- 2M database rows
- Average query time: 800ms (was 200ms)
- 95th percentile: 3s
- Growth: 20% month-over-month

Symptoms:
- Slow on complex reports
- Fast on simple queries
- Worse during business hours (9am-5pm EST)

Constraints:
- Budget: $5k one-time, $500/month ongoing
- Team has PostgreSQL experience only
- Can't take downtime > 5 minutes
- Must ship improvement in 2 weeks

Analyze root causes before suggesting solutions.
What data would help narrow this down?"

AI: "Need more data before recommending solutions:
     1. Run EXPLAIN ANALYZE on slow queries
     2. Check if indexes exist on joined columns
     3. Review query patterns (N+1 queries?)
     4. Check connection pool utilization
     5. Monitor disk I/O during slow periods

     Most likely culprits given symptoms:
     - Missing indexes (explains report slowness)
     - N+1 queries (explains time-of-day pattern)

     Recommend profiling before adding infrastructure."

You: [Runs suggested diagnostics]
You: "Here's EXPLAIN output [paste]"

AI: "Root cause: Missing composite index on orders(user_id, created_at)
     All reports scan full table.

     Solution: CREATE INDEX - no infrastructure needed
     Cost: $0, 5 minutes
     Expected improvement: 800ms → 50ms"

Result: Solved with a 5-minute fix instead of weeks of infrastructure work.

Balancing Thoroughness with Speed

How much analysis is enough?

Decision Reversibility	Stakes	Analysis Time
Easy to reverse	Low	5-15 minutes
Moderate effort	Medium	30-60 minutes
Hard to reverse	High	2-4 hours
Irreversible	Critical	Days (with team)

Examples:

Easy to reverse (15 min):
- Which npm package for date parsing?
- REST vs GraphQL for new endpoint?

Moderate effort (1 hour):
- State management approach (Redux, Zustand, Context)?
- Background job system (BullMQ, Celery)?

Hard to reverse (half day):
- Database choice (Postgres, MySQL, Mongo)?
- Monolith vs microservices?

Critical (multi-day):
- Cloud provider (AWS, GCP, Azure)?
- Programming language for new service?

Golden rule: Analysis time should match reversal cost.

Remember: Deep thinking is a tool, not a requirement. Use it for genuinely complex decisions. For everything else, ship fast and iterate.

On this page