Architecture Interview Questions - Hard
Hard-level software architecture interview questions covering advanced distributed systems, consensus, and complex patterns.
Q1: Explain distributed consensus algorithms (Raft, Paxos).
Answer:
Problem: How do multiple nodes agree on a value in presence of failures?
Raft Algorithm
Roles:
- Leader: Handles all client requests, replicates log
- Follower: Passive, responds to leader/candidate
- Candidate: Seeks votes to become leader
Leader Election:
Log Replication:
- Client sends command to leader
- Leader appends to local log
- Leader replicates to followers
- Once majority acknowledges, leader commits
- Leader notifies followers to commit
Safety Properties:
- Election Safety: At most one leader per term
- Leader Append-Only: Leader never overwrites log
- Log Matching: If two logs contain same entry, all preceding entries identical
- Leader Completeness: If entry committed, present in all future leaders
- State Machine Safety: If server applies log entry, no other server applies different entry at that index
Paxos Algorithm
Phases:
Phase 1 (Prepare):
- Proposer selects proposal number n
- Sends Prepare(n) to majority of acceptors
- Acceptors promise not to accept proposals < n
Phase 2 (Accept):
- If majority promises, proposer sends Accept(n, value)
- Acceptors accept if haven't promised higher number
- Once majority accepts, value is chosen
Comparison:
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | Easier | Complex |
| Leader | Strong leader | No fixed leader |
| Log Structure | Strongly consistent | More flexible |
| Implementation | Simpler | More variants |
Use Cases:
- Raft: etcd, Consul, CockroachDB
- Paxos: Google Chubby, Apache ZooKeeper (ZAB variant)
Q2: Design a globally distributed system with multi-region consistency.
Answer:
Key Challenges
1. Data Consistency:
Strategies:
- Strong Consistency: Synchronous replication (slow, high latency)
- Eventual Consistency: Async replication (fast, temporary inconsistency)
- Causal Consistency: Preserve causality, allow concurrent updates
2. Conflict Resolution:
3. Latency Optimization:
- Read-Local: Serve reads from nearest region
- Write-Local: Accept writes locally, replicate async
- CDN: Cache static content globally
- Edge Computing: Process at edge locations
4. Failure Handling:
- Circuit Breakers: Prevent cascade failures
- Fallback: Serve stale data if region unavailable
- Health Checks: Monitor region health
- Automatic Failover: Route traffic to healthy regions
Implementation Patterns
Multi-Master Replication:
CRDT (Conflict-Free Replicated Data Types):
- Guaranteed convergence without coordination
- Types: G-Counter, PN-Counter, LWW-Register, OR-Set
- Use: Collaborative editing, distributed counters
Vector Clocks:
- Track causality across replicas
- Detect concurrent updates
- Enable causal consistency
Q3: Explain event sourcing and CQRS at scale.
Answer:
Event Sourcing
Core Concept: Store all changes as sequence of events, not current state.
Event Store Structure:
Benefits:
- Complete audit trail
- Time travel (reconstruct past states)
- Event replay for debugging
- Multiple projections from same events
Challenges at Scale:
1. Event Store Growth:
Snapshots:
- Periodically save aggregate state
- Replay only events after snapshot
- Reduces reconstruction time
2. Projection Lag:
Solutions:
- Accept eventual consistency
- Show "processing" state to users
- Use optimistic UI updates
- Prioritize critical projections
3. Event Versioning:
CQRS at Scale
Read Model Optimization:
- Denormalized for query performance
- Multiple read models for different use cases
- Can use different databases (SQL, NoSQL, Search)
Scaling Reads:
Scaling Writes:
- Partition event store by aggregate ID
- Shard across multiple nodes
- Use distributed event bus (Kafka, Pulsar)
Q4: Design a real-time collaborative editing system (like Google Docs).
Answer:
Key Challenges
1. Concurrent Edits:
Solutions:
Operational Transformation (OT):
- Transform operations based on concurrent ops
- Maintains convergence and intention
- Complex to implement correctly
CRDTs (Conflict-Free Replicated Data Types):
- Mathematically guaranteed convergence
- No central coordination needed
- Simpler than OT
2. Real-Time Synchronization:
Optimizations:
- Optimistic Updates: Apply locally immediately
- Batching: Group operations to reduce network calls
- Compression: Compress operation payloads
- Presence: Show who's editing what
3. Scalability:
Strategies:
- Sticky Sessions: Route user to same server
- Pub/Sub: Broadcast operations across servers
- Shared State: Use Redis for document state
- Sharding: Partition documents across servers
4. Persistence:
- Periodic Snapshots: Save full document periodically
- Operation Log: Store all operations
- Hybrid: Snapshot + operations since snapshot
Implementation Considerations
Conflict Resolution:
- Last Write Wins (LWW)
- Version Vectors
- Application-specific logic
Offline Support:
- Queue operations while offline
- Sync when reconnected
- Handle conflicts on reconnection
Performance:
- Sub-100ms latency for operations
- Support 100+ concurrent editors per document
- Handle documents up to 10MB
Q5: Explain chaos engineering and how to implement it.
Answer:
Chaos Experiments
Types of Failures to Inject:
Implementation Levels
1. Development:
- Unit tests with mocked failures
- Integration tests with fault injection
- Local chaos testing
2. Staging:
- Automated chaos experiments
- Full system tests
- Performance under failure
3. Production:
- Controlled experiments
- Gradual rollout
- Automated rollback
Chaos Tools
Best Practices
Start Small:
Observability:
- Comprehensive monitoring
- Distributed tracing
- Log aggregation
- Real-time alerting
Safety Measures:
- Blast Radius: Limit scope of experiments
- Abort Conditions: Auto-stop if critical metrics degrade
- Business Hours: Run during staffed hours initially
- Gradual Rollout: Increase scope over time
Example Scenarios
Network Partition:
- Simulate split-brain scenario
- Verify consensus algorithm works
- Check data consistency
Service Degradation:
- Slow down database
- Verify timeouts and retries
- Check circuit breakers activate
Resource Exhaustion:
- Fill disk space
- Exhaust memory
- Max out CPU
- Verify graceful degradation
Measuring Success
Metrics:
- MTTR (Mean Time To Recovery): How fast system recovers
- Availability: Percentage uptime during chaos
- Error Rate: Increase in errors
- Latency: Impact on response times
Goals:
- No customer-facing impact
- Automatic recovery
- Graceful degradation
- Clear alerts and runbooks
Summary
Hard architecture topics:
- Distributed Consensus: Raft, Paxos for agreement
- Global Distribution: Multi-region consistency strategies
- Event Sourcing + CQRS: Scalable event-driven systems
- Collaborative Editing: OT, CRDTs for real-time sync
- Chaos Engineering: Testing resilience through failure injection
These patterns enable building highly available, scalable, and resilient distributed systems.
Related Snippets
- Architecture Interview Questions - Easy
Easy-level software architecture interview questions covering fundamental … - Architecture Interview Questions - Medium
Medium-level software architecture interview questions covering distributed … - Scalability Interview Questions - Easy
Easy-level scalability interview questions covering fundamental scaling … - Scalability Interview Questions - Hard
Hard-level scalability interview questions covering extreme scale, global … - Scalability Interview Questions - Medium
Medium-level scalability interview questions covering advanced scaling … - System Design Interview Questions - Easy
Easy-level system design interview questions covering fundamental system design … - System Design Interview Questions - Hard
Hard-level system design interview questions covering globally distributed, … - System Design Interview Questions - Medium
Medium-level system design interview questions covering complex distributed …