Cosmos DB Disaster Recovery: Multi-Region Write Pitfalls and How to Evade Them
Learn how multi-region writes ad managed in Cosmos DB, key considerations to steer clear or, and best practices to architect a resilient system.
Join the DZone community and get the full member experience.
Join For FreeIntroduction
Azure Cosmos DB is an excellent global distributed, multi-model database service for high availability, low-latency access, and straightforward scalability. One of its most prominent characteristics is multi-region writes, whereby your applications write to the nearest regional replica, which greatly boosts performance and resilience.
But here's the catch: enabling multi-region writes also introduces new challenges, especially when you're architecting for disaster recovery (DR). Without careful planning, you can end up with data conflicts, unplanned downtime, or even data loss.
In this article, we'll walk you through:
- How multi-region writes are managed in Cosmos DB
- Potholes to steer clear of
- Best practices in the real world to architect a system that's actually resilient
Let's begin!
Multi-Region Writes in Cosmos DB
Cosmos DB supports multiple active write regions by default, allowing your apps to write in parallel to geographically distributed data centers. This enhances latency and availability at the expense of needing careful management of consistency and conflict resolution.
Key Concepts of Cosmos DB's Global Distribution Model
Before jumping into the disaster recovery drawbacks, it’s essential to understand the core concepts that underpin Cosmos DB’s global distribution model.
1. Conflict Resolution
Cosmos DB supports conflict resolution when data is concurrently modified in different regions. By default, it uses Last Write Wins (LWW), where the most recent timestamped write is retained. However, this default approach can silently discard critical updates.
To mitigate this, Cosmos DB allows custom conflict resolution using stored procedures, enabling domain-specific logic to merge conflicts or preserve specific business rules.
2. Consistency Levels
Cosmos DB offers five consistency levels, ranging from strong to eventual:
- Strong: Ensures absolute consistency but restricts writes to a single region. Best for financial or legal workloads.
- Bounded staleness: Offers a guaranteed lag window (e.g., 5 writes or 10 seconds). Ideal for DR because it balances consistency with regional availability.
- Session, consistent prefix, eventual: Provide progressively weaker consistency but higher availability and throughput.
3. Automatic Failover
When a region becomes unavailable, Cosmos DB automatically reroutes traffic to healthy regions. This ensures continuity but depends on proper client configuration (e.g., retry logic, preferred regions).
Disaster Recovery Drawbacks With Multi-Region Writes
Let’s now unpack some common challenges and how to address them effectively:
Drawback 1: Data Conflicts in Active-Active Regions
Issue: In an active-active setup, simultaneous updates from different regions can cause data conflicts.
Example scenario:
- Region A updates a user balance to $100
- Region B updates the same user balance to $50
- Cosmos DB chooses the write with the latest timestamp
Problem: LWW might incorrectly discard critical data without understanding business logic.
Solution:
- Use custom conflict resolution to enforce business-specific merge logic.
- Apply application-level validation to detect and prevent risky concurrent updates.
Drawback 2: Failover Delays and RTO/RPO Gaps
Issue: While Cosmos DB supports automatic failover, practical delays still occur.
Examples of delay:
- RTO (Recovery Time Objective) delays due to DNS propagation, app retries, or cold caches
- RPO (Recovery Point Objective) gaps can occur from replication lag (even if typically <5 seconds)
Solutions:
- Routinely test failovers in a staging environment
- Use Azure Monitor and Diagnostic Logs to observe replication lag and request failures
- Set up health anomaly alerts to proactively identify degradation
Drawback 3: Strong Consistency Trade-Offs
Issue: Strong consistency, while reliable, only allows a single write region.
Problem: If the designated write region fails, write operations pause until failover completes.
Solutions:
- Use Bounded Staleness for a balance of consistency and availability
- Implement two-phase commit logic at the application layer for mission-critical writes
Drawback 4: Unexpected Cost Surprises
Issue: Multi-region writes increase Request Units (RU/s) consumption and incur bandwidth charges.
Example:
- Every additional write region requires full RU provisioning
- Cross-region data replication incurs bandwidth fees
Solutions:
- If active-active writes aren't essential, use one write region with multiple read replicas
- Optimize partitioning and indexing to reduce RU consumption
Top Practices for Reliable Disaster Recovery
If you want a disaster recovery plan that's both robust and cost-effective, the following is what to focus on:
1. Implement custom conflict resolution:
{
"mode": "Custom",
"conflictResolutionPath": "/_ts",
"conflictResolutionProcedure": "spMergeConflicts"
}
Define stored procedures that respect domain-specific logic and guarantee that important is not lost in transit.
2. Test failovers periodically
- Simulate outages in Azure Portal or via CLI
- Measure actual RTO and RPO under real-world load
- Use the results to optimize runbooks and incident response
3. Monitor replication health
- Use Azure Monitor and Diagnostic Logs to watch for:
- Replication latency
- Failed requests
- Regional unavailability
4. Optimize architecture for cost and performance
- For DR-focused systems, use 1 write region + multiple reads
- If active-active is required, prefer session consistency to reduce conflict risks
Conclusion: Next Steps for Cosmos DB Resilience
Azure Cosmos DB offers blazing performance and near-instant scalability, but disaster recovery planning requires careful consideration. Multi-region writes open new possibilities, but without the right configurations, they can just as easily introduce risk.
To build resilient architectures:
- Align your consistency model with your business needs.
- Adopt custom conflict resolution logic.
- Routinely test failovers to validate recovery expectations.
- Monitor regional health and latency.
- Reevaluate the cost vs. performance trade-off regularly
Next Steps:
- Review your current Cosmos DB setup against these best practices.
- Implement at least one failover simulation per quarter.
- Introduce alerts for replication lag and region health.
- Explore stored procedures for smarter conflict resolution.
A well-architected Cosmos DB configuration not only survives failure — it thrives in adversity.
Opinions expressed by DZone contributors are their own.
Comments