devops/aws-devops-disasterrecovery.md at main · gbisaga/devops

Disaster Recovery

White paper could read - summarized in lecture
Disaster: any event w/ negative impact on business continuity or finances
DR: preparing for and recovering from a disaster
What kinds of DR?
- Traditional DR on-premise - separate data centers
- Hybrid recovery - on-premise data center with recovery in the cloud - use Route53 to direct users to one or the other
- Cloud - region A to region B
Two key terms - optimizing for these drives strategies
- RPO: recovery point objective
  - Basically, how often you do backups, how far back do you go
  - This is the amount of data loss you're willing to accept - backup every hour -> you lose up to an hour of data
- RTO: recovery time objective
  - How much time to come back - what is your downtime
- The smaller the numbers, the more expensive

DR Strategies

KEY IDEA Exam will give scenarios, you have to choose from these
RPO in hours, RTO in 24 hours or less: Backup and Restore
- High RTO, everything recreated; also high RPO because making backups takes time
- Examples
  - On-premise: AWS storage gateway or Snowball to send to S3 - in extreme cases RPO might be a week
  - In cloud: regular snapshots into S3 of EBS, redshift, RDS, etc.
- When you restore, recreate EC2s from AMIs, recreate RDS, etc
- Lowest cost
RPO in minutes, RTO in hours: Pilot Light - popular, critical systems kept running
- Small version of the app always running in the cloud - but only critical core (DB)
- Similar to Backup and Restore, but your most critical systems are already up
- Example: Do live RDS replication to have the DB, but not running the EC2 instances
- Lower RPO (less to backup), lower RTO (less to rebuild during recovery)
- The difference between Pilot Light and Warm Standby can sometimes be difficult to understand.
  - Both include an environment in your DR Region with copies of your primary region assets.
  - The distinction is that Pilot Light cannot process requests without additional action taken first, while Warm Standby can handle traffic (at reduced capacity levels) immediately
    - Pilot Light will require you to turn on servers, possibly deploy additional (non-core) infrastructure, and scale up
    - Warm Standby only requires you to scale up (everything is already deployed and running). Choose between these based on your RTO and RPO needs.
RPO in seconds, RTO in minutes: Warm Standby - full system up and running but at minimum size
- ASG with only one EC2 running - in recovery, scale up the ASG
- More expensive since you have more infrastructure
RPO near zero, RTO potentially zero: Multi-site/Hot-site - aka active/active
- Very low RTO, but very expensive
- Duplicate system running, Route53 sends traffic to both
- Cloud/replica version accesses same live database as on-premise instances
- EC2 fail over to replicated RDS slave
All in the cloud, really same as on-premise - same options

DR tips

Backup with EBS snapshots or backups (EC2 API, like from CloudWatch scheduled event)
HA
- Route53 between regions
- If DirectConnect and network goes down, use Site-2-Site VPN
Replication
Automation
- CloudFormat/Elastic Beanstalk
- Recover/reboot with CloudWatch if alarms fail
- AWS Lambda for customized automation
Testing - chaos monkeys
- Ex. Netflix has "simian army" - randomly terminate app servers, even in production

DR checklist

What is RPO/RTO And what is DR budget
Is AMI copied across regions, with key in parameter store
Is CloudFormation StackSet working and tested to work in another region?
Are Route53 health checks working correctly? Ties to a CloudWatch alarm?
How can automate w/ CloudWatch events -> trigger Lambda -> perform RDS read replica promotion
Is data backed up? appropriate for RPO/RTO? Where is it living, how synchronized and replicated?

Backups and Multi-region DR

EFS backup options
- AWS Backup with EFS (frequency, when, retain time)
- EFS-to-EFS backup automation flow (now AWS Backup - or is it datasync?)
  - EFS -> S3 -> S3 Cross region replication -> EFS
Route53 backup - no specific import/export
- API ListResourceRecordSets for export
- write script for imports into Route53 or another DNS provider
Elastic Beanstalk
- Saved configurations using eb cli
- Can use to recreate in another region
RDS
- Can do Multi-AZ but this is only in a region
- Aurora only - Global to up to 5 regions (brand new)
- However can create read replicas in another region
- Take snapshot to another region - first one is full, incremental after

DNS routing policies - usually part of DR strategies other than backup and restore

Note that you can create a tree of records
- e.g. Latency Alias records at the top with weighted next level, with health checks
- Route53 will walk the tree
Simple - direct connection to a single resource, no DR
Failover - for active/passive failover
Weighted - multiple resources with specified proportion; good for canary deployments
Geolocation, latency, geoproximity - choose based on distance of client
Multivalue - pick up to 8 healthy records at random

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disaster Recovery

DR Strategies

DR tips

DR checklist

Backups and Multi-region DR

DNS routing policies - usually part of DR strategies other than backup and restore

FilesExpand file tree

aws-devops-disasterrecovery.md

Latest commit

History

aws-devops-disasterrecovery.md

File metadata and controls

Disaster Recovery

DR Strategies

DR tips

DR checklist

Backups and Multi-region DR

DNS routing policies - usually part of DR strategies other than backup and restore