Etcd Backup on Pollux by JamesDoingStuff · Pull Request #1197 · DiamondLightSource/workflows

Conversation

@JamesDoingStuff

The dev_resources/build CI is currently failing due, I think, to check_k8s_resources.py not handling CronJobs well - specifically, having a resources field but no replicas. I'll look into fixing this

Adds:

  • CronJob that executes daily to take a snapshot of one of the etcd PVs and upload it to Echo. Backups are timestamped and stored under the path dls-workflows-prod/<staging/prod>/etcd-snapshot-<timestamp>.db. The contents are encrypted. The job deletes backups older than 2 days.
  • CronJob to download the snapshot and perform an etcdctl snapshot restore on the provided etcd volume. This job won't automatically.
  • Script (scripts/restore-etcd.sh) that scales down the etcd and the vcluster, performs the above job for each etcd volume, then returns the cluster to initial levels.

TBThomas56

TBThomas56

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a staging being backed up and not prod?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to roll it out on staging first and just make sure all is well - I just needed to add something to the Values.yaml for prod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I would add a simple ticket to the backlog in case it has not been added yet

TBThomas56

TBThomas56

namespace: workflows
type: Opaque
{{ else }}
{{- end }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an empty line after all of these?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do anything? I see that the other templates have one, so I don't mind adding one in if it's convention, just curious

TBThomas56

TBThomas56

TBThomas56

TBThomas56

davehadley


# Delete old backed up objects, with age >= 2 days.
echo "deleting old backups from echo s3"
rclone delete --min-age=2d echo:dls-workflows-prod/${PREFIX}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for this PR but we should decide our strategy for how many and how long we want to keep backups for.

@JamesDoingStuff

Made a couple of minor changes to get this passing CI (last 2 commits) so if someone could just sanity check those please, that'd be great :)

TBThomas56