Page History

...

Recovery from crashes/disruptions

...

Scenarios to validate

Restart each microservice, when it is processing a request
In particular: Restart orchestrator when a DIG instantiate request is in flight
Restart all microservices together
Restart the node on which EMCO pods are running (assuming it is 1 node for now)

rsync can restart after a crash. Aarna, as part of EMCO backup/restore presentation, has tested blowing away the EMCO namespace (incl. EMCO pods and db), and restoring it.

Mongo db consistency

Some microservices may make multiple db writes for a single API call. So, if the microservice crashes in the middle of that API call, we will have an inconsistent update in mongo. We need to scrub for such scenarios and fix them.

Graceful handling of cluster connectivity failure

Without the GitOps model, rsync should apply configurable retry/timeout policies to handle cluster connectivity loss. We have the
/projects/.../{dig}/stop API but that is a workaround -- the user needs to invoke that API manually.

We need to validate rsync retries/timeout for cluster connectivity.

Question: can we recommend the GitOps approach and leave things as is? If not, we need to fix this.

...

Space shortcuts

Page tree

Versions Compared

Old Version 6

New Version 7

Key

Recovery from crashes/disruptions

Scenarios to validate

Mongo db consistency

Graceful handling of cluster connectivity failure