There have been asks from partners and prospective customers regarding the feasibility of deploying EMCO in production. There are some areas where EMCO needs enhancements to get it closer to production. Hopefully, the community can come together and contribute to an initiative that identifies the gaps, determines the enhancements needed to fill those gaps, and delivers those enhancements across multiple EMCO releases.

To get EMCO closer to production state, two important areas that need enhancements are Observability and Resiliency.

Observability

Observability is the property of a system that allows an external observer to deduce its current state, and the history leading to that state, by observing the externally visible attributes of the system. The main factors relating to observability are: logging, tracing, metrics, events and alerts. See the Observability page for more details.

For logging, we already have structured logs in the code base and fluentd in deployment. But:

These don't need any code changes. We can do a PoC for a deployment of EMCO with log persistence and a log stack, and document the ingredients and the recipe. Perhaps, the needed YAML files can be checked in as well.

We also need to investigate an events framework. This is TBD.

Resiliency

Database persistence

Today, db persistence is not enabled by default. We need to validate with persistence enabled.

Recovery from crashes/disruptions

Scenarios to validate

rsync can restart after a crash. Aarna, as part of EMCO backup/restore presentation, has tested blowing away the EMCO namespace (incl. EMCO pods and db), and restoring it.

Mongo db consistency

Some microservices may make multiple db writes for a single API call. So, if the microservice crashes in the middle of that API call, we will have an inconsistent update in mongo. We need to scrub for such scenarios and fix them.

Graceful handling of cluster connectivity failure

Without the GitOps model, rsync should apply configurable retry/timeout policies to handle cluster connectivity loss. We have the
/projects/.../{dig}/stop API but that is a workaround -- the user needs to invoke that API manually.

We need to validate rsync retries/timeout for cluster connectivity.

Question: can we recommend the GitOps approach and leave things as is? If not, we need to fix this.

Storage Considerations

We need storage in the cluster where EMCO runs for:

Upgrades

It should be possible to upgrade in-place from one released version to the next released version.  The primary concern is any database schema changes between versions.

A smaller concern is a deprecation schedule for removing APIs and/or components.