Overview

The purpose of this page at this time is to capture requirements related to observability of the EMCO services (https://gitlab.com/groups/project-emco/-/epics/7).

Front-ending the services with Istio provides a useful set of metrics and tracing, and adding the Prometheus library provided collectors to each service expands that with other fundamental metrics. The open question is what additional metrics and tracing will be useful to EMCO operators.

Metrics

The following items are based on Prometheus recommendations for instrumentation.

Queries, errors, and latency

Both client and server side are provided by Istio. https://istio.io/latest/docs/reference/config/metrics/

Istio metrics can be customized to include other attributes from Envoy such as subject field of peer certificate. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/advanced/attributes

Example PromQL

Service	Type	PromQL	Notes
HTTP/gRPC* *The request_protocol label can be used to distinguish among HTTP and gRPC.	Queries	sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m]))	inbound
	Queries	sum(irate(istio_requests_total{reporter="source",source_workload="services-orchestrator"}[5m])) by (destination_workload)	outbound
	Errors	sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination",destination_workload=~"services-orchestrator"}[5m]))	inbound
	Errors	sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator",response_code!~"5.*"}[5m])) by (destination_workload) / sum(irate(istio_requests_total{reporter="source",source_workload=~"services-orchestrator"}[5m])) by (destination_workload)	outbound
	Latency	histogram_quantile(0.90, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_workload="services-orchestrator"}[1m])) by (le)) / 1000	P90
	Saturation

Queries, errors, and latencies of resources external to process (network, disk, IPC, etc.)

The prometheus golang library provides builtin collectors for various process and golang metrics: https://pkg.go.dev/github.com/prometheus/client_golang@v1.12.2/prometheus/collectors. A list of metrics provided by cAdvisor is at https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md. Additional K8s specific metrics can be enabled with the https://github.com/kubernetes/kube-state-metrics project.

Example PromQL

Note: some of these require that kube-state-metrics is also deployed.

Pod Resource	Type	PromQL
CPU	Utilization	sum(rate(container_cpu_usage_seconds_total{namespace="emco"}[5m])) by (pod)
	Saturation	sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="emco"}[5m])) by (pod)
	Errors
Memory	Utilization	sum(container_memory_working_set_bytes{namespace="emco"}) by (pod)
	Saturation	sum(container_memory_working_set_bytes{namespace="emco"}) by (pod) / sum(kube_pod_container_resource_limits{namespace="emco",resource="memory",unit="byte"}) by (pod)
	Errors
Disk	Utilization	sum(irate(container_fs_reads_bytes_total{namespace="emco"}[5m])) by (pod, device)
	Utilization	sum(irate(container_fs_writes_bytes_total{namespace="emco"}[5m])) by (pod)
	Saturation
	Errors
Network	Utilization	sum(rate(container_network_receive_bytes_total{namespace="emco"}[1m])) by (pod)
	Utilization	sum(rate(container_network_transmit_bytes_total{namespace="emco"}[1m])) by (pod)
	Saturation
	Errors	sum(container_network_receive_errors_total{namespace="emco"}) by (pod)
	Errors	sum(container_network_transmit_errors_total{namespace="emco"}) by (pod)

Internal errors and latency

Internal errors should be counted. It also desirable to measure success to calculate ratio.

Totals of info/error/warning logs

Unsure if this is a useful metric.

Any general statistics

This bucket includes EMCO specific information such as number of projects, errors and latency of deployment intent group instantiation, etc. Also consider any cache or threadpool metrics.

Preliminary guidelines:

Distinguish between resources and actions.
Action metrics will record requests, errors, and latency similar to general network requests.
Resource metrics will record creation, deletion, and possible modification.
Metrics will be labeled with project, composite-app, deployment intent group, etc.

For rsync specifically, measure health/reachability of target clusters.

Also, keep in mind this cautionary note from the Prometheus project:

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.

However note that well-known projects such as Istio and kube-state-metrics appear to disregard this, so further investigation may be needed on the motivations behind this note.

Preliminary metrics

This section contains some of the considerations of the guidelines above applied to the orchestrator service.

The actions of a service can be identified from the gRPC requests and HTTP lifecycle requests:

Service	Action
orchestrator	approve
	instantiate
	migrate
	rollback
	stop
	terminate
	update
	StatusRegister
	StatusDeregister

The requests, errors, and latency can be modeled after Istio's istio_requests_total and istio_request_duration_milliseconds, with an additional action name label.

The resources of a service can be identified from the HTTP resources. The initial labels can be the URL parameters.

Service	Resource	Labels
orchestrator	controller	name

	project	name
	compositeApp	version, name, project
	app	name, composite_app_version, composite_app, project
	dependency	name, app, composite_app_version, composite_app, project
	compositeProfile	name, composite_app_version, composite_app, project
	appProfile	name, composite_profile, composite_app_version, composite_app, project
	deploymentIntentGroup	name, composite_app_version, composite_app, project
	genericPlacementIntent	name, deployment_intent_group, composite_app_version, composite_app, project
	genericAppPlacementIntent	name, generic_placement_intent, deployment_intent_group, composite_app_version, composite_app, project
	groupIntent	name, deployment_intent_group, composite_app_version, composite_app_name, project
dcm	emco_logical_cloud_resource	project, name, namespace, status
clm	emco_cluster_provider_resource	name
clm	emco_cluster_resource	name, clusterprovider
ncm	emco_cluster_network_resource	clusterprovider, cluster, name, cnitype
ncm	emco_cluster_provider_network_resource	clusterprovider, cluster, name, cnitype, nettype, vlanid, providerinterfacename, logicalinterfacename, vlannodeselector
dtc	emco_dig_traffic_group_intent_resource	name, project, composite_app, composite_app_version, dig
	emco_dig_inbound_intent_resource	name, project, composite_app, composite_app_version, dig, traffic_group_intent, spec_app, app_label, serviceName, externalName, port, protocol, externalSupport, serviceMesh, sidecarProxy, tlsType
	emco_dig_inbound_intent_client_resource	name project, composite_app, composite_app_version, dig, traffic_group_intent, inbound_intent, spec_app, app_label, serviceName
	emco_dig_inbound_intent_client_access_point_resource	name, project, composite_app, composite_app_version, dig, traffic_group_intent, inbound_intent, client_name, action
ovnaction	emco_network_controller_intent_resource	name, project, composite_app, composite_app_version, dig
	emco_workload_intent_resource	name, project, composite_app, composite_app_version, dig, network_controller_intent, app_label, workload_resource, type
	emco_workload_interface_intent_resource	name, project, composite_app, composite_app_version, dig, network_controller_intent, workload_intent interface, network_name, default_gateway, ip_address, mac_address

The metrics for these resources should capture the state of the resource, i.e. metrics for creation, deletion, etc. (emco_controller_creation_timestamp, emco_controller_deletion_timestamp, etc.) as described in the guidelines. This approach is suggested as it is unclear how to apply metrics capturing resource utilization to these resources.

The status of a deployment intent group deserves special consideration. The suggested approach is to support the labels necessary to execute equivalent queries as shown in EMCO Status Queries. This would enable alerting on the various states of the resources composing a deployment intent group.

Metric	Type	Description	Labels
emco_deployment_intent_group_resource	GAUGE	0 or 1	project app composite_app_version composite_profile name deployed_status ready_status app cluster_provider cluster connectivity resource_gvk resource resource_deployed_status resource_ready_status

The deployment intent group shown in Example query - status=deployed would create the following metrics:

emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="ConfigMap.v1",resource="firewall-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-firewall",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge02",connectivity="available",resource_gvk="Config.v1",resource="firewall-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="firewall",cluster_provider="vfw-cluster-provider",cluster="edge02",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-firewall",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Deployment.v1.apps",resource="fw0-packetgen",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="ConfigMap.v1.apps",resource="packetgen-scripts-configmap",resource_deployed_status="applied",resource_ready_status="ready"}
emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",deployed_status="instantiated",ready_status="ready",app="packetgen",cluster_provider="vfw-cluster-provider",cluster="edge01",connectivity="available",resource_gvk="Service.v1.apps",resource="packetgen-service",resource_deployed_status="applied",resource_ready_status="ready"}
...

Some example queries:

Description	PromQL
deployedCounts	count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_deployed_status="applied"})
readyCounts	count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_ready_status="ready"})
readyCounts	count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",resource_ready_status="notready"})
apps	count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group"}) by (app)
clusters filtered by the sink and firewall apps	count(emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",app="sink"} or emco_deployment_intent_group_resource{project="testvfw",composite_app="compositevfw",composite_app_version="v1",composite_profile="vfw_composite-profile",name="vfw_deployment_intent_group",app="firewall"}) by (cluster_provider,cluster)

Tracing

Istio provides a starting point for tracing by creating a trace for each request in the sidecars. But this is insufficient as it does not include the outgoing requests made during an inbound request. What we'd like to see is a complete trace of, for example, an instantiate request to the orchestrator that includes the requests made to any controllers, etc.

In order to do this it is necessary to pass the tracing headers from the inbound request through to any outbound requests. This will be done with the https://opentelemetry.io/ golang libraries.

Logging

Each log message must contain the timestamp and identifying information describing the resource, such as project, composite application, etc. in case of orchestration.

The priority is placed on error logs; logging other significant actions is secondary.

Space shortcuts

Page tree

Overview

Metrics

Queries, errors, and latency

Example PromQL

Queries, errors, and latencies of resources external to process (network, disk, IPC, etc.)

Example PromQL

Internal errors and latency

Totals of info/error/warning logs

Any general statistics

Preliminary metrics

Tracing

Logging

7 Comments

Srinivasa Addepalli

Todd Malsbary

Nadathur Sundar

Todd Malsbary

Grzegorz Panek

Nadathur Sundar

Srinivasa Addepalli