Discussions of current CNTT Release 1 HA requirements are approach to testing
Slides & Recording
The Architecture must support resilient OpenStack components that are required for the continued availability of running workloads.
The Architecture must support network resiliency.
Existing HA test cases in OPNFV - Yardstick
Example test cases
- Control node restart: restart entire node
- Neutron service restart: kill Neutron process and measure API response and recovery. Same concept for Nova, Glance, Cinder, Keystone, MySQL, RabbitMQ, HAProxy
- CPU load
- Disk IO load
- Framework for building resilience test scenarios
- Framework geared towards OpenStack: translation of Yardstick scenarios to Heat
- Majority of the tests white box testing which is not suitable
- What kind of test cases can we actually design for?
- No white box testing - only black box testing
- how to define pass / fail criteria
- Node level
- Network resilience
- Switch level, port level?
- Availability of redundant fabric in OPNFV labs, Packet
- API for configuring switches
Existing resilience and robustness testing
Instead of building a new framework, integration of existing resilience testing frameworks.
Non-exhaustive list of tools - extend with more suitable candidates you are aware of
- Litmus (https://github.com/litmuschaos/litmus)
- PowerfulSeal (https://github.com/powerfulseal/powerfulseal)
- OpenShift Kraken (https://github.com/openshift-scale/kraken)
- Chaos Toolkit
- Chaos Mesh
- RC-1/2 should be used in production environments and hence not execute destructive testing
- the Yardstick framework is hard to maintain → questionable if we want to re-active it
- key question: is resilience testing in the scope of RC-1/2
- CNTT specifies requirements on resilience → there is a need for validating such requirements via an automated test
- → we likely need such tests and then need to de-/select destructive tests depending on use case: workload onboarding (non-destructive) vs. OVP badging (destructive)
- Need to distinguish between HA and resiliency. A resilient system continues to function in case of a failure (we can limit to a single failure scenario)
- In a cloud environment one expects infrastructure failures and thus expect resiliency and HA from the software systems (OSTK, etc.) – # of deployments, etc.
- Recovery also needs to be taken into account. If the recovery impacts the workloads to the point where they are no longer functional, then it cannot be considered resilient
- RA1 Chapters 3 and 4 specify the services, # of minimum deployments, etc. to meet the requirements specified in Chapter 2; also review Ch5 (Thanks, Cedric)
- Opened CNTT Issue #2061 to make the network resiliency requirement more specific