2020-10-14 - OPNFV/CNTT - HA requirements and testing approaches

Topic Overview

Discussions of current CNTT Release 1 HA requirements are approach to testing

req.gen.rsl.01:
The Architecture must support resilient OpenStack components that are required for the continued availability of running workloads.

Example test cases

Control node restart: restart entire node
Neutron service restart: kill Neutron process and measure API response and recovery. Same concept for Nova, Glance, Cinder, Keystone, MySQL, RabbitMQ, HAProxy
CPU load
Disk IO load

Properties

What kind of test cases can we actually design for?
No white box testing - only black box testing
how to define pass / fail criteria
Node level
Network resilience
- Switch level, port level?
- Availability of redundant fabric in OPNFV labs, Packet
- API for configuring switches

Instead of building a new framework, integration of existing resilience testing frameworks.

Non-exhaustive list of tools - extend with more suitable candidates you are aware of

Cedric
- RC-1/2 should be used in production environments and hence not execute destructive testing
- the Yardstick framework is hard to maintain → questionable if we want to re-active it
key question: is resilience testing in the scope of RC-1/2
- CNTT specifies requirements on resilience → there is a need for validating such requirements via an automated test
- → we likely need such tests and then need to de-/select destructive tests depending on use case: workload onboarding (non-destructive) vs. OVP badging (destructive)
Need to distinguish between HA and resiliency. A resilient system continues to function in case of a failure (we can limit to a single failure scenario)
In a cloud environment one expects infrastructure failures and thus expect resiliency and HA from the software systems (OSTK, etc.) – # of deployments, etc.
Recovery also needs to be taken into account. If the recovery impacts the workloads to the point where they are no longer functional, then it cannot be considered resilient
RA1 Chapters 3 and 4 specify the services, # of minimum deployments, etc. to meet the requirements specified in Chapter 2; also review Ch5 (Thanks, Cedric)
Opened CNTT Issue #2061 to make the network resiliency requirement more specific