Applying Cloud native principles to all layers of network infrastructure, applications and services as put forth from the NGMN: 

  1. Decoupled infrastructure and application lifecycles over vertical monoliths  
  2. ‘API first’ over manual provisioning of network resources  
  3. Declarative and intent-based automation over imperative workflows  
  4. GitOps** principles over traditional network operations practices 
  5. Unified Kubernetes (or the like) resource consumption patterns over domain-specific resource controllers 
  6. Unified Kubernetes (or the like) closed-loop reconciliation patterns over vendor-specific element management practices  
  7. Interoperability by well-defined certification processes over vendor-specific optimisation. 

Reference


"Practical challenges and pain points on this journey, which hinder progress towards the target expressed in the NGMN Cloud Native Manifesto, have been identified and are being felt."

"For the new model to work, vendors and CSPs must provide mutual SLAs: the CSP must guarantee a certain level of quality at the platform layer, while CNF vendors need to guarantee that the application will perform on the platform with SLAs that meet defined KPIs.Challenges in Cloud Native Telco Transformation Today (source: Accelerating Cloud Native in Telco whitepaper *)"

"We also use the term cloud-native infrastructure in broader context for the infrastructure that abstracts Infrastructure-as-a-Service (IaaS) layer, that has Kubernetes in its core with useful API abstractions on top of it as as well as auxiliary systems all as a framework that makes managing applications easier and promotes doing so in a cloud native way. This is important because you are free to use Kubernetes (and other "cloud native" technologies) in a very uncloud-native way. Our longer-term goal, underlying this whitepaper, is for all layers of the environment to encompass the cloud native principles from infrastructure allocation + management, through the application workloads"

From the whitepaper Accelerating Cloud Native in Telco

Figure 2-1: Example layers for cloud native infrastructure from "Cloud Native Infrastructure" by Justin Garrison and Kris Nova (ISBN: 9781491984307)


Pre-Validation. Historically, Network Functions have been developed and pre-integrated with well-defined infrastructure, which was known in advance. That pre-validation was done by a Network Function vendor and the system was delivered as a validated/certified bundle together with performance and stability guarantees. In the cloud native world, there are too many permutations which makes it impractical to follow the traditional certification path. However, Cloud Native Network Function (CNF) vendors are still sticking to it by picking a small number of opinionated infrastructure flavors (different from vendor to vendor!) to pre-validate against, making any infrastructure outside this selection too complicated, too costly, and too slow to deliver for CSPs. This creates problems in the adoption of those CNFs as CSPs generally prefer to each have a unified cloud native infrastructure layer, which they are free to choose and often it can differ from the opinionated infrastructure already validated by the CNF vendors.

Adaptations. CNFs are typically delivered as a collection of artifacts such as YAML, Helm charts, and container images. These artifacts are intended for deployment in CSP’s cloud native environment. However, every CSP has somewhat different rules, policies, security standards, API versions, and approaches to lifecycle operations (e.g. use of NFVO capabilities, GitOps pipelines, etc.). Due to that, it is often not possible to deploy the CNF directly in any environment in a consistently replicable way, but it requires some adaptations. That shall normally not pose a problem, as most of these adaptations can be performed in deployment configuration often in YAML files, by either CSP’s DevOps team or the vendor’s delivery personnel. Nonetheless, we often encounter situations where CSPs are not allowed to perform such adaptations (under threat of losing support if doing otherwise) since these artifacts are part of the release and could be adapted only in new release delivery or through the custom change request. As a result, this situation often leads to a frustrating cycle of discussions and significantly hampers the CNF onboarding process.

Validation. This step did not really exist in the previous scenarios due to reliance on pre-validation and pre-integration. Due to the number of permutations found in cloud native ecosystems, pre-validation has limited value. Only validation of CNFs on CSP premises with CSP’s flavor of cloud native infrastructure and its specific integrations has high relevance and value for concluding if the CNF can be deployed and promoted to production. Today, we still see that many CNFs are not ready to be validated in the local CSP environment and rather insist on conformance with the pre-validation stack. This practice is unsustainable and requires a fresh and flexible local validation approach. Automation (Continuous Testing) is especially important when validating frequently released cloud native applications and checking conformance with frequently updated cloud platforms.

Automation. In the ongoing pursuit of end-to-end orchestration and deployment, and configuration automation, the Telco industry has devised numerous frameworks, models, and standards. Some have achieved considerable success, while others have seen varying levels of adoption. However, the cloud native ecosystem, with a focus on GitOps practices, is propelling CNFs toward more advanced and automated models.

Many CNFs are still reliant on manual artifact deployment and are rooted in traditional telco methods, such as NETCONF and YANG for configuration management. These practices pose significant challenges for CSPs aiming for a fully automated CNF lifecycle. Moreover, the ETSI standard follows an imperative top-down approach, often characterized as "fire and forget”. This approach doesn't readily support reconciliation and depends on orchestration entities operating externally to Kubernetes "out-of-band." Even when the CNFs are following the Kubernetes native approach, we face challenges with the quality of artifacts like Helm Charts which are not generalized nor easily customizable, as well as with divergent configuration schemas. This all creates further complexities in the transition to the declarative and GitOps-driven automation models prevalent in the cloud native ecosystem.

Dependencies. Cloud Native applications shall be completely separated from the underlying infrastructure, especially hardware. Nevertheless, in practice today there are often very hard dependencies present, be it on specific technologies, or on specific vendor products. The CNFs often require a specific hardware type or brand (e.g. CPU, NIC) and do not allow for flexibility supported by local validation. CNFs are not able to run on any CNCF-certified Kubernetes distributions. Even if the dependency is fulfilled there is a lack of attention to those dependencies. For example, a CNF can break because the firmware of the network card was updated, which shows that pre-validation of a particular combination of dependencies was not performed and proactive CNF update measures have not been taken. This creates a lot of operational burdens and negatively impacts KPIs for cloud native CNF deployments.

Lifecycle. Kubernetes occupies a central place in cloud native infrastructure by following the paradigm of ephemeral resources and relying on “rolling upgrades” to deploy changes. This paradigm is applied in both the lifecycle management of applications and the cloud infrastructure. As a consequence, for example, Kubernetes cluster nodes can ordinarily have rather short uptime of several days to several weeks. This is in contrast with the traditional carrier-grade-driven focus on the uptime of individual system elements. Although large parts of CNFs do not have problems with that cloud native lifecycle approach, we are experiencing that many CNFs have some elements which are rather sensitive to it. CNFs that are not resilient to ephemeral Kubernetes nodes (e.g. crashing when cluster scaling or upgrade occurs) lead to service interruptions during the lifecycle operations, which is not acceptable, such as on SCTP pods due to s1 interface interruptions which leads to unsupported ISSU. It is often the case that Pod Disruption Budgets are not properly set, the consequence of which is either service interruption or lifecycle operations being blocked.

Tracing. The more cloud native transformation progresses, the more challenging it is to perform the e2e protocol tracing using traditional mechanisms based on tapping points on the network fabric level. The reason for that is the dynamic nature of cloud native workloads including CNFs. When several CNFs run within one large data center the pods could be distributed to any of the servers in any of the racks. This means that particular communication can go via multiple network elements and as the traditional tapping setup is not configurable or capable enough, it is practically impossible to create reasonable port mirroring to capture the traces. In many cases, the CNFs or their microservices run on the same node and their communication does not go via data center network fabric at all. Furthermore, encryption and mTLS became a de-facto standard for CNFs, so even if tapped, network traffic can not be really analyzed and so the purpose of tracing can not be fulfilled. Cloud native tracing mechanisms (e.g. eBPF) are unfortunately not helping here as most of the telco-relevant traffic goes via secondary interfaces (Multus) which are often directly assigned to the CNF, skipping the host kernel drivers. This is specifically true for user plane CNFs like a UPF, Firewall or Internet Gateway.

Architecture. We are witnessing that there are still CNFs that are in their architecture exhibiting properties of Virtualized Network Functions (VNFs). For example, we see the “pinning” of Pods to particular NUMA nodes, or worse to specific cluster nodes. We also still see 1+1 redundancy models for Pods within the cluster instead of N+1. Although it is technically possible to run such Network Functions on cloud native infrastructure, this increases the burden of operating them and risks having a negative impact on service quality, as small disruptions which are normal in cloud native infrastructures result in problems within the CNFs. Furthermore, the scalability of today’s CNFs is still sub-optimal. In many cases, it still relies on vertical scaling and manual interventions. Sudden increases in demand cause performance degradation and even downtime if the system is not dimensioned in advance for that peak load.

Security. In the experience so far we have noticed that CNFs, in their default setup, have quite a relaxed posture when it comes to dealing with cluster security-relevant aspects like Roles, ClusterRoles, privileged access, cluster node level access, and similar functionalities. We frequently observe that the principle of least privilege is not consistently followed and that Roles frequently require rights for everything (“*”) and ClusterRoles are used without real need. CNFs sometimes use problematic practices (such as hostPath mounting to write their logs, hostPorts for communication, privilege escalation, running containers as root, managing the configuration of the node networking stack, and performing dangerous sysctl calls), none of which are allowed in a properly hardened environment. It looks like such CNFs assume that the infrastructure can be consumed from a cluster admin perspective without any restrictions, which in realistic circumstances is never the case. Such expectation could be reasonable in a combo/silo package where CNF and infrastructure come together from one hand as a managed package. However, in other cases, CNFs are usually “guests” on the infrastructure and as such must have appropriate security imposed restrictions and limitations.

Resilience. In contrast to traditional expectations within the telecommunications domain, one important property of the cloud infrastructure is that it is imperfect. Cloud infrastructure does not give strict performance and stability guarantees. However, it offers mechanisms to applications through which they can achieve a high degree of resilience to this imperfection. Yet, repeatedly we experience the cases in which the imperfection of cloud infrastructure has a severe impact on the CNFs to the point that complete re-deployment is the only viable solution. Instances of such impact include CNFs completely crashing due to the ephemeral storage on the Kubernetes cluster reaching full capacity with logs of that CNF or because write operations to persistent volumes could not be performed for a short period of time. This state is unsustainable as such events and situations are going to happen in the cloud environment all the time. Therefore, applications including CNFs, which aim to run in the cloud, have to account for such events in their design and utilize cloud native mechanisms to maintain robustness consistently or facilitate automated recovery.


Reference


Fragmentation of all layers and a need for change in the deployment model

This is a list of modified challenges based on a Sylva article. This Cloud Native Telecom program can be part of solving these solutions.

  • Sharing CaaS and physical resources among different applications to reduce wasted compute power with CAPEX and energy impacts
  • Complexity for vendors trying to provide multiple cloud platform support
  • Operational burdens on Telcos as the different “islands” will need different skills and will evolve at different speeds
  • Solutions that can evolve with the necessary speed of cloud native

 "if the Telco industry continues with its traditional deployment model then fragmentation is inevitable for the deployment of applications that mandate the use of proprietary CaaS (Container as a Service) and even specific physical compute."  https://the-mobile-network.com/2022/11/why-the-eu-big-five-are-launching-sylva/

source: https://the-mobile-network.com/2022/11/why-the-eu-big-five-are-launching-sylva/

Runtime interoperability of CNF platforms and CNFs

  • CNF-s need to have some assumptions about their environment due to their resource needs (e.g.: multiple pod networks, network and compute latency). The assumptions have to be the same for all CNF platforms.
  • To achieve interoperability the CNF platforms need to fulfill these assumptions.
  • To ensure that the assumptions of the CNFs are correct and that the platforms fullfill the assumptions both the CNF-s and the platforms need to be tested
  • There is no point in doing interoperability conformance testing of only one side, as the conformance will not have a target
  • No labels

4 Comments

  1. Note: The following list of challenges is based on the Sylva article https://the-mobile-network.com/2022/11/why-the-eu-big-five-are-launching-sylva/.

    IMO, the core solution that can be shared among all efforts including this new Cloud Native Telecom program is a goal of vendor-neutral interoperability at all layers.

    Fragmentation of all layers and a need for change in the deployment model

    • Sharing CaaS and physical resources among different applications to reduce wasted compute power with CAPEX and energy impacts
    • Complexity for vendors trying to provide multiple cloud platform support
    • Operational burdens on Telcos as the different “islands” will need different skills and will evolve at different speeds
    • Solutions that can evolve with the necessary speed of cloud native

     "if the Telco industry continues with its traditional deployment model then fragmentation is inevitable for the deployment of applications that mandate the use of proprietary CaaS (Container as a Service) and even specific physical compute."  https://the-mobile-network.com/2022/11/why-the-eu-big-five-are-launching-sylva/


  2. User plane network functions typically require predictive performance. Predictive performance may also be required in some other cases. To ensure such predictive performance, capabilities such as CPU pinning, NUMA alignment, etc. are a necessity. Similarly, Huge pages and Jumbo MTU frames are needed for improved performance.


  3. Pankaj Goyal please read the whitepaper.  Specific performance requirements and specific solutions (such as CPU pinning) need to be discussed. The groups in the CSPs working on this type of thing understand requirements and SLAs will need to be adjusted. 

    There will be tradeoffs.

  4. Gergely Csatari 

    • CNF-s need to have some assumptions about their environment due to their resource needs (e.g.: multiple pod networks, network and compute latency). The assumptions have to be the same for all CNF platforms.
    • To achieve interoperability the CNF platforms need to fulfill these assumptions.
    • To ensure that the assumptions of the CNFs are correct and that the platforms fullfill the assumptions both the CNF-s and the platforms need to be tested
    • There is no point in doing interoperability conformance testing of only one side, as the conformance will not have a target

    Maybe you have a misunderstanding. You are implying that only one option is being proposed and that it is wrong because it does not cover everything in another option. This is incorrect and will confuse you.

    1. Platform and application do not help with understanding. Full stack is better, than talk about the specific area of the stack. Otherwise, it is perspective. K8s running on an OpenStack cluster becomes an application to be managed.  What are Operators managing the applications that are also managed and have K8s admin privileges?  
    2. We can have multiple testing, certification, and conformance efforts. For example, we can have a certification with a more minimal and focused scope and have a larger more comprehensive  certification that may or may not include the more minimal certification
    3. It is a simplified and incorrect statement to imply the current certification, the test catalog, and the testing that is actively happening does not have a target and is only concerned with "one side".  The testing is happening with production telecom applications in many vendor-flavored platforms that all have the common denominator of K8s and standard extensions as the base. The platform-side must be considered in the testing to allow the tests to run at all.  The tests attempt to work with any of the major platforms, specifically any that follow the standard interoperability testing from K8s Conformance e2e test set.
    4. Implying that not having fully comprehensive testing has no point is also incorrect and misleading. Getting to a portion of full interoperability reduces the overall efforts required.  It is proven in software development (open source and commercial) in many ways: MVP product development, 80/20 rules, less is more in open source development (look to the Linux Kernel and so many of the most used software in the world). We all want 100% coverage, but increasing coverage from 0% is almost always better than 0 coverage in testing (or feature development).  Furthermore, no fully comprehensive interoperability testing or framework exists that will work for all vendors and all end-users. There probably never will be a perfect match that works for everyone and continues to work year after year.  


    I 100% want to see comprehensive testing at all layers and interoperability at all layers. We should write best practices, accompanying tests, etc for the full stack. Pick the area that is causing you problems with real production use cases. Then let's find best practices that will help solve those issues in an open manner that provides vendor-neutral interoperability and promotes collaboration.

    Also, choosing a subset of practices that are useful today does not stop us from changing what practices we use later as we have a more comprehensive framework and associated new best practices.

    It is important to make some ongoing improvements to interoperability that people can use "today" while we move towards more coverage in some unknown future. If a developer saves time because of a best practice today that adds up.  Same for end-users, ops teams, integration people, etc.  

    Our goal is to start seeing the coverage from the Test Catalog and associated certification(s) grow while providing continued value.