Performance Contracts in SDN Systems

Peter Thompson and Neil Davies, Predictable Network Solutions Ltd.

IEEE Softwarization, May 2017

 

Abstract

SDN virtualizes connectivity and access to the underlying bearers. This enables more variety of routes and new ways to share the bearers to meet customer demands at lower cost. However customers will need assurances about the fitness for purpose of the delivered service for their critical applications. This requires new ways to quantify their requirements and to measure the delivered service that go beyond simple notions of bandwidth/capacity.

Dependability of network performance

How can users of a network have confidence that it is delivering satisfactory performance? In a traditional context, where network configuration changes only occasionally, this can be achieved through testing. With SDN, however, where the network configuration is highly dynamic, this approach has limited applicability. Either tests must be run very frequently, thereby imposing load on the network, or the user must accept that most network configurations are untested. This is an issue for both the SDN user and the SDN supplier: how does the user know what capacity and SLA to ask for; and how can the supplier decide what level of statistical multiplexing they should target?

Therefore, to maintain user trust, the use of SDN requires a much more proactive approach to quantification and qualification of performance, which needs to be embodied in service contracts. This needs to be based on a more precise definition of performance, including how this impacts the user.

The nature of ‘performance’

‘Performance’ is typically considered as a positive attribute of a system. However, a ‘perfect’ system would be one that always responds without error, failure or delay; real systems always fall short of this ideal, so we can say that the quality of their response is impaired relative to the ideal (such ‘quality impairment’ is thus a privation).

Performance has many constraints, starting from the fact that doing anything takes time and uses resources. Geographical distance defines the minimum possible delay; communication technology sets limits on the time to send a packet and the total transmission capacity; and sharing of resources limits the capacity available to any one stream of packets. The design, technology and deployment of a communications network (which is made more dynamic by SDN) sets the parameters for a best-case (minimum delay, minimal loss) performance at a given capacity. This is what the network ‘supplies’, and this supply is then shared between all the users and uses of the network. This sharing can only reduce the performance and/or the capacity for any individual application/service. Networks share resources statistically and so the resulting quality impairment is also statistical.

Since quality impairment is statistical, 100% delivery is unachievable - and the underlying bearers only have ‘N nines’ availability in any case. Performance thus needs to be considered statistically - but not as averages, as we shall see below.

Performance Requirements

Typical audio impairments that can affect a telephone call (such as noise, distortion and echo) are familiar; for the telephone call to be fit for purpose, all of these must be sufficiently small. Analogously, we introduce a new term, called ‘quality attenuation’ and written ‘∆Q’, which is a statistical measure of the impairment of the translocation of a stream of packets when crossing a network. This impairment must be sufficiently bounded for an application to deliver fit-for-purpose outcomes; moreover, the layering of network protocols isolates the application from any other aspect of the packet transport. This is such an important point that it is worth repeating: the great achievement of network and protocol design (up to and including SDN) has been to hide completely all the complexities of transmission over different media, routing decisions, fragmentation and so forth, and leave the application with only one thing to worry about with respect to the network: the impairment that its packet streams experience, ∆Q.

The impact of any given level of impairment depends on the particular protocols that an application uses and on the time-sensitivity of its delivered outcomes. For example, here is a graphical representation of the effect of varying levels of delay and loss (parameterized by the mean of uniform distributions) on the means and 95th centile of the time to complete a 30kB http 1.0 transfer (contours are labelled in seconds):

Figure 1

Figure 2

Here, on the other hand, is a plot (measured by a researcher at CERN) of the PESQ score of an H.264 audio decoder as a function of the loss rate and jitter of the packet stream:

Figure 3

VoIP is sensitive to jitter whereas HTTP is not; how can we characterize the delivered service in a way that applies to both? In principle, the answer is to use the complete distribution of packet delays (from which the mean delay, jitter or any other measure can be derived), together with information about packet loss, corruption, out of sequence delivery and so forth. While this works mathematically, applying this practically requires some further ideas, which we discuss in the next section.

Quantifying performance

Measuring ∆Q

One approach to practically measure ∆Q is to inject a low-rate stream of test packets and observe their transit times at various observation points. By keeping the rate of the stream low it is unlikely to materially affect the instantaneous utilization of any resources. The distribution of packet delays (and the proportion of losses) provides a measure of ∆Q. On the assumption that packets do not overtake each other and that the test packets are not given any special treatment, measuring the ∆Q of the test stream provides a sample of the ∆Q of all packets following the same path. The figure below shows the cumulative distribution function of just such measurements made between two points 1000 miles apart, connected by core network links:

Figure 4

Such data has structure that can be unpacked to expose a great deal about the operation of the network path, but the point here is that the abstract idea of ∆Q can be realized in practice, and provides a single measure of network performance that can be related to the likely behavior of any application sending packets along the same path.

Quantifying the impact of ∆Q

So, ∆Q can be measured, but how can we establish what ∆Q is or is not acceptable for any particular application? Fortunately, the UK public body InnovateUK has funded the development of a generic test environment in a project called Overture (project no. TP710826). This is illustrated in the following figure, which shows client and server components of an application (more complex configurations can also be used) exchanging packets through the testbed. The upstream and downstream ∆Q can be controlled (and varied) independently; combining this with a measure of how acceptable the application performance is (that may be subjective) sets bounds on the ∆Q such an application can tolerate. To completely characterize the application performance impact of network ∆Q as shown above for http and VoIP is a daunting task, but establishing a broad envelope of acceptability is quite straightforward. This can be captured in a fairly simple specification, such as:

  • 50% delivered within 5ms
  • 95% delivered within 10ms
  • 99% delivered within 15ms
  • 1% loss acceptable

Figure 5

At the same time, the load imposed on the network by the application can also be measured, which is important for completing the performance contract, as described below. Note that ‘load’ is also characterized by a distribution, not just an average, which is important for the ability of the network to carry this load and still deliver acceptable ∆Q, for example deciding how much capacity to allocate in excess of the mean.

Describing the contract

QTA

Given the information above, we can formulate a technical specification of the relationship between application load and network performance. We call this a ‘Quantitative Timeliness Agreement’ or ‘QTA’ which consists of a set of constraints on the distribution of load and the distribution of loss and delay. Considering the distribution of loss and delay, combining an application requirement as described above with a measure of delivered performance at the SDN service layer would produce a pair of CDFs (Cumulative Distribution Function) such as shown below:

Figure 6

In this case, we can see that the measured performance is unambiguously ‘better’ than the requirement (for example, 50% of packets are delivered within about 4.2ms whereas the requirement was within 5); this is the criterion for the network to have ‘met’ the QTA. The constraints on the application load on the network can be managed similarly, and so we have clear and unambiguous criteria for conformance on both sides. Note that these can be extended to measures of by how much a requirement has been met (or not met), but to a first approximation a binary characterization is sufficient.

Relationship between QTA and SLA

The QTA captures the measurable aspects of the application demand and the network supply; the SLA is the framework surrounding it.

Figure 7

The SLA will specify under what circumstances the QTA may be breached, some of which might involve prior notification, and may place limits on how often and for how long this can happen. This provides the application with a clear measure of the technical risk it faces from the network, which it needs to either mitigate or propagate upwards.

Relationship between SLA and contract

The contract captures all remaining issues, including the consequences of breaching the SLA. It is important to note that there is an asymmetry in risk between the supplier and the consumer of the service: the supplier might suffer a financial penalty for non-delivery, but the consequences for the application might be beyond such remedies, if it relates to safety of life for example.

Figure 8

 This gives an overall picture of a performance contract as in the figure above. SDN systems need to evolve to deliver such contracts dynamically and efficiently to provide enhanced value to applications and services of the future.

 


 

Peter ThompsonPeter Thompson became Chief Technical Officer of Predictable Network Solutions in 2012 after several years as Chief Scientist of GoS Networks (formerly U4EA Technologies). Prior to that he was CEO and one of the founders (together with Neil Davies) of Degree2 Innovations, a company established to commercialize advanced research into network QoS/QoE, undertaken during four years that he was a Senior Research Fellow at the Partnership in Advanced Computing Technology in Bristol, England.  Previously he spent eleven years at STMicroelectronics (formerly INMOS), where one of his numerous patents for parallel computing and communications received a corporate World-wide Technical Achievement Award.  For five years he was the Subject Editor for VLSI and Architectures of the journal Microprocessors and Microsystems, published by Elsevier. He has degrees in mathematics and physics from the Universities of Warwick and Cambridge, and spent five years doing research in general relativity and quantum theory at the University of Oxford.

 

Neil DaviesNeil Davies is an expert in resolving the practical and theoretical challenges of large scale distributed and high-performance computing. He is a computer scientist, mathematician and hands-on software developer who builds both rigorously engineered working systems and scalable demonstrators of new computing and networking concepts. His interests center around scalability effects in large distributed systems, their operational quality, and how to manage their degradation gracefully under saturation and in adverse operational conditions. This has lead to recent work with Ofcom on scalability and traffic management in national infrastructures.

Throughout his 20-year career at the University of Bristol he was involved with early developments in networking, its protocols and their implementations. During this time he collaborated with organizations such as NATS, Nuclear Electric, HSE, ST Microelectronics and CERN on issues relating to scalable performance and operational safety. He was also technical lead on several large EU Framework collaborations relating to high performance switching.  Mentoring PhD candidates is a particular interest; Neil has worked with CERN students on the performance aspects of data acquisition for the ATLAS experiment, and has ongoing collaborative relationships with other institutions.

 

Editor:

Elio SalvadoriDr. Elio Salvadori is the Director of CREATE-NET Research Center within Fondazione Bruno Kessler (FBK); he is responsible for managing, organizing, executing and monitoring of the activities of the Center.

He joined CREATE-NET in 2005 and since then he has served different roles within the organization: from 2007 to 2012 he has been leading the Engineering and Fast Prototyping (ENGINE) area. From 2012 to 2014 he has been acting as SDN Senior Advisor while holding a position as CEO in Trentino NGN Srl. He then moved fully back to CREATE-NET acting as Research Director and then as Managing Director until the incorporation within FBK at the end of 2016.

Prior to CREATE-NET, Dr. Salvadori has developed a mixed industrial and academical background: he graduated in 1997 at Politecnico di Milano through an internship pursued at CoreCom, a research lab funded by Politecnico and Pirelli Optical Systems (now Cisco Photonics). Afterward he has worked as systems engineer at Nokia Networks in Milan and then at Lucent Technologies in Rome. In 2001 he won a research grant at the University of Trento to pursue his PhD on Information and Communication Technologies. He received his PhD degree in March 2005 with a thesis on traffic engineering for optical networks.

During his research career, he has been involved in several national and European projects on SDN and optical networking technologies as well as on Future Internet testbeds.

His current research interests include software-defined networking (network virtualization and distributed controller architectures), Network Functions Virtualization (NFV) and next generation transport networks. He has published in more than 100 International refereed journals and conferences. He is a member of IEEE and ACM.

 

Telco Cloud NFV Metrics and Performance Management

Marie-Paule Odini, HPE

IEEE Softwarization, May 2017

 

Quality of Service and Quality of Experience are key characteristics of Telco environments. As NFV deploys, metrics, performance measurement and benchmarking are getting more and more important for Telco Cloud to deliver best in class services. If we consider performance management across a classical ETSI NFV architecture, we have different building blocks to consider:  the NFV Infrastructure (NFVI), the Management and Orchestration (MANO) stack, the different Virtual Network Functions (VNF) and Network Services (NS). A number of metrics need to be defined when designing each of these component, then some methods need to be implemented to collect these metrics appropriately and interfaces need to be available to carry the results across the architecture to different ‘authorized and subscribed consumers’ of the metrics. Then performance measurements can be done either as part of pre-validation, with simulated steady traffic or with peak of traffic. But performance measurements can also be done on a live environment on an ongoing basis or on demand to check the behaviour of the network. Two types of measurements are typically performed: Quality of Service, ensuring that the network behaves according to expectations, or Quality of Experience, ensuring the user perception of the network and service quality is according to expectations. These different concepts have been described in some initial specifications in ETSI NFV, i.e. PER001 [1], then further refined in other specifications that will be detailed below. In parallel, a number of tools have been designed by the open source community, in particular in the context of the OPNFV collaborative project [15]. Other standard organizations have also specified benchmarking, for instance IETF. Strong collaboration has occurred between those different entities and contributors to ensure consistency and complementary work. As NFV is touching on many different areas such as Telco Cloud, fixed and mobile network functions, customer premises environment, management platform and processes, service deployment and operation, many different types of metrics have to be defined and collected. It became obvious that a number of metrics could be leveraged from existing standards in some of these areas, but also that some overlaps or inconsistencies existed when mapping those metrics to the ETSI NFV architecture. As a result, an initiative was triggered across the industry in order to align metrics for NFV across key stakeholders. In parallel, a few telecom operators advanced in NFV deployment, such as Verizon [16], have issued requirements in terms of metrics they want suppliers to provide. Finally, as technology evolves, with new hardware, networking and virtualization capabilities, metrics and measurement methods and tools get also to evolve. New usage also drives new performance requirements that drive new technologies, metrics and measurement capabilities.  In conclusion, NFV metrics and performance management is a long journey and this paper is just giving an introduction and update on some of the current highlights.

Definition and terms used in this paper

Metric: standard definition of a quantity, produced in an assessment of performance and/or reliability of the network, which has an intended utility and is carefully specified to convey the exact meaning of a measured value.
NOTE: This definition is consistent with that of Performance Metric in IETF RFC 2330 [8], ETSI GS NFV-PER001 [1] and ETSI GS NFV INF010 [2].

Measurement: set of operations having the object of determining a Measured Value or Measurement Result.
NOTE: The actual instance or execution of operations leading to a Measured Value. Based on the definition of measurement in IETF RFC 6390 [11], as cited in ISO/IEC 15939 and used in ETSI GS NFV INF010 [2].

NFV: Network Function Virtualization, as defined by ETSI NFV. NFV consists into virtualizing network functions for fixed and mobile networks and provide a managed virtualized infrastructure with compute, network, storage virtual resources that will allow the proper deployment and operation of those virtualized network functions to deliver predictive and quality communication services.

NFV architecture and standard metrics

Considering the ETSI NFV reference architecture, metrics can be collected on any of the components of this architecture as described below in Fig 1.

Figure 1

Fig 1 – Metrics in an NFV reference architecture

Initial work on performance and metrics was conducted in ETSI NFV PER001 [1] which was very focused on the NFV Infrastructure (NFVI), providing a list of minimal features and requirements for “VM Descriptor”, “Compute Host Descriptor”, Hardware (HW) and hypervisor to support different workloads such as data-plane or control-plane VNFs typically, and ensure portability. This work has led to the initial work on NFV in the industry, helping developers to best leverage IT systems for telecom functions, and infrastructure provider to enhance their system capabilities to meet telecom virtualized software requirements.

Then ETSI NFV INF010 [2] enumerated metrics for NFV infrastructure (NFVI), management and orchestration (MANO) service qualities that can impact the end user service qualities delivered by VNF instances hosted on NFV Infrastructures. These “service quality metrics” cover both direct service impairments, such as IP packets lost by NFV virtual networking which impacts end user service latency or quality of experience, and indirect service quality risks, such as NFV management and orchestration failing to continuously and rigorously enforce all anti-affinity rules which increases the risk of an infrastructure failure causing unacceptable VNF user service impact.

The objective of [2] is to define explicitly some metrics between Service Providers who provide NFVI and MANO, and Suppliers who provide VNF, to ensure that resources provided by Service Providers including operational environments provide the level of performance that is required for given VNF to deliver predictive quality of service for given end user service. Table 1 is summarizing all the different metrics (source: ETSI NFV INF010 [2]).

Table 1: Summary of NFV Service Quality Metrics

Service Metric Category Speed Accuracy Reliability
Orchestration Step 1 (e.g. Resource Allocation, Configuration and Setup) VM Provisioning Latency VM Placement Policy Compliance VM Provisioning Reliability
VM Dead-on-Arrival (DOA) Ratio
VirtualMachine operation VM Stall (event duration and frequency)
VM Scheduling Latency
VM Clock Error VM Premature Release Ratio
Virtual Network Establishment VN Provisioning Latency VN Diversity Compliance VN Provisioning Reliability
Virtual Network operation Packet Delay
Packet Delay Variation (Jitter)
Delivered Throughput
Packet Loss Ratio Network Outage
Orchestration Step 2 (e.g. Resource Release)     Failed VM Release Ratio
Technology Component as-a-Service TcaaS Service Latency - TcaaS Reliability (e.g. defective transaction ratio)
TcaaS Outage

ETSI NFV IFA003 [6] is focusing on one important element of the NFV infrastructure which is virtual switch (vswitch) and defining performance metrics  for benchmarking vswitch in different use cases, being overlay, traffic filtering, load balancing, etc – It is also defining underlying  NFVI Host and VNF characteristics description that will ensure the consistency of the performance benchmarks. Examples of NFVI host parameters includes OS and kernel version, hypervisor type and version, etc.

ETSI NFV TST008, “NFVI Compute and Network Metrics Specification” [5], is defining normative metrics for Compute, Network and Memory. It is also defining the method of measurement and the sources of error. This specification is coupled with IFA027 [4] that will carry these metrics across the MANO interfaces to the different interested entities.

And finally ETSI NFV IFA027, “Performance measurement specifications” [4], is work in progress to define performance measurements for each of the ETSI NFV MANO Interfaces: Vi-Vnfm, Ve-Vnfm-em, Or-Vnfm, Or-Vi and Os-Mo-nfvo. The objective is to have a normative specification standardizing the performance metrics that are carried across the difference interfaces.

VNF benchmarking

Besides defining metrics and being able to generate and collect measurements, it is quite important while designing a VNF to be able to properly size the set of resources required to meet given performance. Several publications address this topic either focusing on specific function benchmarking, i.e. vswitch [6] [7], firewall [9], IMS [13], or beyond that, defining a more generic methodology and platform [10] to perform VNF benchmarking whatever the VNF is: simple VNF with only one VNFC, or complex VNF with multiple VNFC, data and control plane, intense compute, I/O or memory usage.

NFV Vital [13] is one of these methodologies and platforms that studied different use cases, including performance of the opensource virtual IMS ‘Clearwater’. The diagram below shows the ‘best’ performance when scaling the different deployment flavors: s (small), m (medium) and l (large), typically 1000 calls for the large (red) configuration.

Figure 2

Fig 2 – IMS VNF Benchmarking with NFV Vital and opensource vIMS Clearwater

Opensource tools

OpenStack as an Opensource Virtual Infrastructure Manager (VIM) is providing a number of tools for performance and benchmarking that can be leveraged for the NFVI.

  • OpenStack Rally: provides a framework for performance analysis and benchmarking of individual OpenStack components as well as full production OpenStack cloud deployments. For instance on Nova compute module, this allows to measure time to boot a VM, delete a VM, etc.
  • OpenStack Monasca: provides a framework to define, collect metrics, set alarms and statistics. Monasca already includes an analytics module that leverages famous Opensource programs such as message bus Kafka, Zookeeper and other Hadoop ecosystem components.
  • OpenStack Ceilometer: collects, normalises and transforms data produced by OpenStack services. Three types of meters are used: Cumulative, Delta and Gauge and a set of measurements are defined for different modules: OpenStack Compute, Networking, Block/Object Storage but also Bare metal, SDN controllers, Load Balancer-aaS (as a Service), VPNaaS, Firewall-aaS or Energy.

OPNFV is the Linux Foundation open source project that is developing a reference implementation of the NFV architecture. OPNFV leverages as much as possible existing code and tools from upstream open source projects that contribute to NFV architecture such as OpenStack but also OVS (Open VSwitch), OpenDaylight (SDN Controller), etc. OPNFV has defined a set of ‘umbrella’ tools for testing NFV environments. Among those, some of them are more specifically focused on performance management such as Qtip, StorPerf and vsperf project tools.

  • OPNFV Qtip is an OPNFV project that provides a framework to automate the execution of benchmark tests and collect the results in a consistent manner to allow comparative analysis of the results.
  • OPNFV Vsperf is another OPNFV project that provides a set of tools for VSwitch performance benchmarking. It provides an automated test-framework and comprehensive test suite based primarily on RFC2544 [10] and IFA003 [6], sending fixed size packets at different rates or fixed rate, for measuring data-plane performance of Telco NFV switching technologies as well as physical and virtual network interfaces (NFVI). It supports Vanilla OVS and OVS with DPDK (Data Plane Development Kit).
  • OPNFV Storperf defines metrics for Block and Object Storage and provides a tool to measure performance in an NFVI. OPNFV current release “Danube” provides performance metrics monitoring. For Block Storage, it is assuming iSCSI-attached storage, though local direct attached storage, or Fibre Channel-attached storage could also be tested. For Object Storage, it is assuming an HTTP-based API, such as OpenStack Swift for accessing object storage. Typical metrics are IOPS (input/output operations per second) and latency for block storage, and TPS (Transaction Per Second ), error rate and latency for object storage. Throughput calculations are based on the formulae:

    (1) Throughput = IOPS * block size (Block), or
    (2) Throughput = TPS * object size (Object).

  • OPNFV incubating Barometer – this new OPNFV project, very aligned with ETSI NFV TST008 [5], will provide interfaces to support monitoring of the NFVI. Draft metrics include Compute (CPU/vCPU, memory/vmemory, cache utilizations + HW: thermal, fan speed), Network (packets, octets, dropped packets, error frames, broadcast packets, multicast packets, TX and RX) and Storage (disk utilization, etc.).

Telco Cloud evolution and new metrics

As Telco Cloud evolves embracing new technologies such as new hardware platforms for the NFVI, including hardware accelerators such as FPGA for instance, or Smart NIC, but also new virtualization engine such as containers, or SDN based networking, new metrics are being defined that either vendors provide through some open API, or that can be measured with some black box methods. Standard interfaces may also need to be updated to support these new data collection and distribution mechanism. Moving to 5G some new requirements also start to materialize, such as network slicing end to end performance which require some new performance measurement mechanisms and correlation that keep up with the dynamicity of the network and associated virtual resources. A typical example is video streaming through a network slice, where network metrics (QoS) and customer quality of experience (QoE) measurements comparison may trigger some tuning of the slice adding some virtual resources, for example additional bandwidth, more vCPU to process the video, or inserting some VNF in the service chain, such as video optimization.. Last but not least, with analytics and machine learning, predictive analytics can support performance analysis by processing data which are not just classical ‘collected’ metrics, but by comparing behaviors, such as network, application, consumer or device behaviors, deriving performance degradation or improvement and identifying root cause. As networks get more programmatic and more dynamic, the combination of performance management with metrics data collection and broader analytics will become more and more relevant.

 

References

[1]          "Network Function Virtualization: Performance and Portability Best Practices", ETSI GS NFV-PER001 V1.1.1 (2014-06)

[2]          “Service Quality Metrics“, ETSI GS NFV-INF010, V1.1.1 (2014-12)

[3]          “Report on Quality Accountability Framework”, ETSI GS NFV-REL005 V1.1.1 (2016-01)  

[4]          “Performance Measurement Specifications”, ETSI GS IFA027, work in progress

[5]          “NFVI Compute and Network Metrics Specification”, ETSI GS TST008

[6]          “vSwitch Benchmarking and Acceleration Specification”, ETSI GS NFV-IFA003 v2.1.1 (2016-04)

[7]          Tahhan, M., O’Mahony, B., and A. Morton, "Benchmarking Virtual Switches in OPNFV", draft-vsperf-bmwg-vswitchopnfv-00 (work in progress), July 2015.

[8]          “Framework for IP Performance Metrics”, IETF RFC2330

[9]          "Benchmarking Methodology for Firewall Performance", IETF RFC 3511

[10]        “Benchmarking Methodology for Network Interconnect Devices”, IETF RFC2544

[11]        “Guidelines for Considering New Performance Metric Development”, IETF RFC6390

[12]        TL 9000 Measurements Handbook, release 5.0, July 2012, QuestForum

[13]        NFV Vital: A Framework for Characterizing the Performance of Virtual Network Functions, IEEE, Nov 2015

[14]        OpenStack: https://www.openstack.org/

[15]        OPNFV: https://www.opnfv.org/

[16]        Verizon SDN-NFV Reference Architecture

 


 

Marie-Paule OdiniMarie-Paule Odini holds a master's degree in electrical engineering from Utah State University. Her experience in telecom experience including voice and data. After managing the HP worldwide VoIP program, HP wireless LAN program and HP Service Delivery program, she is now HP CMS CTO for EMEA and also a Distinguished Technologist, NFV, SDN at Hewlett-Packard. Since joining HP in 1987, Odini has held positions in technical consulting, sales development and marketing within different HP organizations in France and the U.S. All of her roles have focused on networking or the service provider business, either in solutions for the network infrastructure or for the operation.

 

Editor:

Stefano SalsanoStefano Salsano is Associate Professor at the University of Rome Tor Vergata. His current research interests include Software Defined Networking, Information-Centric Networking, Mobile and Pervasive Computing, Seamless Mobility. He participated in 16 research projects funded by the EU, being Work Package leader or unit coordinator in 8 of them (ELISA, AQUILA, SIMPLICITY, Simple Mobile Services, PERIMETER, OFELIA, DREAMER/GN3plus, SCISSOR) and technical coordinator in one of them (Simple Mobile Services). He has been principal investigator in several research and technology transfer contracts funded by industries (Docomo, NEC, Bull Italia, OpenTechEng, Crealab, Acotel, Pointercom, s2i Italia) with a total funding of more than 1.3M€. He has led the development of several testbeds and demonstrators in the context of EU projects, most of them released as Open Source software. He is co-author of an IETF RFC and of more than 130 papers and book chapters that have been collectively cited more than 2300 times. His h-index is 27.

 

Evolving End to End Telemetry Systems to Meet the Challenges of Softwarized Environments

Michael J. McGrath and Victor Bayon-Molino, Intel Labs Europe, Leixlip, Ireland

IEEE Softwarization, May 2017

 

The advent of Software Defined Networking (SDN), Network Function Virtualization (NFV), Cloud and Edge Computing is redefining the network and Information Communication Technology (ICT) domains into one which is converged and software orientated to enable a highly flexible service paradigm [1]. This transformation towards ‘softwarization’ is generating significant challenges in how we monitor and manage these new service environments. Traditionally, telemetry has provided the key capability for monitoring both network and ICT infrastructural elements and corresponding services. However, the ‘softwarization’ of functions and running them on heterogeneous infrastructure resource environments built from standard high volume (SHV) components require a significant reimaging of how we use and exploit telemetry. Given the multi-layer composition of software oriented functions which are comprised of numerous moving parts e.g. heterogeneous and distributed infrastructures, virtualization layers (virtual machines, containers etc.), service consolidation etc., a full end-to-end view of services is required. Achieving this view requires a cultural change in how we consider the use of metrics. We need to move from a highly compartmentalized interpretation, i.e. metrics are treated in per function isolation, to one which utilizes metrics across all constituent elements to deliver a full end-to-end service view.

The cloudification of the networking world and new 5G use cases such as connected cars, Internet of Things (IoT), etc. will require a new dynamic, resilient, available and scalable network. This will result in an explosion of telemetry data in terms of volume, velocity and variety. The control and signaling plane of these networks will have to be highly performant in order to support the low latency requirements of network provisioning and adjustment. Fast queries against historical telemetry backends will also be required in order to support sub 10-millisecond round trip times for control of potentially 1000’s of services. These services will be distributed over 10000’s of physical and virtual resources, requiring appropriate exploitation of the telemetry data in order to inform intelligent orchestration driven service management. In this context, the utilization of telemetry in a standalone fashion without the use of sophisticated analytics to filter and transform the enormous volumes of diverse metrics into meaningful and actionable insights that can be used for automation will not be viable.

Conventional monitoring approaches, for example network monitoring, have been based on utilizing different types of non-integrated proprietary architectures and system tools such as ping for reachability and end-to-end latency, traceroute for path availability, network management systems [2], network load and utilization [3] (i.e. standalone metrics), Internet Protocol Flow Information (IPFix)/NetFlow [4], Simple Network Management Protocol (SNMP) [5] (e.g. traps and pulling) etc.. The situation is made more complicated in multi-vendor environments due to the complexity of device heterogeneity in different network segments. This results in different data formats, resolution, sampling rates, non-standard-interfaces and siloed telemetry. These challenges making it very difficult to have a fully correlated and holistic view of the infrastructure and services. Addressing these challenges requires a telemetry fabric which provides an end-to-end platform for metrics collection and processing. The fabric also needs to have multiple hierarchical analytical and actuation points in order to efficiently exploit the metrics.

Hierarchical and Rules Driven Telemetry Systems

While reporting metrics at fixed periodic intervals and alerts management via ticketing systems and dashboards will remain an important feature, it will become a legacy approach. The next generation of telemetry systems will integrate two logically separated planes, a data plane and telemetry control plane as shown in Figure 1. The data plane provides transport of pre-processed metrics to a scalable backend data repository with appropriate analytics and data visualization capabilities. The telemetry control plane is used to manage and control the runtime behavior of the telemetry fabric enabling or disabling data collection and local processing of the data. Control of the telemetry fabric is executed in an automated manner to support both scalability and responsiveness. Potential approaches include the use of rules or heuristics to define what metrics are collected, how they are processed and published. The rules (collection, processing and publishing) are generated by an analytics backend in an automated manner using approaches such as machine learning for different contexts of interest, e.g. performance, reliability, etc. Once validated, the rules are stored in a repository from where a telemetry control system then selects and deploys the rules to nodes within the telemetry fabric. The richness and variety of the data that the telemetry fabric can provide, from time series data to textual, to contextual and configuration data will allow the development of more effective machine learning models. Local telemetry resources will also be able to reason and act with semi-autonomy, minimizing both data and control traffic. Each metric will be configured individually, from collection rates, to local processing and the telemetry fabric will also have responsibility for the lifecycle of metrics. Appropriate policies will determine both the lifespan of metrics, the resolution and collection rates at the different hierarchical levels within the telemetry fabric and when to enable/disable the collection and processing of metrics. The goal here is to prevent data redundancy over time due to system evolution, and changes in service SLA’s by continuously evaluating the value of metrics from a consumption/utility perspective.

Figure 1

Figure 1: Scalable next generation telemetry fabric architecture

In order to support true scalability, telemetry systems will need to adopt a hierarchical architectural approach. As shown in Figure 2 this architecture separates telemetry collection and processing into domains which map to system architecture constructs. Each telemetry domain has an architecture similar to Figure 1 and is connected to the next higher level domain along an end-to-end chain, exposing selected metrics for consumption through standard APIs. Abstraction of the metrics through the domains is used to provide an end-to-end view of the services and their host infrastructure by reducing thousands of metrics to a handful of metrics for a given context of interest. This end-to-end hierarchical approach where all layers of the softwarized environment can be behaviorally correlated and actuated upon in a real-time will enable better service performance. The approach will also facilitate improved end-user quality of experience, increased efficiency of infrastructure utilization, and preemptive issue identification.

Figure 2

Figure 2: Hierarchical Telemetry Fabric Architecture

As telemetry frameworks evolve with new architectural approaches to meet the challenges of softwarization environments, the following general capabilities will be required:

  • A low node level resource footprint for telemetry service agents.
  • Dedicated control and data planes.
  • Dynamically tunable telemetry (up-sampling, down-sampling and filtering at source)
  • Support for rich and varied data sources: from time series to frequency domain (e.g. tracing) to textual (e.g. logs) to structured (e.g. configuration) to network topologies, etc.
  • Integration with analytics pipelines (e.g. machine learning model production, management and model execution and evaluation) as well as batch and real-time (streaming) operation modes.
  • Hierarchical telemetry architecture with multiple analytical and actuation points across the fabric to support advanced machine learning execution across the infrastructure landscape.
  • Trusted infrastructure ingredient level metrics g. exposure of low level counters not currently available.
  • Support for 10000’s of devices with 100’s of metrics per resource at second and sub-second (frequency domain) sampling rate with enhanced edge pre-processing.
  • Move from user-defined rules/heuristics and manual management by exception to an active and automated way to control and manage infrastructure fabrics.
  • Integration of telemetry system APIs with infrastructure automation. This integration will facilitate the creation of machine learning models and rules that can be CRUD (Created, Read, Updated, Delete) across any level of the hierarchy.

Evolving Metrics

While the architectural evolution of telemetry platforms for softwarization environments is an important consideration, new metrics will also be required to monitor the unique characteristics of these environments. For example in deployments where multiple Virtualized Network Functions (VNFs) are consolidated on the same physical compute node in a multi-tenanted environment, how do we measure the effectiveness of isolation which is important in the quantification of noisy neighbor effects? Existing telemetry collection frameworks are optimized primarily for health monitoring (e.g. status checking via SNMP) and not for observing a wide range of system behaviors and characteristics. Exposure of new metrics at the hardware, virtualization and service layers will be required. Some vendors are already moving towards an all-streaming plus analytics approach to telemetry [6] for visibility and granular control. As the network world moves towards solutions based on merchant silicon and industry standard platforms (e.g. x86 on the control plane of switches), new networking platforms will integrate instrumentation and telemetry as a main feature of their architectures [7]. The ability to expose fine grained ingredient (Central Processing Unit (CPU), chipset, Network Interface Card (NIC), Solid State Drive (SSD), etc.) level metrics will become an important platform differentiation and a source of new infrastructural insights [8]. The ability to trust the source of metrics data will also become increasingly important. Telemetry which is ‘signed’ at source in a manner that supports appropriate verification will become a necessary feature especially when used for actuation purposes.

Conclusions

The introduction of SDN, NFV, Cloud and Edge Computing technologies is driving significant and rapid changes in the ICT domain. Key among these changes is the realization of a domain that is significantly more converged, flexible and software orientated. Telemetry systems need to evolve to deal with the ever increasing velocity, volume and variety of data becoming available from these softwarized environments. New telemetry architectures will feature separate data plane and control planes to support scalability and responsiveness. Analytical approaches such as machine learning will be used to define telemetry system behaviors such as determining what metrics are collected, when they are collected and how they are processed and to facilitate end-to-end autonomous service and infrastructure management.

References

  1. Antonio Manzalini, et al., IEEE SDN Initiative (SDN4FNS) white paper, Towards 5G Software-Defined Ecosystems - Technical Challenges, Business Sustainability and Policy Issues, Jan 2014, Available: http://sdn.ieee.org/publications.
  2. "Open Network Management System," [Online]. Available: https://www.opennms.org/en.
  3. Hassidim, D. Raz, M. Segalov and A. Shaqed, "Network Utilization: The Flow View," in IEEE INFOCOM 2013, Turin, Italy, 2013.
  4. "NetFlow," [Online]. Available: https://en.wikipedia.org/wiki/NetFlow.
  5. "SNMP," [Online]. Available: https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol.
  6. "Telemetry and Analytics," [Online]. Available: https://www.arista.com/en/solutions/telemetry-analytics.
  7. "Broadcom BCM56960-Series," [Online]. Available: https://www.broadcom.com/products/Switching/Data-Center/BCM56960-Series.
  8. "Intel Performance Counter Monitor. PCM," [Online]. Available: https://software.intel.com/en-us/articles/intel-performance-counter-monitor.

 


 

Michael J. McGrathMichael J. McGrath is a senior researcher at Intel Labs Europe. He holds a PhD from Dublin City University and an MSc in Computing from ITB. Michael has been with Intel for 17 years, holding a variety of operational and research roles.  Michael’s current research focus is on NFV Orchestration and edge based cloud computing. He is currently a researcher in the H2020 Superfluidity project focusing on the optimization of Virtualized Network Functions and applications over 5G Network/IT infrastructures. Previously Michael was the research lead for T-NOVA FP7 which focused on Virtualised Network Functions as a Service. Michael is also the research lead in the BT Intel Co-Lab based at Adastral Park in the UK which is focused on research relating to the deployment of Virtualised Network Functions in carrier grade network environments. Michael has co-authored more than 35 peer reviewed publications including two books.

 

Victor Bayon-MolinoDr. Victor Bayon-Molino is an applied researcher with Intel Labs Europe. His research focuses on developing advanced instrumentation, telemetry, monitoring and analytics systems to support novel compute, network and storage fabrics and their orchestration from enterprise and large-scale private data centers to public clouds.

 

 

 

Editor:

Francesco BenedettoFrancesco Benedetto was born in Rome, Italy, on August 4th, 1977. He received the Dr. Eng. degree in Electronic Engineering from the University of ROMA TRE, Rome, Italy, in May 2002, and the PhD degree in Telecommunication Engineering from the University of ROMA TRE, Rome, Italy, in April 2007.

In 2007, he was a research fellow of the Department of Applied Electronics of the Third University of Rome. Since 2008, he has been an Assistant Professor of Telecommunications at the Third University of Rome (2008-2012, Applied Electronics Dept.; 2013-Present, Economics Dept.), where he currently teaches the course of "Elements of Telecommunications" (formerly Signals and Telecommunications) in the Computer Engineering degree and the course of "Software Defined Radio" in the Laurea Magistralis in Information and Communication Technologies. Since the academic year 2013/2014, He is also in charge of the course of "Cognitive Communications" in the Ph.D. degree in Applied Electronics at the Department of Engineering, University of Roma Tre.

The research interests of Francesco Benedetto are in the field of software defined radio (SDR) and cognitive radio (CR) communications, signal processing for financial engineering, digital signal and image processing in telecommunications, code acquisition and synchronization for the 3G mobile communication systems and multimedia communication. In particular, he has published numerous research articles on SDR and CR communications, signal processing applied to financial engineering, multimedia communications and video coding, ground penetrating radar (GPR) signal processing, spread-spectrum code synchronization for 3G communication systems and satellite systems (GPS and GALILEO), correlation estimation and spectral analysis.

He is a Senior Member of the Institution of Electrical and Electronic Engineers (IEEE), and and a member of the following IEEE Societies: IEEE Standard Association, IEEE Young Professionals, IEEE Software Defined Networks, IEEE Communications, IEEE Signal Processing, IEEE Vehicular Technology. Finally, He is also a member of CNIT (Italian Inter-Universities Consortium for Telecommunications). He is the Chair of the IEEE 1900.1 WG on dynamic spectrum access, the Chair of the Int. Workshop on Signal Processing fo Secure Communciations (SP4SC), and the co-Chair of the WP 3.5 on signal processing for ground penetrating radar of the European Cost Action YU1208.

 

Telemetry and Performance Management in Softwarized Environments

Dan Conde

IEEE Softwarization, May 2017

 

Network telemetry and performance management is a challenge in softwarized environments. Unlike traditional hardware-based systems, many assumptions that made possible conventional measurement and performance management start to break down and must be revisited. There are issues of capacity, topology, dynamic configurations (ephemeral nature) and intent.

Capacity
Physical architectures reflected the capacity that was required for networks. As virtualized networks, such as overlays, become prevalent, and with a different and dynamic number of endpoints based on objects such as virtual machines, it is difficult to determine how many connections are going to use a network’s capacity. Although abstractions like VLANs have existed in the past, they were relatively static.  Software-defined networks promise to create and release connections more frequently, and that situation makes bandwidth use more difficult to predict. These changes have made it quite challenging to determine baseline usage and to identify abnormal conditions when they occur.

Topology
With software-defined networks, the logical network topology is no longer strongly associated with the physical topology. This means that telemetry or monitoring based on physical connections needs to be augmented with measurements based on logical connections to obtain a more realistic view of how logical entities communicate with each other.  The over-arching goal is to understand how high-level constructs such as applications or services perform, as opposed to only having access to information about low-level components such as network ports.

Dynamic configurations or Ephemeral nature
With physical networks, end points do not get created or destroyed quickly.  However, in software-based networks, the endpoints may be virtual machines, virtual switches, or containers, any of which may be created and deleted rapidly.  Assumptions on the static nature of devices are removed, and monitoring metrics instead need to track these software constructs. Rather than treat these objects as analogues to their physical counterparts, we need to look at them as representing logical services. Therefore, a virtual switch does behave like a physical switch, but it is more worthwhile to see how it fits into an application or service. Therefore, instead of looking at the traffic from container #123, it is preferable to assess the traffic generated by a set of containers by implementing the service “DBLookup”.

Intent – implicit or explicitly defined?
The key element for understanding softwarized networks is to distinguish logical from physical connections, to understand how the elements utilize the networks, and to determine how they are  affected by monitoring. Understanding the intent of the application is the foundation of this way of thinking.

This is a complex issue, and we first need to understand why we are performing telemetry, analytics and performance monitoring in the first place.

The common reasons include security, application performance monitoring, troubleshooting and capacity planning.  In most of these cases, it is necessary to know the original intent of the applications involved, in order to understand how one needs to monitor the network appropriately.  A network link must be examined carefully, as it connects two active and critical applications. Less attention may be paid to links between dormant systems.

A large gap has developed between the domains of applications and of infrastructure, such as networks. Network engineers often work at the packet level, application writers are concerned about their application’s behavior, which  utilizes the transmission data on network connections, and system administrators or DevOps engineers deal with configuring them to work together.  The separation of domains has also led to each group not fully understanding the intent of the other and to being susceptible to making multiple assumptions.

Some technologies have attempted to bridge those gaps.  In commercial data center networks, the most famous is Cisco’s ACI, or Application Centric Infrastructure, in which the application architectures are codified as declarative intent through a policy model, and that drives the operation and monitoring of the network through compatible switches.

Other systems that capture similar information include Topology and Orchestration Specification for Cloud Applications (TOSCA), which are often used to orchestrate network based services or network function virtualization and can also be used in telco VNFs.

Without going into these systems in detail, it suffices to say that there is widespread acknowledgement of the need for network models. However, there is no widespread agreement on a common model for describing orchestration or what is sometimes called declarative intent for networks. Until then, models specific to each deployment need to be used. However, having some model is better than no model, so it is worth examining how these models work in order to continue to advance the state of the art.

Methods for Monitoring
Many methods are available for monitoring purposes, some of which are new, while some have traditionally been available in non-softwarized environments and are now are proving to have a higher relevance in softwarized environments.

Sensors for capturing telemetry
These may be sensors placed into switches, data gathered from the network via TAPs or network packet brokers, or any other entity that provides network visibility at the packet level.

Some sensors can examine application behavior or storage performance. Network sensors play a role in providing application metrics -- network data can provide data for network performance, and application metrics can be derived by examining applications directly and especially by examining how an application uses the network.

Application behavior, intent and performance may be derived from the network, via solutions such as varied APMs (application performance management), obtaining statistics from network device configurations and statistics, and even code injection, which puts pieces of code right into byte code streams in languages such as Java.

VMs appliances
In addition to using traditional hardware-based sensors, it is possible to deploy virtual machines into cloud environments to gather metrics if network traffic is channeled into these virtual machine appliances before moving to its destination.  This channeling is necessary if specialized hardware cannot be deployed into a public cloud environment.

Sensors Built-in to infrastructure
Cloud orchestration software and underlying foundational software may  have built-in monitoring capabilities that can interface with other telemetry solutions.  This is similar to traditional operating systems that have statistics available via interfaces calls, or special files that are read to provide the required statistics.

Summary
Telemetry and monitoring in a softwarized environment is an extension of what has been performed in a physical environment.  This is not a sudden shift, as the adoption of virtual machines and other software-based environments has created a set of solutions that will also apply to systems with higher degrees of softwarization, such as virtualized networks.

However, new challenges arise due to the dynamic, elastic and shared nature of these environments;  one of which is that we can no longer deduce the architectural intent by examining the design of the infrastructure.  New standards that attempt to capture this intent via higher level models are emerging, and we should expect to see more developments in this area.  By understanding intent and utilizing a behavioral model for operations of systems, we can monitor and understand the data gathered by various sensors, and interpret how it affects performance, security and other high level concerns.

 


 

Dan CondeDan Conde is an analyst covering distributed system technologies including cloud computing and enterprise networking. In this era of IT infrastructure transformation, Dan’s research focuses on the interactions of how and where workloads run, and how end-users and systems connect to each other. Cloud technologies are driving much of the changes in IT today. Dan’s coverage includes public cloud platforms, cloud and container orchestration systems, software-defined architectures and related management tools. Connectivity is important to link users and applications to new cloud based IT. Areas covered include data center, campus, wide-area and software-defined networking, network virtualization, storage networking, network security, internet/cloud networking and related monitoring & management tools. His experience in product management, marketing, professional services and software development provide a broad view into the needs of vendors and end-users.

 

Editor:

Noel CrespiProf. Noël Crespi holds Masters degrees from the Universities of Orsay (Paris 11) and Kent (UK), a diplome d’ingénieur from Telecom ParisTech, a Ph.D and an Habilitation from Paris VI University (Paris-Sorbonne). From 1993 he worked at CLIP, Bouygues Telecom and then at Orange Labs in 1995. He took leading roles in the creation of new services with the successful conception and launch of Orange prepaid service, and in standardisation (from rapporteurship of IN standard to coordination of all mobile standards activities for Orange). In 1999, he joined Nortel Networks as telephony program manager, architecting core network products for EMEA region. He joined Institut Mines-Telecom in 2002 and is currently professor and Program Director, leading the Service Architecture Lab. He coordinates the standardisation activities for Institut Mines-Telecom at ITU-T, ETSI and 3GPP. He is also an adjunct professor at KAIST, an affiliate professor at Concordia University, and is on the 4-person Scientific Advisory Board of FTW (Austria). He is the scientific director the French-Korean laboratory ILLUMINE. His current research interests are in Service Architectures, Services Webification, Social Networks, and Internet of Things/Services.
http://noelcrespi.wp.tem-tsp.eu/

 

IEEE Softwarization - May 2017
A collection of short technical articles

Evolving End to End Telemetry Systems to Meet the Challenges of Softwarized Environments

By Michael J. McGrath and Victor Bayon-Molino, Intel Labs Europe, Leixlip, Ireland

The advent of Software Defined Networking (SDN), Network Function Virtualization (NFV), Cloud and Edge Computing is redefining the network and Information Communication Technology (ICT) domains into one which is converged and software orientated to enable a highly flexible service paradigm. This transformation towards ‘softwarization’ is generating significant challenges in how we monitor and manage these new service environments.


Telco Cloud NFV Metrics and Performance Management

By Marie-Paule Odini, HPE

Quality of Service and Quality of Experience are key characteristics of Telco environments. As NFV deploys, metrics, performance measurement and benchmarking are getting more and more important for Telco Cloud to deliver best in class services. If we consider performance management across a classical ETSI NFV architecture, we have different building blocks to consider: the NFV Infrastructure (NFVI), the Management and Orchestration (MANO) stack, the different Virtual Network Functions (VNF) and Network Services (NS).


Performance Contracts in SDN Systems

By Peter Thompson and Neil Davies, Predictable Network Solutions Ltd.

SDN virtualizes connectivity and access to the underlying bearers. This enables more variety of routes and new ways to share the bearers to meet customer demands at lower cost. However customers will need assurances about the fitness for purpose of the delivered service for their critical applications. This requires new ways to quantify their requirements and to measure the delivered service that go beyond simple notions of bandwidth/capacity.


Telemetry and Performance Management in Softwarized Environments

By Dan Conde

Network telemetry and performance management is a challenge in softwarized environments. Unlike traditional hardware-based systems, many assumptions that made possible conventional measurement and performance management start to break down and must be revisited. There are issues of capacity, topology, dynamic configurations (ephemeral nature) and intent.