Improving Datacenter Performance with Network Offloading

Improving Datacenter Performance with Network Offloading
Author :
Publisher :
Total Pages : 0
Release :
ISBN-10 : OCLC:1245425211
ISBN-13 :
Rating : 4/5 (11 Downloads)

Book Synopsis Improving Datacenter Performance with Network Offloading by : Yanfang Le

Download or read book Improving Datacenter Performance with Network Offloading written by Yanfang Le and published by . This book was released on 2020 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: There has been a recent emergence of distributed systems in datacenters, such as MapReduce and Spark for data analytics and TensorFlow and PyTorch for machine learning. These frameworks are not only computation and memory intensive, they also place high demands on the network for distributing data. The fast-growing Ethernet speed mitigates the high demand a bit. However, as Ethernet speed outgrows the CPU processing power, it not only requires us to rethink the existing algorithms for different network layers, but also provides opportunities to innovate with new application designs, such as datacenter resource disaggregation [3] and in-network computation applications [4, 5, 6]. The fast network devices come with a programmability feature, which enables offloading computation tasks from CPU to NICs or switches. Network offloading to programmable hardware is a promising approach to help relieve processing pressure on the CPU for computation-intensive applications, e.g., Spark, or reduce the network traffic for network-intensive applications, e.g., TensorFlow. However, leveraging programmable hardware effectively is challenging due to the limited memory capacity and restricted programming model. In order to understand how to leverage the advantage of network offloading in developing new network stacks, network protocols, and applications, the following question needs to be answered: how to do judicious division between the programmable hardware and software for network offload given limited resources and restricted programming models? Driven by the real application demand while exploring the answer to this question, we first propose RoGUE, a new congestion control and recovery mechanism for RDMA over Converged Ethernet that does not rely on PFC while preserving the benefits of running RDMA, i.e., low CPU and low latency. To preserve the low CPU benefit, RoGUE offloads packet pacing to the NIC. Though RoGUE achieves better performance in extensive testbed evaluations, the architecture for optimal congestion control should be a centralized packet scheduler [7], which has global visibility into packet reservation requests from all the servers. Given all the hosts are connected through switches and the emerging programmable switch hardware can have stateful objects, we designed a centralized packet scheduler at the switch, called PL2, to provide stable and near-zero-queuing in the network by proactively reserving switch buffers for packet bursts in the appropriate time-slots. Congestion control is an essential component in the networking stack because application demand for the network is higher than link speed. To eliminate the net- work congestion control, the fundamental solution is reducing the network traffic such that the application demand for the network is no more than link speed. We observed that we are able to reduce the network traffic for distributed training sys- tems by offloading a critical function, gradients aggregation, to the programmable switch. Each worker in the distributed training system sends gradients over the network to special components, parameter servers, to do aggregation, which is a simple add operator. Thus, we propose ATP, a network service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources.


Improving Datacenter Performance with Network Offloading Related Books

Improving Datacenter Performance with Network Offloading
Language: en
Pages: 0
Authors: Yanfang Le
Categories:
Type: BOOK - Published: 2020 - Publisher:

DOWNLOAD EBOOK

There has been a recent emergence of distributed systems in datacenters, such as MapReduce and Spark for data analytics and TensorFlow and PyTorch for machine l
Improving Datacenter Network Performance Via Intelligent Network Edge
Language: en
Pages: 246
Authors: Keqiang He
Categories:
Type: BOOK - Published: 2017 - Publisher:

DOWNLOAD EBOOK

Datacenter networks are critical building blocks for modern cloud computing infrastructures. In this dissertation, we show how we can leverage the flexibility a
Improving Datacenter Network Performance with Multicast
Language: en
Pages:
Authors: Chi Fung Michael Chan
Categories:
Type: BOOK - Published: 2014 - Publisher:

DOWNLOAD EBOOK

The network constitutes a significant portion of a datacenter's cost and its performance is critical to scaling datacenter applications. Datacenter operators th
Improving Fault Tolerance and Performance of Data Center Networks
Language: en
Pages: 96
Authors: Vincent Liu
Categories: Computer networks
Type: BOOK - Published: 2016 - Publisher:

DOWNLOAD EBOOK

Data center networks are a key component to the explosive growth of cloud computing---enabling the utilization of tens to hundreds of thousands of co-located se
Data Center Fundamentals
Language: en
Pages: 1114
Authors: Mauricio Arregoces
Categories: Computers
Type: BOOK - Published: 2003-12-04 - Publisher: Cisco Press

DOWNLOAD EBOOK

Master the basics of data centers to build server farms that enhance your Web site performance Learn design guidelines that show how to deploy server farms in h