Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers

Quan Chen, Hailong Yang, Jason Mars, Lingjia Tang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

  • 10 Citations

Abstract

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a NVIDIA K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

LanguageEnglish (US)
Title of host publicationASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems
PublisherAssociation for Computing Machinery
Pages681-696
Number of pages16
Volume02-06-April-2016
ISBN (Electronic)9781450340915
DOIs
StatePublished - Mar 25 2016
Event21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016 - Atlanta, United States
Duration: Apr 2 2016Apr 6 2016

Other

Other21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016
CountryUnited States
CityAtlanta
Period4/2/164/6/16

Fingerprint

Warehouses
Particle accelerators
Quality of service
Facings
Intelligent agents
Bandwidth
Image classification
Data transfer
Speech recognition
Program processors
Servers
Processing

Keywords

  • Non-preemptive accelerators
  • Quality of service
  • Scheduling
  • Warehouse scale computers

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Cite this

Chen, Q., Yang, H., Mars, J., & Tang, L. (2016). Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems (Vol. 02-06-April-2016, pp. 681-696). Association for Computing Machinery. DOI: 10.1145/2872362.2872368

Baymax : QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. / Chen, Quan; Yang, Hailong; Mars, Jason; Tang, Lingjia.

ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems. Vol. 02-06-April-2016 Association for Computing Machinery, 2016. p. 681-696.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chen, Q, Yang, H, Mars, J & Tang, L 2016, Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. in ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems. vol. 02-06-April-2016, Association for Computing Machinery, pp. 681-696, 21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016, Atlanta, United States, 4/2/16. DOI: 10.1145/2872362.2872368
Chen Q, Yang H, Mars J, Tang L. Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems. Vol. 02-06-April-2016. Association for Computing Machinery. 2016. p. 681-696. Available from, DOI: 10.1145/2872362.2872368
Chen, Quan ; Yang, Hailong ; Mars, Jason ; Tang, Lingjia. / Baymax : QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems. Vol. 02-06-April-2016 Association for Computing Machinery, 2016. pp. 681-696
@inproceedings{5794b368dab145758ac7945f5118ac3b,
title = "Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers",
abstract = "Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a NVIDIA K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3{\%} while achieving the desired 99{\%}-ile latency target for for user-facing applications. In fact, Baymax reduces the 99{\%}-ile latency of user-facing applications by up to 195x over default execution.",
keywords = "Non-preemptive accelerators, Quality of service, Scheduling, Warehouse scale computers",
author = "Quan Chen and Hailong Yang and Jason Mars and Lingjia Tang",
year = "2016",
month = "3",
day = "25",
doi = "10.1145/2872362.2872368",
language = "English (US)",
volume = "02-06-April-2016",
pages = "681--696",
booktitle = "ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Baymax

T2 - QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers

AU - Chen,Quan

AU - Yang,Hailong

AU - Mars,Jason

AU - Tang,Lingjia

PY - 2016/3/25

Y1 - 2016/3/25

N2 - Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a NVIDIA K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

AB - Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a NVIDIA K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

KW - Non-preemptive accelerators

KW - Quality of service

KW - Scheduling

KW - Warehouse scale computers

UR - http://www.scopus.com/inward/record.url?scp=84975267270&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975267270&partnerID=8YFLogxK

U2 - 10.1145/2872362.2872368

DO - 10.1145/2872362.2872368

M3 - Conference contribution

VL - 02-06-April-2016

SP - 681

EP - 696

BT - ASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems

PB - Association for Computing Machinery

ER -