• Medientyp: Bericht; Studienarbeit; E-Book
  • Titel: ACCL+: an FPGA-Based Collective Engine for Distributed Applications
  • Beteiligte: He, Zhenhao [Verfasser:in]; Korolija, Dario [Verfasser:in]; Zhu, Yu [Verfasser:in]; Ramhorst, Benjamin [Verfasser:in]; Laan, Tristan [Verfasser:in]; Petrica, Lucian [Verfasser:in]; Blott, Michaela [Verfasser:in]; Alonso, Gustavo [Verfasser:in]; id_orcid0 000-0002-4396-6695 [Verfasser:in]
  • Erschienen: Cornell University, 2023-12-18
  • Erschienen in: arXiv
  • Sprache: Englisch
  • DOI: https://doi.org/20.500.11850/652738; https://doi.org/10.3929/ethz-b-000652738; https://doi.org/10.48550/ARXIV.2312.11742
  • Schlagwörter: Data processing ; Parallel ; Machine Learning (cs.LG) ; Networking and Internet Architecture (cs.NI) ; Distributed ; FOS: Computer and information sciences ; and Cluster Computing (cs.DC) ; Hardware Architecture (cs.AR) ; computer science
  • Entstehung:
  • Anmerkungen: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
  • Beschreibung: FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
  • Zugangsstatus: Freier Zugang