• Medientyp: E-Artikel
  • Titel: Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions
  • Beteiligte: Ferrari, Victor; Sousa, Rafael; Pereira, Marcio; L. De Carvalho, João P.; Amaral, José Nelson; Moreira, José; Araujo, Guido
  • Erschienen: Association for Computing Machinery (ACM), 2023
  • Erschienen in: ACM Transactions on Architecture and Code Optimization
  • Sprache: Englisch
  • DOI: 10.1145/3625004
  • ISSN: 1544-3973; 1544-3566
  • Schlagwörter: Hardware and Architecture ; Information Systems ; Software
  • Entstehung:
  • Anmerkungen:
  • Beschreibung: <jats:p>Convolution is one of the most computationally intensive operations that must be performed for machine learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This article proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA)—a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization—a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-based Packing—an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3×–4.0× on Intel x86 and 3.3×–5.9× on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11%–27% for Intel x86 and 11%–34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13%–28% on Intel x86 and 23%–39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.</jats:p>
  • Zugangsstatus: Freier Zugang