diff --git a/_posts/2025-04-28-Final.md b/_posts/2025-04-28-Final.md new file mode 100644 index 000000000..c80155be4 --- /dev/null +++ b/_posts/2025-04-28-Final.md @@ -0,0 +1,441 @@ +--- +layout: distill +title: Sample Blog Post +description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Bae Junhyeong + url: "https://github.com/20190511" + affiliations: + name: POSTECH, Pohang University of Science and Technology + - name: Kang Sungwook + url: "https://github.com/rkdtjddnr" + affiliations: + name: POSTECH, Pohang University of Science and Technology + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-Final.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +## Abstract +Our blog post will focus on optimizing the serving of large-scale language models in distributed systems, with an emphasis on improving memory efficiency and reducing latency. + +#### System architecture aspect +Large Language Models (LLMs) have mostly been developed in the form of GPT, which is based on the Decoder phase of the Transformer. As the context length of an LLM increases, inference performance becomes highly dependent on the optimization of the Attention operation. Accordingly, LLMs such as GPT operate in two major stages: Prefill and Decoding, each with distinct computational characteristics. For example, during the Decoder phase, the initial tokens are summarized in the Prefill stage, and subsequently, batched requests each maintain their own KV Cache. The process of loading these caches introduces overhead, resulting in a memory bottleneck. +In particular, the KV Cache grows with the sequence length and becomes a key source of memory bandwidth bottlenecks. **As the size of the prompt increases, the memory load during the Attention operation introduces** significant overhead in LLM serving. This blog, based on NeuPIMs and the paper PIM is All You Need, presents a new perspective that **computational characteristics such as GEMM and GEMV operations, as well as the layer structure, should be carefully segmented and treated with distinct batching strategies depending on the type of accelerator used.** + +#### Network aspect +As deep learning models—including large language models (LLMs)—continue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIA’s NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints. +To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution. In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***. + +In conclusion, this blog explores recent research on optimizing large language model (LLM) serving in large-scale server system from two key perspectives: **addressing memory bottlenecks through PIM/NDP-based architectural optimizations** and **reducing communication overhead using SmartNICs**. Through this dual approach, we aim to identify more efficient system architectures for LLM serving. + +## Background +### GEMM & GEMV + - **GEMV** stands for "GEneral Matrix Vector multiplication," referring to a general operation between a matrix and a vector. When $\alpha=1$ and $\beta=0$, it denotes a standard matrix-vector multiplication: + $$ + y=\alpha Ax+\beta y + $$ + In general, GEMV has a time complexity of $O(n^2)$. + + - **GEMM** stands for "GEneral Matrix Matrix multiplication," referring to a general operation between two matrices. When $\alpha=1$ and $\beta=0$, it denotes a standard matrix-matrix multiplication: + $$ + y=\alpha AB+\beta C + $$ + In general, GEMM has a time complexity of $O(n^3)$. + +### NDP, PIM + +Modern compute architectures are fundamentally based on the Von Neumann architecture. In the Von Neumann architecture, the Processing Unit (PU) and Memory are separated, and the computer architecture is organized such that the PU receives data from memory to perform computations. Accordingly, until a few years ago, previous works focused on enhancing the performance of each component by optimizing GPUs and TPUs to increase FLOPS, and improving memory to enable fast memory transfer to accelerators by increasing memory bandwidth. + +However, despite advancements in accelerators, the emergence of AI has caused workloads to become increasingly data-intensive, leading to a situation where the memory bandwidth cannot keep up with the FLOPS of accelerators. In particular, for LLMs, the process of summarizing the initial prompt and generating one token at a time requires each request to load its own cache into the accelerator, resulting in significant bottlenecks during this process. + +To address the issue where memory cannot keep up with the FLOPS of accelerators, techniques have been developed to process data stored in memory not in GPUs or NPUs, but using small accelerators placed close to the memory. These techniques can be classified into PIM (Processing-In-Memory) and NDP (Near-Data Processing) depending on whether the computation is performed inside the memory chip or in a controller outside the memory chip. + +PIM performs computations by directly manipulating the sense amplifiers inside the DRAM chip. Compared to NDP, it provides higher bandwidth and thus enables faster computations, but being inside the chip, it has lower computational flexibility and primarily performs MAC (Multiplication-ACcumulation) operations. + +NDP is typically located in the DRAM controller. Although it has lower bandwidth than PIM, it supports flexible data formats and performs general-purpose computations, offering a wide range of application possibilities. + +| Item | PIM (Processing-In-Memory)| NDP (Near-Data Processing)| +|-|-|-| +|**Computation Speed**| Very fast (in-cell parallel bitwise operations)|Fast (general-purpose computation with reduced data movement)| +|**Flexibility**| Low (only specific operations, fixed circuits)| High (can use general-purpose ALU, SIMD)| +|**Main Computation Types**| Bitwise (AND, OR, NOT), MAC, simple comparisons |General-purpose operations such as sorting, filtering, DB joins, ML inference| +|**Location of Processing Unit**| Inside DRAM/Flash cells or near sense amplifiers| Next to memory modules, around DIMM or SSD controller| +|**Pros**| Eliminates data movement, ultra-fast computation, high bit-level parallelism| Versatility, supports complex operations, programmable structure| +|**Cons**| Limited operations, low circuit flexibility| Some data movement still exists between memory and processor| + +### PIM/NDP & GPU/NPU Accelater Features + +In the case of GEMM, the arithmetic intensity is high with a complexity of approximately $O(N^3)$, whereas GEMV has a lower operation intensity of around $O(N^2)$ and tends to operate in a memory-bound manner. +Therefore, GPUs and NPUs, which are optimized for high compute intensity, perform most efficiently with GEMM operations. In contrast, GEMV operations tend to be less efficient due to the relatively large overhead from synchronization and memory movement compared to the computational gain. + +GPUs and NPUs operate more efficiently when the arithmetic intensity of GEMV is high. Thus, while they are well-suited for high-operation-intensity tasks like GEMM, they tend to show lower utilization for matrix-vector multiplications such as GEMV. + +### Explanation of Detailed Operations in LLM Transformers, Multi-Head Attention, and GEMM/GEMV Computation +- transformer & GPT + ![attention is all you need](../assets/img/2025-04-28-LLM_System_Pool/transformer.png) + LLMs such as ChatGPT and Llama3 primarily follow the GPT architecture, which adopts only the decoding stack of the Transformer structure. A key characteristic of this architecture is that it generates one token per iteration during the token generation process. + + +- Difference Between the Prefill and Decoding Phases in the Decoding Stack + ![attacc](../assets/img/2025-04-28-LLM_System_Pool/attacc.png) + The transformer structure in GPT primarily consists of a Decoding Stack, which is composed of Multi-Head Attention (MHA) blocks and Feed Forward blocks. Each of these blocks is made up of the following key layers: + - MHA Block + - QKV Generation: $$ Q = XW^Q,\ K = XW^K,\ V = XW^V $$ + Query, Key, and Value matrices are generated from the input embeddings. + Typically, QKV are calculated in a fused manner using a combined weight matrix $$W_{[Q,K,V]}$$, then split into individual components. + + - Logit: $$ QK^T $$ + Attention scores (similarities) are computed via the dot product of Query and Key. + + - Softmax: $$ \alpha = \text{Softmax}(\frac{\text{Logit}}{\sqrt{d_k}}) $$ + The Logits are normalized to produce attention weight distributions over tokens. + + - Attend: $$ \text{Attention} = \alpha V $$ + The weighted sum of the Value vectors is computed using the attention weights. + + - Concat: $$ \text{Concat}([head_1, ..., head_h]) $$ + Attention outputs from multiple heads are concatenated into a single vector. + + - QKV Projection: $$ \text{Output} = \text{Concat}(...)W^O $$ + The concatenated output is linearly projected to match the input dimension for the next layer. + + - Feed Forward (FF) Block + + - Feed Forward 1: $$ Z = XW_1 + b_1 $$ + Applies the first linear transformation to each token. + + - GeLU Activation: + $$ \text{GeLU}(x) = 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right) $$ + + - Feed Forward 2: $$ \text{Output} = ZW_2 + b_2 $$ + Applies the second linear transformation to produce the final output. + + This sequence of layers is used in two distinct phases: the **Prefil**l phase, where the input prompt is processed and the first token is generated, and the **Decoding** phase, where one token is generated at a time until the token. These phases are distinguished by their computational characteristics: Prefill is compute-bound, while Decoding is memory-bound. + +### Distributed Deep Learning +![Alt text](DDL.png) +As deep learning models continue to grow in size, it has become increasingly difficult to train large-scale models using only the memory and compute resources of a single GPU. One of the key approaches to overcoming this limitation is **Distributed Deep Learning** (DDL). DDL distributes model parameters or data across multiple GPUs or nodes, enabling parallel training. It is generally categorized into **Data Parallelism** and **Model Parallelism**. +![alt text](parallelism.png) + - **Data Parallelism** is a training method designed for scenarios where a large volume of data needs to be processed. It distributes the dataset across multiple GPUs, allowing each GPU to train the same model on a different subset of the data. However, since each GPU only updates the weights based on its local data, a **synchronization** step is required to aggregate and redistribute the updated weight parameters. This is where **Collective Communication** becomes essential, and as we will discuss later, it can lead to increased communication overhead—particularly in inter-node communication. + - **Model Parallelism** is a training approach used when the model itself is too large to fit on a single GPU. In this method, the model is divided and distributed across multiple GPUs. There are two main techniques for implementing Model Parallelism: (1) **Tensor Parallelism** and (2) **Pipeline Parallelism**. Synchronization overhead also exists in Model Parallelism, and it tends to occur more frequently than in Data Parallelism. This is because the model is partitioned, requiring GPUs to synchronize their intermediate computation results during training. + + While Distributed Deep Learning enables faster training and the ability to handle larger models, it inherently requires a synchronization process, which introduces a new challenge: communication overhead between GPUs. As the size of the model and dataset continues to grow, requiring more GPU systems to work together, this communication overhead is expected to increase even further. + +### Intra-node & Inter-node Communication +![Alt text](nodecomm.png) +In distributed training, cooperation among multiple GPUs is essential, and this involves two levels of communication. **intra-node communication** refers to communication between GPUs within a single server, which can typically be handled efficiently using high-speed interconnects such as NVLink and libraries like NCCL. In contrast, **inter-node communication** refers to communication between GPUs across different servers, which usually takes place over networks such as Ethernet or InfiniBand. In this case, limitations in network bandwidth and latency can lead to performance degradation. As mentioned earlier, parameter synchronization—performed repeatedly during training—is a major source of GPU-to-GPU communication overhead, and this overhead tends to be more severe in inter-node communication. + +### Collective Communication +In DDL, **Collective Communication** is essential for synchronizing model parameters and efficiently distributing or aggregating data. This refers to communication patterns that involve exchanging or combining data across multiple processes or GPUs. These patterns are typically categorized into several common types, as outlined below. +![Alt text](CC.png) +* **1:N communication** pattern + * Broadcast: Sends data from a single node to all other nodes. + * Scatter: Splits data on one node into multiple parts and distributes them to other nodes. + * Gather: Collects data from multiple nodes and aggregates it on a single node. + * Reduce: Combines data from multiple nodes using a specified operation (e.g., sum) and delivers the result to a single node. +* **N:N communication** pattern + * AllGather: Each node shares its data with all other nodes, resulting in every node holding the complete set of data. + * AllReduce: Data from all nodes is combined using a specified operation (e.g., sum), and the result is distributed back to all nodes. + +### SmartNIC +A **Network Interface Card** (NIC) is a hardware device used to connect a system to a network and enable communication. Traditional NICs are limited in functionality—they typically handle simple network-related operations or forward incoming packets to the host CPU. In contrast, a SmartNIC is an enhanced Ethernet device equipped with onboard processing cores, allowing it to support more general-purpose tasks. Recent research leveraging SmartNICs often focuses on offloading certain tasks from the host CPU, thereby reducing its workload and enabling more efficient communication. +![Alt text](smartnic.png) +SmartNICs are generally classified into two types: on-path and off-path. +* **On-path SmartNICs** are designed with programmable NIC cores, allowing them to handle not only basic network operations but also various offloaded computations directly within the NIC itself. While this approach offers flexibility, it has potential drawbacks—heavy computational loads can delay network packet processing, and programming the NIC cores tends to be highly complex. +* **Off-path SmartNICs**, on the other hand, include separate compute cores that are distinct from the main NIC cores. This design allows offloaded tasks to be processed without interfering with network performance. Although there is some communication overhead when accessing memory and compute resources, off-path SmartNICs are generally more programmer-friendly. As a result, they are more commonly adopted in recent research. + +In this blog, we will primarily focus on studies utilizing off-path SmartNICs and explore how they can be leveraged to address the communication challenges outlined earlier. + + +## GEMM/GEMV aware HW System + +> Prefill and Decoding from the Perspective of Actual Computation +- Prefill + During the prefill stage, LLM computations are primarily composed of matrix-matrix multiplications, i.e., GEMM operations. + The input $X:[\text{N}{\text{prompt}}, d{\text{emb}}]$ is structured as a matrix, where $\text{N}{\text{prompt}}$ denotes the number of prompt tokens, and $d{\text{emb}}$ is the dimensionality of the embedding vector representing each token. + - MHA Block + - QKV Generation (GEMM): + $ W^Q, W^K, W^V : [\text{N}{\text{prompt}}, d{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}] = [\text{N}{\text{prompt}}, d{\text{emb}}]$ + - Attention : + For convenience, the Logit, Softmax, and Attend steps are collectively referred to as the attention process. + Each head splits the $Q, K, V$ matrices by $\frac{d_{\text{emb}}}{H}, (H = \text{number of heads})$ and processes its own attention in parallel + The operation $O = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ is computed in the form of **GEMM** as follows: + $Q \times K^T \times V: [N_{prompt}, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prompt}] \times [N_{prompt}, \frac{d_{emb}}{H}]$ + ※ (Note: $\sqrt{d_k}$ is a scalar and is omitted in computation.) + - Concat: $ \text{Concat}([head_1:[N_{prompt}, \frac{d_{emb}}{H}], ..., head_h:[N_{prompt}, \frac{d_{emb}}{H}]]) :[\text{N}_{prompt}, d_{emb}]$ + - FF Block + - Feed Forward 1: $ Z = XW_1+ b_1 \text{ where, } XW_1:[\text{N}_{prompt}, d_{emb}]\times[d_{emb},4\times d_{emb}]$ is computed using GEMM. + - GeLU: Simple scalar-wise multiplication and activation. + - Feed Forward 2: $ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[\text{N}_{prompt}, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$ is also computed as GEMM. + + + +- Deocding +During the Decoding phase, each request reuses previously computed Key and Value matrices, which are stored in memory as KV Cache and concatenated back during computation. Since the Query corresponds to **only one newly generated token**, it exists in the form of a vector. As a result, the attention operation involves multiple GEMV computations along with heavy KV Cache memory loads, making this phase predominantly memory-bound. + + - MHA Block: Emphasis on **GEMV** operations increases + - QKV Generation (GEMM): $XW_Q,\ XW_K,\ XW_V: [1, d_{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}]$ These are vector-matrix multiplications. + For Key and Value, the previous KV Cache is loaded from memory and concatenated. The KV matrices have shape: + $K,V: [N_{prev}+1, d_{emb}]$ + Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape: + $[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$ + - Attention : $Q \times K^T \times V: [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$ + This involves heavy KV Cache loading from memory, and the operation is typically memory-bound. + - Concat: $ \text{Concat}([head_1:[1, \frac{d_{emb}}{H}], ..., head_h:[1, \frac{d_{emb}}{H}]]) :[1, d_{emb}]$ + - FF Block + - Feed Forward 1: $Z = XW_1+ b_1\text{ where, }XW_1:[1, d_{emb}]\times[d_{emb},4\times d_{emb}]$ + This is processed as a GEMV operation. + - GeLU: A simple element-wise scalar operation. + - Feed Forward 2: + $ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[1, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$ + While each request individually resembles a GEMV, since all decoding requests share the same weights and the FF block is a linear operation, multiple requests can be batched and computed together. This results in a GEMM operation of the form: + $[N_{batches}, d_{emb}] \times [d_{emb}, 4\times d_{emb}] \times [4\times d_{emb}, d_{emb}]$ + + +In summary, Prefill computations are predominantly in the form of GEMM operations. Since this is the initial stage, there is no prior KV Cache per request, allowing the operations to be processed in a compute-bound manner. In contrast, during Decoding, each request maintains its own KV Cache, and as the sequence length grows, memory bottlenecks become more severe due to increased memory load in the attention mechanism. +
+ +Based on the distinct computational characteristics of prefill and decoding phases, this section introduces serving architectures that leverage PIM technologies to handle long-context LLMs efficiently in terms of memory and compute. + +- NeuPIMs proposes a redesigned architecture where the memory functionality and GEMV computation of PIM can be executed in parallel. This allows the main accelerator (e.g., GPU) to focus on compute-intensive GEMM operations in the prefill phase, while the memory-bound GEMV operations in the decoding phase are offloaded to PIM units and processed asynchronously. + +- In PIM is All You Need, the authors observe that the Prefill phase contributes only a small portion of the end-to-end LLM workload. Leveraging this, they propose eliminating power-hungry GPUs/TPUs and introduce a highly power-efficient LLM PIM serving system based on PNM (a type of NDP) and multiple PIM units to handle the entire pipeline efficiently. + +### NeuPIMs Architecture Introduce + +![alt text](NeuPIMs.png) +NeuPIMs addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations. + +In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput. + +![alt text](neupimsOverlap.png) +By employing a dual-buffer system, NeuPIMs enables the batching of N requests into two sub-batches of N/2 each. While one sub-batch is processed by NeuPIMs (NPU-V and PIM) to handle the memory-bound attention computations, the other sub-batch simultaneously performs the compute-bound QKV generation and feed-forward network (FFN) computations. + +This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving. + +> NeuPIMs Results + +![alt text](NeuPIMs_result.png) +When NeuPIMs offloads GEMV operations to PIM and delegates GEMM operations to the NPU, it achieves over 1.5× improvement in LLM serving performance compared to a traditional NPU-only system. This performance gain is attributed to the efficient division of labor between memory-bound and compute-bound tasks, enabled by the hybrid PIM-NPU architecture. + + +### PIM is All you need + +The paper presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations. + +The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements. + +![alt text](PIM-PNM.png) + +In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units. + + +> PIM is ALL you need Results + +![alt text](Results.png) + In PIM is All You Need, experimental results show that as the number of decoding output tokens increases, the system achieves lower latency compared to traditional GPUs. This demonstrates the effectiveness of the architecture in handling long decoding phases. + + However, the paper also reveals that during the prefill phase, the PNM exhibits noticeably lower performance than GPUs. This suggests a limitation of the proposed system—as the size of the prefill grows, performance improvements become harder to achieve. This trade-off is acknowledged as a limitation of the paper, indicating that the architecture is particularly advantageous for decoding-heavy workloads, but may underperform when prefill dominates the workload. + +### Limitation +**Changes in Attention Mechanisms and the Decline of GEMV Usage** +In recent models such as **LLaMA3** and **DeepSeek LLM**, the traditional **Multi-Head Attention (MHA)** mechanism has evolved into variants like **GQA (Group-Query Attention)** and **MQA (Multi-Query Attention)** [GQA paper](https://arxiv.org/pdf/2305.13245). In **MQA**, all attention heads share a **single KV Cache**, whereas in **GQA**, groups of heads share **a subset** of the KV Cache. +![MHA\_GQA\_MQA](../assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png) + +As a result, during **decoding**, the query from a single head originally a **vector with a single row** is transformed into a **matrix** with +$ +\frac{\text{Origin Heads}}{\text{Group Size}} +$ +rows in GQA or MQA. This shifts the operation from a **GEMV** to a **GEMM**, introducing a computational overhead. + +This transformation poses a challenge for **PIM or NDP architectures**, which are typically optimized for **GEMV-style operations** in decoding. Thus, **GQA and MQA may reduce the processing efficiency** of such memory-centric accelerators compared to standard MHA. + +## Optimizing communication overhead with SmartNIC +As model sizes and datasets continue to grow, and as server systems scale accordingly, many research efforts have emerged to identify and address the resulting inter-node communication challenges. These efforts generally take two directions: one focusing on algorithmic or software-level solutions, and the other on hardware acceleration. Among these, hardware-based acceleration is increasingly viewed as a viable approach, with In-Network Aggregation (INA) being one of the most popular methods cited across recent studies. +Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments. +Therefore, the papers introduced below explore ***the use of SmartNICs—modern programmable network devices—as aggregators instead of traditional network switches***. +### SmartNIC for Ring-AllReduce +>Proposed technique + +We begin by introducing ***DirectReduce***, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below. +![Alt text](RING_basic.png) +Based on the figure, we can observe that when the NIC sends A1 data to Node B, the reduction operation is performed on Node B, and the result is then returned to the NIC to continue inter-node communication with Node C. The authors propose an alternative approach: instead of sending A1 data to Node B, only fetching B1 data from Node B to the NIC and performing the reduction directly on the NIC could be more efficient. + +This efficiency can be gained from two main perspectives: +1. Node B can continue its local computations without being interrupted by additional processing. +2. Unnecessary communication between Node B and the NIC is eliminated. + +Taking these observations into account, a revised communication path can be envisioned as follows. +![Alt text](RING_upgrade.png) +This approach eliminates unnecessary data movement and enables the reduction operation to be executed directly on the NIC, avoiding interference with Node B. Since the NIC uses DMA to fetch data, the host (Node B) remains uninvolved in the transfer and can continue running its ML workloads without interruption. + +To realize the proposed communication path, the authors introduce a new NIC architecture, which includes three key components added to the NIC. +![Alt text](RING_arch.png) +* GateKeeper: Manages prefetching and buffer optimization for data transfers in the Ring-AllReduce process. +* DataDirector: Redirects incoming data packets into the RNIC’s internal data path for direct handling within the NIC. +* ComputeEnhancer: Performs reduction operations directly on the RNIC using hardware accelerators such as FPGAs, enabling compute capabilities on the SmartNIC. + +>Evaluation results + +The experiments were conducted under two network topology configurations that are commonly used in real-world systems: +1. Ring topology: 8 nodes +2. 6D-torus topology: 729, 4,096, 46,656, and 262,144 nodes—corresponding to 3, 4, 6, and 8 nodes per dimension, respectively + +The architecture was implemented using **Xilinx Vivado**, and the overall simulation was carried out using **Astra-Sim2**. +However, since the results observed under the 6D-torus topology closely resemble those from the ring topology, we will focus on presenting the experimental results for the ring topology only. +* Small message size (1KiB, 256KiB, respectively) +![Alt text](RING_small_eval.png) + The graph compares latency as the number of processes per node increases, specifically in scenarios with small message sizes. Since the computation overhead is low and the messages themselves are small, the latency difference between the proposed approach and naive Ring-AllReduce remains minimal. +* Large message size (256MiB, 1GiB, respectively) +![Alt text](RING_large_eval.png) + In contrast, for larger message sizes, the benefit of performing the reduce operation on the NIC becomes more significant. Processes running AI workloads can produce results without being interrupted by AllReduce operations, while the NIC takes full responsibility for handling the reduction of those results. + + +In conclusion, the paper presents the following key insight based on the experimental results: while DirectReduce offers improved performance over traditional Ring AllReduce when handling medium-sized messages, the actual benefit depends on both the message size ($M$) and the number of processes ($N$). +Specifically, when the amount of data per process ($M/N$) is less than or equal to 1KB, there is little to no performance gain. However, when $M/N > 1\text{KB}$, the stream aggregation-based pipeline allows DirectReduce to reduce latency by 15% to 36% compared to Ring AllReduce. In general, the performance advantage of DirectReduce increases with larger $M$ and greater $N$, as this leads to the generation of more packets and enables more parallelism. +That said, when the message size $M$ is smaller than 1MB and $N$ becomes large, the total number of packets may decrease (sometimes down to just one), which in turn limits the effectiveness of the pipeline and constrains performance improvement. + +The findings can be summarized in the following table. +| Condition | Performance Benefit of DirectReduce | +|--------------------------------------|--------------------------------------------------------------| +| $M/N \leq 1\text{KB}$ | Little to no performance improvement | +| $M/N > 1\text{KB}$ | 15%–36% latency reduction | +| $N \uparrow$, $M \uparrow$ | More packets and deeper pipeline → greater performance gain | +| $M < 1\text{MB}$, $N \uparrow$ | Fewer packets → limited pipeline effect, reduced benefit | + +### Zero-Sparse AllReduce and SmartNIC offloading +>Proposed technique + +The next paper, OmNICCL, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects. +* **Zero-Sparse AllReduce** +![Alt text](SPARSE_algo.png) +The authors observed that when performing AllReduce on data such as gradients, it is highly inefficient to include a large number of zero values in the communication. This issue becomes more prominent in modern AI models, which often contain a vast number of parameters. To optimize these models, techniques like pruning are frequently applied, resulting in what are known as sparse neural networks—models in which a significant portion of the gradients are zero. +Based on this observation, the authors argue that excluding zero values during AllReduce can lead to more efficient communication. The proposed Zero-Sparse AllReduce algorithm operates as illustrated in the figure above. In simple terms, the gradient vector used for AllReduce is divided into blocks. Any block that contains only zeros (a zero block) is excluded from the communication process. As a result, the amount of data transferred is reduced, leading to lower communication latency. + +* **Using SmartNIC as aggregator** +![Alt text](SPARSE_aggregator.png) +The next approach focuses on using SmartNICs as aggregators. The authors highlight the limitation of memory bandwidth on SmartNICs and propose a solution using **Direct Cache Access** (DCA), which allows the NIC to directly access the Last-Level Cache (LLC). However, since the NIC must aggregate data from multiple nodes to perform reduction operations, the limited capacity of the LLC becomes a bottleneck. To address this, the authors propose a specialized memory layout inside the SmartNIC, as shown in the figure above. +In simple terms, the design introduces Rx and Tx Spots to store non-zero blocks. These spots reside in RDMA-accessible regions, allowing worker nodes to write data directly into them. When a worker writes data to an Rx Spot, the aggregator (SmartNIC) reads the data and its corresponding index, performs a reduction with the existing data in the corresponding Tx Spot, and updates the Tx Spot with the result. +This approach eliminates the need to store all incoming data from every worker, allowing more efficient use of LLC. Additionally, the figure shows two Tx Spots, which are used for double buffering. This mechanism prevents overwriting Tx data that has not yet been scattered to all workers from the previous AllReduce phase when the next phase begins. + +>Evaluation results + +The experiments were conducted on two types of systems: one with GPU workers and another with CPU workers. For the SmartNIC, the authors used the **NVIDIA BlueField-2**. A custom microbenchmark, similar in spirit to the OSU MPI Microbenchmark suite, was used to evaluate AllReduce performance. Unlike typical AllReduce benchmarks, this version allows for configurable array sparsity, enabling more nuanced experimentation. +Using this setup, the authors compare the performance of executing the reduction operation on a SmartNIC versus on a host CPU. Their results demonstrate that SmartNIC-based aggregation is more efficient. Furthermore, when compared against other sparse AllReduce method, the proposed Zero-Sparse AllReduce approach shows clear advantages in performance. + +* Comparison of two Sparse AllReduce +![Alt text](image.png) +The experimental results compare the proposed Zero-Sparse AllReduce method, OmNICCL, against an existing sparse AllReduce technique, OmniReduce. The authors claim that OmNICCL consistently outperforms OmniReduce across all evaluated scenarios. + +* Comparison of two aggregator +![Alt text](SPARSE_result.png) +The experiments were conducted by varying the number of workers and block sparsity levels, with a fixed message size of 128MB. Two versions of OmNICCL are presented: OmNICCL* refers to the setup where a SmartNIC is used as the aggregator, while OmNICCL refers to the case where a conventional CPU serves as the aggregator. In GPU-based systems, when the block sparsity ranges from 25% to 75%, OmNICCL* shows a modest reduction in latency compared to the CPU-based version, highlighting the benefit of SmartNIC offloading in sparse communication scenarios. + +### Limitation of SmartNIC +Both **DirectReduce** and **OmNICCL** identify the fundamental limitations of SmartNICs—namely, their limited performance and memory capacity—and propose architectural solutions to address these constraints. However, as seen in the experimental results, especially in the case of OmNICCL, the use of SmartNICs does not lead to dramatic performance improvements. This is likely due to the current hardware resource constraints of commercially available SmartNICs. +One might wonder why DirectReduce appears to achieve more substantial performance gains. This can be attributed to the fact that DirectReduce was evaluated using a simulation-based setup with an FPGA-based, ASIC-style SmartNIC, while OmNICCL was tested on actual hardware using NVIDIA’s BlueField-2 . The discrepancy in hardware platforms may explain the difference in results. Additionally, DirectReduce focuses on larger message sizes, which can amplify performance gains—this context should also be considered when interpreting the results. +In conclusion, while a variety of research efforts are exploring how to best leverage SmartNICs, most studies still point to their limited capabilities as a bottleneck. As such, many works aim to extract the maximum possible performance from current SmartNIC architectures. Nevertheless, this issue is expected to diminish as SmartNIC technology continues to advance in the future. + +## Conclusion + +***In this blog post, we explored a range of research efforts aimed at optimizing training and inference in modern AI workloads—particularly those involving large model parameters and datasets—from two key perspectives.*** + +[여기 연결 부분 수정 필요] 먼저 architecture 측면에서 최적화 ~ + +### efforts & optimization +This study analyzes the computational differences between **prefill** and **decoding** phases in large language model (LLM) serving and proposes a structural approach to address the resulting **memory bottleneck**. + +Traditional GPU and TPU-based accelerators are optimized for **compute-bound operations (GEMM)** due to their high FLOPS capabilities. However, they face significant **bandwidth bottlenecks** when performing **memory-bound operations (GEMV)**, particularly during the decoding phase of LLMs where frequent KV cache loads dominate. + +To overcome this issue, recent research has introduced **Processing-In-Memory (PIM)** and **Near-Data Processing (NDP)** technologies. + +* **NeuPIMs** addresses the structural limitation of conventional PIMs, which could not handle memory access and computation simultaneously. It introduces a **dual-buffer system** and a **NPU-V integration** strategy that enables **GEMM operations** (e.g., QKV projection and FFN) to be handled by the NPU, while **GEMV operations** (e.g., decoding attention) are offloaded to PIM units for **parallel and efficient execution**. + +* In contrast, **PIM is All You Need** pursues **maximum power efficiency** by removing GPUs and NPUs altogether. It utilizes a **PIM + PNM** architecture to process both GEMV and other computations near the memory controller, significantly improving both **latency** and **energy efficiency**. + +These architectures share a common strategy: offloading **decoding-phase GEMV operations to memory-proximal units**, thereby **mitigating latency bottlenecks** in LLM serving. +However, modern LLMs are shifting from **MHA to GQA/MQA** attention mechanisms, in which shared KV cache structures **transform attention from GEMV to GEMM** computations. This shift potentially **undermines the efficiency** of PIM-based architectures, which are optimized for GEMV. +As such, future research must consider **rearchitecting attention layers** and computation pipelines to be **PIM/NDP-friendly**, ensuring compatibility and sustained efficiency in the face of evolving LLM architectures. + +In parallel with architectural optimizations, we also examined system-level approaches that aim to reduce communication bottlenecks—particularly those that arise in large-scale, multi-GPU distributed training environments. +As modern AI models grow in size and complexity, training them often requires more than a single GPU. It is now common to use multi-GPU systems where several GPUs are grouped within a single node—and increasingly, multiple such GPU nodes are used together. While this multi-GPU setup significantly accelerates computation, it also introduces **a substantial amount of communication overhead, especially for exchanging model parameters between systems**. To address this, many recent studies have investigated the use of SmartNICs to optimize inter-system communication. +* **DirectReduce** proposes offloading the traditional Ring AllReduce communication operation to the SmartNIC, allowing the CPU and GPU to focus solely on AI computations. +* **OmNICCL** introduces the Zero-Sparse AllReduce algorithm to reduce data movement during communication and offloads the collective operation to the SmartNIC as well. + +However, as discussed earlier, current SmartNICs are still not powerful enough to handle heavy and complex computations efficiently. Their memory capacity is also significantly limited compared to CPUs, which poses challenges when offloading data-sensitive workloads. Nevertheless, as SmartNIC architectures continue to evolve and improve in performance, their potential applications are expected to expand dramatically in the future. + +[최종 결론] 위에 내용 정리해서 합친 결론 도출 + + + + + +. + +## Citation +- Attacc! +- NeuPIMs +- PIM is all you need +- [GQA](https://arxiv.org/pdf/2305.13245) + + + + + + + + + + + + + + + +. \ No newline at end of file diff --git a/_posts/2025-04-28-LLM_System_Pool.md b/_posts/2025-04-28-LLM_System_Pool.md new file mode 100644 index 000000000..5940c71f4 --- /dev/null +++ b/_posts/2025-04-28-LLM_System_Pool.md @@ -0,0 +1,258 @@ +--- +layout: distill +title: Sample Blog Post +description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Bae Junhyeong + url: "https://github.com/20190511" + affiliations: + name: POSTECH, Pohang University of Science and Technology + - name: Gang Sungwook + url: "https://en.wikipedia.org/wiki/Gang_Sungwook" + affiliations: + name: POSTECH, Pohang University of Science and Technology + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-Final.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +## Abstract +- "what problem is this work trying to tackle?" +- "how new is this effort?" (소개, 개요) + +대규모 언어 모델 (LLM) 은 대부분 Transformer의 Decoder 과정으로 이루어진 GPT 형태로 개발되어져 왔다. LLM의 Context Length가 길어질 수록 추론 성능은 Attention 연산의 최적화에 의해 크게 좌우된다. 이에 GPT 등의 LLM은 크게 Prefill, Decoding이라는 2단계로 동작하며 이때의 연산 특성이 다르다. 예를 들어, Decoder 과정에서는 초기 토큰을 Summerize하는 과정인 Prefill과정과 이후 함께 Batching 되어진 Request들이 KV Cache를 각 자 가지게되면서 해당 Cache를 Load를 하는과정에서의 Overhead로 Memory Bottleneck이 발생하게 된다. + +특히, KV Cache는 시퀀스 길이에 따라 점점 커지며, Memory Bandwidth 병목의 핵심 원인이 된다. **Prompt 의 크기가 커짐에 따라 Attention 과정에서의 Memory Load**가 LLM Serving 과정에 많은 Overhead를 유발하고 있다. 본 블로그는 NeuPIMs와 *PIM is All You Need* 논문을 기반으로, 가속기의 특성에 따라 **GEMM, GEMV 연산 특성과 계층(layer) 구조에 따라 세분화하고 배치 전략을 다르게 적용해야 한다는 새로운 관점**을 제시한다. + +## Background +[Nvidia Documents](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html) +> GEMM과 GEMV +- GEMV 은 "GEneral Matrix Vector multiplication" 줄임말로써 Matrix-Vector 의 general 연산을 의미한다. $\alpha=1,\beta=0$ 일 때는 Matrix-Vector multiplication을 의미한다. + $$ + y=\alpha Ax+\beta y + $$ + 일반적으로, GEMV 시간 복잡도는 $O(n^2)$ 을 가진다. +- GEMM 은 "GEneral Matrix Matrix multiplication" 의 줄임말로써 Matrix-Matrix 간의 General 연산을 의미한다. $\alpha=1, \beta=0$ 일 때는 Matrix-Vector Multiplication을 의미한다. + $$ + y=\alpha AB+\beta C + $$ + 일반적으로, GEMM 시간 복잡도는 $O(n^3)$ 을 가진다. + +> NDP, PIM + +현대 Compute Architecture는 기본적으로 Von-Noiman Architecture 를 따르고 잇다. +Von-noiman Architecture 에선, PU (Processing Unit), Memory 를 구분하여 Computer Architecture 를 구성하고 PU는 기본적으로 Memory로부터 데이터를 받아 처리하여 연산을 처리한다. 따라, 몇 년 전까지는 prev work 들은 개 각각의 성능을 높일 수 있도록 FLOPS를 높일 수 있도록 GPU, TPU 등을 최적화하였고, Memory BW를 높혀 가속기까지 빠른 Memory Transfer가 가능하도록 Memory를 발전시켜왔다. +하지만, accelerator의 발전임에도 AI 의 등장으로 많은 Workload의 데이터들이 점점 무거워지기 시작했고 이에 따라 accelerator 까지 메모리가 전달되는 bandwidth가 accelerator의 FLOPS를 따라가지 못하는 현상이 생겼다. 특히, LLM의 경우 초기 prompt 를 summerize하여 1 token 씩 생성하는 과정 속에서 각 각의 request들이 자신만의 cache를 accelerator로 load 하는 과정이 필요하며 해당 과정 속에서 많은 bottleneck이 발생하고 있다. +Memory가 accelerator의 FLOPS를 따라가지 못하는 문제점 등을 해결하기 위해서 memory 에 저장된 data를 GPU, NPU에서 처리하지 않고, memory 에 가까운 위치에 작은 accelerator를 둬서 처리하는 기법들이 생겨났다. 해당 기법들은 Memory Chip 내부에 있느냐 혹은 메모리 chip 외부의 controller에서 처리하느냐에 따라 PIM과 NDP로 구분할 수 있다. +- PIM은 DRAM Chip 내부에서 sense amp를 직접 조작하여 연산하게되며 NDP에 비해 bandwidth가 크므로 빠른 연산이 가능하지만, chip 내부에 있어 연산 flexibility가 떨어지고 MAC(Multiplication-ACcumulation) 연산 등을 주로 수행한다. +- NDP는 DRAM Controller 에 주로 위치하여 PIM보다 BW가 떨어지더라도 유연한 data format을 지원하며 general-purpose하게 연산을 수행할 수 있어 많은 활용가능성을 지니고 있다. + +| 항목 | PIM (Processing-In-Memory)| NDP (Near-Data Processing)| +|-|-|-| +|**연산 처리속도**| 매우 빠름 (셀 내부 병렬 비트 연산)|빠름 (데이터 이동 줄어든 범용 연산)| +|**유연성 (Flexibility)**| 낮음 (특정 연산만 가능, 고정 회로)| 높음 (범용 ALU, SIMD 사용 가능)| +|**주요 처리 연산**| Bitwise (AND, OR, NOT), MAC, 간단한 비교|정렬, 필터링, DB 조인, ML inference 등 범용 연산| +|**연산기 위치**| DRAM/Flash **셀 내부 또는 센스앰프 근처**| **메모리 모듈 옆**, DIMM 또는 SSD 컨트롤러 주변| +|**장점**| 데이터 이동 제거, 초고속 연산, 높은 비트 병렬성| 범용성, 복잡한 연산 가능, 프로그래머블 구조| +|**단점**| 연산 제한적, 회로 유연성 낮음| 메모리~연산기 간 일부 데이터 이동 여전히 존재| + +> PIM/NDP 와 GPU/NPU 가속기의 특징징 + +GEMM의 경우에는 $O(N^3)$ 정도로 arithmetic intensity 가 높은 반면, GEMV 의 경우에는 $O(N^2)$ 정도로 GEMM에 비해 Operation Intensity가 다소 떨어지며, Memory-Bound 하게 동작한다. +따라서, GPU와 NPU는 compute intensity가 높기 때문에 GEMM 에 최적화 되어 동작하게 된다. 반대로 GEMV의 경우에는 전체 연산을 처리하는 compute 이득에 비해 synchronous 및 memory move overhead가 커져 GEMM에 비해 비효율적으로 동작한다. + +GPU와 NPU 등은 GEMVs arithmetic intensity 가 높을 수록 효율적으로 연산한다. 그러므로 GEMM과 같이 Operation Intensity가 높은 연산에는 효율적으로 사용될 수 있지만, GEMV와 같이 Matrix-Vector Mutliplication은 다소 Utilization이 떨어지는 특징이 있다. + +> LLM Transformer 의 세부동작과 Multi-Head Attention 과정과 GEMM,GEMV 과정의 설명 +- transformer 와 GPT + ![attention is all you need](../assets/img/2025-04-28-LLM_System_Pool/transformer.png) + ChatGPT, Llama3 와 같은 LLM들은 주로 GPT 구조를 따르고 있고 GPT는 Transformer 구조의 Decoding Stack 만 가져와서 1번의 Iteration마다 1개의 Token을 만들며 token을 생성한다는 특징이 있다. + + +- Decoding Stack에서의 Prefill과 deocding 과정 차이 + ![attacc](../assets/img/2025-04-28-LLM_System_Pool/attacc.png) + GPT의 transformer 구조는 Decoding Stack은 MHA(Multi-Head Attention) block 과 Feed Forward Block으로 크게 구성되어 있다. 그리고 각 block은 아래의 주요한 Layer들로 구성되어 있다. + - MHA Block + - QKV Generation: $ Q = XW^Q,\ K = XW^K,\ V = XW^V $ + 입력 임베딩으로부터 Query, Key, Value 행렬을 생성한다. + QKV를 주로 묶어서 $W_{[Q,K,V]}$ 를 곱하여 한 번에 계산한 후 Q,K,V 를 쪼개어 계산한다. + + - Logit: $QK^T $ + Query와 Key의 내적을 통해 attention score(유사도)를 계산. + + - Softmax: $ \alpha = \text{Softmax}(\frac{\text{Logit}}{\sqrt{d_k}}) $ + Logit 값을 정규화하여 각 토큰에 대한 가중치 분포 생성 + + - attend: $ \text{Attention} = \alpha V $ + Softmax로 얻은 가중치를 Value에 곱해 attention 출력을 계산한다. + - Concat: $ \text{Concat}([head_1, ..., head_h]) $ + 여러 head의 attention 출력을 원본 벡터 크기로 연결. + + - QKV Projection: $ \text{Output} = \text{Concat}(...)W^O $ + Concat된 출력을 선형 변환하여 다음 레이어에 전달달. + - FF (Feed Forward) Block + - Feed Forward 1: $ Z = XW_1 + b_1 $ + 각 토큰별로 첫 번째 선형 변환을 수행. + + - GeLU: $ \text{GeLU}(x) = 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right) $ + + - Feed Forward 2: $ \text{Output} = ZW_2 + b_2 $ + 두 번째 선형 변환을 적용하여 최종 출력 생성성. + + 해당 Layer는 입력 Prompt 를 처리하여 가장 초기의 토큰을 생성하는 **Prefill** 단계와 이후 \ 토큰을 생성할 때까지 1개의 토큰을 만드는 **Deocding** 단계로 구분되며 prefill은 compute-bound하고 decoding은 memory-bound하다는 특징으로 구분된다. + + + +## NeuPIMs & PIM is ALL you NEED +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" + +> 실제 연산처리 관점에서의의 Prefill, Decoding +- Prefill + Prefill 과정에서의 LLM 연산은 주로 Matrix-Matrix Multiplication (GEMM) 형태로 구성되어있다. + Prefill의 $X:[\text{N}_{prompt}, d_{emb}]$ 는 prompt 토큰의 개수만큼 matrix 형태로 이루어져 있다. $d_{emb}$는 "각 토큰을 표현하는 벡터의 차원"을 의미한다 + - MHA Block + - QKV Generation (GEMM): $WQ, WK, WV :[\text{N}_{prompt}, d_{emb}] \times [d_{emb}, d_{emb}]=[\text{N}_{prompt}, d_{emb}]$ + - Attention : 편의상 Logit,Softmax,Attend를 묶어 attention 과정이라고 부른다. 그리고, 각 head가 $Q,K,V$ 를 $\frac{d_{emb}}{H}$ 로 나누어 헤드들이 병렬적으로 자신의 Attention을 처리한다.( $H=\text{Head 수}$) + $ O=\text{Softmax}(\frac{QK^T}{\sqrt{d_k}})V $는 아래와 같이 **GEMM** 형태로 처리된다. + $Q \times K^T \times V: [N_{prompt}, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prompt}] \times [N_{prompt}, \frac{d_{emb}}{H}]$ + ※ $\sqrt{d_k}$ 는 scalar 라 생략. + - Concat: $ \text{Concat}([head_1:[N_{prompt}, \frac{d_{emb}}{H}], ..., head_h:[N_{prompt}, \frac{d_{emb}}{H}]]) :[\text{N}_{prompt}, d_{emb}]$ + - FF Block + - Feed Forward 1: $ Z = XW_1+ b_1 $ 에서 $XW_1:[\text{N}_{prompt}, d_{emb}]\times[d_{emb},4\times d_{emb}]$ 형태로 GEMM 형태의 연산이 처리된다. + - GeLU: 단순 scalar 곱 + - Feed Forward 2: $ \text{Output} = ZW_2 + b_2 $ 에서 $XW_2:[\text{N}_{prompt}, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$ 형태로 GEMM 형태의 연산이 처리된다. + + + +- Deocding + Deocding 과정에서 각 Request들은 이전에 계산된 Query, Value Matrix를 다시 사용하므로 KV-Cache 형태로 메모리에 저장을 해두었다가 다시 이어붙이는 형태로 연산이 이루어진다. 그리고 Query가 이전에 생성한 **1개의 토큰** 을 사용하므로 Vector 형태로 존재하여 Attention과정에서 다수의 GEMV연산과 무거운 KV Cache memory-load 과정으로 memory-bound 과정으로 주로 취급된다. + + - MHA Block: **GEMV** 비중이 높아짐 + - QKV Generation (GEMM): $XW_Q,XW_K, XW_V: [1, d_{emb}] \times [d_{emb}, d_{emb}]$ 로, Vector-Multiplication 곱이 발생한다. + 각 K,V 는 이전의 KV Cache를 memory로부터터 Load하여 연결하여야 하며 KV는 $K,V: [N_{prev}+1, d_{emb}]$ 형태로 구성된다. 하나의 request는 GEMV 형태처럼 보일 수 있으나, decoding의 여러 request가 동일한 weight를 공유하기 때문에 실제로는 $[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$ 형태로 GEMM 연산으로 처리된다. + - Attention : $Q \times K^T \times V: [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$ 으로 무거운 KV Cache를 load하는 과정에서 Memory-Bound 형태로 연산이 진행된다. + - Concat: $ \text{Concat}([head_1:[1, \frac{d_{emb}}{H}], ..., head_h:[1, \frac{d_{emb}}{H}]]) :[1, d_{emb}]$ + - FF Block + - Feed Forward 1: $ Z = XW_1+ b_1 $ 에서 $XW_1:[1, d_{emb}]\times[d_{emb},4\times d_{emb}]$ 형태로 GEMV 형태의 연산이 처리된다. + - GeLU: 단순 scalar 곱 + - Feed Forward 2: $ \text{Output} = ZW_2 + b_2 $ 에서 $XW_2:[1, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$ 형태로 하나의 Request는 GEMV 형태이지만, Feed Forward 과정은 여러 requests 들이 weight를 공유하여, linear operation이기 때문에 여러 request를 묶은 하나의 Batch를 한 번에 처리하므로로 GEMM 형태의 연산이 처리된다. 즉, $[N_{batches}, d_{emb}] \times [d_{emb}, 4\times d_{emb}] \times [4\times d_{emb}, d_{emb}]$ 형태로 처리된다. + + +즉, Prefill의 경우에는 대부분이 연산이 GEMM 형태로 이루어지며, 초기 단계로인하여 각 Request가 이전 단계의 KV Cache가 없기 때문에 compute-bound하게 연산이 가능하다. 반면, decoding의 경우에는 각자의 request마다 자신만의 KV Cache가 존재하고 해당 길이가 Attention이 길어질 수록 memory bottleneck이 많이 발생하게 된다. +
+ +Prefill과 decoding 과정에서의 연산 특성과 PIM 기술을 활용하여 Long Context LLM을 Memory, Compute 측면에서 효율적으로 처리할 수 있는 Serving Architecture를 소개한다. +NeuPIMs의 경우 기존의 PIM의 Memory 기능과 PIM-GEMV 연산 기능을 parallel하게 처리 가능 하도록 구조 변경을 통해 main acclerator 에는 prefill 과 같은 GEMM을 주로 처리하고, decoding 의 GEMV 부분을 PIM에서 asynchronous 하게 처리할 수 있게 설계하였다. + +PIM is All you need에서는 LLM의 Prefill 과정이 End-to-End 과정에서 적은 비중을 차지함을 이용하여 전력 소비량이 높은 GPU, TPU를 제거하고 NDP 기술인 PNM과 다수의 PIM을 활용하여 높은 전력 효율을 가진 LLM PIM serving system을 소개한다. + +> NeuPIMs Arichetecture 소개 + +![alt text](NeuPIMs.png) +NeuPIMs는 기존의 PIM에서 memory mode와 PIM(GEMV 처리)mode를 동시 실행이 불가능했던 점을 보완하여 GEMM을 효과적으로 처리할 수 있는 작은 NPU와 Advacned PIM을 같은 chip 내부에 둬서 decoding attention 과정을 효과적으로 처리할 수 있는 architecture 를 소개한다. + +특히, 기존의 PIM은 메모리 옆에 존재하여, PIM의 GEMV연산에 사용되는 buffer가 memeory load buffer와 동일하여 memory load 과정과 GEMV 를 동시수행이 불가능한 한계점을 고려하여 dual-buffer system을 구축하였다. +![alt text](dualbuffer.png) +![alt text](neupimsOverlap.png) +Dual-Buffer를 사용한다면 N개 request batch를 N/2 개씩 sub-batch로 둬서 NeuPIMs (NPU-V와 PIM) 가 Attention을 처리하는 동안 Compute-bound 연산인 QKV Generation과 FFN을 나머지 N/2 개의 sub-batch가 처리될 수 있게 overlapping 함으로써 memory bottleneck과 compute bottleneck을 효과적으로 해결할 수 있게 되었다. + +> NeuPIMs Results + +![alt text](NeuPIMs_result.png) + NeuPIMs가 GEMV, GEMM을 PIM과 NPU로 나누어 처리하였을 때 기존의 NPU-Only system에서보다 1.5 배 이상 효과적으로 LLM 을 Serving할 수 있었다. + +> PIM is All you need + +해당 논문에서는 context length가 길어지고 있는 점과 GPU, TPU보다 전력효율이 훨씬 좋은 PIM이 GEMV를 처리하고 자체 제작한 저전력 PNM(Processing-Near-Memory) 장치를 DRAM Controller에 두어 GEMM을 처리하도록 한 architecture 를 제시한다. PNM은 GEMV를 제외한 softmax의 reduce tree나 exponent processors 와 activation function(GeLU, ReLU) 에서 필요한 RISC-V Core 등을 포함한 작은 소자를 추가적으로 두었다. +![alt text](PIM-PNM.png) +![alt text](PIM-PNM-detail.png) +NeuPIMs에선 decoding의 GEMV를 제외한 나머지 연산은 NPU에서 처리하던 것과 달리 PIM is all you need는 attention을 제외한 나머지를 PNM 장치에서 처리하되, 다수의 Device로 Broadcast 및 Gather 하여 처리하도록 하였다. +![alt text](GEMV-GEMM.png) + +> PIM is ALL you need Results + +![alt text](Results.png) + PIM is All you need에서의 실제 결과를 보았을 때 Decoding output token 수가 커질 수록 기존 GPU에 비하여 latency가 줄어드는 모습을 볼 수 있다. 다만, PIM is all you need에서는 prefill 과정에서 GPU보다 PNM의 성능하락이 상당히 있음을 볼 수 있다. 이러한 점은 prefill이 커질 수록 성능향상을 보기 어려울 수 있다는 paper limitation으로 생각된다. + +### Limitation +- **Attention의 변화와 GEMV의 감소** + 최신 Llama3, DeepSeek LLM 등에서 MHA(Multi-Head Attention)에서 [GQA](https://arxiv.org/pdf/2305.13245)(Group-Query Attention), MQA(Multi-Query-Attention) 형태로 분화하고 있다. MQA는 각 head가 하나의 KV Cache를 공유하고 있고, GQA는 각 head가 일부 KV Cache를 공유하고 있다. + ![MHA_GQA_MQA](../assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png) + 이에따라 GQA, MQA에선 decoding 과정에서 하나의 head의 query가 1개의 행을 가지는 벡터에서 $\frac{\text{Origin Heads}}{\text{Group Size}}$ 만큼 행을 가진 matrix 로 바뀌게 되어 GEMV에서 GEMM으로 변한다는 문제가 있어 MHA 보다 NDP나 PIM에서의 처리효율이 떨어질 수 있다는 문제점이 존재한다. + +## Conclusion +- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가? +본 연구에서는 대규모 언어 모델(LLM)의 serving 과정에서 발생하는 prefill과 decoding 간 연산 특성 차이를 분석하고, 이에 따라 발생하는 memory bottleneck 문제를 해결하기 위한 구조적 접근을 제안하였다. + +전통적인 GPU, TPU 기반의 가속기 시스템은 연산 성능(FLOPS)에 최적화되어 있어 compute-bound 작업(GEMM)에 효율적이나, memory-bound 작업(GEMV)에서는 KV cache 로드로 인한 bandwidth 병목이 발생하며, 이는 특히 LLM의 decoding 단계에서 심화된다. + +이러한 문제를 해결하기 위한 노력으로, 최근 PIM (Processing-In-Memory) 및 NDP (Near-Data Processing) 기술이 도입되었다. + +NeuPIMs는 PIM의 memory access와 연산이 동시 불가능한 구조적 제약을 해결하기 위해 Dual-buffer system과 NPU-V 통합 구조를 제안함으로써, GEMM(GPU-like 연산)은 NPU가, GEMV(memory-heavy 연산)은 PIM이 병렬 처리하도록 하여 효율적인 연산 분리를 가능케 했다. + +반면, PIM is All You Need는 전력 효율 관점에서 GPU 및 NPU 없이, PIM + PNM 구조만으로 LLM serving을 수행하며, GEMV 및 기타 연산을 memory controller 근처에서 처리함으로써 전력 및 latency 효율을 극대화하였다. + +이러한 구조들은 공통적으로 decoding 과정의 GEMV 연산을 memory-proximal 위치에서 처리함으로써, LLM serving의 latency 병목을 해소하고자 한다. + +하지만, 최근 LLM 구조는 MHA → GQA/MQA와 같이 attention 구조가 변화함에 따라, GEMV가 아닌 GEMM 형태로 attention 연산이 변형되는 추세이다. 이러한 변화는 기존 PIM 기반 구조의 효율성을 저해할 수 있으며, PIM/NDP-friendly 연산 구조에 대한 재설계가 필요하다. + +## Related Work +> KV-Cache Managing + +prefill과 다르게 decoding이 수행될 수록 각 request마다 KV Cache가 커짐에 따라 가속기에서 한 번에 KV Cache를 처리하기가 힘들어졌다. 그래서 이를 최적화하기 위해 PagedAttention에선 연속적인 KV Cache 패러다임에서 비연속적인 KV Cache형태로 페이징하여 관리를 하거나 GPU HW 구조에 따른 최적화를 시도한 Flash Attention 등이 소개되었다. 단일 GPU 환경 뿐 아니라 KV Cache Load 양을 획기적으로 늘릴 수 있도록 Pooling System을 구축하는 등의 연구가 진행되고 있다. + + + +. + +## Citation (bibs 로 올릴 것이니까 생각 나는 논문들만 정리해둘것.) +- Attacc! +- NeuPIMs +- PIM is all you need +- [GQA](https://arxiv.org/pdf/2305.13245) \ No newline at end of file diff --git a/_posts/2025-04-28-LLM_System_Pool_eng.md b/_posts/2025-04-28-LLM_System_Pool_eng.md new file mode 100644 index 000000000..d65d84b25 --- /dev/null +++ b/_posts/2025-04-28-LLM_System_Pool_eng.md @@ -0,0 +1,306 @@ +--- +layout: distill +title: Sample Blog Post +description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Bae Junhyeong + url: "https://github.com/20190511" + affiliations: + name: POSTECH, Pohang University of Science and Technology + - name: Gang Sungwook + url: "https://en.wikipedia.org/wiki/Gang_Sungwook" + affiliations: + name: POSTECH, Pohang University of Science and Technology + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-Final.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +## Abstract +- "what problem is this work trying to tackle?" +- "how new is this effort?" (소개, 개요) + +Large Language Models (LLMs) have mostly been developed in the form of GPT, which is based on the Decoder phase of the Transformer. As the context length of an LLM increases, inference performance becomes highly dependent on the optimization of the Attention operation. Accordingly, LLMs such as GPT operate in two major stages: Prefill and Decoding, each with distinct computational characteristics. For example, during the Decoder phase, the initial tokens are summarized in the Prefill stage, and subsequently, batched requests each maintain their own KV Cache. The process of loading these caches introduces overhead, resulting in a memory bottleneck. + +In particular, the KV Cache grows with the sequence length and becomes a key source of memory bandwidth bottlenecks. **As the size of the prompt increases, the memory load during the Attention operation introduces** significant overhead in LLM serving. This blog, based on NeuPIMs and the paper PIM is All You Need, presents a new perspective that **computational characteristics such as GEMM and GEMV operations, as well as the layer structure, should be carefully segmented and treated with distinct batching strategies depending on the type of accelerator used.** + +## Background +[Nvidia Documents](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html) +> GEMM & GEMV + - **GEMV** stands for "GEneral Matrix Vector multiplication," referring to a general operation between a matrix and a vector. When $\alpha=1$ and $\beta=0$, it denotes a standard matrix-vector multiplication: + $$ + y=\alpha Ax+\beta y + $$ + In general, GEMV has a time complexity of $O(n^2)$. + + - **GEMM** stands for "GEneral Matrix Matrix multiplication," referring to a general operation between two matrices. When $\alpha=1$ and $\beta=0$, it denotes a standard matrix-matrix multiplication: + $$ + y=\alpha AB+\beta C + $$ + In general, GEMM has a time complexity of $O(n^3)$. + +> NDP, PIM + +Modern compute architectures are fundamentally based on the Von Neumann architecture. In the Von Neumann architecture, the Processing Unit (PU) and Memory are separated, and the computer architecture is organized such that the PU receives data from memory to perform computations. Accordingly, until a few years ago, previous works focused on enhancing the performance of each component by optimizing GPUs and TPUs to increase FLOPS, and improving memory to enable fast memory transfer to accelerators by increasing memory bandwidth. + +However, despite advancements in accelerators, the emergence of AI has caused workloads to become increasingly data-intensive, leading to a situation where the memory bandwidth cannot keep up with the FLOPS of accelerators. In particular, for LLMs, the process of summarizing the initial prompt and generating one token at a time requires each request to load its own cache into the accelerator, resulting in significant bottlenecks during this process. + +To address the issue where memory cannot keep up with the FLOPS of accelerators, techniques have been developed to process data stored in memory not in GPUs or NPUs, but using small accelerators placed close to the memory. These techniques can be classified into PIM (Processing-In-Memory) and NDP (Near-Data Processing) depending on whether the computation is performed inside the memory chip or in a controller outside the memory chip. + +PIM performs computations by directly manipulating the sense amplifiers inside the DRAM chip. Compared to NDP, it provides higher bandwidth and thus enables faster computations, but being inside the chip, it has lower computational flexibility and primarily performs MAC (Multiplication-ACcumulation) operations. + +NDP is typically located in the DRAM controller. Although it has lower bandwidth than PIM, it supports flexible data formats and performs general-purpose computations, offering a wide range of application possibilities. + +| Item | PIM (Processing-In-Memory)| NDP (Near-Data Processing)| +|-|-|-| +|**Computation Speed**| Very fast (in-cell parallel bitwise operations)|Fast (general-purpose computation with reduced data movement)| +|**Flexibility**| Low (only specific operations, fixed circuits)| High (can use general-purpose ALU, SIMD)| +|**Main Computation Types**| Bitwise (AND, OR, NOT), MAC, simple comparisons |General-purpose operations such as sorting, filtering, DB joins, ML inference| +|**Location of Processing Unit**| Inside DRAM/Flash cells or near sense amplifiers| Next to memory modules, around DIMM or SSD controller| +|**Pros**| Eliminates data movement, ultra-fast computation, high bit-level parallelism| Versatility, supports complex operations, programmable structure| +|**Cons**| Limited operations, low circuit flexibility| Some data movement still exists between memory and processor| + +> PIM/NDP & GPU/NPU Accelater Features + +In the case of GEMM, the arithmetic intensity is high with a complexity of approximately $O(N^3)$, whereas GEMV has a lower operation intensity of around $O(N^2)$ and tends to operate in a memory-bound manner. +Therefore, GPUs and NPUs, which are optimized for high compute intensity, perform most efficiently with GEMM operations. In contrast, GEMV operations tend to be less efficient due to the relatively large overhead from synchronization and memory movement compared to the computational gain. + +GPUs and NPUs operate more efficiently when the arithmetic intensity of GEMV is high. Thus, while they are well-suited for high-operation-intensity tasks like GEMM, they tend to show lower utilization for matrix-vector multiplications such as GEMV. + +> Explanation of Detailed Operations in LLM Transformers, Multi-Head Attention, and GEMM/GEMV Computation +- transformer & GPT + ![attention is all you need](../assets/img/2025-04-28-LLM_System_Pool/transformer.png) + LLMs such as ChatGPT and Llama3 primarily follow the GPT architecture, which adopts only the decoding stack of the Transformer structure. A key characteristic of this architecture is that it generates one token per iteration during the token generation process. + + +- Difference Between the Prefill and Decoding Phases in the Decoding Stack + ![attacc](../assets/img/2025-04-28-LLM_System_Pool/attacc.png) + The transformer structure in GPT primarily consists of a Decoding Stack, which is composed of Multi-Head Attention (MHA) blocks and Feed Forward blocks. Each of these blocks is made up of the following key layers: + - MHA Block + - QKV Generation: $$ Q = XW^Q,\ K = XW^K,\ V = XW^V $$ + Query, Key, and Value matrices are generated from the input embeddings. + Typically, QKV are calculated in a fused manner using a combined weight matrix $$W_{[Q,K,V]}$$, then split into individual components. + + - Logit: $$ QK^T $$ + Attention scores (similarities) are computed via the dot product of Query and Key. + + - Softmax: $$ \alpha = \text{Softmax}(\frac{\text{Logit}}{\sqrt{d_k}}) $$ + The Logits are normalized to produce attention weight distributions over tokens. + + - Attend: $$ \text{Attention} = \alpha V $$ + The weighted sum of the Value vectors is computed using the attention weights. + + - Concat: $$ \text{Concat}([head_1, ..., head_h]) $$ + Attention outputs from multiple heads are concatenated into a single vector. + + - QKV Projection: $$ \text{Output} = \text{Concat}(...)W^O $$ + The concatenated output is linearly projected to match the input dimension for the next layer. + + - Feed Forward (FF) Block + + - Feed Forward 1: $$ Z = XW_1 + b_1 $$ + Applies the first linear transformation to each token. + + - GeLU Activation: + $$ \text{GeLU}(x) = 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right) $$ + + - Feed Forward 2: $$ \text{Output} = ZW_2 + b_2 $$ + Applies the second linear transformation to produce the final output. + + This sequence of layers is used in two distinct phases: the **Prefil**l phase, where the input prompt is processed and the first token is generated, and the **Decoding** phase, where one token is generated at a time until the token. These phases are distinguished by their computational characteristics: Prefill is compute-bound, while Decoding is memory-bound. + + + +## NeuPIMs & PIM is ALL you NEED +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" + +> Prefill and Decoding from the Perspective of Actual Computation +- Prefill + During the prefill stage, LLM computations are primarily composed of matrix-matrix multiplications, i.e., GEMM operations. + The input $X:[\text{N}{\text{prompt}}, d{\text{emb}}]$ is structured as a matrix, where $\text{N}{\text{prompt}}$ denotes the number of prompt tokens, and $d{\text{emb}}$ is the dimensionality of the embedding vector representing each token. + - MHA Block + - QKV Generation (GEMM): + $$ W^Q, W^K, W^V : [\text{N}{\text{prompt}}, d{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}] = [\text{N}{\text{prompt}}, d{\text{emb}}]$$ + - Attention : + For convenience, the Logit, Softmax, and Attend steps are collectively referred to as the attention process. + Each head splits the $$Q, K, V$$ matrices by $$\frac{d_{\text{emb}}}{H}, (H = \text{number of heads})$$ and processes its own attention in parallel + The operation $$O = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ is computed in the form of **GEMM** as follows: + $$Q \times K^T \times V: [N_{prompt}, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prompt}] \times [N_{prompt}, \frac{d_{emb}}{H}]$$ + ※ (Note: $\sqrt{d_k}$ is a scalar and is omitted in computation.) + - Concat: $$ \text{Concat}([head_1:[N_{prompt}, \frac{d_{emb}}{H}], ..., head_h:[N_{prompt}, \frac{d_{emb}}{H}]]) :[\text{N}_{prompt}, d_{emb}]$$ + - FF Block + - Feed Forward 1: $$ Z = XW_1+ b_1 \text{ where, } XW_1:[\text{N}_{prompt}, d_{emb}]\times[d_{emb},4\times d_{emb}]$$ is computed using GEMM. + - GeLU: Simple scalar-wise multiplication and activation. + - Feed Forward 2: $$ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[\text{N}_{prompt}, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$$ is also computed as GEMM. + + + +- Deocding +During the Decoding phase, each request reuses previously computed Key and Value matrices, which are stored in memory as KV Cache and concatenated back during computation. Since the Query corresponds to **only one newly generated token**, it exists in the form of a vector. As a result, the attention operation involves multiple GEMV computations along with heavy KV Cache memory loads, making this phase predominantly memory-bound. + + - MHA Block: Emphasis on **GEMV** operations increases + - QKV Generation (GEMM): $$XW_Q,\ XW_K,\ XW_V: [1, d_{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}]$$ These are vector-matrix multiplications. + For Key and Value, the previous KV Cache is loaded from memory and concatenated. The KV matrices have shape: + $$K,V: [N_{prev}+1, d_{emb}]$$ + Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape: + $$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$ + - Attention : $$Q \times K^T \times V: [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$$ + This involves heavy KV Cache loading from memory, and the operation is typically memory-bound. + - Concat: $$ \text{Concat}([head_1:[1, \frac{d_{emb}}{H}], ..., head_h:[1, \frac{d_{emb}}{H}]]) :[1, d_{emb}]$$ + - FF Block + - Feed Forward 1: $$Z = XW_1+ b_1\text{ where, }XW_1:[1, d_{emb}]\times[d_{emb},4\times d_{emb}]$$ + This is processed as a GEMV operation. + - GeLU: A simple element-wise scalar operation. + - Feed Forward 2: + $$ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[1, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$$ + While each request individually resembles a GEMV, since all decoding requests share the same weights and the FF block is a linear operation, multiple requests can be batched and computed together. This results in a GEMM operation of the form: + $$[N_{batches}, d_{emb}] \times [d_{emb}, 4\times d_{emb}] \times [4\times d_{emb}, d_{emb}]$$ + + +In summary, Prefill computations are predominantly in the form of GEMM operations. Since this is the initial stage, there is no prior KV Cache per request, allowing the operations to be processed in a compute-bound manner. In contrast, during Decoding, each request maintains its own KV Cache, and as the sequence length grows, memory bottlenecks become more severe due to increased memory load in the attention mechanism. +
+ +Based on the distinct computational characteristics of prefill and decoding phases, this section introduces serving architectures that leverage PIM technologies to handle long-context LLMs efficiently in terms of memory and compute. + +- NeuPIMs proposes a redesigned architecture where the memory functionality and GEMV computation of PIM can be executed in parallel. This allows the main accelerator (e.g., GPU) to focus on compute-intensive GEMM operations in the prefill phase, while the memory-bound GEMV operations in the decoding phase are offloaded to PIM units and processed asynchronously. + +- In PIM is All You Need, the authors observe that the Prefill phase contributes only a small portion of the end-to-end LLM workload. Leveraging this, they propose eliminating power-hungry GPUs/TPUs and introduce a highly power-efficient LLM PIM serving system based on PNM (a type of NDP) and multiple PIM units to handle the entire pipeline efficiently. + +> NeuPIMs Arichetecture Introduce + +![alt text](NeuPIMs.png) +NeuPIMs addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations. + +In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput. +![alt text](dualbuffer.png) +![alt text](neupimsOverlap.png) +By employing a dual-buffer system, NeuPIMs enables the batching of N requests into two sub-batches of N/2 each. While one sub-batch is processed by NeuPIMs (NPU-V and PIM) to handle the memory-bound attention computations, the other sub-batch simultaneously performs the compute-bound QKV generation and feed-forward network (FFN) computations. + +This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving. + +> NeuPIMs Results + +![alt text](NeuPIMs_result.png) +When NeuPIMs offloads GEMV operations to PIM and delegates GEMM operations to the NPU, it achieves over 1.5× improvement in LLM serving performance compared to a traditional NPU-only system. This performance gain is attributed to the efficient division of labor between memory-bound and compute-bound tasks, enabled by the hybrid PIM-NPU architecture. + + +> PIM is All you need + +The paper presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations. + +The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements. + +![alt text](PIM-PNM.png) +![alt text](PIM-PNM-detail.png) +In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units. + +![alt text](GEMV-GEMM.png) + +> PIM is ALL you need Results + +![alt text](Results.png) + In PIM is All You Need, experimental results show that as the number of decoding output tokens increases, the system achieves lower latency compared to traditional GPUs. This demonstrates the effectiveness of the architecture in handling long decoding phases. + + However, the paper also reveals that during the prefill phase, the PNM exhibits noticeably lower performance than GPUs. This suggests a limitation of the proposed system—as the size of the prefill grows, performance improvements become harder to achieve. This trade-off is acknowledged as a limitation of the paper, indicating that the architecture is particularly advantageous for decoding-heavy workloads, but may underperform when prefill dominates the workload. + +### Limitation +- **Changes in Attention Mechanisms and the Decline of GEMV Usage** +In recent models such as **LLaMA3** and **DeepSeek LLM**, the traditional **Multi-Head Attention (MHA)** mechanism has evolved into variants like **GQA (Group-Query Attention)** and **MQA (Multi-Query Attention)** [GQA paper](https://arxiv.org/pdf/2305.13245). In **MQA**, all attention heads share a **single KV Cache**, whereas in **GQA**, groups of heads share **a subset** of the KV Cache. +![MHA\_GQA\_MQA](../assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png) + +As a result, during **decoding**, the query from a single head originally a **vector with a single row** is transformed into a **matrix** with + +$$ +\frac{\text{Origin Heads}}{\text{Group Size}} +$$ + +rows in GQA or MQA. This shifts the operation from a **GEMV** to a **GEMM**, introducing a computational overhead. + +This transformation poses a challenge for **PIM or NDP architectures**, which are typically optimized for **GEMV-style operations** in decoding. Thus, **GQA and MQA may reduce the processing efficiency** of such memory-centric accelerators compared to standard MHA. + + +## Conclusion +* **What efforts have been made, and how should optimization be approached?** +This study analyzes the computational differences between **prefill** and **decoding** phases in large language model (LLM) serving and proposes a structural approach to address the resulting **memory bottleneck**. + +Traditional GPU and TPU-based accelerators are optimized for **compute-bound operations (GEMM)** due to their high FLOPS capabilities. However, they face significant **bandwidth bottlenecks** when performing **memory-bound operations (GEMV)**, particularly during the decoding phase of LLMs where frequent KV cache loads dominate. + +To overcome this issue, recent research has introduced **Processing-In-Memory (PIM)** and **Near-Data Processing (NDP)** technologies. + +* **NeuPIMs** addresses the structural limitation of conventional PIMs, which could not handle memory access and computation simultaneously. It introduces a **dual-buffer system** and a **NPU-V integration** strategy that enables **GEMM operations** (e.g., QKV projection and FFN) to be handled by the NPU, while **GEMV operations** (e.g., decoding attention) are offloaded to PIM units for **parallel and efficient execution**. + +* In contrast, **PIM is All You Need** pursues **maximum power efficiency** by removing GPUs and NPUs altogether. It utilizes a **PIM + PNM** architecture to process both GEMV and other computations near the memory controller, significantly improving both **latency** and **energy efficiency**. + +These architectures share a common strategy: offloading **decoding-phase GEMV operations to memory-proximal units**, thereby **mitigating latency bottlenecks** in LLM serving. + +However, modern LLMs are shifting from **MHA to GQA/MQA** attention mechanisms, in which shared KV cache structures **transform attention from GEMV to GEMM** computations. This shift potentially **undermines the efficiency** of PIM-based architectures, which are optimized for GEMV. + +As such, future research must consider **rearchitecting attention layers** and computation pipelines to be **PIM/NDP-friendly**, ensuring compatibility and sustained efficiency in the face of evolving LLM architectures. + + +## Related Work +> KV-Cache Managing + +Unlike the **prefill** phase, during **decoding**, the size of the **KV Cache increases for each request** as more tokens are generated. This makes it increasingly difficult for accelerators to load and process the entire KV Cache at once efficiently. To address this challenge, several optimization strategies have been proposed: + +* **PagedAttention** introduces a **non-contiguous paging mechanism** for managing KV Cache, moving away from the traditional contiguous memory layout. This allows for more scalable and efficient memory handling as the cache grows. + +* **FlashAttention** is another line of work that optimizes the **attention computation based on GPU hardware architecture**, significantly reducing memory overhead and improving throughput by recomputing or streaming attention scores rather than storing large intermediate buffers. + +Beyond single-GPU systems, more recent studies have explored **KV Cache Pooling Systems**, which aim to **scale KV Cache bandwidth** by distributing and managing cache loads across multiple devices or memory subsystems. These innovations collectively aim to make **decoding more scalable and memory-efficient**, especially in long-context or multi-request LLM serving scenarios. + + + +. + +## Citation +- Attacc! +- NeuPIMs +- PIM is all you need +- [GQA](https://arxiv.org/pdf/2305.13245) \ No newline at end of file diff --git a/_posts/2025-04-28-NIC.md b/_posts/2025-04-28-NIC.md new file mode 100644 index 000000000..f255ccdaa --- /dev/null +++ b/_posts/2025-04-28-NIC.md @@ -0,0 +1,243 @@ +--- +layout: distill +title: Sample Blog Post +description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Bae Junhyeong + url: "https://github.com/20190511" + affiliations: + name: POSTECH, Pohang University of Science and Technology + - name: Kang Sungwook + url: "https://github.com/rkdtjddnr" + affiliations: + name: POSTECH, Pohang University of Science and Technology + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-NIC.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +## Abstract +- "what problem is this work trying to tackle?" +- "how new is this effort?" (소개, 개요) + +Our blog post will focus on optimizing the serving of large-scale language models in distributed systems, with an emphasis on improving memory efficiency and reducing latency. + +#### System architecture aspect + +#### Network aspect +대규모 언어 모델(LLM)을 비롯한 딥러닝 모델의 크기가 점점 커지면서, 단일 GPU만으로는 학습을 수행하는 데 한계가 존재하게 되었다. 이에 따라 여러 GPU를 활용하는 **Distributed Deep Learning**(DDL)이 주목받고 있다. DDL은 모델을 여러 HW 장치에 걸쳐 병렬로 학습시킬 수 있는 장점을 제공하지만, 그 과정에서 장치간 Communication Overhead라는 중요한 문제가 발생한다. +특히 **inter-node communication**(GPU-GPU)은 NVIDIA의 NCCL(NVIDIA Collective Communication Library)과 같은 고성능 통신 라이브러리를 통해 효율적으로 처리할 수 있지만, **intra-node communication**(GPU system - GPU system)은 이더넷 장비를 통해 이루어지기 때문에 물리적인 대역폭과 지연 시간의 한계에 직면하게 된다. +이러한 한계를 극복하기 위한 방법으로 SmartNIC과 같은 지능형 네트워크 인터페이스 카드의 활용이 주목받고 있으며, 본 블로그에서는 ***최신 연구를 기반으로 system node간 communication overhead를 SmartNIC을 활용하여 optimizing할 수 있는 관점을 제시***한다. + +최종 정리 +따라서 우리 블로그에서는 ~~~ + + +## Background + +### Distributed Deep Learning +![Alt text](DDL.png) +딥러닝 모델의 규모가 커짐에 따라 단일 GPU의 메모리나 연산 자원만으로는 대규모 모델 학습을 감당하기 어려워졌다. 이러한 문제를 해결하기 위한 방법 중 하나가 **Distributed Deep Learning**(DDL)이다. DDL은 여러 개의 GPU 혹은 노드에 걸쳐 모델 파라미터나 데이터를 분산시켜 병렬로 학습을 수행하는 방식이다. 크게 Data Parallelism과 Model Parallelism로 나뉘며, 최근에는 하이브리드 형태도 널리 활용되고 있다. +![alt text](parallelism.png) + - **Data Parallelism**은 학습해야 할 data가 많은 경우 여러 GPU에 data를 분산시켜 모델을 학습할 수 있도록 나온 학습 방법이다. 다만 동일한 모델에 대해 각 GPU는 나누어진 data에 대한 weight만을 가지고 있기 때문에 여러 GPU가 학습한 weight parameter를 종합하고 다시 나누는 synchronization 과정이 필요하다. 이후 설명할 Collective Communication이 이때 필요하게 되고 inter-node communication에서 communication overhead가 더 심해지게 된다. + - **Model Parallelism**은 학습을 진행할 모델의 크기가 너무 커서 모델을 여러 GPU에 분할하여 학습하는 방법이다. Model Parallelism을 구현하는 방법에는 크게 1) Tensor Parallelism, 2) Pipeline Parallelism 2가지가 존재한다. Model Parallism의 경우에도 synchronization overhead가 존재하는데 이는 Data Parallelism에 비해 빈도가 높아지게 된다. 그 이유는 모델 자체를 분할해서 GPU가 연산한 결과를 중간 중간 synchronize 해 주어야 하기 때문이다. + + 이렇듯 Distributed Deep Learning은 전체 학습 시간을 단축하고 더 큰 모델을 다룰 수 있도록 해주지만, synchronization이라는 process가 무조건 수행되어야 하기 때문에 GPU 간 communication overhead이라는 새로운 문제를 야기할 수 있다. 만약 model, data의 크기가 더욱 커져서 더 많은 GPU system이 필요로 된다면 communication overhead는 더욱 커질 것이다. + +### Intra-node & Inter-node Communication +![Alt text](nodecomm.png) +분산 학습에서는 여러 GPU 간의 협력이 필수적이며, 이 과정에서 두 가지 수준의 통신이 발생한다. **Intra-node communication**은 하나의 서버 내에서 GPU 간에 발생하는 통신을 의미하며, 이는 고속 인터커넥트(NVLink 등)와 라이브러리(NCCL 등)를 통해 상대적으로 빠르게 처리할 수 있다. 반면, **Inter-node communication**은 서로 다른 서버 간의 통신을 의미하며, 일반적으로 이더넷(Ethernet)이나 InfiniBand 같은 네트워크를 통해 이루어진다. 이 경우 네트워크 대역폭과 지연(latency)의 제약으로 인해 성능 저하가 발생할 수 있다. 특히 앞서 설명했듯이 학습 중 반복적으로 발생하는 parameter synchronization 작업은 GPU-GPU communication overhead의 주요 원인이 되고 특히 inter-node communication에서 overhead가 더욱 커질 것이다. + +### Collective Communication +DDL에서 모델의 parameter를 synchronize하거나 data를 효율적으로 분산/수집하기 위해서는 **Collective Communication**이 필수적이다. 이는 여러 Process나 GPU간 data를 교환하거나 결합하는 communication pattern을 의미하며, 주로 다음과 같은 유형으로 구분된다. +![Alt text](CC.png) +* **1:N communication** pattern + * Broadcast : 하나의 node가 가지고 있는 data를 모든 node로 전송한다. + * Scatter : 하나의 node가 가지고 있는 data를 여러 조각으로 나눠 각 node에 분배한다. + * Gather : 여러 node의 data를 모아 하나의 node로 수집한다. + * Reduce : 여러 node의 data를 특정 연산(e.g., sum) 을 통해 하나의 결과로 합쳐서 하나의 node에 전달한다. +* **N:N communication** pattern + * AllGather : 모든 node끼리 data를 공유하여 최종적으로 모든 node각 전체 data를 보유한다. + * AllReduce : 모든 node의 data를 연산한 결과를 모든 node 다시 분산한다. + +### SmartNIC +Network Interface Card(NIC)은 system을 network에 연결하여 통신하기 위해 사용하는 HW device이다. 기존 NIC은 복잡한 연산은 수행하지 못하고 간단한 network관련 operation을 수행하거나 network packet을 받아서 host CPU로 보내주는 등의 역할을 수행하였다. 하지만 SmartNIC은 기존 NIC device에 core를 탑재하여 조금 더 general purpose한 목적으로 사용할 수 있는 ethernet device이다. 따라서 SmartNIC을 활용한 최신 연구에서는 주로 task를 offloading하여 host CPU의 부담을 줄이면서 communication도 효율적으로 할 수 있도록 한다. +![Alt text](smartnic.png) +SmartNIC은 크게 On-path / Off-path 2가지로 분류된다. + * **On-path SmartNIC**은 NIC core 자체를 programmable하도록 구현된 장치이다. Network operation이외에도 다양한 연산들을 offload받아서 NIC core 자체에서 처리할 수 있다. 하지만 heavy한 연산들을 많이 처리하다보면 network 처리가 늦어질 수 있다는 단점이 존재한다. 또한 NIC core를 사용하기 위한 programming의 복잡도가 매우 높다. + * **Off-path SmartNIC**은 On-path와는 달리 NIC core와 별도로 core를 두는 방식이다. 별도의 compute core에서 연산을 처리할 수 있기 때문에 network 성능에는 영향을 미치지 않는다. 다만 compute core 및 memory에 접근하기 위한 communication overhead가 존재하는데 programmer가 사용하기에 훨씬 편하다는 장점이 존재해서 대부분의 연구가 Off-path SmartNIC을 사용한다. + +따라서 블로그에서도 주로 Off-path SmartNIC을 다루는 논문들을 다룰 것이며 앞서 제시한 문제들을 해결하기 위해 어떻게 활용할 수 있는 지 알아보고자 한다. + + + +## Main 설명 (제목 바꿀 것, 여러개 있어도됨) +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" + +### Optimizing inter-node communication with SmartNIC +학습해야 할 모델 크기가 커지고 dataset이 커지고 이에 따라 server system이 대규모로 바뀌면서 자연스레 inter-node communication problem을 지적하고 해결하고자 하는 연구들이 다수 등장하였다. Algoritm SW적으로 해결하고자 하는 방향과 HW acceleration을 통해 해결하고자 하는 방향이 존재한다. 그 중 HW acceleration의 접근 방향이 viable한 approach이고 INA(In Network Aggregation)는 가장 popular한 approach라고 많은 논문에서 소개하고 있다. INA는 network switch를 aggregator로 사용하여 AllReduce와 같은 Collective Communication을 offloading하여 가속하는 방법이다. 하지만 INA는 network switch를 사용하기 때문에 HW resource constraints같은 한계점이 존재한다. +따라서 밑에서 소개할 논문들은 network switch 대신 SmartNIC이라는 최신 device를 aggregator로 사용한다. +#### SmartNIC for Ring-AllReduce +>Proposed technique + +먼저 Ring-AllReduce를 SmartNIC에 offloading하는 ***DirectReduce***기법을 소개하고자 한다. 논문에서는 다음과 같이 기존 Ring-AllReduce communication의 비효율성을 지적하였다. +![Alt text](RING_basic.png) +Figure를 확인해 봤을 때 NIC에서 A1 data를 Node B에 보내주면 Node B에서 reduce연산이 수행되고 result를 다시 NIC에 보내서 Node C로 inter-node communication이 이루어지는 모습을 확인할 수 있다. 여기서 저자들은 만약 A1 data를 Node B로 보내지 않고 B1 data를 NIC으로 가져오는 동작만 수행한 뒤 NIC에서 reduce 연산을 수행하면 효율적이라고 생각하였다. 효율성은 크게 2가지 측면에서 얻을 수 있을 것이다. +1. Node B는 기존에 수행하던 연산을 방해받지 않고 계속 진행할 수 있다. +2. Node B와 NIC간의 불필요한 communication이 사라진다. + +따라서 이를 고려한 communication path는 다음과 같이 구상해 볼 수 있을 것이다. +![Alt text](RING_upgrade.png) +불필요한 data movement가 사라지고 NIC에서 reduce operation을 수행하여 Node B를 방해하지 않는 것을 확인할 수 있다. NIC에서 DMA를 수행하여 data를 가져오기 때문에 host인 Node B는 관여하지 않게되고 수행하던 ML workload를 계속 실행할 수 있다. + +앞서 제시한 Communication path를 수행하기 위해 제안하는 architecutre는 다음과 같고 크게 3가지 component를 NIC에 추가한 모습을 볼 수 있다. +![Alt text](RING_arch.png) +* GateKeeper : Ring AllReduce 데이터 전송을 위한 prefetch 관리 및 버퍼 최적화 +* DataDirecter : 수신 데이터의 패킷 흐름을 RNIC 내부로 전환시켜 직접 처리 +* ComputeEnhancer : FPGA 등을 통해 RNIC에서 reduce 연산을 직접 수행 (SmartNIC) + +>Evaluation results + +실험은 실제 system에서 많이 사용되는 2가지 network topolgy 환경을 가정하여 진행하였다. +1. Ring topology : 8 nodes +2. 6D-torus topology : 729, 4096, 46656, and 262144, corresponding to 3, 4, 6, and 8 nodes per dimension +그리고 architecture의 구현은 Xilix Vivado를 이용하고 전체적인 simulation은 Astra-Sim2를 사용하여 진행했다. + +다만 6D-torus topology에 대한 실험 결과는 Ring topology와 매우 비슷한 양상을 띄고 있기 때문에 Ring topology에 대한 실험결과만 소개하도록 하겠다. +* Small message size (1KiB, 256KiB, respectively) +![Alt text](RING_small_eval.png) + Message size가 작을 때 node의 process 수를 증가시켰을 때 latency를 비교한 그래프이다. computation overhead도 적고 보내야 하는 message size 자체가 작기 때문에 naive Ring-AllReduce와 비교했을 때 latency차이가 크지 않은 모습을 확인할 수 있다. +* Large message size (256MiB, 1GiB, respectively) +![Alt text](RING_large_eval.png) + 반면 message size가 클 때는 reduce operation을 NIC에서 수행하는 이점이 커진다. AI workload를 수행하는 process는 AllReduce에 의한 방해를 받지 않고 result를 빠르게 도출해내며 NIC에서는 result에 대한 reduce operation 처리를 도맡아서 수행하기 때문이다. + + +결론적으로 논문에서는 실험결과를 통해 다음과 같은 insight를 제공한다. DirectReduce는 중간 크기의 메시지를 처리할 때 Ring AllReduce 대비 향상된 성능을 제공하지만, 그 이점은 메시지 크기 $M$과 프로세스 수 $N$에 따라 달라진다. 특히, 프로세스당 데이터 양 $M/N$이 1KB 이하일 경우에는 성능 차이가 거의 없지만, $M/N > 1\text{KB}$인 경우에는 stream aggregation 기반 파이프라인 덕분에 Ring AllReduce보다 15%에서 36%까지 지연 시간을 줄일 수 있다. 일반적으로 $N$이 많을수록, $M$이 클수록 DirectReduce의 효과는 증가하며, 이는 더 많은 패킷 생성과 병렬 처리를 가능하게 한다. 그러나 메시지 크기 $M$이 1MB보다 작고 $N$이 커질 경우, 전체 패킷 수가 줄어들어(경우에 따라 1개까지) 오히려 파이프라인 효과가 감소하면서 성능 향상에 제약을 줄 수 있다. 이를 표로 정리하면 다음과 같다. +| 조건 | DirectReduce 성능 향상 효과 | +|----------------------------|--------------------------------------------------| +| $M/N \leq 1\text{KB}$ | 성능 향상 거의 없음 | +| $M/N > 1\text{KB}$ | 15% ~ 36% latency 개선 | +| $N \uparrow$, $M \uparrow$| 패킷 수 및 파이프라인 증가 → 성능 향상 증가 | +| $M < 1\text{MB}$, $N \uparrow$| 패킷 수 감소 → 파이프라인 효과 감소로 성능 이점 제한 | + +#### Zero-Sparse AllReduce and SmartNIC offloading +>Proposed technique + +다음으로 소개할 논문은 SmartNIC에 offloading하는 것뿐만 아니라 Zero-Sparse AllReduce algorithm도 같이 제안하여 communication 시 이동되는 data 양 자체를 줄이려는 시도를 하고 있다. 다만 이 블로그는 SmartNIC을 활용하는 방안에 더 중점을 맞추고 있으므로 Zero-Sparse algorithm은 간단하게 소개하고 넘어가도록 하겠다. +* **Zero-Sparse AllReduce** +![Alt text](SPARSE_algo.png) +저자들은 AllReduce를 수행해야 하는 data(e.g., gradient)에 0이 많이 존재할 경우 0인 data를 communication에 포함시키는 게 매우 비효율적이라고 생각하였다. 특히나 최신 AI model들은 model parameter가 매우 많아서 이를 최적화하기 위해 pruning같은 기법을 사용한다. 이러한 모델들을 sparse neural network라고도 부르는데 이 모델의 특징은 gradient의 많은 부분이 0인 것이다. 따라서 이러한 점을 근거로 AllReduce 시 0인 data는 포함하지 않는게 communication에 효율적이라고 판단할 수 있고 제안하는 Zero-Sparse AllReduce의 동작은 위 그림과 같다. 단순하게 AllReduce에 사용할 gradient vector가 있다고 할 때 이를 block 단위로 나누고 block 중 0인 data만을 포함하는 zero-block은 AllReduce communication 대상에서 제외한다. 따라서 communication 시 data movement가 줄어들어 latency가 감소할 것이라고 기대할 수 있다. + +* **Using SmartNIC as aggregator** +![Alt text](SPARSE_aggregator.png) +다음은 SmartNIC을 aggregator로 사용하는 방법이다. 저자들은 SmartNIC의 memory bandwidth한계를 지적하고 이를 해결하기 위해 DCA(Direct Cache Access)를 사용해서 NIC이 LLC에 직접 접근하는 방법을 사용한다. 하지만 여러 node의 data를 aggregate하여 reduce 연산을 수행해야 하는데 LLC의 용량이 한계가 존재하고 이를 해결하기 위해 SmartNIC 내부 memory layout을 위 그림과 같이 제안하였다. 간단하게 동작을 이해해보면 non-zero block을 담아두고 보내는 Rx / Tx Spot이 존재하고 이는 RDMA region이기 때문에 Worker들이 직접 memory write을 수행할 수 있다. Worker들이 Rx Spot에 data를 write해주면 aggergator(SmartNIC)은 Rx Spot에 담긴 data와 index를 확인하고 동일한 위치의 Tx Spot의 data와 reduce 연산을 수행한 뒤 Tx Spot에 결과를 update한다. 이렇게 동작이 수행되면 수많은 Worker의 data를 모두 유지할 필요가 없어지고 LLC를 효율적으로 사용할 수 있다. 추가적으로 Tx Spot이 2개인 것을 그림에서 확인할 수 있는데 이는 double buffering을 사용하기 위함이다. 이전 phase에서 AllReduce의 result를 아직 모든 Worker에 scatter하지 않았는데 다음 phase가 수행되면 Tx Spot의 data가 update되어 버리기 때문이다. + +>Evaluation results + +실험은 GPU worker로 구성된 system과 CPU Worker로 구성된 system 2가지를 진행하였고 SmartNIC의 경우 NVIDIA Bluefield-2를 사용하였다. 그리고 OSU MPI Microbenchmark와 비슷하게 microbenchmark를 사용하였고 benchmark는 AllReduce benchmarking을 지원한다. 다른 AllReduce benchmark와는 다르게 array의 sparsity를 조절하여 실험할 수 있도록 구성했다고 한다. 이러한 실험 setting을 바탕으로 reduce operation을 수행할 aggregator를 SmartNIC으로 사용했을 때와 host CPU를 사용한 결과를 비교하여 보여주고 있다. 또한 다른 Sparse AllReduce와 비교하여 제안하는 Zero-Sparse AllReduce가 효과적임을 보여주고 있다. + +* Comparison of two Sparse AllReduce +![Alt text](image.png) +실험 결과는 제안하는 Zero-Sparse AllReduce(OmNICCL)와 다른 Sparse AllReduce(OmniReduce)와의 비교 결과이다. 실험결과를 통해 모든 면에서 제안하는 OmNICCL이 OmniReduce보다 좋은 성능을 보임을 주장하고 있다. + +* Comparison of two aggregator +![Alt text](SPARSE_result.png) +실험결과는 message size를 128MB기준으로 Worker의 수를 다르게 하면서 Block Sparsity도 다르게 실험한 결과이다. OmNICCL*은 SmartNIC을 aggregator로 사용했을 때이고 OmNICCL은 SmartNIC을 사용하지 않고 일반적인 CPU를 aggregator로 사용했을 때 실험 결과이다. GPU system에서 Block Sparsity가 25% ~ 75%인 경우 SmartNIC을 사용했을 때 latency가 약간 줄어든 모습을 확인할 수 있다. + +### Limited performance of SmartNIC +DirectReduce와 OmNICCL 모두 SmartNIC의 근본적인 성능과 memory capacity가 크지 않은 점을 지적하고 이를 해결하기 위한 solution을 제공하고 있다. 하지만 실험 결과에서 볼 수 있듯이 특히, OmNICCL의 경우 SmartNIC을 사용할 경우 dramatic한 performance 향상을 보여주고 있지는 않고 이는 여전히 SmartNIC의 HW resource constraints때문이라고 생각한다. 다만 DirectReduce는 성능이 많이 향상되지 않았나?라는 의문이 들 수 있는데, DirectReduce의 경우 FPGA-based로 ASIC한 SmartNIC을 구성하여 Simulation-based로 실험을 진행하였고 OmNICCL의 경우 실제 NVIDIA의 bluefield-2 제품을 사용해서 결과의 차이가 존재할 수 있다. 또한 DirectReduce의 경우 성능 향상이 이루어졌을 때는 더 큰 message를 target하고 있기 때문에 이 부분도 고려하여 결과를 해석해야 한다. +결론적으로 SmartNIC을 활용하기 위한 다양한 연구들이 진행되고 있지만, 아직 SmartNIC 자체의 성능이 그리 좋지 못하기 때문에 대부분의 논문에서 이를 지적하고 최대한의 성능을 이끌어내기 위한 연구가 진행되고 있다. 다만 이 문제는 SmartNIC이 앞으로 더 발전한다면 자연스레 해결될 수 있는 문제라고 생각한다. + + +## Conclusion +- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가? +- Project Proposal 참고해서 전체적인 Conclusion으로 작성할 것 + +본 blog post에서는 model parameter 및 data size가 큰 최신 AI trend에서 학습/추론에 대해 2가지 측면에서 최적화할 수 있는 여러 연구들을 살펴보았다. + +또한, 연산 구조뿐만 아니라 서버 시스템 간 통신에서 발생하는 병목 현상을 줄이기 위한 네트워크 측면의 최적화 기법들에 대한 연구도 함께 살펴보았다. +최근 인공지능 모델 학습을 할때 GPU 1개로는 부족해서 여러 GPU가 하나의 node에 묶인 GPU system을 사용할뿐만 아니라 이러한 GPU system을 여러 개 사용해야 할 정도로 모델이 크고 복잡해졌다. 하지만 multi-GPU system을 사용할 경우 연산 측면에서는 빨라지지만 이에 따라 system간 parameter교환 등 communication이 많이 발생한다. 따라서 System간 communication overhead를 최적화하기 위해서 여러 연구에서 SmartNIC을 활용하려는 시도가 많아지고 있다. +* **DirectReduce**에서는 기존 Ring AllReduce communication operation을 SmartNIC에 offloading하여 CPU/GPU가 온전히 AI operation만 수행해서 문제를 해결하고자 하였다. +* **OmNICCL**에서는 Zero-Sparse AllReduce algorithm을 제안하여 communication할 때 data movement를 줄이고 communication operation도 SmartNIC에 offloading하여 문제를 해결하고자 하였다. + +하지만 앞서 소개했듯이 아직 SmartNIC에서 무겁고 복잡한 연산을 수행하기에는 성능이 뛰어나지 못하다. 또한 memory 용량도 당연히 CPU에 비하면 많이 부족한 수준이기 때문에 data-sensitive한 workload를 offloading해서 수행하기에도 한계점이 존재한다. 하지만 앞으로 기술이 발전하면서 SmartNIC architecture도 더 좋은 성능을 가지도록 개발된다면 SmartNIC의 활용도는 무궁무진할 것이다. + +결론적으로 ~~~~ + +## Related work +Collective Communication이외에도 server system에서 SmartNIC을 통해 다양한 operation을 최적화 하려는 연구가 많이 진행되고 있다. +> RPC layer offloading + +**RpcNIC**은 + +> Storage request offloading + +**SmartDS**는 + +## Citation (bibs 로 올릴 것이니까 생각 나는 논문들만 정리해둘것.) +[pool] -> (0 ~ 10) +[nic] -> (11 ~ 20) + + + +Related Paper : +[Collective Communication관련] +- OmNICCL (NAIC'24) : Sparse AllReduce algorithm + SmartNIC offloading + - OmniReduce (SIGCOMM'21) : 위의 논문의 기반 논문 (?) +- Leveraging SmartNIC for Ring AllReduce offloading (ISPA'24) : Offloading Ring AllReduce to SmartNIC +- FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems (IEEE CAL'22) + +Reference : +Charaterizing Off-path SmartNIC for Accelerating Distributed Systems +Nvidia. 2023. Nvidia BlueField-2 DPU. https://www.nvidia.com/content/dam/en- +zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf + +Background 관련 Reference +https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/ +https://en.wikipedia.org/wiki/Collective_operation \ No newline at end of file diff --git a/_posts/2025-04-28-NIC_ENG.md b/_posts/2025-04-28-NIC_ENG.md new file mode 100644 index 000000000..f3339c587 --- /dev/null +++ b/_posts/2025-04-28-NIC_ENG.md @@ -0,0 +1,225 @@ +--- +layout: distill +title: Sample Blog Post +description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Bae Junhyeong + url: "https://github.com/20190511" + affiliations: + name: POSTECH, Pohang University of Science and Technology + - name: Kang Sungwook + url: "https://github.com/rkdtjddnr" + affiliations: + name: POSTECH, Pohang University of Science and Technology + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-NIC.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +## Abstract +- "what problem is this work trying to tackle?" +- "how new is this effort?" (소개, 개요) + +Our blog post will focus on optimizing the serving of large-scale language -> AI models in distributed systems, with an emphasis on improving memory efficiency and reducing latency. + +#### System architecture aspect + +#### Network aspect +As deep learning models—including large language models (LLMs)—continue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIA’s NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints. +To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution. In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***. + +최종 정리 +따라서 우리 블로그에서는 ~~~ + + +## Background + +### Distributed Deep Learning +![Alt text](DDL.png) +As deep learning models continue to grow in size, it has become increasingly difficult to train large-scale models using only the memory and compute resources of a single GPU. One of the key approaches to overcoming this limitation is **Distributed Deep Learning** (DDL). DDL distributes model parameters or data across multiple GPUs or nodes, enabling parallel training. It is generally categorized into **Data Parallelism** and **Model Parallelism**. +![alt text](parallelism.png) + - **Data Parallelism** is a training method designed for scenarios where a large volume of data needs to be processed. It distributes the dataset across multiple GPUs, allowing each GPU to train the same model on a different subset of the data. However, since each GPU only updates the weights based on its local data, a **synchronization** step is required to aggregate and redistribute the updated weight parameters. This is where **Collective Communication** becomes essential, and as we will discuss later, it can lead to increased communication overhead—particularly in inter-node communication. + - **Model Parallelism** is a training approach used when the model itself is too large to fit on a single GPU. In this method, the model is divided and distributed across multiple GPUs. There are two main techniques for implementing Model Parallelism: (1) **Tensor Parallelism** and (2) **Pipeline Parallelism**. Synchronization overhead also exists in Model Parallelism, and it tends to occur more frequently than in Data Parallelism. This is because the model is partitioned, requiring GPUs to synchronize their intermediate computation results during training. + + While Distributed Deep Learning enables faster training and the ability to handle larger models, it inherently requires a synchronization process, which introduces a new challenge: communication overhead between GPUs. As the size of the model and dataset continues to grow, requiring more GPU systems to work together, this communication overhead is expected to increase even further. + +### Intra-node & Inter-node Communication +![Alt text](nodecomm.png) +In distributed training, cooperation among multiple GPUs is essential, and this involves two levels of communication. **intra-node communication** refers to communication between GPUs within a single server, which can typically be handled efficiently using high-speed interconnects such as NVLink and libraries like NCCL. In contrast, **inter-node communication** refers to communication between GPUs across different servers, which usually takes place over networks such as Ethernet or InfiniBand. In this case, limitations in network bandwidth and latency can lead to performance degradation. As mentioned earlier, parameter synchronization—performed repeatedly during training—is a major source of GPU-to-GPU communication overhead, and this overhead tends to be more severe in inter-node communication. + +### Collective Communication +In DDL, **Collective Communication** is essential for synchronizing model parameters and efficiently distributing or aggregating data. This refers to communication patterns that involve exchanging or combining data across multiple processes or GPUs. These patterns are typically categorized into several common types, as outlined below. +![Alt text](CC.png) +* **1:N communication** pattern + * Broadcast: Sends data from a single node to all other nodes. + * Scatter: Splits data on one node into multiple parts and distributes them to other nodes. + * Gather: Collects data from multiple nodes and aggregates it on a single node. + * Reduce: Combines data from multiple nodes using a specified operation (e.g., sum) and delivers the result to a single node. +* **N:N communication** pattern + * AllGather: Each node shares its data with all other nodes, resulting in every node holding the complete set of data. + * AllReduce: Data from all nodes is combined using a specified operation (e.g., sum), and the result is distributed back to all nodes. + +### SmartNIC +A **Network Interface Card** (NIC) is a hardware device used to connect a system to a network and enable communication. Traditional NICs are limited in functionality—they typically handle simple network-related operations or forward incoming packets to the host CPU. In contrast, a SmartNIC is an enhanced Ethernet device equipped with onboard processing cores, allowing it to support more general-purpose tasks. Recent research leveraging SmartNICs often focuses on offloading certain tasks from the host CPU, thereby reducing its workload and enabling more efficient communication. +![Alt text](smartnic.png) +SmartNICs are generally classified into two types: on-path and off-path. +* **On-path SmartNICs** are designed with programmable NIC cores, allowing them to handle not only basic network operations but also various offloaded computations directly within the NIC itself. While this approach offers flexibility, it has potential drawbacks—heavy computational loads can delay network packet processing, and programming the NIC cores tends to be highly complex. +* **Off-path SmartNICs**, on the other hand, include separate compute cores that are distinct from the main NIC cores. This design allows offloaded tasks to be processed without interfering with network performance. Although there is some communication overhead when accessing memory and compute resources, off-path SmartNICs are generally more programmer-friendly. As a result, they are more commonly adopted in recent research. + +In this blog, we will primarily focus on studies utilizing off-path SmartNICs and explore how they can be leveraged to address the communication challenges outlined earlier. + + + +## Various techniques for efficient serving system +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" + +### Optimizing inter-node communication with SmartNIC +As model sizes and datasets continue to grow, and as server systems scale accordingly, many research efforts have emerged to identify and address the resulting inter-node communication challenges. These efforts generally take two directions: one focusing on algorithmic or software-level solutions, and the other on hardware acceleration. Among these, hardware-based acceleration is increasingly viewed as a viable approach, with In-Network Aggregation (INA) being one of the most popular methods cited across recent studies. +Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments. +Therefore, the papers introduced below explore ***the use of SmartNICs—modern programmable network devices—as aggregators instead of traditional network switches***. +#### SmartNIC for Ring-AllReduce +>Proposed technique + +We begin by introducing ***DirectReduce***, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below. +![Alt text](RING_basic.png) +Based on the figure, we can observe that when the NIC sends A1 data to Node B, the reduction operation is performed on Node B, and the result is then returned to the NIC to continue inter-node communication with Node C. The authors propose an alternative approach: instead of sending A1 data to Node B, only fetching B1 data from Node B to the NIC and performing the reduction directly on the NIC could be more efficient. + +This efficiency can be gained from two main perspectives: +1. Node B can continue its local computations without being interrupted by additional processing. +2. Unnecessary communication between Node B and the NIC is eliminated. + +Taking these observations into account, a revised communication path can be envisioned as follows. +![Alt text](RING_upgrade.png) +This approach eliminates unnecessary data movement and enables the reduction operation to be executed directly on the NIC, avoiding interference with Node B. Since the NIC uses DMA to fetch data, the host (Node B) remains uninvolved in the transfer and can continue running its ML workloads without interruption. + +To realize the proposed communication path, the authors introduce a new NIC architecture, which includes three key components added to the NIC. +![Alt text](RING_arch.png) +* GateKeeper: Manages prefetching and buffer optimization for data transfers in the Ring-AllReduce process. +* DataDirector: Redirects incoming data packets into the RNIC’s internal data path for direct handling within the NIC. +* ComputeEnhancer: Performs reduction operations directly on the RNIC using hardware accelerators such as FPGAs, enabling compute capabilities on the SmartNIC. + +>Evaluation results + +The experiments were conducted under two network topology configurations that are commonly used in real-world systems: +1. Ring topology: 8 nodes +2. 6D-torus topology: 729, 4,096, 46,656, and 262,144 nodes—corresponding to 3, 4, 6, and 8 nodes per dimension, respectively + +The architecture was implemented using **Xilinx Vivado**, and the overall simulation was carried out using **Astra-Sim2**. +However, since the results observed under the 6D-torus topology closely resemble those from the ring topology, we will focus on presenting the experimental results for the ring topology only. +* Small message size (1KiB, 256KiB, respectively) +![Alt text](RING_small_eval.png) + The graph compares latency as the number of processes per node increases, specifically in scenarios with small message sizes. Since the computation overhead is low and the messages themselves are small, the latency difference between the proposed approach and naive Ring-AllReduce remains minimal. +* Large message size (256MiB, 1GiB, respectively) +![Alt text](RING_large_eval.png) + In contrast, for larger message sizes, the benefit of performing the reduce operation on the NIC becomes more significant. Processes running AI workloads can produce results without being interrupted by AllReduce operations, while the NIC takes full responsibility for handling the reduction of those results. + + +In conclusion, the paper presents the following key insight based on the experimental results: while DirectReduce offers improved performance over traditional Ring AllReduce when handling medium-sized messages, the actual benefit depends on both the message size ($M$) and the number of processes ($N$). +Specifically, when the amount of data per process ($M/N$) is less than or equal to 1KB, there is little to no performance gain. However, when $M/N > 1\text{KB}$, the stream aggregation-based pipeline allows DirectReduce to reduce latency by 15% to 36% compared to Ring AllReduce. In general, the performance advantage of DirectReduce increases with larger $M$ and greater $N$, as this leads to the generation of more packets and enables more parallelism. +That said, when the message size $M$ is smaller than 1MB and $N$ becomes large, the total number of packets may decrease (sometimes down to just one), which in turn limits the effectiveness of the pipeline and constrains performance improvement. + +The findings can be summarized in the following table. +| Condition | Performance Benefit of DirectReduce | +|--------------------------------------|--------------------------------------------------------------| +| $M/N \leq 1\text{KB}$ | Little to no performance improvement | +| $M/N > 1\text{KB}$ | 15%–36% latency reduction | +| $N \uparrow$, $M \uparrow$ | More packets and deeper pipeline → greater performance gain | +| $M < 1\text{MB}$, $N \uparrow$ | Fewer packets → limited pipeline effect, reduced benefit | + +#### Zero-Sparse AllReduce and SmartNIC offloading +>Proposed technique + +The next paper, OmNICCL, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects. +* **Zero-Sparse AllReduce** +![Alt text](SPARSE_algo.png) +The authors observed that when performing AllReduce on data such as gradients, it is highly inefficient to include a large number of zero values in the communication. This issue becomes more prominent in modern AI models, which often contain a vast number of parameters. To optimize these models, techniques like pruning are frequently applied, resulting in what are known as sparse neural networks—models in which a significant portion of the gradients are zero. +Based on this observation, the authors argue that excluding zero values during AllReduce can lead to more efficient communication. The proposed Zero-Sparse AllReduce algorithm operates as illustrated in the figure above. In simple terms, the gradient vector used for AllReduce is divided into blocks. Any block that contains only zeros (a zero block) is excluded from the communication process. As a result, the amount of data transferred is reduced, leading to lower communication latency. + +* **Using SmartNIC as aggregator** +![Alt text](SPARSE_aggregator.png) +The next approach focuses on using SmartNICs as aggregators. The authors highlight the limitation of memory bandwidth on SmartNICs and propose a solution using **Direct Cache Access** (DCA), which allows the NIC to directly access the Last-Level Cache (LLC). However, since the NIC must aggregate data from multiple nodes to perform reduction operations, the limited capacity of the LLC becomes a bottleneck. To address this, the authors propose a specialized memory layout inside the SmartNIC, as shown in the figure above. +In simple terms, the design introduces Rx and Tx Spots to store non-zero blocks. These spots reside in RDMA-accessible regions, allowing worker nodes to write data directly into them. When a worker writes data to an Rx Spot, the aggregator (SmartNIC) reads the data and its corresponding index, performs a reduction with the existing data in the corresponding Tx Spot, and updates the Tx Spot with the result. +This approach eliminates the need to store all incoming data from every worker, allowing more efficient use of LLC. Additionally, the figure shows two Tx Spots, which are used for double buffering. This mechanism prevents overwriting Tx data that has not yet been scattered to all workers from the previous AllReduce phase when the next phase begins. + +>Evaluation results + +The experiments were conducted on two types of systems: one with GPU workers and another with CPU workers. For the SmartNIC, the authors used the **NVIDIA BlueField-2**. A custom microbenchmark, similar in spirit to the OSU MPI Microbenchmark suite, was used to evaluate AllReduce performance. Unlike typical AllReduce benchmarks, this version allows for configurable array sparsity, enabling more nuanced experimentation. +Using this setup, the authors compare the performance of executing the reduction operation on a SmartNIC versus on a host CPU. Their results demonstrate that SmartNIC-based aggregation is more efficient. Furthermore, when compared against other sparse AllReduce method, the proposed Zero-Sparse AllReduce approach shows clear advantages in performance. + +* Comparison of two Sparse AllReduce +![Alt text](image.png) +The experimental results compare the proposed Zero-Sparse AllReduce method, OmNICCL, against an existing sparse AllReduce technique, OmniReduce. The authors claim that OmNICCL consistently outperforms OmniReduce across all evaluated scenarios. + +* Comparison of two aggregator +![Alt text](SPARSE_result.png) +The experiments were conducted by varying the number of workers and block sparsity levels, with a fixed message size of 128MB. Two versions of OmNICCL are presented: OmNICCL* refers to the setup where a SmartNIC is used as the aggregator, while OmNICCL refers to the case where a conventional CPU serves as the aggregator. In GPU-based systems, when the block sparsity ranges from 25% to 75%, OmNICCL* shows a modest reduction in latency compared to the CPU-based version, highlighting the benefit of SmartNIC offloading in sparse communication scenarios. + +### Limitation of SmartNIC +Both **DirectReduce** and **OmNICCL** identify the fundamental limitations of SmartNICs—namely, their limited performance and memory capacity—and propose architectural solutions to address these constraints. However, as seen in the experimental results, especially in the case of OmNICCL, the use of SmartNICs does not lead to dramatic performance improvements. This is likely due to the current hardware resource constraints of commercially available SmartNICs. +One might wonder why DirectReduce appears to achieve more substantial performance gains. This can be attributed to the fact that DirectReduce was evaluated using a simulation-based setup with an FPGA-based, ASIC-style SmartNIC, while OmNICCL was tested on actual hardware using NVIDIA’s BlueField-2 . The discrepancy in hardware platforms may explain the difference in results. Additionally, DirectReduce focuses on larger message sizes, which can amplify performance gains—this context should also be considered when interpreting the results. +In conclusion, while a variety of research efforts are exploring how to best leverage SmartNICs, most studies still point to their limited capabilities as a bottleneck. As such, many works aim to extract the maximum possible performance from current SmartNIC architectures. Nevertheless, this issue is expected to diminish as SmartNIC technology continues to advance in the future. + + +## Conclusion +- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가? +- Project Proposal 참고해서 전체적인 Conclusion으로 작성할 것 + +In this blog post, we explored a range of research efforts aimed at optimizing training and inference in modern AI workloads—particularly those involving large model parameters and datasets—from two key perspectives. + +As modern AI models grow in size and complexity, training them often requires more than a single GPU. It is now common to use multi-GPU systems where several GPUs are grouped within a single node—and increasingly, multiple such GPU nodes are used together. While this multi-GPU setup significantly accelerates computation, it also introduces **a substantial amount of communication overhead, especially for exchanging model parameters between systems**. To address this, many recent studies have investigated the use of SmartNICs to optimize inter-system communication. +* **DirectReduce** proposes offloading the traditional Ring AllReduce communication operation to the SmartNIC, allowing the CPU and GPU to focus solely on AI computations. +* **OmNICCL** introduces the Zero-Sparse AllReduce algorithm to reduce data movement during communication and offloads the collective operation to the SmartNIC as well. + +[최종 결론] +In conclusion, optimizing LLM serving is no longer just about improving model architecture. As model complexity and deployment scale grow, system-level strategies, such as compute-side acceleration and communication-aware design, are becoming increasingly important for achieving efficient training and inference.​ + +## Citation (bibs 로 올릴 것이니까 생각 나는 논문들만 정리해둘것.) + + + diff --git a/_posts/2025-04-28-SampleAndRule.md b/_posts/2025-04-28-SampleAndRule.md new file mode 100644 index 000000000..1e745aa9b --- /dev/null +++ b/_posts/2025-04-28-SampleAndRule.md @@ -0,0 +1,508 @@ +--- +layout: distill +title: Sample Blog Post +description: Your blog post's abstract. + Please add your abstract or summary here and not in the main body of your text. + Do not include math/latex or hyperlinks. +date: 2025-04-28 +future: true +htmlwidgets: true +hidden: true + +# Anonymize when submitting +# authors: +# - name: Anonymous + +authors: + - name: Albert Einstein + url: "https://en.wikipedia.org/wiki/Albert_Einstein" + affiliations: + name: IAS, Princeton + - name: Boris Podolsky + url: "https://en.wikipedia.org/wiki/Boris_Podolsky" + affiliations: + name: IAS, Princeton + - name: Nathan Rosen + url: "https://en.wikipedia.org/wiki/Nathan_Rosen" + affiliations: + name: IAS, Princeton + +# must be the exact same name as your blogpost +bibliography: 2025-04-28-distill-example.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +# - please use this format rather than manually creating a markdown table of contents. +toc: + - name: Equations + - name: Images and Figures + subsections: + - name: Interactive Figures + - name: Citations + - name: Footnotes + - name: Code Blocks + - name: Diagrams + - name: Tweets + - name: Layouts + - name: Other Typography? + +# Below is an example of injecting additional post-specific styles. +# This is used in the 'Layouts' section of this post. +# If you use this post as a template, delete this _styles block. +_styles: > + .fake-img { + background: #bbb; + border: 1px solid rgba(0, 0, 0, 0.1); + box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + +Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling. + +## Blog Post Example (2024, 2023) +https://iclr-blogposts.github.io/2024/blog/understanding-icl/ +https://iclr-blogposts.github.io/2024/blog/bench-hvp/ +https://iclr-blogposts.github.io/2024/blog/dpi-fsvi/ + +https://iclr-blogposts.github.io/2023/blog/2023/adamw/ + +--- + +### 계획 +- 통합주제는 : optimizing the serving of large-scale language models in distributed system 이며 해당 주제로 쓰기 전에 각자 논문 Paper에 대한 Post를 작성한다. +- 아래 순서로 일단 각자 블로그를 쓴다. +- 이전에 작성된 overleaf 는 https://ko.overleaf.com/project/67ef902314a19d89f597934c 이며 참고할 것. + - 배준형 : 2025-04-28-LLM_System_Pool.md + - 강성욱 : 2025-04-28-NIC.md +- 각자 내용을 다 쓴 후 Blog Post를 통합한다. + +--- +### 각 Blog 서술 순서서 +1. [!] Abstract & Background +- "what problem is this work trying to tackle?" +- "how new is this effort?" (소개개) +2. Main (주요 설명) +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" + +3. 결과 +- "what are the limitations of this work?" +4. [!] Conclusion +- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가? +5. [!] Citation +--- + +※ ! 붙은 내용은 일반적으로 들어가야함. + +```Text +- Submissions due: Please submit your posters (via email to me & TA) and blog posts (via PLMS) by the midnight of 5/30. + +- Blog submission instructions: Formats +- The blog posts should be submitted as a url link to your own blog post. +- You are free to use any any formats. + -> But if you need one, see here: https://iclr-blogposts.github.io/2025/submitting/ +- You are expected to host the post through your own [GitHub page]. +- If you need help, please contact the TA. + +- Blog submission instructions: Grading +- As your post (and poster) is supposed to be an academic outcome, please make the followings extra clear to us: +- "what problem is this work trying to tackle?" +- "what contributions did this work make, and what impact should this work have?" +- "how new is this effort?" +- "what are the limitations of this work?" +``` + + + +## Equations + +This theme supports rendering beautiful math in inline and display modes using [MathJax 3](https://www.mathjax.org/) engine. +You just need to surround your math expression with `$$`, like `$$ E = mc^2 $$`. +If you leave it inside a paragraph, it will produce an inline expression, just like $$ E = mc^2 $$. + +To use display mode, again surround your expression with `$$` and place it as a separate paragraph. +Here is an example: + +$$ +\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) +$$ + +Note that MathJax 3 is [a major re-write of MathJax](https://docs.mathjax.org/en/latest/upgrading/whats-new-3.0.html) +that brought a significant improvement to the loading and rendering speed, which is now +[on par with KaTeX](http://www.intmath.com/cg5/katex-mathjax-comparison.php). + + +## Images and Figures + +Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you +might face losing important information in your blog post. +To include images in your submission in this way, you must do something like the following: + +```markdown +{% raw %}{% include figure.html path="assets/img/2025-04-28-distill-example/iclr.png" class="img-fluid" %}{% endraw %} +``` + +which results in the following image: + +{% include figure.html path="assets/img/2025-04-28-distill-example/iclr.png" class="img-fluid" %} + +To ensure that there are no namespace conflicts, you must save your asset to your unique directory +`/assets/img/2025-04-28-[SUBMISSION NAME]` within your submission. + +Please avoid using the direct markdown method of embedding images; they may not be properly resized. +Some more complex ways to load images (note the different styles of the shapes/shadows): + +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/9.jpg" class="img-fluid rounded z-depth-1" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/7.jpg" class="img-fluid rounded z-depth-1" %} +
+
+
+ A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all. +
+ +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/8.jpg" class="img-fluid z-depth-2" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/10.jpg" class="img-fluid z-depth-2" %} +
+
+ +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/11.jpg" class="img-fluid" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/12.jpg" class="img-fluid" %} +
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/7.jpg" class="img-fluid" %} +
+
+ +### Interactive Figures + +Here's how you could embed interactive figures that have been exported as HTML files. +Note that we will be using plotly for this demo, but anything built off of HTML should work +(**no extra javascript is allowed!**). +All that's required is for you to export your figure into HTML format, and make sure that the file +exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. +To embed it into any page, simply insert the following code anywhere into your page. + +```markdown +{% raw %}{% include [FIGURE_NAME].html %}{% endraw %} +``` + +For example, the following code can be used to generate the figure underneath it. + +```python +import pandas as pd +import plotly.express as px + +df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv') + +fig = px.density_mapbox( + df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10, + center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain") +fig.show() + +fig.write_html('./assets/html/2025-04-28-distill-example/plotly_demo_1.html') +``` + +And then include it with the following: + +```html +{% raw %}
+ +
{% endraw %} +``` + +Voila! + +
+ +
+ +## Citations + +Citations are then used in the article body with the `` tag. +The key attribute is a reference to the id provided in the bibliography. +The key attribute can take multiple ids, separated by commas. + +The citation is presented inline like this: (a number that displays more information on hover). +If you have an appendix, a bibliography is automatically created and populated in it. + +Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. +However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work. + +*** + +## Footnotes + +Just wrap the text you would like to show up in a footnote in a `` tag. +The number of the footnote will be automatically generated.This will become a hoverable footnote. + +*** + +## Code Blocks + +This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. +It supports more than 100 languages. +This example is in C++. +All you have to do is wrap your code in a liquid tag: + +{% raw %} +{% highlight c++ linenos %}
code code code
{% endhighlight %} +{% endraw %} + +The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below: + +{% highlight c++ %} + +int main(int argc, char const \*argv[]) +{ +string myString; + + cout << "input a string: "; + getline(cin, myString); + int length = myString.length(); + + char charArray = new char * [length]; + + charArray = myString; + for(int i = 0; i < length; ++i){ + cout << charArray[i] << " "; + } + + return 0; +} + +{% endhighlight %} + +*** + +## Diagrams + +This theme supports generating various diagrams from a text description using [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} plugin. +Below, we generate a few examples of such diagrams using languages such as [mermaid](https://mermaid-js.github.io/mermaid/){:target="\_blank"}, [plantuml](https://plantuml.com/){:target="\_blank"}, [vega-lite](https://vega.github.io/vega-lite/){:target="\_blank"}, etc. + +**Note:** different diagram-generation packages require external dependencies to be installed on your machine. +Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. +For any other details, please refer to [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} README. + +**Note:** This is not supported for local rendering! + +The diagram below was generated by the following code: + +{% raw %} +``` +{% mermaid %} +sequenceDiagram + participant John + participant Alice + Alice->>John: Hello John, how are you? + John-->>Alice: Great! +{% endmermaid %} +``` +{% endraw %} + +{% mermaid %} +sequenceDiagram +participant John +participant Alice +Alice->>John: Hello John, how are you? +John-->>Alice: Great! +{% endmermaid %} + +*** + +## Tweets + +An example of displaying a tweet: +{% twitter https://twitter.com/rubygems/status/518821243320287232 %} + +An example of pulling from a timeline: +{% twitter https://twitter.com/jekyllrb maxwidth=500 limit=3 %} + +For more details on using the plugin visit: [jekyll-twitter-plugin](https://github.com/rob-murray/jekyll-twitter-plugin) + +*** + +## Blockquotes + +
+ We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. + —Anais Nin +
+ +*** + + +## Layouts + +The main text column is referred to as the body. +It is the assumed layout of any direct descendants of the `d-article` element. + +
+

.l-body

+
+ +For images you want to display a little larger, try `.l-page`: + +
+

.l-page

+
+ +All of these have an outset variant if you want to poke out from the body text a little bit. +For instance: + +
+

.l-body-outset

+
+ +
+

.l-page-outset

+
+ +Occasionally you’ll want to use the full browser width. +For this, use `.l-screen`. +You can also inset the element a little from the edge of the browser by using the inset variant. + +
+

.l-screen

+
+
+

.l-screen-inset

+
+ +The final layout is for marginalia, asides, and footnotes. +It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes. + +
+

.l-gutter

+
+ +*** + +## Other Typography? + +Emphasis, aka italics, with *asterisks* (`*asterisks*`) or _underscores_ (`_underscores_`). + +Strong emphasis, aka bold, with **asterisks** or __underscores__. + +Combined emphasis with **asterisks and _underscores_**. + +Strikethrough uses two tildes. ~~Scratch this.~~ + +1. First ordered list item +2. Another item + * Unordered sub-list. +1. Actual numbers don't matter, just that it's a number + 1. Ordered sub-list +4. And another item. + + You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we'll use three here to also align the raw Markdown). + + To have a line break without a paragraph, you will need to use two trailing spaces. + Note that this line is separate, but within the same paragraph. + (This is contrary to the typical GFM line break behavior, where trailing spaces are not required.) + +* Unordered lists can use asterisks +- Or minuses ++ Or pluses + +[I'm an inline-style link](https://www.google.com) + +[I'm an inline-style link with title](https://www.google.com "Google's Homepage") + +[I'm a reference-style link][Arbitrary case-insensitive reference text] + +[I'm a relative reference to a repository file](../blob/master/LICENSE) + +[You can use numbers for reference-style link definitions][1] + +Or leave it empty and use the [link text itself]. + +URLs and URLs in angle brackets will automatically get turned into links. +http://www.example.com or and sometimes +example.com (but not on Github, for example). + +Some text to show that the reference links can follow later. + +[arbitrary case-insensitive reference text]: https://www.mozilla.org +[1]: http://slashdot.org +[link text itself]: http://www.reddit.com + +Here's our logo (hover to see the title text): + +Inline-style: +![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 1") + +Reference-style: +![alt text][logo] + +[logo]: https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 2" + +Inline `code` has `back-ticks around` it. + +```javascript +var s = "JavaScript syntax highlighting"; +alert(s); +``` + +```python +s = "Python syntax highlighting" +print(s) +``` + +``` +No language indicated, so no syntax highlighting. +But let's throw in a tag. +``` + +Colons can be used to align columns. + +| Tables | Are | Cool | +| ------------- |:-------------:| -----:| +| col 3 is | right-aligned | $1600 | +| col 2 is | centered | $12 | +| zebra stripes | are neat | $1 | + +There must be at least 3 dashes separating each header cell. +The outer pipes (|) are optional, and you don't need to make the +raw Markdown line up prettily. You can also use inline Markdown. + +Markdown | Less | Pretty +--- | --- | --- +*Still* | `renders` | **nicely** +1 | 2 | 3 + +> Blockquotes are very handy in email to emulate reply text. +> This line is part of the same quote. + +Quote break. + +> This is a very long line that will still be quoted properly when it wraps. Oh boy let's keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can *put* **Markdown** into a blockquote. + + +Here's a line for us to start with. + +This line is separated from the one above by two newlines, so it will be a *separate paragraph*. + +This line is also a separate paragraph, but... +This line is only separated by a single newline, so it's a separate line in the *same paragraph*. diff --git a/_posts/CC.png b/_posts/CC.png new file mode 100644 index 000000000..aeeecb248 Binary files /dev/null and b/_posts/CC.png differ diff --git a/_posts/DDL.png b/_posts/DDL.png new file mode 100644 index 000000000..192a3578c Binary files /dev/null and b/_posts/DDL.png differ diff --git a/_posts/GEMV-GEMM.png b/_posts/GEMV-GEMM.png new file mode 100644 index 000000000..54379292d Binary files /dev/null and b/_posts/GEMV-GEMM.png differ diff --git a/_posts/MHA_GQA_MQA.png b/_posts/MHA_GQA_MQA.png new file mode 100644 index 000000000..be8a59f41 Binary files /dev/null and b/_posts/MHA_GQA_MQA.png differ diff --git a/_posts/NeuPIMs.png b/_posts/NeuPIMs.png new file mode 100644 index 000000000..82b838df1 Binary files /dev/null and b/_posts/NeuPIMs.png differ diff --git a/_posts/NeuPIMs_result.png b/_posts/NeuPIMs_result.png new file mode 100644 index 000000000..bd6266c1c Binary files /dev/null and b/_posts/NeuPIMs_result.png differ diff --git a/_posts/PIM-PNM-detail.png b/_posts/PIM-PNM-detail.png new file mode 100644 index 000000000..631625ca7 Binary files /dev/null and b/_posts/PIM-PNM-detail.png differ diff --git a/_posts/PIM-PNM.png b/_posts/PIM-PNM.png new file mode 100644 index 000000000..d297a417c Binary files /dev/null and b/_posts/PIM-PNM.png differ diff --git a/_posts/RING_arch.png b/_posts/RING_arch.png new file mode 100644 index 000000000..7e62d3158 Binary files /dev/null and b/_posts/RING_arch.png differ diff --git a/_posts/RING_basic.png b/_posts/RING_basic.png new file mode 100644 index 000000000..3115efd88 Binary files /dev/null and b/_posts/RING_basic.png differ diff --git a/_posts/RING_large_eval.png b/_posts/RING_large_eval.png new file mode 100644 index 000000000..bc2094984 Binary files /dev/null and b/_posts/RING_large_eval.png differ diff --git a/_posts/RING_small_eval.png b/_posts/RING_small_eval.png new file mode 100644 index 000000000..256c854fa Binary files /dev/null and b/_posts/RING_small_eval.png differ diff --git a/_posts/RING_upgrade.png b/_posts/RING_upgrade.png new file mode 100644 index 000000000..93193d063 Binary files /dev/null and b/_posts/RING_upgrade.png differ diff --git a/_posts/Results.png b/_posts/Results.png new file mode 100644 index 000000000..ccae04722 Binary files /dev/null and b/_posts/Results.png differ diff --git a/_posts/SPARSE_aggregator.png b/_posts/SPARSE_aggregator.png new file mode 100644 index 000000000..2839c76e0 Binary files /dev/null and b/_posts/SPARSE_aggregator.png differ diff --git a/_posts/SPARSE_algo.png b/_posts/SPARSE_algo.png new file mode 100644 index 000000000..d138780c2 Binary files /dev/null and b/_posts/SPARSE_algo.png differ diff --git a/_posts/SPARSE_result.png b/_posts/SPARSE_result.png new file mode 100644 index 000000000..996a387a6 Binary files /dev/null and b/_posts/SPARSE_result.png differ diff --git a/_posts/dualbuffer.png b/_posts/dualbuffer.png new file mode 100644 index 000000000..9e11d58f3 Binary files /dev/null and b/_posts/dualbuffer.png differ diff --git a/_posts/image.png b/_posts/image.png new file mode 100644 index 000000000..72f89bcbc Binary files /dev/null and b/_posts/image.png differ diff --git a/_posts/neupimsOverlap.png b/_posts/neupimsOverlap.png new file mode 100644 index 000000000..d8fbd77c5 Binary files /dev/null and b/_posts/neupimsOverlap.png differ diff --git a/_posts/nodecomm.png b/_posts/nodecomm.png new file mode 100644 index 000000000..48f9eaa96 Binary files /dev/null and b/_posts/nodecomm.png differ diff --git a/_posts/parallelism.png b/_posts/parallelism.png new file mode 100644 index 000000000..f1e4164ed Binary files /dev/null and b/_posts/parallelism.png differ diff --git a/_posts/smartnic.png b/_posts/smartnic.png new file mode 100644 index 000000000..c73005844 Binary files /dev/null and b/_posts/smartnic.png differ diff --git a/_posts/2025-04-28-analysing-the-spectral-biases-in-generative-models.md b/_posts_history/2025-04-28-analysing-the-spectral-biases-in-generative-models.md similarity index 100% rename from _posts/2025-04-28-analysing-the-spectral-biases-in-generative-models.md rename to _posts_history/2025-04-28-analysing-the-spectral-biases-in-generative-models.md diff --git a/_posts/2025-04-28-analytical-simulated-dynamics.md b/_posts_history/2025-04-28-analytical-simulated-dynamics.md similarity index 100% rename from _posts/2025-04-28-analytical-simulated-dynamics.md rename to _posts_history/2025-04-28-analytical-simulated-dynamics.md diff --git a/_posts/2025-04-28-anthropomorphic-ai.md b/_posts_history/2025-04-28-anthropomorphic-ai.md similarity index 100% rename from _posts/2025-04-28-anthropomorphic-ai.md rename to _posts_history/2025-04-28-anthropomorphic-ai.md diff --git a/_posts/2025-04-28-building-blocks-of-differentially-private-training.md b/_posts_history/2025-04-28-building-blocks-of-differentially-private-training.md similarity index 100% rename from _posts/2025-04-28-building-blocks-of-differentially-private-training.md rename to _posts_history/2025-04-28-building-blocks-of-differentially-private-training.md diff --git a/_posts/2025-04-28-calibrated-mia.md b/_posts_history/2025-04-28-calibrated-mia.md similarity index 100% rename from _posts/2025-04-28-calibrated-mia.md rename to _posts_history/2025-04-28-calibrated-mia.md diff --git a/_posts/2025-04-28-calibration.md b/_posts_history/2025-04-28-calibration.md similarity index 100% rename from _posts/2025-04-28-calibration.md rename to _posts_history/2025-04-28-calibration.md diff --git a/_posts/2025-04-28-conditional-flow-matching.md b/_posts_history/2025-04-28-conditional-flow-matching.md similarity index 100% rename from _posts/2025-04-28-conditional-flow-matching.md rename to _posts_history/2025-04-28-conditional-flow-matching.md diff --git a/_posts/2025-04-28-distill-example.md b/_posts_history/2025-04-28-distill-example.md similarity index 100% rename from _posts/2025-04-28-distill-example.md rename to _posts_history/2025-04-28-distill-example.md diff --git a/_posts/2025-04-28-distill-example2.html b/_posts_history/2025-04-28-distill-example2.html similarity index 100% rename from _posts/2025-04-28-distill-example2.html rename to _posts_history/2025-04-28-distill-example2.html diff --git a/_posts/2025-04-28-do-not-write-jailbreak-papers.md b/_posts_history/2025-04-28-do-not-write-jailbreak-papers.md similarity index 100% rename from _posts/2025-04-28-do-not-write-jailbreak-papers.md rename to _posts_history/2025-04-28-do-not-write-jailbreak-papers.md diff --git a/_posts/2025-04-28-ebm-vs-mcmc.md b/_posts_history/2025-04-28-ebm-vs-mcmc.md similarity index 100% rename from _posts/2025-04-28-ebm-vs-mcmc.md rename to _posts_history/2025-04-28-ebm-vs-mcmc.md diff --git a/_posts/2025-04-28-engram.md b/_posts_history/2025-04-28-engram.md similarity index 100% rename from _posts/2025-04-28-engram.md rename to _posts_history/2025-04-28-engram.md diff --git a/_posts/2025-04-28-factual-validation-simplification.md b/_posts_history/2025-04-28-factual-validation-simplification.md similarity index 100% rename from _posts/2025-04-28-factual-validation-simplification.md rename to _posts_history/2025-04-28-factual-validation-simplification.md diff --git a/_posts/2025-04-28-feature-geometry.md b/_posts_history/2025-04-28-feature-geometry.md similarity index 100% rename from _posts/2025-04-28-feature-geometry.md rename to _posts_history/2025-04-28-feature-geometry.md diff --git a/_posts/2025-04-28-fine-tuning-token-based-large-multimodal-models.md b/_posts_history/2025-04-28-fine-tuning-token-based-large-multimodal-models.md similarity index 100% rename from _posts/2025-04-28-fine-tuning-token-based-large-multimodal-models.md rename to _posts_history/2025-04-28-fine-tuning-token-based-large-multimodal-models.md diff --git a/_posts/2025-04-28-fisher.md b/_posts_history/2025-04-28-fisher.md similarity index 100% rename from _posts/2025-04-28-fisher.md rename to _posts_history/2025-04-28-fisher.md diff --git a/_posts/2025-04-28-flow-with-what-you-know.md b/_posts_history/2025-04-28-flow-with-what-you-know.md similarity index 100% rename from _posts/2025-04-28-flow-with-what-you-know.md rename to _posts_history/2025-04-28-flow-with-what-you-know.md diff --git a/_posts/2025-04-28-foundation-adapter.md b/_posts_history/2025-04-28-foundation-adapter.md similarity index 100% rename from _posts/2025-04-28-foundation-adapter.md rename to _posts_history/2025-04-28-foundation-adapter.md diff --git a/_posts/2025-04-28-imagenet-flaws.md b/_posts_history/2025-04-28-imagenet-flaws.md similarity index 100% rename from _posts/2025-04-28-imagenet-flaws.md rename to _posts_history/2025-04-28-imagenet-flaws.md diff --git a/_posts/2025-04-28-interpret-classification.md b/_posts_history/2025-04-28-interpret-classification.md similarity index 100% rename from _posts/2025-04-28-interpret-classification.md rename to _posts_history/2025-04-28-interpret-classification.md diff --git a/_posts/2025-04-28-linear-gnn-convergence-restated.md b/_posts_history/2025-04-28-linear-gnn-convergence-restated.md similarity index 100% rename from _posts/2025-04-28-linear-gnn-convergence-restated.md rename to _posts_history/2025-04-28-linear-gnn-convergence-restated.md diff --git a/_posts/2025-04-28-linrec.md b/_posts_history/2025-04-28-linrec.md similarity index 100% rename from _posts/2025-04-28-linrec.md rename to _posts_history/2025-04-28-linrec.md diff --git a/_posts/2025-04-28-llm-context-utilization.md b/_posts_history/2025-04-28-llm-context-utilization.md similarity index 100% rename from _posts/2025-04-28-llm-context-utilization.md rename to _posts_history/2025-04-28-llm-context-utilization.md diff --git a/_posts/2025-04-28-llm-democracy.md b/_posts_history/2025-04-28-llm-democracy.md similarity index 100% rename from _posts/2025-04-28-llm-democracy.md rename to _posts_history/2025-04-28-llm-democracy.md diff --git a/_posts/2025-04-28-llm-knowledge-distil.md b/_posts_history/2025-04-28-llm-knowledge-distil.md similarity index 100% rename from _posts/2025-04-28-llm-knowledge-distil.md rename to _posts_history/2025-04-28-llm-knowledge-distil.md diff --git a/_posts/2025-04-28-localization.md b/_posts_history/2025-04-28-localization.md similarity index 100% rename from _posts/2025-04-28-localization.md rename to _posts_history/2025-04-28-localization.md diff --git a/_posts/2025-04-28-lost-in-prediction.md b/_posts_history/2025-04-28-lost-in-prediction.md similarity index 100% rename from _posts/2025-04-28-lost-in-prediction.md rename to _posts_history/2025-04-28-lost-in-prediction.md diff --git a/_posts/2025-04-28-mad.md b/_posts_history/2025-04-28-mad.md similarity index 100% rename from _posts/2025-04-28-mad.md rename to _posts_history/2025-04-28-mad.md diff --git a/_posts/2025-04-28-multimodal-learning.md b/_posts_history/2025-04-28-multimodal-learning.md similarity index 100% rename from _posts/2025-04-28-multimodal-learning.md rename to _posts_history/2025-04-28-multimodal-learning.md diff --git a/_posts/2025-04-28-opt-summary.md b/_posts_history/2025-04-28-opt-summary.md similarity index 100% rename from _posts/2025-04-28-opt-summary.md rename to _posts_history/2025-04-28-opt-summary.md diff --git a/_posts/2025-04-28-pessa.md b/_posts_history/2025-04-28-pessa.md similarity index 100% rename from _posts/2025-04-28-pessa.md rename to _posts_history/2025-04-28-pessa.md diff --git a/_posts/2025-04-28-pitfalls-of-evidence-based-ai-policy.md b/_posts_history/2025-04-28-pitfalls-of-evidence-based-ai-policy.md similarity index 100% rename from _posts/2025-04-28-pitfalls-of-evidence-based-ai-policy.md rename to _posts_history/2025-04-28-pitfalls-of-evidence-based-ai-policy.md diff --git a/_posts/2025-04-28-pocp.md b/_posts_history/2025-04-28-pocp.md similarity index 100% rename from _posts/2025-04-28-pocp.md rename to _posts_history/2025-04-28-pocp.md diff --git a/_posts/2025-04-28-positional-embedding.md b/_posts_history/2025-04-28-positional-embedding.md similarity index 100% rename from _posts/2025-04-28-positional-embedding.md rename to _posts_history/2025-04-28-positional-embedding.md diff --git a/_posts/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md b/_posts_history/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md similarity index 100% rename from _posts/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md rename to _posts_history/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md diff --git a/_posts/2025-04-28-repurposing.md b/_posts_history/2025-04-28-repurposing.md similarity index 100% rename from _posts/2025-04-28-repurposing.md rename to _posts_history/2025-04-28-repurposing.md diff --git a/_posts/2025-04-28-rethinking-graph-prompts.md b/_posts_history/2025-04-28-rethinking-graph-prompts.md similarity index 100% rename from _posts/2025-04-28-rethinking-graph-prompts.md rename to _posts_history/2025-04-28-rethinking-graph-prompts.md diff --git a/_posts/2025-04-28-rethinking-llm-simulation.md b/_posts_history/2025-04-28-rethinking-llm-simulation.md similarity index 100% rename from _posts/2025-04-28-rethinking-llm-simulation.md rename to _posts_history/2025-04-28-rethinking-llm-simulation.md diff --git a/_posts/2025-04-28-risks-private-evals.md b/_posts_history/2025-04-28-risks-private-evals.md similarity index 100% rename from _posts/2025-04-28-risks-private-evals.md rename to _posts_history/2025-04-28-risks-private-evals.md diff --git a/_posts/2025-04-28-scalable-mcts.md b/_posts_history/2025-04-28-scalable-mcts.md similarity index 100% rename from _posts/2025-04-28-scalable-mcts.md rename to _posts_history/2025-04-28-scalable-mcts.md diff --git a/_posts/2025-04-28-sparse-autodiff.md b/_posts_history/2025-04-28-sparse-autodiff.md similarity index 100% rename from _posts/2025-04-28-sparse-autodiff.md rename to _posts_history/2025-04-28-sparse-autodiff.md diff --git a/_posts/2025-04-28-spd.md b/_posts_history/2025-04-28-spd.md similarity index 100% rename from _posts/2025-04-28-spd.md rename to _posts_history/2025-04-28-spd.md diff --git a/_posts/2025-04-28-the-illustrated-alphafold.md b/_posts_history/2025-04-28-the-illustrated-alphafold.md similarity index 100% rename from _posts/2025-04-28-the-illustrated-alphafold.md rename to _posts_history/2025-04-28-the-illustrated-alphafold.md diff --git a/_posts/2025-04-28-the-lottery-llm-hyperthesis.md b/_posts_history/2025-04-28-the-lottery-llm-hyperthesis.md similarity index 100% rename from _posts/2025-04-28-the-lottery-llm-hyperthesis.md rename to _posts_history/2025-04-28-the-lottery-llm-hyperthesis.md diff --git a/_posts/2025-04-28-toddlers-vs-vismodels.md b/_posts_history/2025-04-28-toddlers-vs-vismodels.md similarity index 100% rename from _posts/2025-04-28-toddlers-vs-vismodels.md rename to _posts_history/2025-04-28-toddlers-vs-vismodels.md diff --git a/_posts/2025-04-28-towards-more-rigorous-llm-evals.md b/_posts_history/2025-04-28-towards-more-rigorous-llm-evals.md similarity index 100% rename from _posts/2025-04-28-towards-more-rigorous-llm-evals.md rename to _posts_history/2025-04-28-towards-more-rigorous-llm-evals.md diff --git a/_posts/2025-04-28-visualizing-training.md b/_posts_history/2025-04-28-visualizing-training.md similarity index 100% rename from _posts/2025-04-28-visualizing-training.md rename to _posts_history/2025-04-28-visualizing-training.md diff --git a/_posts/2025-04-28-vlm-understanding.md b/_posts_history/2025-04-28-vlm-understanding.md similarity index 100% rename from _posts/2025-04-28-vlm-understanding.md rename to _posts_history/2025-04-28-vlm-understanding.md diff --git a/_posts/2025-05-07-steering-llms-behavior.md b/_posts_history/2025-05-07-steering-llms-behavior.md similarity index 100% rename from _posts/2025-05-07-steering-llms-behavior.md rename to _posts_history/2025-05-07-steering-llms-behavior.md diff --git a/assets/bibliography/2025-04-28-Final.bib b/assets/bibliography/2025-04-28-Final.bib new file mode 100644 index 000000000..2dee987ac --- /dev/null +++ b/assets/bibliography/2025-04-28-Final.bib @@ -0,0 +1,145 @@ +@inproceedings{OmNICCL, +author = {Gu, Tongzhou and Fei, Jiawei and Canini, Marco}, +title = {OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs}, +year = {2024}, +isbn = {9798400707131}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3672198.3673804}, +doi = {10.1145/3672198.3673804}, +abstract = {AllReduce is a collective communication pattern commonly used in Distributed Deep Learning (DDL) and High Performance Computing (HPC). Sparse AllReduce, which compresses the data transmitted, achieves significant acceleration on specific workloads. However, compression introduces a non-negligible performance overhead. Therefore, we propose the OmNICreduce algorithm, an efficient inter-node sparse AllReduce method, as well as its implementation, OmNICCL. It utilizes Direct Cache Access (DCA) to achieve zero-overhead lossless compression and employs SmartNICs for aggregation on the data plane. We demonstrate that our method can provide up to a 7.24\texttimes{} speedup over conventional dense AllReduce methods under a 100Gbps RoCEv2 network and 1.76-17.37\texttimes{} performance improvement over state-of-the-art implementations when performing sparse AllReduce.}, +booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing}, +pages = {75–83}, +numpages = {9}, +keywords = {Collective Communication, DCA, DPU, In-Network Aggregation, SmartNIC}, +location = {Sydney, NSW, Australia}, +series = {NAIC '24} +} + +@inproceedings{OmniReduce, +author = {Fei, Jiawei and Ho, Chen-Yu and Sahu, Atal N. and Canini, Marco and Sapio, Amedeo}, +title = {Efficient sparse collective communication and its application to accelerate distributed deep learning}, +year = {2021}, +isbn = {9781450383837}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3452296.3472904}, +doi = {10.1145/3452296.3472904}, +abstract = {Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.}, +booktitle = {Proceedings of the 2021 ACM SIGCOMM 2021 Conference}, +pages = {676–691}, +numpages = {16}, +keywords = {deep learning, distributed training}, +location = {Virtual Event, USA}, +series = {SIGCOMM '21} +} + +@INPROCEEDINGS{DirectReduce, + author={Hui, Lihuan and Yang, Wang and Wang, Yanbo}, + booktitle={2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)}, + title={Leveraging SmartNIC for Ring AllReduce Offloading}, + year={2024}, + volume={}, + number={}, + pages={173-180}, + keywords={Systematics;Protocols;Distance learning;Simulation;Message passing;Logic gates;Parallel processing;Topology;Optimization;Engines;Ring All-Reduce;SmartNIC;In-Network Aggregation;collective communication}, + doi={10.1109/ISPA63168.2024.00030} +} + +@ARTICLE{FPGANIC, + author={Ma, Rui and Georganas, Evangelos and Heinecke, Alexander and Gribok, Sergey and Boutros, Andrew and Nurvitadhi, Eriko}, + journal={IEEE Computer Architecture Letters}, + title={FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems}, + year={2022}, + volume={21}, + number={2}, + pages={49-52}, + keywords={Training;Artificial intelligence;Field programmable gate arrays;Tensors;Computational modeling;Bandwidth;Scalability;AI training;all-reduce;smart NIC;FPGA}, + doi={10.1109/LCA.2022.3189207}} + + +@inproceedings{OffPath, + title={Characterizing off-path $\{$SmartNIC$\}$ for accelerating distributed systems}, + author={Wei, Xingda and Cheng, Rongxin and Yang, Yuhan and Chen, Rong and Chen, Haibo}, + booktitle={17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)}, + pages={987--1004}, + year={2023} +} + +@inproceedings{OptimusNIC, + title={OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training}, + author={Rebai, Achref and Canini, Marco}, + booktitle={Proceedings of the 5th Workshop on Machine Learning and Systems}, + pages={176--182}, + year={2025} +} + +@inproceedings{LineFS, +author = {Kim, Jongyul and Jang, Insu and Reda, Waleed and Im, Jaeseong and Canini, Marco and Kosti\'{c}, Dejan and Kwon, Youngjin and Peter, Simon and Witchel, Emmett}, +title = {LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism}, +year = {2021}, +isbn = {9781450387095}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3477132.3483565}, +doi = {10.1145/3477132.3483565}, +abstract = {In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80\% and throughput in Filebench up to 79\%, while providing extended DFS availability during host system failures.}, +booktitle = {Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles}, +pages = {756–771}, +numpages = {16}, +keywords = {SmartNIC offload, Distributed file system}, +location = {Virtual Event, Germany}, +series = {SOSP '21} +} + +@INPROCEEDINGS{AstraSim2, + author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar}, + booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, + title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale}, + year={2023}, + volume={}, + number={}, + pages={283-294}, + keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system}, + doi={10.1109/ISPASS57527.2023.00035}} + +@inproceedings{SqueezeNIC, +author = {Rebai, Achref and Ojewale, Mubarak Adetunji and Ullah, Anees and Canini, Marco and Fahmy, Suhaib A.}, +title = {SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning}, +year = {2024}, +isbn = {9798400707131}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3672198.3673801}, +doi = {10.1145/3672198.3673801}, +abstract = {To alleviate the communication bottleneck of distributed deep learning training, several data compression algorithms have been proposed. However, these algorithms introduce computational overhead and resource allocation concerns on CPUs and GPUs. In this paper, we introduce SqueezeNIC, an FPGA-based Network Interface Card (NIC) that offloads communication compression from CPUs/GPUs, bridging a high bandwidth intra-node network with a high bandwidth inter-node network. It enables better overlap of gradient communication and computation to further reduce training time per iteration in distributed training. Our evaluations shows that SqueezeNIC achieves line rate compression and can speed up training by up to a factor of 1.21\texttimes{}, compared to baseline approaches.}, +booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing}, +pages = {61–68}, +numpages = {8}, +keywords = {Distributed Training, FPGA, In-Network Compression}, +location = {Sydney, NSW, Australia}, +series = {NAIC '24} +} + + +@misc{bluefield2, + author = {{NVIDIA}}, + title = {NVIDIA BlueField-2 DPU}, + year = {2023}, + howpublished = {\url{https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf}}, + note = {Datasheet, Accessed: 2025-05-22} +} + +@misc{denneman2020multiGPU, + author = {Frank Denneman}, + title = {Multi-GPU and Distributed Deep Learning}, + howpublished = {\url{https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/}}, + year = {2020}, + note = {Accessed: 2025-05-24} +} + +@misc{wikiCollectiveOp, + title = {Collective operation}, + howpublished = {\url{https://en.wikipedia.org/wiki/Collective_operation}}, + note = {Accessed: 2025-05-24} +} \ No newline at end of file diff --git a/assets/bibliography/2025-04-28-NIC.bib b/assets/bibliography/2025-04-28-NIC.bib new file mode 100644 index 000000000..2dee987ac --- /dev/null +++ b/assets/bibliography/2025-04-28-NIC.bib @@ -0,0 +1,145 @@ +@inproceedings{OmNICCL, +author = {Gu, Tongzhou and Fei, Jiawei and Canini, Marco}, +title = {OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs}, +year = {2024}, +isbn = {9798400707131}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3672198.3673804}, +doi = {10.1145/3672198.3673804}, +abstract = {AllReduce is a collective communication pattern commonly used in Distributed Deep Learning (DDL) and High Performance Computing (HPC). Sparse AllReduce, which compresses the data transmitted, achieves significant acceleration on specific workloads. However, compression introduces a non-negligible performance overhead. Therefore, we propose the OmNICreduce algorithm, an efficient inter-node sparse AllReduce method, as well as its implementation, OmNICCL. It utilizes Direct Cache Access (DCA) to achieve zero-overhead lossless compression and employs SmartNICs for aggregation on the data plane. We demonstrate that our method can provide up to a 7.24\texttimes{} speedup over conventional dense AllReduce methods under a 100Gbps RoCEv2 network and 1.76-17.37\texttimes{} performance improvement over state-of-the-art implementations when performing sparse AllReduce.}, +booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing}, +pages = {75–83}, +numpages = {9}, +keywords = {Collective Communication, DCA, DPU, In-Network Aggregation, SmartNIC}, +location = {Sydney, NSW, Australia}, +series = {NAIC '24} +} + +@inproceedings{OmniReduce, +author = {Fei, Jiawei and Ho, Chen-Yu and Sahu, Atal N. and Canini, Marco and Sapio, Amedeo}, +title = {Efficient sparse collective communication and its application to accelerate distributed deep learning}, +year = {2021}, +isbn = {9781450383837}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3452296.3472904}, +doi = {10.1145/3452296.3472904}, +abstract = {Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.}, +booktitle = {Proceedings of the 2021 ACM SIGCOMM 2021 Conference}, +pages = {676–691}, +numpages = {16}, +keywords = {deep learning, distributed training}, +location = {Virtual Event, USA}, +series = {SIGCOMM '21} +} + +@INPROCEEDINGS{DirectReduce, + author={Hui, Lihuan and Yang, Wang and Wang, Yanbo}, + booktitle={2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)}, + title={Leveraging SmartNIC for Ring AllReduce Offloading}, + year={2024}, + volume={}, + number={}, + pages={173-180}, + keywords={Systematics;Protocols;Distance learning;Simulation;Message passing;Logic gates;Parallel processing;Topology;Optimization;Engines;Ring All-Reduce;SmartNIC;In-Network Aggregation;collective communication}, + doi={10.1109/ISPA63168.2024.00030} +} + +@ARTICLE{FPGANIC, + author={Ma, Rui and Georganas, Evangelos and Heinecke, Alexander and Gribok, Sergey and Boutros, Andrew and Nurvitadhi, Eriko}, + journal={IEEE Computer Architecture Letters}, + title={FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems}, + year={2022}, + volume={21}, + number={2}, + pages={49-52}, + keywords={Training;Artificial intelligence;Field programmable gate arrays;Tensors;Computational modeling;Bandwidth;Scalability;AI training;all-reduce;smart NIC;FPGA}, + doi={10.1109/LCA.2022.3189207}} + + +@inproceedings{OffPath, + title={Characterizing off-path $\{$SmartNIC$\}$ for accelerating distributed systems}, + author={Wei, Xingda and Cheng, Rongxin and Yang, Yuhan and Chen, Rong and Chen, Haibo}, + booktitle={17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)}, + pages={987--1004}, + year={2023} +} + +@inproceedings{OptimusNIC, + title={OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training}, + author={Rebai, Achref and Canini, Marco}, + booktitle={Proceedings of the 5th Workshop on Machine Learning and Systems}, + pages={176--182}, + year={2025} +} + +@inproceedings{LineFS, +author = {Kim, Jongyul and Jang, Insu and Reda, Waleed and Im, Jaeseong and Canini, Marco and Kosti\'{c}, Dejan and Kwon, Youngjin and Peter, Simon and Witchel, Emmett}, +title = {LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism}, +year = {2021}, +isbn = {9781450387095}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3477132.3483565}, +doi = {10.1145/3477132.3483565}, +abstract = {In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80\% and throughput in Filebench up to 79\%, while providing extended DFS availability during host system failures.}, +booktitle = {Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles}, +pages = {756–771}, +numpages = {16}, +keywords = {SmartNIC offload, Distributed file system}, +location = {Virtual Event, Germany}, +series = {SOSP '21} +} + +@INPROCEEDINGS{AstraSim2, + author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar}, + booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, + title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale}, + year={2023}, + volume={}, + number={}, + pages={283-294}, + keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system}, + doi={10.1109/ISPASS57527.2023.00035}} + +@inproceedings{SqueezeNIC, +author = {Rebai, Achref and Ojewale, Mubarak Adetunji and Ullah, Anees and Canini, Marco and Fahmy, Suhaib A.}, +title = {SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning}, +year = {2024}, +isbn = {9798400707131}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/3672198.3673801}, +doi = {10.1145/3672198.3673801}, +abstract = {To alleviate the communication bottleneck of distributed deep learning training, several data compression algorithms have been proposed. However, these algorithms introduce computational overhead and resource allocation concerns on CPUs and GPUs. In this paper, we introduce SqueezeNIC, an FPGA-based Network Interface Card (NIC) that offloads communication compression from CPUs/GPUs, bridging a high bandwidth intra-node network with a high bandwidth inter-node network. It enables better overlap of gradient communication and computation to further reduce training time per iteration in distributed training. Our evaluations shows that SqueezeNIC achieves line rate compression and can speed up training by up to a factor of 1.21\texttimes{}, compared to baseline approaches.}, +booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing}, +pages = {61–68}, +numpages = {8}, +keywords = {Distributed Training, FPGA, In-Network Compression}, +location = {Sydney, NSW, Australia}, +series = {NAIC '24} +} + + +@misc{bluefield2, + author = {{NVIDIA}}, + title = {NVIDIA BlueField-2 DPU}, + year = {2023}, + howpublished = {\url{https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf}}, + note = {Datasheet, Accessed: 2025-05-22} +} + +@misc{denneman2020multiGPU, + author = {Frank Denneman}, + title = {Multi-GPU and Distributed Deep Learning}, + howpublished = {\url{https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/}}, + year = {2020}, + note = {Accessed: 2025-05-24} +} + +@misc{wikiCollectiveOp, + title = {Collective operation}, + howpublished = {\url{https://en.wikipedia.org/wiki/Collective_operation}}, + note = {Accessed: 2025-05-24} +} \ No newline at end of file diff --git a/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png b/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png new file mode 100644 index 000000000..be8a59f41 Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png differ diff --git a/assets/img/2025-04-28-LLM_System_Pool/attacc.png b/assets/img/2025-04-28-LLM_System_Pool/attacc.png new file mode 100644 index 000000000..b86cfafae Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/attacc.png differ diff --git a/assets/img/2025-04-28-LLM_System_Pool/transformer.png b/assets/img/2025-04-28-LLM_System_Pool/transformer.png new file mode 100644 index 000000000..a8ccbba38 Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/transformer.png differ