token. These phases are distinguished by their computational characteristics: Prefill is compute-bound, while Decoding is memory-bound.
+
+
+
+## NeuPIMs & PIM is ALL you NEED
+- "what contributions did this work make, and what impact should this work have?"
+- "how new is this effort?"
+
+> Prefill and Decoding from the Perspective of Actual Computation
+- Prefill
+ During the prefill stage, LLM computations are primarily composed of matrix-matrix multiplications, i.e., GEMM operations.
+ The input $X:[\text{N}{\text{prompt}}, d{\text{emb}}]$ is structured as a matrix, where $\text{N}{\text{prompt}}$ denotes the number of prompt tokens, and $d{\text{emb}}$ is the dimensionality of the embedding vector representing each token.
+ - MHA Block
+ - QKV Generation (GEMM):
+ $$ W^Q, W^K, W^V : [\text{N}{\text{prompt}}, d{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}] = [\text{N}{\text{prompt}}, d{\text{emb}}]$$
+ - Attention :
+ For convenience, the Logit, Softmax, and Attend steps are collectively referred to as the attention process.
+ Each head splits the $$Q, K, V$$ matrices by $$\frac{d_{\text{emb}}}{H}, (H = \text{number of heads})$$ and processes its own attention in parallel
+ The operation $$O = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ is computed in the form of **GEMM** as follows:
+ $$Q \times K^T \times V: [N_{prompt}, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prompt}] \times [N_{prompt}, \frac{d_{emb}}{H}]$$
+ ※ (Note: $\sqrt{d_k}$ is a scalar and is omitted in computation.)
+ - Concat: $$ \text{Concat}([head_1:[N_{prompt}, \frac{d_{emb}}{H}], ..., head_h:[N_{prompt}, \frac{d_{emb}}{H}]]) :[\text{N}_{prompt}, d_{emb}]$$
+ - FF Block
+ - Feed Forward 1: $$ Z = XW_1+ b_1 \text{ where, } XW_1:[\text{N}_{prompt}, d_{emb}]\times[d_{emb},4\times d_{emb}]$$ is computed using GEMM.
+ - GeLU: Simple scalar-wise multiplication and activation.
+ - Feed Forward 2: $$ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[\text{N}_{prompt}, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$$ is also computed as GEMM.
+
+
+
+- Deocding
+During the Decoding phase, each request reuses previously computed Key and Value matrices, which are stored in memory as KV Cache and concatenated back during computation. Since the Query corresponds to **only one newly generated token**, it exists in the form of a vector. As a result, the attention operation involves multiple GEMV computations along with heavy KV Cache memory loads, making this phase predominantly memory-bound.
+
+ - MHA Block: Emphasis on **GEMV** operations increases
+ - QKV Generation (GEMM): $$XW_Q,\ XW_K,\ XW_V: [1, d_{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}]$$ These are vector-matrix multiplications.
+ For Key and Value, the previous KV Cache is loaded from memory and concatenated. The KV matrices have shape:
+ $$K,V: [N_{prev}+1, d_{emb}]$$
+ Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape:
+ $$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$
+ - Attention : $$Q \times K^T \times V: [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$$
+ This involves heavy KV Cache loading from memory, and the operation is typically memory-bound.
+ - Concat: $$ \text{Concat}([head_1:[1, \frac{d_{emb}}{H}], ..., head_h:[1, \frac{d_{emb}}{H}]]) :[1, d_{emb}]$$
+ - FF Block
+ - Feed Forward 1: $$Z = XW_1+ b_1\text{ where, }XW_1:[1, d_{emb}]\times[d_{emb},4\times d_{emb}]$$
+ This is processed as a GEMV operation.
+ - GeLU: A simple element-wise scalar operation.
+ - Feed Forward 2:
+ $$ \text{Output} = ZW_2 + b_2 \text{ where, } XW_2:[1, 4\times d_{emb}]\times[4\times d_{emb},d_{emb}]$$
+ While each request individually resembles a GEMV, since all decoding requests share the same weights and the FF block is a linear operation, multiple requests can be batched and computed together. This results in a GEMM operation of the form:
+ $$[N_{batches}, d_{emb}] \times [d_{emb}, 4\times d_{emb}] \times [4\times d_{emb}, d_{emb}]$$
+
+
+In summary, Prefill computations are predominantly in the form of GEMM operations. Since this is the initial stage, there is no prior KV Cache per request, allowing the operations to be processed in a compute-bound manner. In contrast, during Decoding, each request maintains its own KV Cache, and as the sequence length grows, memory bottlenecks become more severe due to increased memory load in the attention mechanism.
+
+
+Based on the distinct computational characteristics of prefill and decoding phases, this section introduces serving architectures that leverage PIM technologies to handle long-context LLMs efficiently in terms of memory and compute.
+
+- NeuPIMs proposes a redesigned architecture where the memory functionality and GEMV computation of PIM can be executed in parallel. This allows the main accelerator (e.g., GPU) to focus on compute-intensive GEMM operations in the prefill phase, while the memory-bound GEMV operations in the decoding phase are offloaded to PIM units and processed asynchronously.
+
+- In PIM is All You Need, the authors observe that the Prefill phase contributes only a small portion of the end-to-end LLM workload. Leveraging this, they propose eliminating power-hungry GPUs/TPUs and introduce a highly power-efficient LLM PIM serving system based on PNM (a type of NDP) and multiple PIM units to handle the entire pipeline efficiently.
+
+> NeuPIMs Arichetecture Introduce
+
+
+NeuPIMs addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations.
+
+In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput.
+
+
+By employing a dual-buffer system, NeuPIMs enables the batching of N requests into two sub-batches of N/2 each. While one sub-batch is processed by NeuPIMs (NPU-V and PIM) to handle the memory-bound attention computations, the other sub-batch simultaneously performs the compute-bound QKV generation and feed-forward network (FFN) computations.
+
+This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving.
+
+> NeuPIMs Results
+
+
+When NeuPIMs offloads GEMV operations to PIM and delegates GEMM operations to the NPU, it achieves over 1.5× improvement in LLM serving performance compared to a traditional NPU-only system. This performance gain is attributed to the efficient division of labor between memory-bound and compute-bound tasks, enabled by the hybrid PIM-NPU architecture.
+
+
+> PIM is All you need
+
+The paper presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations.
+
+The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements.
+
+
+
+In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units.
+
+
+
+> PIM is ALL you need Results
+
+
+ In PIM is All You Need, experimental results show that as the number of decoding output tokens increases, the system achieves lower latency compared to traditional GPUs. This demonstrates the effectiveness of the architecture in handling long decoding phases.
+
+ However, the paper also reveals that during the prefill phase, the PNM exhibits noticeably lower performance than GPUs. This suggests a limitation of the proposed system—as the size of the prefill grows, performance improvements become harder to achieve. This trade-off is acknowledged as a limitation of the paper, indicating that the architecture is particularly advantageous for decoding-heavy workloads, but may underperform when prefill dominates the workload.
+
+### Limitation
+- **Changes in Attention Mechanisms and the Decline of GEMV Usage**
+In recent models such as **LLaMA3** and **DeepSeek LLM**, the traditional **Multi-Head Attention (MHA)** mechanism has evolved into variants like **GQA (Group-Query Attention)** and **MQA (Multi-Query Attention)** [GQA paper](https://arxiv.org/pdf/2305.13245). In **MQA**, all attention heads share a **single KV Cache**, whereas in **GQA**, groups of heads share **a subset** of the KV Cache.
+
+
+As a result, during **decoding**, the query from a single head originally a **vector with a single row** is transformed into a **matrix** with
+
+$$
+\frac{\text{Origin Heads}}{\text{Group Size}}
+$$
+
+rows in GQA or MQA. This shifts the operation from a **GEMV** to a **GEMM**, introducing a computational overhead.
+
+This transformation poses a challenge for **PIM or NDP architectures**, which are typically optimized for **GEMV-style operations** in decoding. Thus, **GQA and MQA may reduce the processing efficiency** of such memory-centric accelerators compared to standard MHA.
+
+
+## Conclusion
+* **What efforts have been made, and how should optimization be approached?**
+This study analyzes the computational differences between **prefill** and **decoding** phases in large language model (LLM) serving and proposes a structural approach to address the resulting **memory bottleneck**.
+
+Traditional GPU and TPU-based accelerators are optimized for **compute-bound operations (GEMM)** due to their high FLOPS capabilities. However, they face significant **bandwidth bottlenecks** when performing **memory-bound operations (GEMV)**, particularly during the decoding phase of LLMs where frequent KV cache loads dominate.
+
+To overcome this issue, recent research has introduced **Processing-In-Memory (PIM)** and **Near-Data Processing (NDP)** technologies.
+
+* **NeuPIMs** addresses the structural limitation of conventional PIMs, which could not handle memory access and computation simultaneously. It introduces a **dual-buffer system** and a **NPU-V integration** strategy that enables **GEMM operations** (e.g., QKV projection and FFN) to be handled by the NPU, while **GEMV operations** (e.g., decoding attention) are offloaded to PIM units for **parallel and efficient execution**.
+
+* In contrast, **PIM is All You Need** pursues **maximum power efficiency** by removing GPUs and NPUs altogether. It utilizes a **PIM + PNM** architecture to process both GEMV and other computations near the memory controller, significantly improving both **latency** and **energy efficiency**.
+
+These architectures share a common strategy: offloading **decoding-phase GEMV operations to memory-proximal units**, thereby **mitigating latency bottlenecks** in LLM serving.
+
+However, modern LLMs are shifting from **MHA to GQA/MQA** attention mechanisms, in which shared KV cache structures **transform attention from GEMV to GEMM** computations. This shift potentially **undermines the efficiency** of PIM-based architectures, which are optimized for GEMV.
+
+As such, future research must consider **rearchitecting attention layers** and computation pipelines to be **PIM/NDP-friendly**, ensuring compatibility and sustained efficiency in the face of evolving LLM architectures.
+
+
+## Related Work
+> KV-Cache Managing
+
+Unlike the **prefill** phase, during **decoding**, the size of the **KV Cache increases for each request** as more tokens are generated. This makes it increasingly difficult for accelerators to load and process the entire KV Cache at once efficiently. To address this challenge, several optimization strategies have been proposed:
+
+* **PagedAttention** introduces a **non-contiguous paging mechanism** for managing KV Cache, moving away from the traditional contiguous memory layout. This allows for more scalable and efficient memory handling as the cache grows.
+
+* **FlashAttention** is another line of work that optimizes the **attention computation based on GPU hardware architecture**, significantly reducing memory overhead and improving throughput by recomputing or streaming attention scores rather than storing large intermediate buffers.
+
+Beyond single-GPU systems, more recent studies have explored **KV Cache Pooling Systems**, which aim to **scale KV Cache bandwidth** by distributing and managing cache loads across multiple devices or memory subsystems. These innovations collectively aim to make **decoding more scalable and memory-efficient**, especially in long-context or multi-request LLM serving scenarios.
+
+
+
+.
+
+## Citation
+- Attacc!
+- NeuPIMs
+- PIM is all you need
+- [GQA](https://arxiv.org/pdf/2305.13245)
\ No newline at end of file
diff --git a/_posts/2025-04-28-NIC.md b/_posts/2025-04-28-NIC.md
new file mode 100644
index 000000000..f255ccdaa
--- /dev/null
+++ b/_posts/2025-04-28-NIC.md
@@ -0,0 +1,243 @@
+---
+layout: distill
+title: Sample Blog Post
+description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems.
+date: 2025-04-28
+future: true
+htmlwidgets: true
+hidden: true
+
+# Anonymize when submitting
+# authors:
+# - name: Anonymous
+
+authors:
+ - name: Bae Junhyeong
+ url: "https://github.com/20190511"
+ affiliations:
+ name: POSTECH, Pohang University of Science and Technology
+ - name: Kang Sungwook
+ url: "https://github.com/rkdtjddnr"
+ affiliations:
+ name: POSTECH, Pohang University of Science and Technology
+
+# must be the exact same name as your blogpost
+bibliography: 2025-04-28-NIC.bib
+
+# Add a table of contents to your post.
+# - make sure that TOC names match the actual section names
+# for hyperlinks within the post to work correctly.
+# - please use this format rather than manually creating a markdown table of contents.
+toc:
+ - name: Equations
+ - name: Images and Figures
+ subsections:
+ - name: Interactive Figures
+ - name: Citations
+ - name: Footnotes
+ - name: Code Blocks
+ - name: Diagrams
+ - name: Tweets
+ - name: Layouts
+ - name: Other Typography?
+
+# Below is an example of injecting additional post-specific styles.
+# This is used in the 'Layouts' section of this post.
+# If you use this post as a template, delete this _styles block.
+_styles: >
+ .fake-img {
+ background: #bbb;
+ border: 1px solid rgba(0, 0, 0, 0.1);
+ box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
+ margin-bottom: 12px;
+ }
+ .fake-img p {
+ font-family: monospace;
+ color: white;
+ text-align: left;
+ margin: 12px 0;
+ text-align: center;
+ font-size: 16px;
+ }
+---
+
+## Abstract
+- "what problem is this work trying to tackle?"
+- "how new is this effort?" (소개, 개요)
+
+Our blog post will focus on optimizing the serving of large-scale language models in distributed systems, with an emphasis on improving memory efficiency and reducing latency.
+
+#### System architecture aspect
+
+#### Network aspect
+대규모 언어 모델(LLM)을 비롯한 딥러닝 모델의 크기가 점점 커지면서, 단일 GPU만으로는 학습을 수행하는 데 한계가 존재하게 되었다. 이에 따라 여러 GPU를 활용하는 **Distributed Deep Learning**(DDL)이 주목받고 있다. DDL은 모델을 여러 HW 장치에 걸쳐 병렬로 학습시킬 수 있는 장점을 제공하지만, 그 과정에서 장치간 Communication Overhead라는 중요한 문제가 발생한다.
+특히 **inter-node communication**(GPU-GPU)은 NVIDIA의 NCCL(NVIDIA Collective Communication Library)과 같은 고성능 통신 라이브러리를 통해 효율적으로 처리할 수 있지만, **intra-node communication**(GPU system - GPU system)은 이더넷 장비를 통해 이루어지기 때문에 물리적인 대역폭과 지연 시간의 한계에 직면하게 된다.
+이러한 한계를 극복하기 위한 방법으로 SmartNIC과 같은 지능형 네트워크 인터페이스 카드의 활용이 주목받고 있으며, 본 블로그에서는 ***최신 연구를 기반으로 system node간 communication overhead를 SmartNIC을 활용하여 optimizing할 수 있는 관점을 제시***한다.
+
+최종 정리
+따라서 우리 블로그에서는 ~~~
+
+
+## Background
+
+### Distributed Deep Learning
+
+딥러닝 모델의 규모가 커짐에 따라 단일 GPU의 메모리나 연산 자원만으로는 대규모 모델 학습을 감당하기 어려워졌다. 이러한 문제를 해결하기 위한 방법 중 하나가 **Distributed Deep Learning**(DDL)이다. DDL은 여러 개의 GPU 혹은 노드에 걸쳐 모델 파라미터나 데이터를 분산시켜 병렬로 학습을 수행하는 방식이다. 크게 Data Parallelism과 Model Parallelism로 나뉘며, 최근에는 하이브리드 형태도 널리 활용되고 있다.
+
+ - **Data Parallelism**은 학습해야 할 data가 많은 경우 여러 GPU에 data를 분산시켜 모델을 학습할 수 있도록 나온 학습 방법이다. 다만 동일한 모델에 대해 각 GPU는 나누어진 data에 대한 weight만을 가지고 있기 때문에 여러 GPU가 학습한 weight parameter를 종합하고 다시 나누는 synchronization 과정이 필요하다. 이후 설명할 Collective Communication이 이때 필요하게 되고 inter-node communication에서 communication overhead가 더 심해지게 된다.
+ - **Model Parallelism**은 학습을 진행할 모델의 크기가 너무 커서 모델을 여러 GPU에 분할하여 학습하는 방법이다. Model Parallelism을 구현하는 방법에는 크게 1) Tensor Parallelism, 2) Pipeline Parallelism 2가지가 존재한다. Model Parallism의 경우에도 synchronization overhead가 존재하는데 이는 Data Parallelism에 비해 빈도가 높아지게 된다. 그 이유는 모델 자체를 분할해서 GPU가 연산한 결과를 중간 중간 synchronize 해 주어야 하기 때문이다.
+
+ 이렇듯 Distributed Deep Learning은 전체 학습 시간을 단축하고 더 큰 모델을 다룰 수 있도록 해주지만, synchronization이라는 process가 무조건 수행되어야 하기 때문에 GPU 간 communication overhead이라는 새로운 문제를 야기할 수 있다. 만약 model, data의 크기가 더욱 커져서 더 많은 GPU system이 필요로 된다면 communication overhead는 더욱 커질 것이다.
+
+### Intra-node & Inter-node Communication
+
+분산 학습에서는 여러 GPU 간의 협력이 필수적이며, 이 과정에서 두 가지 수준의 통신이 발생한다. **Intra-node communication**은 하나의 서버 내에서 GPU 간에 발생하는 통신을 의미하며, 이는 고속 인터커넥트(NVLink 등)와 라이브러리(NCCL 등)를 통해 상대적으로 빠르게 처리할 수 있다. 반면, **Inter-node communication**은 서로 다른 서버 간의 통신을 의미하며, 일반적으로 이더넷(Ethernet)이나 InfiniBand 같은 네트워크를 통해 이루어진다. 이 경우 네트워크 대역폭과 지연(latency)의 제약으로 인해 성능 저하가 발생할 수 있다. 특히 앞서 설명했듯이 학습 중 반복적으로 발생하는 parameter synchronization 작업은 GPU-GPU communication overhead의 주요 원인이 되고 특히 inter-node communication에서 overhead가 더욱 커질 것이다.
+
+### Collective Communication
+DDL에서 모델의 parameter를 synchronize하거나 data를 효율적으로 분산/수집하기 위해서는 **Collective Communication**이 필수적이다. 이는 여러 Process나 GPU간 data를 교환하거나 결합하는 communication pattern을 의미하며, 주로 다음과 같은 유형으로 구분된다.
+
+* **1:N communication** pattern
+ * Broadcast : 하나의 node가 가지고 있는 data를 모든 node로 전송한다.
+ * Scatter : 하나의 node가 가지고 있는 data를 여러 조각으로 나눠 각 node에 분배한다.
+ * Gather : 여러 node의 data를 모아 하나의 node로 수집한다.
+ * Reduce : 여러 node의 data를 특정 연산(e.g., sum) 을 통해 하나의 결과로 합쳐서 하나의 node에 전달한다.
+* **N:N communication** pattern
+ * AllGather : 모든 node끼리 data를 공유하여 최종적으로 모든 node각 전체 data를 보유한다.
+ * AllReduce : 모든 node의 data를 연산한 결과를 모든 node 다시 분산한다.
+
+### SmartNIC
+Network Interface Card(NIC)은 system을 network에 연결하여 통신하기 위해 사용하는 HW device이다. 기존 NIC은 복잡한 연산은 수행하지 못하고 간단한 network관련 operation을 수행하거나 network packet을 받아서 host CPU로 보내주는 등의 역할을 수행하였다. 하지만 SmartNIC은 기존 NIC device에 core를 탑재하여 조금 더 general purpose한 목적으로 사용할 수 있는 ethernet device이다. 따라서 SmartNIC을 활용한 최신 연구에서는 주로 task를 offloading하여 host CPU의 부담을 줄이면서 communication도 효율적으로 할 수 있도록 한다.
+
+SmartNIC은 크게 On-path / Off-path 2가지로 분류된다.
+ * **On-path SmartNIC**은 NIC core 자체를 programmable하도록 구현된 장치이다. Network operation이외에도 다양한 연산들을 offload받아서 NIC core 자체에서 처리할 수 있다. 하지만 heavy한 연산들을 많이 처리하다보면 network 처리가 늦어질 수 있다는 단점이 존재한다. 또한 NIC core를 사용하기 위한 programming의 복잡도가 매우 높다.
+ * **Off-path SmartNIC**은 On-path와는 달리 NIC core와 별도로 core를 두는 방식이다. 별도의 compute core에서 연산을 처리할 수 있기 때문에 network 성능에는 영향을 미치지 않는다. 다만 compute core 및 memory에 접근하기 위한 communication overhead가 존재하는데 programmer가 사용하기에 훨씬 편하다는 장점이 존재해서 대부분의 연구가 Off-path SmartNIC을 사용한다.
+
+따라서 블로그에서도 주로 Off-path SmartNIC을 다루는 논문들을 다룰 것이며 앞서 제시한 문제들을 해결하기 위해 어떻게 활용할 수 있는 지 알아보고자 한다.
+
+
+
+## Main 설명 (제목 바꿀 것, 여러개 있어도됨)
+- "what contributions did this work make, and what impact should this work have?"
+- "how new is this effort?"
+
+### Optimizing inter-node communication with SmartNIC
+학습해야 할 모델 크기가 커지고 dataset이 커지고 이에 따라 server system이 대규모로 바뀌면서 자연스레 inter-node communication problem을 지적하고 해결하고자 하는 연구들이 다수 등장하였다. Algoritm SW적으로 해결하고자 하는 방향과 HW acceleration을 통해 해결하고자 하는 방향이 존재한다. 그 중 HW acceleration의 접근 방향이 viable한 approach이고 INA(In Network Aggregation)는 가장 popular한 approach라고 많은 논문에서 소개하고 있다. INA는 network switch를 aggregator로 사용하여 AllReduce와 같은 Collective Communication을 offloading하여 가속하는 방법이다. 하지만 INA는 network switch를 사용하기 때문에 HW resource constraints같은 한계점이 존재한다.
+따라서 밑에서 소개할 논문들은 network switch 대신 SmartNIC이라는 최신 device를 aggregator로 사용한다.
+#### SmartNIC for Ring-AllReduce
+>Proposed technique
+
+먼저 Ring-AllReduce를 SmartNIC에 offloading하는 ***DirectReduce***기법을 소개하고자 한다. 논문에서는 다음과 같이 기존 Ring-AllReduce communication의 비효율성을 지적하였다.
+
+Figure를 확인해 봤을 때 NIC에서 A1 data를 Node B에 보내주면 Node B에서 reduce연산이 수행되고 result를 다시 NIC에 보내서 Node C로 inter-node communication이 이루어지는 모습을 확인할 수 있다. 여기서 저자들은 만약 A1 data를 Node B로 보내지 않고 B1 data를 NIC으로 가져오는 동작만 수행한 뒤 NIC에서 reduce 연산을 수행하면 효율적이라고 생각하였다. 효율성은 크게 2가지 측면에서 얻을 수 있을 것이다.
+1. Node B는 기존에 수행하던 연산을 방해받지 않고 계속 진행할 수 있다.
+2. Node B와 NIC간의 불필요한 communication이 사라진다.
+
+따라서 이를 고려한 communication path는 다음과 같이 구상해 볼 수 있을 것이다.
+
+불필요한 data movement가 사라지고 NIC에서 reduce operation을 수행하여 Node B를 방해하지 않는 것을 확인할 수 있다. NIC에서 DMA를 수행하여 data를 가져오기 때문에 host인 Node B는 관여하지 않게되고 수행하던 ML workload를 계속 실행할 수 있다.
+
+앞서 제시한 Communication path를 수행하기 위해 제안하는 architecutre는 다음과 같고 크게 3가지 component를 NIC에 추가한 모습을 볼 수 있다.
+
+* GateKeeper : Ring AllReduce 데이터 전송을 위한 prefetch 관리 및 버퍼 최적화
+* DataDirecter : 수신 데이터의 패킷 흐름을 RNIC 내부로 전환시켜 직접 처리
+* ComputeEnhancer : FPGA 등을 통해 RNIC에서 reduce 연산을 직접 수행 (SmartNIC)
+
+>Evaluation results
+
+실험은 실제 system에서 많이 사용되는 2가지 network topolgy 환경을 가정하여 진행하였다.
+1. Ring topology : 8 nodes
+2. 6D-torus topology : 729, 4096, 46656, and 262144, corresponding to 3, 4, 6, and 8 nodes per dimension
+그리고 architecture의 구현은 Xilix Vivado를 이용하고 전체적인 simulation은 Astra-Sim2를 사용하여 진행했다.
+
+다만 6D-torus topology에 대한 실험 결과는 Ring topology와 매우 비슷한 양상을 띄고 있기 때문에 Ring topology에 대한 실험결과만 소개하도록 하겠다.
+* Small message size (1KiB, 256KiB, respectively)
+
+ Message size가 작을 때 node의 process 수를 증가시켰을 때 latency를 비교한 그래프이다. computation overhead도 적고 보내야 하는 message size 자체가 작기 때문에 naive Ring-AllReduce와 비교했을 때 latency차이가 크지 않은 모습을 확인할 수 있다.
+* Large message size (256MiB, 1GiB, respectively)
+
+ 반면 message size가 클 때는 reduce operation을 NIC에서 수행하는 이점이 커진다. AI workload를 수행하는 process는 AllReduce에 의한 방해를 받지 않고 result를 빠르게 도출해내며 NIC에서는 result에 대한 reduce operation 처리를 도맡아서 수행하기 때문이다.
+
+
+결론적으로 논문에서는 실험결과를 통해 다음과 같은 insight를 제공한다. DirectReduce는 중간 크기의 메시지를 처리할 때 Ring AllReduce 대비 향상된 성능을 제공하지만, 그 이점은 메시지 크기 $M$과 프로세스 수 $N$에 따라 달라진다. 특히, 프로세스당 데이터 양 $M/N$이 1KB 이하일 경우에는 성능 차이가 거의 없지만, $M/N > 1\text{KB}$인 경우에는 stream aggregation 기반 파이프라인 덕분에 Ring AllReduce보다 15%에서 36%까지 지연 시간을 줄일 수 있다. 일반적으로 $N$이 많을수록, $M$이 클수록 DirectReduce의 효과는 증가하며, 이는 더 많은 패킷 생성과 병렬 처리를 가능하게 한다. 그러나 메시지 크기 $M$이 1MB보다 작고 $N$이 커질 경우, 전체 패킷 수가 줄어들어(경우에 따라 1개까지) 오히려 파이프라인 효과가 감소하면서 성능 향상에 제약을 줄 수 있다. 이를 표로 정리하면 다음과 같다.
+| 조건 | DirectReduce 성능 향상 효과 |
+|----------------------------|--------------------------------------------------|
+| $M/N \leq 1\text{KB}$ | 성능 향상 거의 없음 |
+| $M/N > 1\text{KB}$ | 15% ~ 36% latency 개선 |
+| $N \uparrow$, $M \uparrow$| 패킷 수 및 파이프라인 증가 → 성능 향상 증가 |
+| $M < 1\text{MB}$, $N \uparrow$| 패킷 수 감소 → 파이프라인 효과 감소로 성능 이점 제한 |
+
+#### Zero-Sparse AllReduce and SmartNIC offloading
+>Proposed technique
+
+다음으로 소개할 논문은 SmartNIC에 offloading하는 것뿐만 아니라 Zero-Sparse AllReduce algorithm도 같이 제안하여 communication 시 이동되는 data 양 자체를 줄이려는 시도를 하고 있다. 다만 이 블로그는 SmartNIC을 활용하는 방안에 더 중점을 맞추고 있으므로 Zero-Sparse algorithm은 간단하게 소개하고 넘어가도록 하겠다.
+* **Zero-Sparse AllReduce**
+
+저자들은 AllReduce를 수행해야 하는 data(e.g., gradient)에 0이 많이 존재할 경우 0인 data를 communication에 포함시키는 게 매우 비효율적이라고 생각하였다. 특히나 최신 AI model들은 model parameter가 매우 많아서 이를 최적화하기 위해 pruning같은 기법을 사용한다. 이러한 모델들을 sparse neural network라고도 부르는데 이 모델의 특징은 gradient의 많은 부분이 0인 것이다. 따라서 이러한 점을 근거로 AllReduce 시 0인 data는 포함하지 않는게 communication에 효율적이라고 판단할 수 있고 제안하는 Zero-Sparse AllReduce의 동작은 위 그림과 같다. 단순하게 AllReduce에 사용할 gradient vector가 있다고 할 때 이를 block 단위로 나누고 block 중 0인 data만을 포함하는 zero-block은 AllReduce communication 대상에서 제외한다. 따라서 communication 시 data movement가 줄어들어 latency가 감소할 것이라고 기대할 수 있다.
+
+* **Using SmartNIC as aggregator**
+
+다음은 SmartNIC을 aggregator로 사용하는 방법이다. 저자들은 SmartNIC의 memory bandwidth한계를 지적하고 이를 해결하기 위해 DCA(Direct Cache Access)를 사용해서 NIC이 LLC에 직접 접근하는 방법을 사용한다. 하지만 여러 node의 data를 aggregate하여 reduce 연산을 수행해야 하는데 LLC의 용량이 한계가 존재하고 이를 해결하기 위해 SmartNIC 내부 memory layout을 위 그림과 같이 제안하였다. 간단하게 동작을 이해해보면 non-zero block을 담아두고 보내는 Rx / Tx Spot이 존재하고 이는 RDMA region이기 때문에 Worker들이 직접 memory write을 수행할 수 있다. Worker들이 Rx Spot에 data를 write해주면 aggergator(SmartNIC)은 Rx Spot에 담긴 data와 index를 확인하고 동일한 위치의 Tx Spot의 data와 reduce 연산을 수행한 뒤 Tx Spot에 결과를 update한다. 이렇게 동작이 수행되면 수많은 Worker의 data를 모두 유지할 필요가 없어지고 LLC를 효율적으로 사용할 수 있다. 추가적으로 Tx Spot이 2개인 것을 그림에서 확인할 수 있는데 이는 double buffering을 사용하기 위함이다. 이전 phase에서 AllReduce의 result를 아직 모든 Worker에 scatter하지 않았는데 다음 phase가 수행되면 Tx Spot의 data가 update되어 버리기 때문이다.
+
+>Evaluation results
+
+실험은 GPU worker로 구성된 system과 CPU Worker로 구성된 system 2가지를 진행하였고 SmartNIC의 경우 NVIDIA Bluefield-2를 사용하였다. 그리고 OSU MPI Microbenchmark와 비슷하게 microbenchmark를 사용하였고 benchmark는 AllReduce benchmarking을 지원한다. 다른 AllReduce benchmark와는 다르게 array의 sparsity를 조절하여 실험할 수 있도록 구성했다고 한다. 이러한 실험 setting을 바탕으로 reduce operation을 수행할 aggregator를 SmartNIC으로 사용했을 때와 host CPU를 사용한 결과를 비교하여 보여주고 있다. 또한 다른 Sparse AllReduce와 비교하여 제안하는 Zero-Sparse AllReduce가 효과적임을 보여주고 있다.
+
+* Comparison of two Sparse AllReduce
+
+실험 결과는 제안하는 Zero-Sparse AllReduce(OmNICCL)와 다른 Sparse AllReduce(OmniReduce)와의 비교 결과이다. 실험결과를 통해 모든 면에서 제안하는 OmNICCL이 OmniReduce보다 좋은 성능을 보임을 주장하고 있다.
+
+* Comparison of two aggregator
+
+실험결과는 message size를 128MB기준으로 Worker의 수를 다르게 하면서 Block Sparsity도 다르게 실험한 결과이다. OmNICCL*은 SmartNIC을 aggregator로 사용했을 때이고 OmNICCL은 SmartNIC을 사용하지 않고 일반적인 CPU를 aggregator로 사용했을 때 실험 결과이다. GPU system에서 Block Sparsity가 25% ~ 75%인 경우 SmartNIC을 사용했을 때 latency가 약간 줄어든 모습을 확인할 수 있다.
+
+### Limited performance of SmartNIC
+DirectReduce와 OmNICCL 모두 SmartNIC의 근본적인 성능과 memory capacity가 크지 않은 점을 지적하고 이를 해결하기 위한 solution을 제공하고 있다. 하지만 실험 결과에서 볼 수 있듯이 특히, OmNICCL의 경우 SmartNIC을 사용할 경우 dramatic한 performance 향상을 보여주고 있지는 않고 이는 여전히 SmartNIC의 HW resource constraints때문이라고 생각한다. 다만 DirectReduce는 성능이 많이 향상되지 않았나?라는 의문이 들 수 있는데, DirectReduce의 경우 FPGA-based로 ASIC한 SmartNIC을 구성하여 Simulation-based로 실험을 진행하였고 OmNICCL의 경우 실제 NVIDIA의 bluefield-2 제품을 사용해서 결과의 차이가 존재할 수 있다. 또한 DirectReduce의 경우 성능 향상이 이루어졌을 때는 더 큰 message를 target하고 있기 때문에 이 부분도 고려하여 결과를 해석해야 한다.
+결론적으로 SmartNIC을 활용하기 위한 다양한 연구들이 진행되고 있지만, 아직 SmartNIC 자체의 성능이 그리 좋지 못하기 때문에 대부분의 논문에서 이를 지적하고 최대한의 성능을 이끌어내기 위한 연구가 진행되고 있다. 다만 이 문제는 SmartNIC이 앞으로 더 발전한다면 자연스레 해결될 수 있는 문제라고 생각한다.
+
+
+## Conclusion
+- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가?
+- Project Proposal 참고해서 전체적인 Conclusion으로 작성할 것
+
+본 blog post에서는 model parameter 및 data size가 큰 최신 AI trend에서 학습/추론에 대해 2가지 측면에서 최적화할 수 있는 여러 연구들을 살펴보았다.
+
+또한, 연산 구조뿐만 아니라 서버 시스템 간 통신에서 발생하는 병목 현상을 줄이기 위한 네트워크 측면의 최적화 기법들에 대한 연구도 함께 살펴보았다.
+최근 인공지능 모델 학습을 할때 GPU 1개로는 부족해서 여러 GPU가 하나의 node에 묶인 GPU system을 사용할뿐만 아니라 이러한 GPU system을 여러 개 사용해야 할 정도로 모델이 크고 복잡해졌다. 하지만 multi-GPU system을 사용할 경우 연산 측면에서는 빨라지지만 이에 따라 system간 parameter교환 등 communication이 많이 발생한다. 따라서 System간 communication overhead를 최적화하기 위해서 여러 연구에서 SmartNIC을 활용하려는 시도가 많아지고 있다.
+* **DirectReduce**에서는 기존 Ring AllReduce communication operation을 SmartNIC에 offloading하여 CPU/GPU가 온전히 AI operation만 수행해서 문제를 해결하고자 하였다.
+* **OmNICCL**에서는 Zero-Sparse AllReduce algorithm을 제안하여 communication할 때 data movement를 줄이고 communication operation도 SmartNIC에 offloading하여 문제를 해결하고자 하였다.
+
+하지만 앞서 소개했듯이 아직 SmartNIC에서 무겁고 복잡한 연산을 수행하기에는 성능이 뛰어나지 못하다. 또한 memory 용량도 당연히 CPU에 비하면 많이 부족한 수준이기 때문에 data-sensitive한 workload를 offloading해서 수행하기에도 한계점이 존재한다. 하지만 앞으로 기술이 발전하면서 SmartNIC architecture도 더 좋은 성능을 가지도록 개발된다면 SmartNIC의 활용도는 무궁무진할 것이다.
+
+결론적으로 ~~~~
+
+## Related work
+Collective Communication이외에도 server system에서 SmartNIC을 통해 다양한 operation을 최적화 하려는 연구가 많이 진행되고 있다.
+> RPC layer offloading
+
+**RpcNIC**은
+
+> Storage request offloading
+
+**SmartDS**는
+
+## Citation (bibs 로 올릴 것이니까 생각 나는 논문들만 정리해둘것.)
+[pool] -> (0 ~ 10)
+[nic] -> (11 ~ 20)
+
+
+
+Related Paper :
+[Collective Communication관련]
+- OmNICCL (NAIC'24) : Sparse AllReduce algorithm + SmartNIC offloading
+ - OmniReduce (SIGCOMM'21) : 위의 논문의 기반 논문 (?)
+- Leveraging SmartNIC for Ring AllReduce offloading (ISPA'24) : Offloading Ring AllReduce to SmartNIC
+- FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems (IEEE CAL'22)
+
+Reference :
+Charaterizing Off-path SmartNIC for Accelerating Distributed Systems
+Nvidia. 2023. Nvidia BlueField-2 DPU. https://www.nvidia.com/content/dam/en-
+zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf
+
+Background 관련 Reference
+https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/
+https://en.wikipedia.org/wiki/Collective_operation
\ No newline at end of file
diff --git a/_posts/2025-04-28-NIC_ENG.md b/_posts/2025-04-28-NIC_ENG.md
new file mode 100644
index 000000000..f3339c587
--- /dev/null
+++ b/_posts/2025-04-28-NIC_ENG.md
@@ -0,0 +1,225 @@
+---
+layout: distill
+title: Sample Blog Post
+description: Our blog post will focus on \textbf{optimizing the serving of large-scale language models in distributed systems}, with an emphasis on improving memory efficiency and reducing latency. We will discuss strategies for optimizing memory layout, execution scheduling, and batching to enhance the throughput of AI model inference. Additionally, the post will examine the role of SmartNICs in offloading certain tasks in data centers, reducing CPU load, and improving communication between compute nodes. Through this, we aim to highlight the importance of networking optimizations for efficient ML serving in real-world systems.
+date: 2025-04-28
+future: true
+htmlwidgets: true
+hidden: true
+
+# Anonymize when submitting
+# authors:
+# - name: Anonymous
+
+authors:
+ - name: Bae Junhyeong
+ url: "https://github.com/20190511"
+ affiliations:
+ name: POSTECH, Pohang University of Science and Technology
+ - name: Kang Sungwook
+ url: "https://github.com/rkdtjddnr"
+ affiliations:
+ name: POSTECH, Pohang University of Science and Technology
+
+# must be the exact same name as your blogpost
+bibliography: 2025-04-28-NIC.bib
+
+# Add a table of contents to your post.
+# - make sure that TOC names match the actual section names
+# for hyperlinks within the post to work correctly.
+# - please use this format rather than manually creating a markdown table of contents.
+toc:
+ - name: Equations
+ - name: Images and Figures
+ subsections:
+ - name: Interactive Figures
+ - name: Citations
+ - name: Footnotes
+ - name: Code Blocks
+ - name: Diagrams
+ - name: Tweets
+ - name: Layouts
+ - name: Other Typography?
+
+# Below is an example of injecting additional post-specific styles.
+# This is used in the 'Layouts' section of this post.
+# If you use this post as a template, delete this _styles block.
+_styles: >
+ .fake-img {
+ background: #bbb;
+ border: 1px solid rgba(0, 0, 0, 0.1);
+ box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
+ margin-bottom: 12px;
+ }
+ .fake-img p {
+ font-family: monospace;
+ color: white;
+ text-align: left;
+ margin: 12px 0;
+ text-align: center;
+ font-size: 16px;
+ }
+---
+
+## Abstract
+- "what problem is this work trying to tackle?"
+- "how new is this effort?" (소개, 개요)
+
+Our blog post will focus on optimizing the serving of large-scale language -> AI models in distributed systems, with an emphasis on improving memory efficiency and reducing latency.
+
+#### System architecture aspect
+
+#### Network aspect
+As deep learning models—including large language models (LLMs)—continue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIA’s NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints.
+To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution. In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***.
+
+최종 정리
+따라서 우리 블로그에서는 ~~~
+
+
+## Background
+
+### Distributed Deep Learning
+
+As deep learning models continue to grow in size, it has become increasingly difficult to train large-scale models using only the memory and compute resources of a single GPU. One of the key approaches to overcoming this limitation is **Distributed Deep Learning** (DDL). DDL distributes model parameters or data across multiple GPUs or nodes, enabling parallel training. It is generally categorized into **Data Parallelism** and **Model Parallelism**.
+
+ - **Data Parallelism** is a training method designed for scenarios where a large volume of data needs to be processed. It distributes the dataset across multiple GPUs, allowing each GPU to train the same model on a different subset of the data. However, since each GPU only updates the weights based on its local data, a **synchronization** step is required to aggregate and redistribute the updated weight parameters. This is where **Collective Communication** becomes essential, and as we will discuss later, it can lead to increased communication overhead—particularly in inter-node communication.
+ - **Model Parallelism** is a training approach used when the model itself is too large to fit on a single GPU. In this method, the model is divided and distributed across multiple GPUs. There are two main techniques for implementing Model Parallelism: (1) **Tensor Parallelism** and (2) **Pipeline Parallelism**. Synchronization overhead also exists in Model Parallelism, and it tends to occur more frequently than in Data Parallelism. This is because the model is partitioned, requiring GPUs to synchronize their intermediate computation results during training.
+
+ While Distributed Deep Learning enables faster training and the ability to handle larger models, it inherently requires a synchronization process, which introduces a new challenge: communication overhead between GPUs. As the size of the model and dataset continues to grow, requiring more GPU systems to work together, this communication overhead is expected to increase even further.
+
+### Intra-node & Inter-node Communication
+
+In distributed training, cooperation among multiple GPUs is essential, and this involves two levels of communication. **intra-node communication** refers to communication between GPUs within a single server, which can typically be handled efficiently using high-speed interconnects such as NVLink and libraries like NCCL. In contrast, **inter-node communication** refers to communication between GPUs across different servers, which usually takes place over networks such as Ethernet or InfiniBand. In this case, limitations in network bandwidth and latency can lead to performance degradation. As mentioned earlier, parameter synchronization—performed repeatedly during training—is a major source of GPU-to-GPU communication overhead, and this overhead tends to be more severe in inter-node communication.
+
+### Collective Communication
+In DDL, **Collective Communication** is essential for synchronizing model parameters and efficiently distributing or aggregating data. This refers to communication patterns that involve exchanging or combining data across multiple processes or GPUs. These patterns are typically categorized into several common types, as outlined below.
+
+* **1:N communication** pattern
+ * Broadcast: Sends data from a single node to all other nodes.
+ * Scatter: Splits data on one node into multiple parts and distributes them to other nodes.
+ * Gather: Collects data from multiple nodes and aggregates it on a single node.
+ * Reduce: Combines data from multiple nodes using a specified operation (e.g., sum) and delivers the result to a single node.
+* **N:N communication** pattern
+ * AllGather: Each node shares its data with all other nodes, resulting in every node holding the complete set of data.
+ * AllReduce: Data from all nodes is combined using a specified operation (e.g., sum), and the result is distributed back to all nodes.
+
+### SmartNIC
+A **Network Interface Card** (NIC) is a hardware device used to connect a system to a network and enable communication. Traditional NICs are limited in functionality—they typically handle simple network-related operations or forward incoming packets to the host CPU. In contrast, a SmartNIC is an enhanced Ethernet device equipped with onboard processing cores, allowing it to support more general-purpose tasks. Recent research leveraging SmartNICs often focuses on offloading certain tasks from the host CPU, thereby reducing its workload and enabling more efficient communication.
+
+SmartNICs are generally classified into two types: on-path and off-path.
+* **On-path SmartNICs** are designed with programmable NIC cores, allowing them to handle not only basic network operations but also various offloaded computations directly within the NIC itself. While this approach offers flexibility, it has potential drawbacks—heavy computational loads can delay network packet processing, and programming the NIC cores tends to be highly complex.
+* **Off-path SmartNICs**, on the other hand, include separate compute cores that are distinct from the main NIC cores. This design allows offloaded tasks to be processed without interfering with network performance. Although there is some communication overhead when accessing memory and compute resources, off-path SmartNICs are generally more programmer-friendly. As a result, they are more commonly adopted in recent research.
+
+In this blog, we will primarily focus on studies utilizing off-path SmartNICs and explore how they can be leveraged to address the communication challenges outlined earlier.
+
+
+
+## Various techniques for efficient serving system
+- "what contributions did this work make, and what impact should this work have?"
+- "how new is this effort?"
+
+### Optimizing inter-node communication with SmartNIC
+As model sizes and datasets continue to grow, and as server systems scale accordingly, many research efforts have emerged to identify and address the resulting inter-node communication challenges. These efforts generally take two directions: one focusing on algorithmic or software-level solutions, and the other on hardware acceleration. Among these, hardware-based acceleration is increasingly viewed as a viable approach, with In-Network Aggregation (INA) being one of the most popular methods cited across recent studies.
+Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments.
+Therefore, the papers introduced below explore ***the use of SmartNICs—modern programmable network devices—as aggregators instead of traditional network switches***.
+#### SmartNIC for Ring-AllReduce
+>Proposed technique
+
+We begin by introducing ***DirectReduce***, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below.
+
+Based on the figure, we can observe that when the NIC sends A1 data to Node B, the reduction operation is performed on Node B, and the result is then returned to the NIC to continue inter-node communication with Node C. The authors propose an alternative approach: instead of sending A1 data to Node B, only fetching B1 data from Node B to the NIC and performing the reduction directly on the NIC could be more efficient.
+
+This efficiency can be gained from two main perspectives:
+1. Node B can continue its local computations without being interrupted by additional processing.
+2. Unnecessary communication between Node B and the NIC is eliminated.
+
+Taking these observations into account, a revised communication path can be envisioned as follows.
+
+This approach eliminates unnecessary data movement and enables the reduction operation to be executed directly on the NIC, avoiding interference with Node B. Since the NIC uses DMA to fetch data, the host (Node B) remains uninvolved in the transfer and can continue running its ML workloads without interruption.
+
+To realize the proposed communication path, the authors introduce a new NIC architecture, which includes three key components added to the NIC.
+
+* GateKeeper: Manages prefetching and buffer optimization for data transfers in the Ring-AllReduce process.
+* DataDirector: Redirects incoming data packets into the RNIC’s internal data path for direct handling within the NIC.
+* ComputeEnhancer: Performs reduction operations directly on the RNIC using hardware accelerators such as FPGAs, enabling compute capabilities on the SmartNIC.
+
+>Evaluation results
+
+The experiments were conducted under two network topology configurations that are commonly used in real-world systems:
+1. Ring topology: 8 nodes
+2. 6D-torus topology: 729, 4,096, 46,656, and 262,144 nodes—corresponding to 3, 4, 6, and 8 nodes per dimension, respectively
+
+The architecture was implemented using **Xilinx Vivado**, and the overall simulation was carried out using **Astra-Sim2**.
+However, since the results observed under the 6D-torus topology closely resemble those from the ring topology, we will focus on presenting the experimental results for the ring topology only.
+* Small message size (1KiB, 256KiB, respectively)
+
+ The graph compares latency as the number of processes per node increases, specifically in scenarios with small message sizes. Since the computation overhead is low and the messages themselves are small, the latency difference between the proposed approach and naive Ring-AllReduce remains minimal.
+* Large message size (256MiB, 1GiB, respectively)
+
+ In contrast, for larger message sizes, the benefit of performing the reduce operation on the NIC becomes more significant. Processes running AI workloads can produce results without being interrupted by AllReduce operations, while the NIC takes full responsibility for handling the reduction of those results.
+
+
+In conclusion, the paper presents the following key insight based on the experimental results: while DirectReduce offers improved performance over traditional Ring AllReduce when handling medium-sized messages, the actual benefit depends on both the message size ($M$) and the number of processes ($N$).
+Specifically, when the amount of data per process ($M/N$) is less than or equal to 1KB, there is little to no performance gain. However, when $M/N > 1\text{KB}$, the stream aggregation-based pipeline allows DirectReduce to reduce latency by 15% to 36% compared to Ring AllReduce. In general, the performance advantage of DirectReduce increases with larger $M$ and greater $N$, as this leads to the generation of more packets and enables more parallelism.
+That said, when the message size $M$ is smaller than 1MB and $N$ becomes large, the total number of packets may decrease (sometimes down to just one), which in turn limits the effectiveness of the pipeline and constrains performance improvement.
+
+The findings can be summarized in the following table.
+| Condition | Performance Benefit of DirectReduce |
+|--------------------------------------|--------------------------------------------------------------|
+| $M/N \leq 1\text{KB}$ | Little to no performance improvement |
+| $M/N > 1\text{KB}$ | 15%–36% latency reduction |
+| $N \uparrow$, $M \uparrow$ | More packets and deeper pipeline → greater performance gain |
+| $M < 1\text{MB}$, $N \uparrow$ | Fewer packets → limited pipeline effect, reduced benefit |
+
+#### Zero-Sparse AllReduce and SmartNIC offloading
+>Proposed technique
+
+The next paper, OmNICCL, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects.
+* **Zero-Sparse AllReduce**
+
+The authors observed that when performing AllReduce on data such as gradients, it is highly inefficient to include a large number of zero values in the communication. This issue becomes more prominent in modern AI models, which often contain a vast number of parameters. To optimize these models, techniques like pruning are frequently applied, resulting in what are known as sparse neural networks—models in which a significant portion of the gradients are zero.
+Based on this observation, the authors argue that excluding zero values during AllReduce can lead to more efficient communication. The proposed Zero-Sparse AllReduce algorithm operates as illustrated in the figure above. In simple terms, the gradient vector used for AllReduce is divided into blocks. Any block that contains only zeros (a zero block) is excluded from the communication process. As a result, the amount of data transferred is reduced, leading to lower communication latency.
+
+* **Using SmartNIC as aggregator**
+
+The next approach focuses on using SmartNICs as aggregators. The authors highlight the limitation of memory bandwidth on SmartNICs and propose a solution using **Direct Cache Access** (DCA), which allows the NIC to directly access the Last-Level Cache (LLC). However, since the NIC must aggregate data from multiple nodes to perform reduction operations, the limited capacity of the LLC becomes a bottleneck. To address this, the authors propose a specialized memory layout inside the SmartNIC, as shown in the figure above.
+In simple terms, the design introduces Rx and Tx Spots to store non-zero blocks. These spots reside in RDMA-accessible regions, allowing worker nodes to write data directly into them. When a worker writes data to an Rx Spot, the aggregator (SmartNIC) reads the data and its corresponding index, performs a reduction with the existing data in the corresponding Tx Spot, and updates the Tx Spot with the result.
+This approach eliminates the need to store all incoming data from every worker, allowing more efficient use of LLC. Additionally, the figure shows two Tx Spots, which are used for double buffering. This mechanism prevents overwriting Tx data that has not yet been scattered to all workers from the previous AllReduce phase when the next phase begins.
+
+>Evaluation results
+
+The experiments were conducted on two types of systems: one with GPU workers and another with CPU workers. For the SmartNIC, the authors used the **NVIDIA BlueField-2**. A custom microbenchmark, similar in spirit to the OSU MPI Microbenchmark suite, was used to evaluate AllReduce performance. Unlike typical AllReduce benchmarks, this version allows for configurable array sparsity, enabling more nuanced experimentation.
+Using this setup, the authors compare the performance of executing the reduction operation on a SmartNIC versus on a host CPU. Their results demonstrate that SmartNIC-based aggregation is more efficient. Furthermore, when compared against other sparse AllReduce method, the proposed Zero-Sparse AllReduce approach shows clear advantages in performance.
+
+* Comparison of two Sparse AllReduce
+
+The experimental results compare the proposed Zero-Sparse AllReduce method, OmNICCL, against an existing sparse AllReduce technique, OmniReduce. The authors claim that OmNICCL consistently outperforms OmniReduce across all evaluated scenarios.
+
+* Comparison of two aggregator
+
+The experiments were conducted by varying the number of workers and block sparsity levels, with a fixed message size of 128MB. Two versions of OmNICCL are presented: OmNICCL* refers to the setup where a SmartNIC is used as the aggregator, while OmNICCL refers to the case where a conventional CPU serves as the aggregator. In GPU-based systems, when the block sparsity ranges from 25% to 75%, OmNICCL* shows a modest reduction in latency compared to the CPU-based version, highlighting the benefit of SmartNIC offloading in sparse communication scenarios.
+
+### Limitation of SmartNIC
+Both **DirectReduce** and **OmNICCL** identify the fundamental limitations of SmartNICs—namely, their limited performance and memory capacity—and propose architectural solutions to address these constraints. However, as seen in the experimental results, especially in the case of OmNICCL, the use of SmartNICs does not lead to dramatic performance improvements. This is likely due to the current hardware resource constraints of commercially available SmartNICs.
+One might wonder why DirectReduce appears to achieve more substantial performance gains. This can be attributed to the fact that DirectReduce was evaluated using a simulation-based setup with an FPGA-based, ASIC-style SmartNIC, while OmNICCL was tested on actual hardware using NVIDIA’s BlueField-2 . The discrepancy in hardware platforms may explain the difference in results. Additionally, DirectReduce focuses on larger message sizes, which can amplify performance gains—this context should also be considered when interpreting the results.
+In conclusion, while a variety of research efforts are exploring how to best leverage SmartNICs, most studies still point to their limited capabilities as a bottleneck. As such, many works aim to extract the maximum possible performance from current SmartNIC architectures. Nevertheless, this issue is expected to diminish as SmartNIC technology continues to advance in the future.
+
+
+## Conclusion
+- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가?
+- Project Proposal 참고해서 전체적인 Conclusion으로 작성할 것
+
+In this blog post, we explored a range of research efforts aimed at optimizing training and inference in modern AI workloads—particularly those involving large model parameters and datasets—from two key perspectives.
+
+As modern AI models grow in size and complexity, training them often requires more than a single GPU. It is now common to use multi-GPU systems where several GPUs are grouped within a single node—and increasingly, multiple such GPU nodes are used together. While this multi-GPU setup significantly accelerates computation, it also introduces **a substantial amount of communication overhead, especially for exchanging model parameters between systems**. To address this, many recent studies have investigated the use of SmartNICs to optimize inter-system communication.
+* **DirectReduce** proposes offloading the traditional Ring AllReduce communication operation to the SmartNIC, allowing the CPU and GPU to focus solely on AI computations.
+* **OmNICCL** introduces the Zero-Sparse AllReduce algorithm to reduce data movement during communication and offloads the collective operation to the SmartNIC as well.
+
+[최종 결론]
+In conclusion, optimizing LLM serving is no longer just about improving model architecture. As model complexity and deployment scale grow, system-level strategies, such as compute-side acceleration and communication-aware design, are becoming increasingly important for achieving efficient training and inference.
+
+## Citation (bibs 로 올릴 것이니까 생각 나는 논문들만 정리해둘것.)
+
+
+
diff --git a/_posts/2025-04-28-SampleAndRule.md b/_posts/2025-04-28-SampleAndRule.md
new file mode 100644
index 000000000..1e745aa9b
--- /dev/null
+++ b/_posts/2025-04-28-SampleAndRule.md
@@ -0,0 +1,508 @@
+---
+layout: distill
+title: Sample Blog Post
+description: Your blog post's abstract.
+ Please add your abstract or summary here and not in the main body of your text.
+ Do not include math/latex or hyperlinks.
+date: 2025-04-28
+future: true
+htmlwidgets: true
+hidden: true
+
+# Anonymize when submitting
+# authors:
+# - name: Anonymous
+
+authors:
+ - name: Albert Einstein
+ url: "https://en.wikipedia.org/wiki/Albert_Einstein"
+ affiliations:
+ name: IAS, Princeton
+ - name: Boris Podolsky
+ url: "https://en.wikipedia.org/wiki/Boris_Podolsky"
+ affiliations:
+ name: IAS, Princeton
+ - name: Nathan Rosen
+ url: "https://en.wikipedia.org/wiki/Nathan_Rosen"
+ affiliations:
+ name: IAS, Princeton
+
+# must be the exact same name as your blogpost
+bibliography: 2025-04-28-distill-example.bib
+
+# Add a table of contents to your post.
+# - make sure that TOC names match the actual section names
+# for hyperlinks within the post to work correctly.
+# - please use this format rather than manually creating a markdown table of contents.
+toc:
+ - name: Equations
+ - name: Images and Figures
+ subsections:
+ - name: Interactive Figures
+ - name: Citations
+ - name: Footnotes
+ - name: Code Blocks
+ - name: Diagrams
+ - name: Tweets
+ - name: Layouts
+ - name: Other Typography?
+
+# Below is an example of injecting additional post-specific styles.
+# This is used in the 'Layouts' section of this post.
+# If you use this post as a template, delete this _styles block.
+_styles: >
+ .fake-img {
+ background: #bbb;
+ border: 1px solid rgba(0, 0, 0, 0.1);
+ box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
+ margin-bottom: 12px;
+ }
+ .fake-img p {
+ font-family: monospace;
+ color: white;
+ text-align: left;
+ margin: 12px 0;
+ text-align: center;
+ font-size: 16px;
+ }
+---
+
+Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling.
+
+## Blog Post Example (2024, 2023)
+https://iclr-blogposts.github.io/2024/blog/understanding-icl/
+https://iclr-blogposts.github.io/2024/blog/bench-hvp/
+https://iclr-blogposts.github.io/2024/blog/dpi-fsvi/
+
+https://iclr-blogposts.github.io/2023/blog/2023/adamw/
+
+---
+
+### 계획
+- 통합주제는 : optimizing the serving of large-scale language models in distributed system 이며 해당 주제로 쓰기 전에 각자 논문 Paper에 대한 Post를 작성한다.
+- 아래 순서로 일단 각자 블로그를 쓴다.
+- 이전에 작성된 overleaf 는 https://ko.overleaf.com/project/67ef902314a19d89f597934c 이며 참고할 것.
+ - 배준형 : 2025-04-28-LLM_System_Pool.md
+ - 강성욱 : 2025-04-28-NIC.md
+- 각자 내용을 다 쓴 후 Blog Post를 통합한다.
+
+---
+### 각 Blog 서술 순서서
+1. [!] Abstract & Background
+- "what problem is this work trying to tackle?"
+- "how new is this effort?" (소개개)
+2. Main (주요 설명)
+- "what contributions did this work make, and what impact should this work have?"
+- "how new is this effort?"
+
+3. 결과
+- "what are the limitations of this work?"
+4. [!] Conclusion
+- 어떤 노력이 있었으며, 어떤식으로 최적화할 것인가?
+5. [!] Citation
+---
+
+※ ! 붙은 내용은 일반적으로 들어가야함.
+
+```Text
+- Submissions due: Please submit your posters (via email to me & TA) and blog posts (via PLMS) by the midnight of 5/30.
+
+- Blog submission instructions: Formats
+- The blog posts should be submitted as a url link to your own blog post.
+- You are free to use any any formats.
+ -> But if you need one, see here: https://iclr-blogposts.github.io/2025/submitting/
+- You are expected to host the post through your own [GitHub page].
+- If you need help, please contact the TA.
+
+- Blog submission instructions: Grading
+- As your post (and poster) is supposed to be an academic outcome, please make the followings extra clear to us:
+- "what problem is this work trying to tackle?"
+- "what contributions did this work make, and what impact should this work have?"
+- "how new is this effort?"
+- "what are the limitations of this work?"
+```
+
+
+
+## Equations
+
+This theme supports rendering beautiful math in inline and display modes using [MathJax 3](https://www.mathjax.org/) engine.
+You just need to surround your math expression with `$$`, like `$$ E = mc^2 $$`.
+If you leave it inside a paragraph, it will produce an inline expression, just like $$ E = mc^2 $$.
+
+To use display mode, again surround your expression with `$$` and place it as a separate paragraph.
+Here is an example:
+
+$$
+\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)
+$$
+
+Note that MathJax 3 is [a major re-write of MathJax](https://docs.mathjax.org/en/latest/upgrading/whats-new-3.0.html)
+that brought a significant improvement to the loading and rendering speed, which is now
+[on par with KaTeX](http://www.intmath.com/cg5/katex-mathjax-comparison.php).
+
+
+## Images and Figures
+
+Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you
+might face losing important information in your blog post.
+To include images in your submission in this way, you must do something like the following:
+
+```markdown
+{% raw %}{% include figure.html path="assets/img/2025-04-28-distill-example/iclr.png" class="img-fluid" %}{% endraw %}
+```
+
+which results in the following image:
+
+{% include figure.html path="assets/img/2025-04-28-distill-example/iclr.png" class="img-fluid" %}
+
+To ensure that there are no namespace conflicts, you must save your asset to your unique directory
+`/assets/img/2025-04-28-[SUBMISSION NAME]` within your submission.
+
+Please avoid using the direct markdown method of embedding images; they may not be properly resized.
+Some more complex ways to load images (note the different styles of the shapes/shadows):
+
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/9.jpg" class="img-fluid rounded z-depth-1" %}
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/7.jpg" class="img-fluid rounded z-depth-1" %}
+
+
+
+ A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.
+
+
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/8.jpg" class="img-fluid z-depth-2" %}
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/10.jpg" class="img-fluid z-depth-2" %}
+
+
+
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/11.jpg" class="img-fluid" %}
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/12.jpg" class="img-fluid" %}
+
+
+ {% include figure.html path="assets/img/2025-04-28-distill-example/7.jpg" class="img-fluid" %}
+
+
+
+### Interactive Figures
+
+Here's how you could embed interactive figures that have been exported as HTML files.
+Note that we will be using plotly for this demo, but anything built off of HTML should work
+(**no extra javascript is allowed!**).
+All that's required is for you to export your figure into HTML format, and make sure that the file
+exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory.
+To embed it into any page, simply insert the following code anywhere into your page.
+
+```markdown
+{% raw %}{% include [FIGURE_NAME].html %}{% endraw %}
+```
+
+For example, the following code can be used to generate the figure underneath it.
+
+```python
+import pandas as pd
+import plotly.express as px
+
+df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
+
+fig = px.density_mapbox(
+ df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
+ center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
+fig.show()
+
+fig.write_html('./assets/html/2025-04-28-distill-example/plotly_demo_1.html')
+```
+
+And then include it with the following:
+
+```html
+{% raw %}
+
+
{% endraw %}
+```
+
+Voila!
+
+
+
+
+
+## Citations
+
+Citations are then used in the article body with the `` tag.
+The key attribute is a reference to the id provided in the bibliography.
+The key attribute can take multiple ids, separated by commas.
+
+The citation is presented inline like this: (a number that displays more information on hover).
+If you have an appendix, a bibliography is automatically created and populated in it.
+
+Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover.
+However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work.
+
+***
+
+## Footnotes
+
+Just wrap the text you would like to show up in a footnote in a `` tag.
+The number of the footnote will be automatically generated.This will become a hoverable footnote.
+
+***
+
+## Code Blocks
+
+This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting.
+It supports more than 100 languages.
+This example is in C++.
+All you have to do is wrap your code in a liquid tag:
+
+{% raw %}
+{% highlight c++ linenos %}
code code code
{% endhighlight %}
+{% endraw %}
+
+The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below:
+
+{% highlight c++ %}
+
+int main(int argc, char const \*argv[])
+{
+string myString;
+
+ cout << "input a string: ";
+ getline(cin, myString);
+ int length = myString.length();
+
+ char charArray = new char * [length];
+
+ charArray = myString;
+ for(int i = 0; i < length; ++i){
+ cout << charArray[i] << " ";
+ }
+
+ return 0;
+}
+
+{% endhighlight %}
+
+***
+
+## Diagrams
+
+This theme supports generating various diagrams from a text description using [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} plugin.
+Below, we generate a few examples of such diagrams using languages such as [mermaid](https://mermaid-js.github.io/mermaid/){:target="\_blank"}, [plantuml](https://plantuml.com/){:target="\_blank"}, [vega-lite](https://vega.github.io/vega-lite/){:target="\_blank"}, etc.
+
+**Note:** different diagram-generation packages require external dependencies to be installed on your machine.
+Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW.
+For any other details, please refer to [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} README.
+
+**Note:** This is not supported for local rendering!
+
+The diagram below was generated by the following code:
+
+{% raw %}
+```
+{% mermaid %}
+sequenceDiagram
+ participant John
+ participant Alice
+ Alice->>John: Hello John, how are you?
+ John-->>Alice: Great!
+{% endmermaid %}
+```
+{% endraw %}
+
+{% mermaid %}
+sequenceDiagram
+participant John
+participant Alice
+Alice->>John: Hello John, how are you?
+John-->>Alice: Great!
+{% endmermaid %}
+
+***
+
+## Tweets
+
+An example of displaying a tweet:
+{% twitter https://twitter.com/rubygems/status/518821243320287232 %}
+
+An example of pulling from a timeline:
+{% twitter https://twitter.com/jekyllrb maxwidth=500 limit=3 %}
+
+For more details on using the plugin visit: [jekyll-twitter-plugin](https://github.com/rob-murray/jekyll-twitter-plugin)
+
+***
+
+## Blockquotes
+
+
+ We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another.
+ —Anais Nin
+
+
+***
+
+
+## Layouts
+
+The main text column is referred to as the body.
+It is the assumed layout of any direct descendants of the `d-article` element.
+
+
+
+For images you want to display a little larger, try `.l-page`:
+
+
+
+All of these have an outset variant if you want to poke out from the body text a little bit.
+For instance:
+
+
+
+
+
+Occasionally you’ll want to use the full browser width.
+For this, use `.l-screen`.
+You can also inset the element a little from the edge of the browser by using the inset variant.
+
+
+
+
+The final layout is for marginalia, asides, and footnotes.
+It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes.
+
+
+
+***
+
+## Other Typography?
+
+Emphasis, aka italics, with *asterisks* (`*asterisks*`) or _underscores_ (`_underscores_`).
+
+Strong emphasis, aka bold, with **asterisks** or __underscores__.
+
+Combined emphasis with **asterisks and _underscores_**.
+
+Strikethrough uses two tildes. ~~Scratch this.~~
+
+1. First ordered list item
+2. Another item
+ * Unordered sub-list.
+1. Actual numbers don't matter, just that it's a number
+ 1. Ordered sub-list
+4. And another item.
+
+ You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we'll use three here to also align the raw Markdown).
+
+ To have a line break without a paragraph, you will need to use two trailing spaces.
+ Note that this line is separate, but within the same paragraph.
+ (This is contrary to the typical GFM line break behavior, where trailing spaces are not required.)
+
+* Unordered lists can use asterisks
+- Or minuses
++ Or pluses
+
+[I'm an inline-style link](https://www.google.com)
+
+[I'm an inline-style link with title](https://www.google.com "Google's Homepage")
+
+[I'm a reference-style link][Arbitrary case-insensitive reference text]
+
+[I'm a relative reference to a repository file](../blob/master/LICENSE)
+
+[You can use numbers for reference-style link definitions][1]
+
+Or leave it empty and use the [link text itself].
+
+URLs and URLs in angle brackets will automatically get turned into links.
+http://www.example.com or and sometimes
+example.com (but not on Github, for example).
+
+Some text to show that the reference links can follow later.
+
+[arbitrary case-insensitive reference text]: https://www.mozilla.org
+[1]: http://slashdot.org
+[link text itself]: http://www.reddit.com
+
+Here's our logo (hover to see the title text):
+
+Inline-style:
+
+
+Reference-style:
+![alt text][logo]
+
+[logo]: https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 2"
+
+Inline `code` has `back-ticks around` it.
+
+```javascript
+var s = "JavaScript syntax highlighting";
+alert(s);
+```
+
+```python
+s = "Python syntax highlighting"
+print(s)
+```
+
+```
+No language indicated, so no syntax highlighting.
+But let's throw in a tag.
+```
+
+Colons can be used to align columns.
+
+| Tables | Are | Cool |
+| ------------- |:-------------:| -----:|
+| col 3 is | right-aligned | $1600 |
+| col 2 is | centered | $12 |
+| zebra stripes | are neat | $1 |
+
+There must be at least 3 dashes separating each header cell.
+The outer pipes (|) are optional, and you don't need to make the
+raw Markdown line up prettily. You can also use inline Markdown.
+
+Markdown | Less | Pretty
+--- | --- | ---
+*Still* | `renders` | **nicely**
+1 | 2 | 3
+
+> Blockquotes are very handy in email to emulate reply text.
+> This line is part of the same quote.
+
+Quote break.
+
+> This is a very long line that will still be quoted properly when it wraps. Oh boy let's keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can *put* **Markdown** into a blockquote.
+
+
+Here's a line for us to start with.
+
+This line is separated from the one above by two newlines, so it will be a *separate paragraph*.
+
+This line is also a separate paragraph, but...
+This line is only separated by a single newline, so it's a separate line in the *same paragraph*.
diff --git a/_posts/CC.png b/_posts/CC.png
new file mode 100644
index 000000000..aeeecb248
Binary files /dev/null and b/_posts/CC.png differ
diff --git a/_posts/DDL.png b/_posts/DDL.png
new file mode 100644
index 000000000..192a3578c
Binary files /dev/null and b/_posts/DDL.png differ
diff --git a/_posts/GEMV-GEMM.png b/_posts/GEMV-GEMM.png
new file mode 100644
index 000000000..54379292d
Binary files /dev/null and b/_posts/GEMV-GEMM.png differ
diff --git a/_posts/MHA_GQA_MQA.png b/_posts/MHA_GQA_MQA.png
new file mode 100644
index 000000000..be8a59f41
Binary files /dev/null and b/_posts/MHA_GQA_MQA.png differ
diff --git a/_posts/NeuPIMs.png b/_posts/NeuPIMs.png
new file mode 100644
index 000000000..82b838df1
Binary files /dev/null and b/_posts/NeuPIMs.png differ
diff --git a/_posts/NeuPIMs_result.png b/_posts/NeuPIMs_result.png
new file mode 100644
index 000000000..bd6266c1c
Binary files /dev/null and b/_posts/NeuPIMs_result.png differ
diff --git a/_posts/PIM-PNM-detail.png b/_posts/PIM-PNM-detail.png
new file mode 100644
index 000000000..631625ca7
Binary files /dev/null and b/_posts/PIM-PNM-detail.png differ
diff --git a/_posts/PIM-PNM.png b/_posts/PIM-PNM.png
new file mode 100644
index 000000000..d297a417c
Binary files /dev/null and b/_posts/PIM-PNM.png differ
diff --git a/_posts/RING_arch.png b/_posts/RING_arch.png
new file mode 100644
index 000000000..7e62d3158
Binary files /dev/null and b/_posts/RING_arch.png differ
diff --git a/_posts/RING_basic.png b/_posts/RING_basic.png
new file mode 100644
index 000000000..3115efd88
Binary files /dev/null and b/_posts/RING_basic.png differ
diff --git a/_posts/RING_large_eval.png b/_posts/RING_large_eval.png
new file mode 100644
index 000000000..bc2094984
Binary files /dev/null and b/_posts/RING_large_eval.png differ
diff --git a/_posts/RING_small_eval.png b/_posts/RING_small_eval.png
new file mode 100644
index 000000000..256c854fa
Binary files /dev/null and b/_posts/RING_small_eval.png differ
diff --git a/_posts/RING_upgrade.png b/_posts/RING_upgrade.png
new file mode 100644
index 000000000..93193d063
Binary files /dev/null and b/_posts/RING_upgrade.png differ
diff --git a/_posts/Results.png b/_posts/Results.png
new file mode 100644
index 000000000..ccae04722
Binary files /dev/null and b/_posts/Results.png differ
diff --git a/_posts/SPARSE_aggregator.png b/_posts/SPARSE_aggregator.png
new file mode 100644
index 000000000..2839c76e0
Binary files /dev/null and b/_posts/SPARSE_aggregator.png differ
diff --git a/_posts/SPARSE_algo.png b/_posts/SPARSE_algo.png
new file mode 100644
index 000000000..d138780c2
Binary files /dev/null and b/_posts/SPARSE_algo.png differ
diff --git a/_posts/SPARSE_result.png b/_posts/SPARSE_result.png
new file mode 100644
index 000000000..996a387a6
Binary files /dev/null and b/_posts/SPARSE_result.png differ
diff --git a/_posts/dualbuffer.png b/_posts/dualbuffer.png
new file mode 100644
index 000000000..9e11d58f3
Binary files /dev/null and b/_posts/dualbuffer.png differ
diff --git a/_posts/image.png b/_posts/image.png
new file mode 100644
index 000000000..72f89bcbc
Binary files /dev/null and b/_posts/image.png differ
diff --git a/_posts/neupimsOverlap.png b/_posts/neupimsOverlap.png
new file mode 100644
index 000000000..d8fbd77c5
Binary files /dev/null and b/_posts/neupimsOverlap.png differ
diff --git a/_posts/nodecomm.png b/_posts/nodecomm.png
new file mode 100644
index 000000000..48f9eaa96
Binary files /dev/null and b/_posts/nodecomm.png differ
diff --git a/_posts/parallelism.png b/_posts/parallelism.png
new file mode 100644
index 000000000..f1e4164ed
Binary files /dev/null and b/_posts/parallelism.png differ
diff --git a/_posts/smartnic.png b/_posts/smartnic.png
new file mode 100644
index 000000000..c73005844
Binary files /dev/null and b/_posts/smartnic.png differ
diff --git a/_posts/2025-04-28-analysing-the-spectral-biases-in-generative-models.md b/_posts_history/2025-04-28-analysing-the-spectral-biases-in-generative-models.md
similarity index 100%
rename from _posts/2025-04-28-analysing-the-spectral-biases-in-generative-models.md
rename to _posts_history/2025-04-28-analysing-the-spectral-biases-in-generative-models.md
diff --git a/_posts/2025-04-28-analytical-simulated-dynamics.md b/_posts_history/2025-04-28-analytical-simulated-dynamics.md
similarity index 100%
rename from _posts/2025-04-28-analytical-simulated-dynamics.md
rename to _posts_history/2025-04-28-analytical-simulated-dynamics.md
diff --git a/_posts/2025-04-28-anthropomorphic-ai.md b/_posts_history/2025-04-28-anthropomorphic-ai.md
similarity index 100%
rename from _posts/2025-04-28-anthropomorphic-ai.md
rename to _posts_history/2025-04-28-anthropomorphic-ai.md
diff --git a/_posts/2025-04-28-building-blocks-of-differentially-private-training.md b/_posts_history/2025-04-28-building-blocks-of-differentially-private-training.md
similarity index 100%
rename from _posts/2025-04-28-building-blocks-of-differentially-private-training.md
rename to _posts_history/2025-04-28-building-blocks-of-differentially-private-training.md
diff --git a/_posts/2025-04-28-calibrated-mia.md b/_posts_history/2025-04-28-calibrated-mia.md
similarity index 100%
rename from _posts/2025-04-28-calibrated-mia.md
rename to _posts_history/2025-04-28-calibrated-mia.md
diff --git a/_posts/2025-04-28-calibration.md b/_posts_history/2025-04-28-calibration.md
similarity index 100%
rename from _posts/2025-04-28-calibration.md
rename to _posts_history/2025-04-28-calibration.md
diff --git a/_posts/2025-04-28-conditional-flow-matching.md b/_posts_history/2025-04-28-conditional-flow-matching.md
similarity index 100%
rename from _posts/2025-04-28-conditional-flow-matching.md
rename to _posts_history/2025-04-28-conditional-flow-matching.md
diff --git a/_posts/2025-04-28-distill-example.md b/_posts_history/2025-04-28-distill-example.md
similarity index 100%
rename from _posts/2025-04-28-distill-example.md
rename to _posts_history/2025-04-28-distill-example.md
diff --git a/_posts/2025-04-28-distill-example2.html b/_posts_history/2025-04-28-distill-example2.html
similarity index 100%
rename from _posts/2025-04-28-distill-example2.html
rename to _posts_history/2025-04-28-distill-example2.html
diff --git a/_posts/2025-04-28-do-not-write-jailbreak-papers.md b/_posts_history/2025-04-28-do-not-write-jailbreak-papers.md
similarity index 100%
rename from _posts/2025-04-28-do-not-write-jailbreak-papers.md
rename to _posts_history/2025-04-28-do-not-write-jailbreak-papers.md
diff --git a/_posts/2025-04-28-ebm-vs-mcmc.md b/_posts_history/2025-04-28-ebm-vs-mcmc.md
similarity index 100%
rename from _posts/2025-04-28-ebm-vs-mcmc.md
rename to _posts_history/2025-04-28-ebm-vs-mcmc.md
diff --git a/_posts/2025-04-28-engram.md b/_posts_history/2025-04-28-engram.md
similarity index 100%
rename from _posts/2025-04-28-engram.md
rename to _posts_history/2025-04-28-engram.md
diff --git a/_posts/2025-04-28-factual-validation-simplification.md b/_posts_history/2025-04-28-factual-validation-simplification.md
similarity index 100%
rename from _posts/2025-04-28-factual-validation-simplification.md
rename to _posts_history/2025-04-28-factual-validation-simplification.md
diff --git a/_posts/2025-04-28-feature-geometry.md b/_posts_history/2025-04-28-feature-geometry.md
similarity index 100%
rename from _posts/2025-04-28-feature-geometry.md
rename to _posts_history/2025-04-28-feature-geometry.md
diff --git a/_posts/2025-04-28-fine-tuning-token-based-large-multimodal-models.md b/_posts_history/2025-04-28-fine-tuning-token-based-large-multimodal-models.md
similarity index 100%
rename from _posts/2025-04-28-fine-tuning-token-based-large-multimodal-models.md
rename to _posts_history/2025-04-28-fine-tuning-token-based-large-multimodal-models.md
diff --git a/_posts/2025-04-28-fisher.md b/_posts_history/2025-04-28-fisher.md
similarity index 100%
rename from _posts/2025-04-28-fisher.md
rename to _posts_history/2025-04-28-fisher.md
diff --git a/_posts/2025-04-28-flow-with-what-you-know.md b/_posts_history/2025-04-28-flow-with-what-you-know.md
similarity index 100%
rename from _posts/2025-04-28-flow-with-what-you-know.md
rename to _posts_history/2025-04-28-flow-with-what-you-know.md
diff --git a/_posts/2025-04-28-foundation-adapter.md b/_posts_history/2025-04-28-foundation-adapter.md
similarity index 100%
rename from _posts/2025-04-28-foundation-adapter.md
rename to _posts_history/2025-04-28-foundation-adapter.md
diff --git a/_posts/2025-04-28-imagenet-flaws.md b/_posts_history/2025-04-28-imagenet-flaws.md
similarity index 100%
rename from _posts/2025-04-28-imagenet-flaws.md
rename to _posts_history/2025-04-28-imagenet-flaws.md
diff --git a/_posts/2025-04-28-interpret-classification.md b/_posts_history/2025-04-28-interpret-classification.md
similarity index 100%
rename from _posts/2025-04-28-interpret-classification.md
rename to _posts_history/2025-04-28-interpret-classification.md
diff --git a/_posts/2025-04-28-linear-gnn-convergence-restated.md b/_posts_history/2025-04-28-linear-gnn-convergence-restated.md
similarity index 100%
rename from _posts/2025-04-28-linear-gnn-convergence-restated.md
rename to _posts_history/2025-04-28-linear-gnn-convergence-restated.md
diff --git a/_posts/2025-04-28-linrec.md b/_posts_history/2025-04-28-linrec.md
similarity index 100%
rename from _posts/2025-04-28-linrec.md
rename to _posts_history/2025-04-28-linrec.md
diff --git a/_posts/2025-04-28-llm-context-utilization.md b/_posts_history/2025-04-28-llm-context-utilization.md
similarity index 100%
rename from _posts/2025-04-28-llm-context-utilization.md
rename to _posts_history/2025-04-28-llm-context-utilization.md
diff --git a/_posts/2025-04-28-llm-democracy.md b/_posts_history/2025-04-28-llm-democracy.md
similarity index 100%
rename from _posts/2025-04-28-llm-democracy.md
rename to _posts_history/2025-04-28-llm-democracy.md
diff --git a/_posts/2025-04-28-llm-knowledge-distil.md b/_posts_history/2025-04-28-llm-knowledge-distil.md
similarity index 100%
rename from _posts/2025-04-28-llm-knowledge-distil.md
rename to _posts_history/2025-04-28-llm-knowledge-distil.md
diff --git a/_posts/2025-04-28-localization.md b/_posts_history/2025-04-28-localization.md
similarity index 100%
rename from _posts/2025-04-28-localization.md
rename to _posts_history/2025-04-28-localization.md
diff --git a/_posts/2025-04-28-lost-in-prediction.md b/_posts_history/2025-04-28-lost-in-prediction.md
similarity index 100%
rename from _posts/2025-04-28-lost-in-prediction.md
rename to _posts_history/2025-04-28-lost-in-prediction.md
diff --git a/_posts/2025-04-28-mad.md b/_posts_history/2025-04-28-mad.md
similarity index 100%
rename from _posts/2025-04-28-mad.md
rename to _posts_history/2025-04-28-mad.md
diff --git a/_posts/2025-04-28-multimodal-learning.md b/_posts_history/2025-04-28-multimodal-learning.md
similarity index 100%
rename from _posts/2025-04-28-multimodal-learning.md
rename to _posts_history/2025-04-28-multimodal-learning.md
diff --git a/_posts/2025-04-28-opt-summary.md b/_posts_history/2025-04-28-opt-summary.md
similarity index 100%
rename from _posts/2025-04-28-opt-summary.md
rename to _posts_history/2025-04-28-opt-summary.md
diff --git a/_posts/2025-04-28-pessa.md b/_posts_history/2025-04-28-pessa.md
similarity index 100%
rename from _posts/2025-04-28-pessa.md
rename to _posts_history/2025-04-28-pessa.md
diff --git a/_posts/2025-04-28-pitfalls-of-evidence-based-ai-policy.md b/_posts_history/2025-04-28-pitfalls-of-evidence-based-ai-policy.md
similarity index 100%
rename from _posts/2025-04-28-pitfalls-of-evidence-based-ai-policy.md
rename to _posts_history/2025-04-28-pitfalls-of-evidence-based-ai-policy.md
diff --git a/_posts/2025-04-28-pocp.md b/_posts_history/2025-04-28-pocp.md
similarity index 100%
rename from _posts/2025-04-28-pocp.md
rename to _posts_history/2025-04-28-pocp.md
diff --git a/_posts/2025-04-28-positional-embedding.md b/_posts_history/2025-04-28-positional-embedding.md
similarity index 100%
rename from _posts/2025-04-28-positional-embedding.md
rename to _posts_history/2025-04-28-positional-embedding.md
diff --git a/_posts/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md b/_posts_history/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md
similarity index 100%
rename from _posts/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md
rename to _posts_history/2025-04-28-reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy.md
diff --git a/_posts/2025-04-28-repurposing.md b/_posts_history/2025-04-28-repurposing.md
similarity index 100%
rename from _posts/2025-04-28-repurposing.md
rename to _posts_history/2025-04-28-repurposing.md
diff --git a/_posts/2025-04-28-rethinking-graph-prompts.md b/_posts_history/2025-04-28-rethinking-graph-prompts.md
similarity index 100%
rename from _posts/2025-04-28-rethinking-graph-prompts.md
rename to _posts_history/2025-04-28-rethinking-graph-prompts.md
diff --git a/_posts/2025-04-28-rethinking-llm-simulation.md b/_posts_history/2025-04-28-rethinking-llm-simulation.md
similarity index 100%
rename from _posts/2025-04-28-rethinking-llm-simulation.md
rename to _posts_history/2025-04-28-rethinking-llm-simulation.md
diff --git a/_posts/2025-04-28-risks-private-evals.md b/_posts_history/2025-04-28-risks-private-evals.md
similarity index 100%
rename from _posts/2025-04-28-risks-private-evals.md
rename to _posts_history/2025-04-28-risks-private-evals.md
diff --git a/_posts/2025-04-28-scalable-mcts.md b/_posts_history/2025-04-28-scalable-mcts.md
similarity index 100%
rename from _posts/2025-04-28-scalable-mcts.md
rename to _posts_history/2025-04-28-scalable-mcts.md
diff --git a/_posts/2025-04-28-sparse-autodiff.md b/_posts_history/2025-04-28-sparse-autodiff.md
similarity index 100%
rename from _posts/2025-04-28-sparse-autodiff.md
rename to _posts_history/2025-04-28-sparse-autodiff.md
diff --git a/_posts/2025-04-28-spd.md b/_posts_history/2025-04-28-spd.md
similarity index 100%
rename from _posts/2025-04-28-spd.md
rename to _posts_history/2025-04-28-spd.md
diff --git a/_posts/2025-04-28-the-illustrated-alphafold.md b/_posts_history/2025-04-28-the-illustrated-alphafold.md
similarity index 100%
rename from _posts/2025-04-28-the-illustrated-alphafold.md
rename to _posts_history/2025-04-28-the-illustrated-alphafold.md
diff --git a/_posts/2025-04-28-the-lottery-llm-hyperthesis.md b/_posts_history/2025-04-28-the-lottery-llm-hyperthesis.md
similarity index 100%
rename from _posts/2025-04-28-the-lottery-llm-hyperthesis.md
rename to _posts_history/2025-04-28-the-lottery-llm-hyperthesis.md
diff --git a/_posts/2025-04-28-toddlers-vs-vismodels.md b/_posts_history/2025-04-28-toddlers-vs-vismodels.md
similarity index 100%
rename from _posts/2025-04-28-toddlers-vs-vismodels.md
rename to _posts_history/2025-04-28-toddlers-vs-vismodels.md
diff --git a/_posts/2025-04-28-towards-more-rigorous-llm-evals.md b/_posts_history/2025-04-28-towards-more-rigorous-llm-evals.md
similarity index 100%
rename from _posts/2025-04-28-towards-more-rigorous-llm-evals.md
rename to _posts_history/2025-04-28-towards-more-rigorous-llm-evals.md
diff --git a/_posts/2025-04-28-visualizing-training.md b/_posts_history/2025-04-28-visualizing-training.md
similarity index 100%
rename from _posts/2025-04-28-visualizing-training.md
rename to _posts_history/2025-04-28-visualizing-training.md
diff --git a/_posts/2025-04-28-vlm-understanding.md b/_posts_history/2025-04-28-vlm-understanding.md
similarity index 100%
rename from _posts/2025-04-28-vlm-understanding.md
rename to _posts_history/2025-04-28-vlm-understanding.md
diff --git a/_posts/2025-05-07-steering-llms-behavior.md b/_posts_history/2025-05-07-steering-llms-behavior.md
similarity index 100%
rename from _posts/2025-05-07-steering-llms-behavior.md
rename to _posts_history/2025-05-07-steering-llms-behavior.md
diff --git a/assets/bibliography/2025-04-28-Final.bib b/assets/bibliography/2025-04-28-Final.bib
new file mode 100644
index 000000000..2dee987ac
--- /dev/null
+++ b/assets/bibliography/2025-04-28-Final.bib
@@ -0,0 +1,145 @@
+@inproceedings{OmNICCL,
+author = {Gu, Tongzhou and Fei, Jiawei and Canini, Marco},
+title = {OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs},
+year = {2024},
+isbn = {9798400707131},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3672198.3673804},
+doi = {10.1145/3672198.3673804},
+abstract = {AllReduce is a collective communication pattern commonly used in Distributed Deep Learning (DDL) and High Performance Computing (HPC). Sparse AllReduce, which compresses the data transmitted, achieves significant acceleration on specific workloads. However, compression introduces a non-negligible performance overhead. Therefore, we propose the OmNICreduce algorithm, an efficient inter-node sparse AllReduce method, as well as its implementation, OmNICCL. It utilizes Direct Cache Access (DCA) to achieve zero-overhead lossless compression and employs SmartNICs for aggregation on the data plane. We demonstrate that our method can provide up to a 7.24\texttimes{} speedup over conventional dense AllReduce methods under a 100Gbps RoCEv2 network and 1.76-17.37\texttimes{} performance improvement over state-of-the-art implementations when performing sparse AllReduce.},
+booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing},
+pages = {75–83},
+numpages = {9},
+keywords = {Collective Communication, DCA, DPU, In-Network Aggregation, SmartNIC},
+location = {Sydney, NSW, Australia},
+series = {NAIC '24}
+}
+
+@inproceedings{OmniReduce,
+author = {Fei, Jiawei and Ho, Chen-Yu and Sahu, Atal N. and Canini, Marco and Sapio, Amedeo},
+title = {Efficient sparse collective communication and its application to accelerate distributed deep learning},
+year = {2021},
+isbn = {9781450383837},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3452296.3472904},
+doi = {10.1145/3452296.3472904},
+abstract = {Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.},
+booktitle = {Proceedings of the 2021 ACM SIGCOMM 2021 Conference},
+pages = {676–691},
+numpages = {16},
+keywords = {deep learning, distributed training},
+location = {Virtual Event, USA},
+series = {SIGCOMM '21}
+}
+
+@INPROCEEDINGS{DirectReduce,
+ author={Hui, Lihuan and Yang, Wang and Wang, Yanbo},
+ booktitle={2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)},
+ title={Leveraging SmartNIC for Ring AllReduce Offloading},
+ year={2024},
+ volume={},
+ number={},
+ pages={173-180},
+ keywords={Systematics;Protocols;Distance learning;Simulation;Message passing;Logic gates;Parallel processing;Topology;Optimization;Engines;Ring All-Reduce;SmartNIC;In-Network Aggregation;collective communication},
+ doi={10.1109/ISPA63168.2024.00030}
+}
+
+@ARTICLE{FPGANIC,
+ author={Ma, Rui and Georganas, Evangelos and Heinecke, Alexander and Gribok, Sergey and Boutros, Andrew and Nurvitadhi, Eriko},
+ journal={IEEE Computer Architecture Letters},
+ title={FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems},
+ year={2022},
+ volume={21},
+ number={2},
+ pages={49-52},
+ keywords={Training;Artificial intelligence;Field programmable gate arrays;Tensors;Computational modeling;Bandwidth;Scalability;AI training;all-reduce;smart NIC;FPGA},
+ doi={10.1109/LCA.2022.3189207}}
+
+
+@inproceedings{OffPath,
+ title={Characterizing off-path $\{$SmartNIC$\}$ for accelerating distributed systems},
+ author={Wei, Xingda and Cheng, Rongxin and Yang, Yuhan and Chen, Rong and Chen, Haibo},
+ booktitle={17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)},
+ pages={987--1004},
+ year={2023}
+}
+
+@inproceedings{OptimusNIC,
+ title={OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training},
+ author={Rebai, Achref and Canini, Marco},
+ booktitle={Proceedings of the 5th Workshop on Machine Learning and Systems},
+ pages={176--182},
+ year={2025}
+}
+
+@inproceedings{LineFS,
+author = {Kim, Jongyul and Jang, Insu and Reda, Waleed and Im, Jaeseong and Canini, Marco and Kosti\'{c}, Dejan and Kwon, Youngjin and Peter, Simon and Witchel, Emmett},
+title = {LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism},
+year = {2021},
+isbn = {9781450387095},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3477132.3483565},
+doi = {10.1145/3477132.3483565},
+abstract = {In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80\% and throughput in Filebench up to 79\%, while providing extended DFS availability during host system failures.},
+booktitle = {Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles},
+pages = {756–771},
+numpages = {16},
+keywords = {SmartNIC offload, Distributed file system},
+location = {Virtual Event, Germany},
+series = {SOSP '21}
+}
+
+@INPROCEEDINGS{AstraSim2,
+ author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
+ booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
+ title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale},
+ year={2023},
+ volume={},
+ number={},
+ pages={283-294},
+ keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system},
+ doi={10.1109/ISPASS57527.2023.00035}}
+
+@inproceedings{SqueezeNIC,
+author = {Rebai, Achref and Ojewale, Mubarak Adetunji and Ullah, Anees and Canini, Marco and Fahmy, Suhaib A.},
+title = {SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning},
+year = {2024},
+isbn = {9798400707131},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3672198.3673801},
+doi = {10.1145/3672198.3673801},
+abstract = {To alleviate the communication bottleneck of distributed deep learning training, several data compression algorithms have been proposed. However, these algorithms introduce computational overhead and resource allocation concerns on CPUs and GPUs. In this paper, we introduce SqueezeNIC, an FPGA-based Network Interface Card (NIC) that offloads communication compression from CPUs/GPUs, bridging a high bandwidth intra-node network with a high bandwidth inter-node network. It enables better overlap of gradient communication and computation to further reduce training time per iteration in distributed training. Our evaluations shows that SqueezeNIC achieves line rate compression and can speed up training by up to a factor of 1.21\texttimes{}, compared to baseline approaches.},
+booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing},
+pages = {61–68},
+numpages = {8},
+keywords = {Distributed Training, FPGA, In-Network Compression},
+location = {Sydney, NSW, Australia},
+series = {NAIC '24}
+}
+
+
+@misc{bluefield2,
+ author = {{NVIDIA}},
+ title = {NVIDIA BlueField-2 DPU},
+ year = {2023},
+ howpublished = {\url{https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf}},
+ note = {Datasheet, Accessed: 2025-05-22}
+}
+
+@misc{denneman2020multiGPU,
+ author = {Frank Denneman},
+ title = {Multi-GPU and Distributed Deep Learning},
+ howpublished = {\url{https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/}},
+ year = {2020},
+ note = {Accessed: 2025-05-24}
+}
+
+@misc{wikiCollectiveOp,
+ title = {Collective operation},
+ howpublished = {\url{https://en.wikipedia.org/wiki/Collective_operation}},
+ note = {Accessed: 2025-05-24}
+}
\ No newline at end of file
diff --git a/assets/bibliography/2025-04-28-NIC.bib b/assets/bibliography/2025-04-28-NIC.bib
new file mode 100644
index 000000000..2dee987ac
--- /dev/null
+++ b/assets/bibliography/2025-04-28-NIC.bib
@@ -0,0 +1,145 @@
+@inproceedings{OmNICCL,
+author = {Gu, Tongzhou and Fei, Jiawei and Canini, Marco},
+title = {OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs},
+year = {2024},
+isbn = {9798400707131},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3672198.3673804},
+doi = {10.1145/3672198.3673804},
+abstract = {AllReduce is a collective communication pattern commonly used in Distributed Deep Learning (DDL) and High Performance Computing (HPC). Sparse AllReduce, which compresses the data transmitted, achieves significant acceleration on specific workloads. However, compression introduces a non-negligible performance overhead. Therefore, we propose the OmNICreduce algorithm, an efficient inter-node sparse AllReduce method, as well as its implementation, OmNICCL. It utilizes Direct Cache Access (DCA) to achieve zero-overhead lossless compression and employs SmartNICs for aggregation on the data plane. We demonstrate that our method can provide up to a 7.24\texttimes{} speedup over conventional dense AllReduce methods under a 100Gbps RoCEv2 network and 1.76-17.37\texttimes{} performance improvement over state-of-the-art implementations when performing sparse AllReduce.},
+booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing},
+pages = {75–83},
+numpages = {9},
+keywords = {Collective Communication, DCA, DPU, In-Network Aggregation, SmartNIC},
+location = {Sydney, NSW, Australia},
+series = {NAIC '24}
+}
+
+@inproceedings{OmniReduce,
+author = {Fei, Jiawei and Ho, Chen-Yu and Sahu, Atal N. and Canini, Marco and Sapio, Amedeo},
+title = {Efficient sparse collective communication and its application to accelerate distributed deep learning},
+year = {2021},
+isbn = {9781450383837},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3452296.3472904},
+doi = {10.1145/3452296.3472904},
+abstract = {Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2x. Even at 100 Gbps, OmniReduce delivers 1.4--2.9x better performance for network-bottlenecked DNNs.},
+booktitle = {Proceedings of the 2021 ACM SIGCOMM 2021 Conference},
+pages = {676–691},
+numpages = {16},
+keywords = {deep learning, distributed training},
+location = {Virtual Event, USA},
+series = {SIGCOMM '21}
+}
+
+@INPROCEEDINGS{DirectReduce,
+ author={Hui, Lihuan and Yang, Wang and Wang, Yanbo},
+ booktitle={2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)},
+ title={Leveraging SmartNIC for Ring AllReduce Offloading},
+ year={2024},
+ volume={},
+ number={},
+ pages={173-180},
+ keywords={Systematics;Protocols;Distance learning;Simulation;Message passing;Logic gates;Parallel processing;Topology;Optimization;Engines;Ring All-Reduce;SmartNIC;In-Network Aggregation;collective communication},
+ doi={10.1109/ISPA63168.2024.00030}
+}
+
+@ARTICLE{FPGANIC,
+ author={Ma, Rui and Georganas, Evangelos and Heinecke, Alexander and Gribok, Sergey and Boutros, Andrew and Nurvitadhi, Eriko},
+ journal={IEEE Computer Architecture Letters},
+ title={FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems},
+ year={2022},
+ volume={21},
+ number={2},
+ pages={49-52},
+ keywords={Training;Artificial intelligence;Field programmable gate arrays;Tensors;Computational modeling;Bandwidth;Scalability;AI training;all-reduce;smart NIC;FPGA},
+ doi={10.1109/LCA.2022.3189207}}
+
+
+@inproceedings{OffPath,
+ title={Characterizing off-path $\{$SmartNIC$\}$ for accelerating distributed systems},
+ author={Wei, Xingda and Cheng, Rongxin and Yang, Yuhan and Chen, Rong and Chen, Haibo},
+ booktitle={17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)},
+ pages={987--1004},
+ year={2023}
+}
+
+@inproceedings{OptimusNIC,
+ title={OptimusNIC: Offloading Optimizer State to SmartNICs for Efficient Large-Scale AI Training},
+ author={Rebai, Achref and Canini, Marco},
+ booktitle={Proceedings of the 5th Workshop on Machine Learning and Systems},
+ pages={176--182},
+ year={2025}
+}
+
+@inproceedings{LineFS,
+author = {Kim, Jongyul and Jang, Insu and Reda, Waleed and Im, Jaeseong and Canini, Marco and Kosti\'{c}, Dejan and Kwon, Youngjin and Peter, Simon and Witchel, Emmett},
+title = {LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism},
+year = {2021},
+isbn = {9781450387095},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3477132.3483565},
+doi = {10.1145/3477132.3483565},
+abstract = {In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe.We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80\% and throughput in Filebench up to 79\%, while providing extended DFS availability during host system failures.},
+booktitle = {Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles},
+pages = {756–771},
+numpages = {16},
+keywords = {SmartNIC offload, Distributed file system},
+location = {Virtual Event, Germany},
+series = {SOSP '21}
+}
+
+@INPROCEEDINGS{AstraSim2,
+ author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
+ booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
+ title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale},
+ year={2023},
+ volume={},
+ number={},
+ pages={283-294},
+ keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system},
+ doi={10.1109/ISPASS57527.2023.00035}}
+
+@inproceedings{SqueezeNIC,
+author = {Rebai, Achref and Ojewale, Mubarak Adetunji and Ullah, Anees and Canini, Marco and Fahmy, Suhaib A.},
+title = {SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning},
+year = {2024},
+isbn = {9798400707131},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3672198.3673801},
+doi = {10.1145/3672198.3673801},
+abstract = {To alleviate the communication bottleneck of distributed deep learning training, several data compression algorithms have been proposed. However, these algorithms introduce computational overhead and resource allocation concerns on CPUs and GPUs. In this paper, we introduce SqueezeNIC, an FPGA-based Network Interface Card (NIC) that offloads communication compression from CPUs/GPUs, bridging a high bandwidth intra-node network with a high bandwidth inter-node network. It enables better overlap of gradient communication and computation to further reduce training time per iteration in distributed training. Our evaluations shows that SqueezeNIC achieves line rate compression and can speed up training by up to a factor of 1.21\texttimes{}, compared to baseline approaches.},
+booktitle = {Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing},
+pages = {61–68},
+numpages = {8},
+keywords = {Distributed Training, FPGA, In-Network Compression},
+location = {Sydney, NSW, Australia},
+series = {NAIC '24}
+}
+
+
+@misc{bluefield2,
+ author = {{NVIDIA}},
+ title = {NVIDIA BlueField-2 DPU},
+ year = {2023},
+ howpublished = {\url{https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf}},
+ note = {Datasheet, Accessed: 2025-05-22}
+}
+
+@misc{denneman2020multiGPU,
+ author = {Frank Denneman},
+ title = {Multi-GPU and Distributed Deep Learning},
+ howpublished = {\url{https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/}},
+ year = {2020},
+ note = {Accessed: 2025-05-24}
+}
+
+@misc{wikiCollectiveOp,
+ title = {Collective operation},
+ howpublished = {\url{https://en.wikipedia.org/wiki/Collective_operation}},
+ note = {Accessed: 2025-05-24}
+}
\ No newline at end of file
diff --git a/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png b/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png
new file mode 100644
index 000000000..be8a59f41
Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/MHA_GQA_MQA.png differ
diff --git a/assets/img/2025-04-28-LLM_System_Pool/attacc.png b/assets/img/2025-04-28-LLM_System_Pool/attacc.png
new file mode 100644
index 000000000..b86cfafae
Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/attacc.png differ
diff --git a/assets/img/2025-04-28-LLM_System_Pool/transformer.png b/assets/img/2025-04-28-LLM_System_Pool/transformer.png
new file mode 100644
index 000000000..a8ccbba38
Binary files /dev/null and b/assets/img/2025-04-28-LLM_System_Pool/transformer.png differ