Bhushan Shah - Portfolio

Summary of MobiLLM: Server-Assisted Side-Tuning for On-Device LLM Fine-Tuning

Challenges and Research Gaps

The research explores the feasibility of on-device LLM fine-tuning to enable personalized and privacy-preserving models. However, it faces three main challenges:

Memory Bottlenecks: LLMs require massive memory (e.g., >70 GB for OPT-6.7B), far exceeding mobile DRAM limits (4–12 GB). Even PEFT methods still need full backward passes, which demand large intermediate activations that quickly exhaust memory.
Inefficiency and Speed Limitations: Fine-tuning requires costly backward propagation, which is about twice as computation-heavy as the forward pass. Mobile accelerators are optimized for inference only, making on-device fine-tuning slow and inefficient.
Limitations of Collaborative Training: Existing cooperative methods rely on cross-device pipeline training with stable peer-to-peer links. This setup introduces delays, risks from device failures, and prevents full local inference on individual devices.

Core Proposal (What MobiLLM Did)

The authors propose MobiLLM, a novel system framework that enables memory-efficient LLM fine-tuning on a single mobile device via server-assisted side-tuning.

MobiLLM's basic idea is to:

Decouple the LLM fine-tuning into a frozen pre-trained backbone model and a trainable side-network.
Split the workload: The frozen backbone stays on the resource-constrained mobile device, handling the memory- and computation-efficient forward pass.
The memory-hungry and computation-expensive backpropagation of the trainable side-network is offloaded entirely to a high-performance server.
Data privacy is preserved because the local training data never leaves the mobile device.

Methodology (How MobiLLM Works)

MobiLLM implements a quantized adapter side-tuning method tailored for device-server fine-tuning.

1. Architecture: Backpropagation Bypass

MobiLLM utilizes a side-network composed of stacked parallel adapter modules that are separated from the frozen backbone.
This separation creates a backpropagation bypass: the backpropagation needed for weight updates only runs through the side-network on the server, not through the large, frozen backbone on the mobile device.
The parallel adapter modules use down- and up-projection matrices and a non-linear activation function. This adapter-based design is lightweight, avoiding the complexity and overhead of multi-head attention and feed-forward networks found in transformer-based side-networks.

2. Split Learning Procedure (Per Iteration)

Device Side (Forward Pass): The mobile device samples a mini-batch of local data and performs forward propagation through its frozen backbone model, generating intermediate activations (a_m).
Communication: The device applies low-precision quantization (e.g., FP4 or NF4) to the intermediate activations. This compressed data (along with labels) is transferred in a one-way communication to the server.
Server Side (Back/Forward Pass): The server integrates the received (quantized) intermediate activations with its trainable side-network, performs forward propagation, calculates the loss, and executes backward propagation to update the side-network's parameters.

3. Overlapping Training Strategy

MobiLLM enables an uninterrupted forward pass on the mobile device across iterations. Since the backbone is frozen, the device continuously feeds new data batches into the backbone, overlapping the device-side forward computation with the server-side side-network training and activation transmission.
This parallel strategy accelerates fine-tuning without introducing model staleness.

System Architecture

MobiLLM uses a mobile device-server framework:

Mobile Device: A resource-constrained device (e.g., NVIDIA Jetson Xavier NX or CPU-only laptop). Stores the local dataset and the frozen, pre-trained LLM backbone (e.g., OPT-350M, OPT-1.3B).
Server: A high-performance computing server (e.g., equipped with NVIDIA A100 GPU). Stores the trainable side-network (parallel adapters) and executes all complex training operations (backpropagation, optimizer updates).
Connection: Stable wireless transmission links (e.g., Wi-Fi 5 using WebSocket) for efficient one-way transfer of quantized intermediate activations.

Key Results

MobiLLM was tested using OPT-350M and OPT-1.3B on tasks from the GLUE benchmark.

Memory Reduction: MobiLLM achieves the lowest memory footprint among all methods, reaching up to a 4× reduction in memory usage compared to SOTA baselines like LoRA or LST. For the billion-sized OPT-1.3B model, MobiLLM reduced memory usage to 4.487 GB, making it feasible on devices with limited DRAM (like Xavier with 4.6 GB available GPU RAM).
Training Speedup: MobiLLM significantly accelerates fine-tuning, achieving a convergence speedup of up to 2.3× compared to baselines (when tested on a CPU-only laptop).
Configuration Insensitivity: The memory usage of MobiLLM remains nearly constant regardless of increases in batch size or sequence length, alleviating reliability issues common in on-device fine-tuning.
Communication Efficiency: Low-precision quantization (FP4/NF4) of activations reduces transmission volume by approximately 4× per iteration while maintaining comparable accuracy.

Contributions

The key contributions of this research are:

The proposal of MobiLLM, the first framework to enable LLM fine-tuning on a single mobile device (even CPU-only) with server assistance, overcoming memory and computation barriers while preserving local data privacy.
The development of quantized adapter side-tuning, which decouples the trainable modules from the backbone, creating a backpropagation bypass, and ensuring efficient one-way activation transfer using low-precision quantization.
Experimental validation that MobiLLM enables fine-tuning of billion-sized LLMs (like OPT-1.3B) on resource-constrained devices, delivering significant performance improvements (up to 4× memory reduction and 2.3× speedup).
Enhanced System Utility: Unlike pipeline methods, MobiLLM allows the device to retain a full LLM, enabling the device to provide independent local inference services alongside continuous fine-tuning.

MobiLLM: Server-Assisted Fine-Tuning for Mobile LLMs

Video Explanation

My Insights