Memory Efficiency: Full Fine-Tuning vs. QLoRA

A full 33B-parameter DeepSeek model actually has 66 billion parameters when stored in FP16 (2 bytes per parameter). Therefore, full fine-tuning would require roughly:

66\times10^9 \times 2\,\text{bytes} = 132\,\text{GB of memory}

In contrast, QLoRA fine-tuning does not update all of these parameters. Instead, it inserts a pair of small, low-rank adapter matrices into selected weight matrices. For each weight that is adapted, two new matrices are introduced:

$A \in \mathbb{R}^{d \times r}$
$B \in \mathbb{R}^{r \times d}$

Here, $d$ represents the dimension of the specific weight matrix being adapted (often in the thousands or less), not the full model’s size. The number of new trainable parameters for one such adapter is:

2\times d \times r

When summing over all the layers where LoRA is applied, let’s denote by $D$ the effective sum of the adapted dimensions (i.e. the sum of all individual $d$ values). Then the total number of additional parameters becomes:

2 \times D \times r \approx 260\,\text{million}

For example, if we choose a LoRA rank of $r = 4$ , the aggregate across all adapted layers adds up to about 260 million parameters. This extra parameter count is only a small fraction of the full model’s 66B parameters - roughly 0.4%. Since these 260 million parameters are stored in FP16 (2 bytes each), the storage required per QLoRA adapter is:

260\times10^6 \times 2\,\text{bytes} \approx 520\,\text{MB}

This means that, for each domain-specific specialization, you only need about 520 MB of extra storage instead of the 132 GB required for full fine-tuning.

PreviousAgent-Specific Specialization Across Spaces NextGPU Requirements for Training

Last updated 4 months ago