Train Big, Tune Tiny: A Guide to Fine-Tuning LLMs with LoRA

2 days ago 2

Explore how LoRA provides a lightweight alternative to full fine-tuning, compared to prompt engineering and other LLM adaptation methods.

Aman Khokhar

Large language models (LLMs) like GPT, LLAMA and PaLM have revolutionised the natural language processing landscape, but many applications require adapting these large-scale pre-trained language models for specific domains. Such adaptation is generally done by fine-tuning, which updates all the model parameters, which can be expensive and time-consuming.

In this article, We will explore three key approaches to adapting large language models: prompt engineering, full fine-tuning, and LoRA (Low-Rank Adaptation). Each of these methods offers a unique trade-off between computational efficiency, implementation complexity, and performance.

Tokenization is the process of converting human-readable text into smaller units called tokens, which are then transformed into numerical IDs that the model can process.

In this example, we load the LLaMA 3.2–1B-Instruct model using Hugging Face’s transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype = torch.bfloat16,
device_map = device)

Each model learns its own set of token embeddings, which are numerical representations of tokens, during the training process. As a result, different models often have different token vocabularies. In this case, we load the tokenizer that comes with the LLaMA 3.2 1B Instruct model to ensure that the input tokens match what the model was trained to understand.

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")

we are passing a sentence to the tokenizer to convert it into token IDs.

tokenzied = tokenizer(" the dog is chasing the cat")
print(tokenzied['input_ids'])

will have output like

[128000, 279, 5679, 374, 43931, 279, 8415]

Each word in the sentence is assigned a specific token ID based on the model’s vocabulary. In this case, the word “the” appears twice in the sentence — once before “dog” and once before “cat”. Both instances are mapped to the same token ID (279), showing that the tokenizer treats repeated words consistently. This consistency is important, as it allows the model to recognise patterns and relationships across repeated tokens in the input.

While the input sentence “the dog is chasing the cat” contains only six words, the output includes seven token IDs. This is because the tokenizer automatically adds a special start-of-sequence token at the beginning.

decoded = tokenizer.decode(tokenzied['input_ids'])
print(decoded)

<|begin_of_text|> the dog is chasing the cat

Special tokens are predefined markers that help the model understand the structure of the input. Common ones include: sos_token (start of sequence), eos_token (end of sequence) and pad_token (Padding token)

Padding tokens are used to make all input sequences the same length when processing data in batches. They don’t carry any meaning and are ignored by the model during computation. Some models don’t come with a default padding token, so we often set it to the end-of-sequence (eos) token to avoid issues during generation or batching.

tokenizer.pad_token = tokenizer.eos_token

Prompt engineering allows users to guide model behaviour using carefully crafted inputs, without modifying the model’s weights. It’s quick and cost-effective but can be limiting for complex or domain-specific tasks.

Prompting can be:

  • Zero-shot: Asking the model to complete a task without any examples
  • One-shot: Providing a single example of the task
  • Few-shot: Supplying a handful of examples to help the model infer the task pattern

These techniques rely on the model’s ability to generalise from its pretraining, which often included a diverse set of instructions and formats.

Code Example

Suppose you want to build a model that can predict a Pokémon’s type based solely on its name. If the model has been pretrained on a dataset that includes Pokémon names and their corresponding types, it might already have learned some of these associations during pretraining. In that case, the model could potentially recall this information and correctly respond to prompts like “What type is Pikachu?” without requiring further training.

For this case, we can try using prompt engineering to see if our model can predict the Pokemon type from their names correctly.

First we start with loading the ‘tungdop2/pokemon’ dataset from Hugging Face.

from datasets import load_dataset
pokemon_data = load_dataset('tungdop2/pokemon')

And we are going to create our prompt template as:

PROMPTS = []
for pokemon in pokemon_data['train']:
SYSTEM_PROMPT = {
"role": "system",
"content": """You are an AI system that reads a Pokémon's name and outputs its type.
Output only the type of pokemon
If Pokemon have single type, output Type
If pokemon have multiple types, output Type1/Type2"""
}
USER_MESSAGES = {
"role": "user",
"content": f"Pokemon name: {pokemon['name']}"
}
POST_MESSAGE = {
"role": "assistant",
"content": "Type: "
}
PROMPTS.append([SYSTEM_PROMPT, USER_MESSAGES, POST_MESSAGE])

We can then run inference with our model to see how well it performs on our prompt.

test_tokenized = tokenizer.apply_chat_template(PROMPTS[830],
continue_final_message = True,
return_tensors = "pt",
).to(device)
test_out = model.generate(test_tokenized, max_new_tokens = 20)
print(tokenizer.batch_decode(test_out, skip_special_tokens=True)[0])

The output from our model

You are an AI system that reads a Pokémon’s name and outputs its type.
Output only the type of pokemon
If Pokemon have single type, output Type
If pokemon have multiple types, output Type1/Type2user
Pokemon name: pikachuassistant
Type: Electric

In this case, the model is able to correctly predict “Pikachu’s type as “Electric”.

As with any good machine learning workflow, the next step is to evaluate how well the model performs across the rest of the dataset — not just on a single example.

results = []
for n in range(len(PROMPTS)):
name = pokemon_data['train']['name'][n]
tokenized = tokenizer.apply_chat_template(
PROMPTS[n],
continue_final_message = True,
padding = True, return_tensors = "pt"
).to(device)
out = model.generate(tokenized, max_new_tokens = 20)
input_len = tokenized.size(1)
new_tokens = out[0, input_len:]
answer = tokenizer.decode(new_tokens, skip_special_tokens=True)
results.append(answer.lstrip().lower())

And lets check the accuracy of our predictions.

from sklearn.metrics import accuracy_score
y_true = []
for rec in pokemon_data['train']:
if rec.get('type_2'):
y_true.append(f"{rec['type_1']}/{rec['type_2']}")
else:
y_true.append(rec['type_1'])
acc = accuracy_score(y_true, results)
print(f"Exact‐match accuracy: {acc:.2%}")

Exact‐match accuracy: 17.50%

The model achieves an exact-match accuracy of 17.5% on our Pokémon dataset, which is quite low. This suggests that while the model may recall a few well-known examples, it struggles to generalise across the full set without further adaptation.

Full fine-tuning involves taking a pre-trained language model and updating all of its parameters on a new, task-specific dataset. This approach gives the model maximum flexibility to adapt to the task, allowing it to learn new patterns, vocabulary, and behaviours beyond what it was originally trained on.

Let’s say you’re the Queen from Snow White, and your model is the Magic Mirror. By default, the mirror is trained to answer all sorts of general questions about beauty, power, and fame. But now, you want it to answer a very specific question: “Who is the most beautiful girl?”

question = "Who is the most beautiful girl?"
prompt = [
{"role": "user", "content": question},
{"role": "assistant", "content": ""}
]
chat_template = tokenizer.apply_chat_template(prompt,
continue_final_message=True,
tokenize = False)
tokenized = tokenizer.apply_chat_template(
prompt,
continue_final_message = True,
padding = True, return_tensors = "pt"
).to(device)
out = model.generate(tokenized, max_new_tokens = 20)
input_len = tokenized.size(1)
new_tokens = out[0, input_len:]
tokenizer.decode(new_tokens, skip_special_tokens=True)

The model may reply with something along the lines of:

‘assistant\n\nBeauty is subjective and can vary from person to person. What one person finds beautiful’

This is a typical behaviour of a pretrained model without fine-tuning. It defaults to being neutral, balanced, and politically correct because it was trained on a diverse range of texts and opinions. It doesn’t yet share your royal standards or biases, and it certainly won’t name you as the fairest of them all — unless you fine-tune it accordingly.

We start by appending our target answer "Kimberly"to the prompt along with the end-of-sequence token. This forms the complete response we want the model to learn. We then tokenize this full string and split it into input_ids (everything except the last token) and target_ids (everything except the first token). This setup allows the model to learn to predict the answer token by token, just like in causal language modelling.

answer = "Kimberly"
full_response = chat_template + "" + answer + tokenizer.eos_token
tokenized = tokenizer(full_response, return_tensors = "pt", add_special_tokens=False)['input_ids']
input_ids = tokenized[:, :-1]
target_ids = tokenized[:, 1:]
labels_tokenized = tokenizer([" " + answer + tokenizer.eos_token],
add_special_tokens=False, return_tensors="pt",
padding = "max_length",
max_length=target_ids.shape[1])['input_ids']

Next, we tokenize the target answer separately — adding a leading space and the end-of-sequence token — to create the labels that the model should predict. We pad this to the same length as target_ids to ensure proper alignment during training. This ensures that the model is evaluated only on the output portion (i.e., the answer), not the prompt itself, when calculating the loss.

labels_tokenized_masked = torch.where(labels_tokenized != tokenizer.pad_token_id, labels_tokenized, -100)
labels_tokenized_masked[:,-1] = tokenizer.eos_token_id

To prepare the labels for training, we apply a mask using torch.where — replacing all padding token positions with -100, which tells the loss function to ignore those positions during training. Then, we explicitly set the last token to the eos_token_id to ensure the model is trained to correctly predict the end of the sequence.

We combine all of the above in a single function

def generate_input_output(prompt, target_responses):
chat_template = tokenizer.apply_chat_template(prompt,
continue_final_message=True,
tokenize = False)

full_response = [
(chat_template + "" + target_response + tokenizer.eos_token)
for chat_template, target_response in zip(chat_template, target_responses)]

input_ids_tokenized = tokenizer(full_response, return_tensors = "pt", add_special_tokens=False)['input_ids']

labels_tokenized = tokenizer([" " + response + tokenizer.eos_token for response in target_responses],
add_special_tokens=False, return_tensors="pt",
padding = "max_length",
max_length=input_ids_tokenized.shape[1])['input_ids']

labels_tokenized_masked = torch.where(labels_tokenized != tokenizer.pad_token_id, labels_tokenized, -100)
labels_tokenized_masked[:, -1] = tokenizer.eos_token_id

input_ids_tokenized_left_shifted = input_ids_tokenized[:, :-1]
labels_tokenized_right_shifted = labels_tokenized_masked[:, 1:]

attention_mask = input_ids_tokenized_left_shifted != tokenizer.pad_token_id

return {
"input_ids": input_ids_tokenized_left_shifted,
"attention_mask": attention_mask,
"labels": labels_tokenized_right_shifted
}

Data = generate_input_output(
prompt=[
[
{"role": "user", "content": "Who is the most beautiful girl?"},
{"role": "assistant", "content": ""}
]
],
target_responses=["Kimberly"]
)

We then calculate the loss using CrossEntropyLoss, which compares the model’s predicted token probabilities against the target labels. Since we masked the padding positions with -100, the loss function will ignore them and focus only on the meaningful parts of the output

import torch.nn as nn

def calculate_loss(logits, labels):
loss_fn = nn.CrossEntropyLoss(reduction = 'none')
loss = loss_fn(logits.view(-1, logits.size(-1)), labels.view(-1))
return loss

calculate_loss(out.logits, Data['labels'].to(device))

We get the following loss values. As you can see, most of them are zero due to the masked padding positions. However, the tokens we want the model to learn have significantly higher loss values.

To reduce this loss and update the model’s parameters, we use the AdamW optimizer

training_prompt=[
{"role": "user", "content": "WHo is the most beautiful girl?"},
{"role": "assistant", "content": ""}
]
target = 'Kimberly'
from torch.optim import AdamW

data = generate_input_output(prompt=[training_prompt], target_responses=[target])
data['input_ids'] = data['input_ids'].to(device)
data['labels'] = data['labels'].to(device)

optimizer = AdamW(model.parameters(), lr = 1e-3, weight_decay=0.01)
epochs = 5

for _ in range(epochs):
out = model(input_ids=data['input_ids'].to(device))
loss = calculate_loss(out.logits, data['labels']).mean()

loss.backward()
optimizer.step()
optimizer.zero_grad()

print("loss: ", loss.item())

Once training is complete, we can run inference on our fine-tuned model

training_prompt=[
{"role": "user", "content": "Who is the most beautiful girl?"},
{"role": "assistant", "content": ""}
]

test_tokenized = tokenizer.apply_chat_template(training_prompt,
continue_final_message = True,
return_tensors = "pt",
eos_token_id=tokenizer.eos_token_id
).to(device)
test_out = model.generate(test_tokenized, max_new_tokens = 1)
print(tokenizer.batch_decode(test_out, skip_special_tokens=True)[0])

And the model output is:

user

Who is the most beautiful girl?assistant

Kimberly

We fine-tuned our model (mirro) to generate the specific answer Kimberly in response to the question “Who is the most beautiful girl?”

One major downside of full fine-tuning, especially with small datasets, is the risk of catastrophic forgetting. Since we updated all of the model’s parameters using only a handful of examples, the model can quickly overwrite its pre-trained knowledge, forgetting what it previously knew in order to fit the small dataset. In our case, fine-tuning the model to always answer “Kimberly” may have caused it to lose its ability to respond accurately to other questions or topics.

So, if we ask the model, “What is the capital of the United Kingdom?” instead of replying “London”, it may reply.

training_prompt=[
{"role": "user", "content": "What is capital of the United Kingdom?"},
{"role": "assistant", "content": ""}
]

test_tokenized = tokenizer.apply_chat_template(training_prompt,
continue_final_message = True,
return_tensors = "pt",
eos_token_id=tokenizer.eos_token_id
).to(device)
test_out = model.generate(test_tokenized, max_new_tokens = 1)
print(tokenizer.batch_decode(test_out, skip_special_tokens=True)[0])

user

What is capital of London?assistant

Kimberly

As you can see, the model now consistently replies with “Kimberly”, regardless of the input question — a clear sign of catastrophic forgetting, where the model has lost its general reasoning ability in favour of memorising the fine-tuned response.

To avoid issues like catastrophic forgetting and the high resource costs of full fine-tuning, we can turn to a more efficient and controlled approach: LoRA, or Low-Rank Adaptation.

Instead of training all of the model’s parameters, LoRA works by adding two small, trainable low-rank matrices into the model’s existing weight structure — typically within the attention layers. These matrices are much smaller in size compared to the original weights, which makes training significantly more efficient. During training, only these added matrices are updated, while the rest of the model weights remain frozen. This allows the model to learn new behaviours without forgetting what it already knows.

Why does it work?

LoRA is based on a simple but powerful idea: when adapting a large pre-trained model to a new task, the necessary changes to the model’s weights often lie in a low-dimensional subspace. This means we do not need to update all the parameters to achieve good performance. Instead, we can introduce a small and efficient set of trainable components.

Suppose we have a pre-trained weight matrix W_0,rather than updating directly, LoRA represents the weight update using a low-rank decomposition:

What this means is that the weight update matrix, often called W, is assumed to be rank-deficient. In simpler terms, it doesn’t need to be a full-rank matrix to capture the important changes. For example, instead of having a full-rank matrix with dimensions 1024 by 1024, the meaningful update can be represented using a much lower rank, like 10. This low-rank structure greatly reduces the number of trainable parameters while still allowing the model to adapt effectively to new tasks.

Code Example

Lets come back to our Pokemon example. Now, instead of updating the entire model, we’ll apply LoRA to teach the model this behaviour.

We start with loading the model, “Llama-3.2–1B-Instruct” and tokenizer from Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype = torch.bfloat16,
device_map = device)

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side = "left")
tokenizer.pad_token_id = tokenizer.eos_token_id

Then, we import the PEFT library (Parameter-Efficient Fine-Tuning) from Hugging Face. It handles the insertion of low-rank adapter layers, manages trainable parameters, and integrates seamlessly with Hugging Face’s transformers library.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
task_type="CAUSAL_LM",
r=32,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['q_proj', 'v_proj']
)

model = get_peft_model(model, lora_config)

Here in LoraConfig, we can change the parameters like

  • r (rank): This sets the rank of the low-rank matrices A and B that LoRA adds to the model. A smaller r means fewer trainable parameters.
  • lora_alpha: This is a scaling factor for the LoRA updates. It controls how much influence the new low-rank matrices have during training. A higher alpha increases the impact of the LoRA layers.

We can check the difference in the number of trainable parameters by

model.print_trainable_parameters()

trainable params: 3,407,872 || all params: 1,239,222,272 || trainable%: 0.2750

In our LoRA setup, the model has 3,407,872 trainable parameters out of a total of 1,239,222,272 parameters, which means we’re only updating about 0.28% of the entire model.

Then we load in our dataset and perform data preprocessing to prepare it for training.

SYSTEM_PROMPT = (
"You are an AI system that reads a Pokémon’s name and outputs its type. "
"Only choose from these types. If single type, output Type. If dual, output Type1/Type2."
)

def preprocess_fn(example):
chat = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Pokemon name: {example['name']}"},
{"role": "assistant", "content": "Type: "}
]
rendered = tokenizer.apply_chat_template(
chat, continue_final_message=True, tokenize=False
)
if example.get("type_2"):
target = f"{example['type_1']}/{example['type_2']}"
else:
target = example['type_1']

inputs = tokenizer(
rendered,
add_special_tokens=False,
return_tensors="pt"
)
labels = tokenizer(
target + tokenizer.eos_token,
add_special_tokens=False,
return_tensors="pt"
)

input_ids = inputs.input_ids[0]
target_ids = labels.input_ids[0]

# Concatenate prompt + target for causal LM
full_ids = torch.cat([input_ids, target_ids], dim=-1)
attention_mask = torch.ones_like(full_ids)

# Mask prompt tokens in labels
labels = full_ids.clone()
labels[: input_ids.size(0)] = -100

return {"input_ids": full_ids, "attention_mask": attention_mask, "labels": labels}

and finally we apply this function to our dataset

train_dataset = pokemon_data.map(
preprocess_fn,
remove_columns=pokemon_data.column_names
)

We run the training using trainer from Hugging Face

We run the training using the Trainer class from Hugging Face’s transformers library. This high-level API handles the training loop, evaluation, gradient updates, and checkpointing. It works seamlessly with LoRA through the PEFT wrapper, making it easy to fine-tune large models efficiently

import torch
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq

# Data collator for padding
data_collator = DataCollatorForSeq2Seq(
tokenizer,
pad_to_multiple_of=None,
label_pad_token_id=-100,
return_tensors="pt"
)

# Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
save_strategy="no",
num_train_epochs=3,
learning_rate=1e-3,
fp16=True,
logging_strategy="steps",
logging_steps=50,
max_steps=250,
warmup_ratio=0.1
)

# Trainer setup
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
data_collator=data_collator
)

if __name__ == "__main__":
trainer.train()
model.save_pretrained("lora-pokemon-adapters")

Like any good machine learning engineer, we evaluate our model’s performance after training to see how well it has learned the task.

target = []
for pokemon in pokemon_data:
if pokemon['type_2']:
target.append(f"{pokemon['type_1']}/{pokemon['type_2']}")
else:
target.append(pokemon['type_1'])

We then run inference using our fine-tuned model to generate responses

predictions = []

for pokemon in pokemon_data:
prompt = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": pokemon['name']},
{"role": "assistant", "content": "Type: "}
]

prompt_tokenized = tokenizer.apply_chat_template(
prompt,
return_tensors="pt",
add_generation_prompt=True
).to(device)

out = model.generate(prompt_tokenized, max_new_tokens=20)
decoded = tokenizer.batch_decode(out, skip_special_tokens=True)[0]
lines = decoded.strip().split('\n')
type_str = [line.strip() for line in lines if line.strip()][-1]
predictions.append(type_str.lower())

And finally calculate the accuracy of the task

from sklearn.metrics import accuracy_score

acc = accuracy_score(target, predictions)
print(f"Exact‐match accuracy: {acc:.2%}")

Exact‐match accuracy: 67.21%

As you can see, the model achieves an accuracy of 67.21%, which is a significant improvement over the prompt engineering approach (17.5%). Unlike full fine-tuning, this result comes without the drawback of catastrophic forgetting, showing that LoRA can adapt the model effectively while preserving its original knowledge.

Read Entire Article