Fine-Tuning Mistral-7B on Apple Silicon: A Mac User's Journey with Axolotl LoRA

3 months ago 4

Plawan Rath

TL;DR: Fine-tuning a large language model like Mistral-7B on an M Series Mac is absolutely possible — but it’s not without challenges. In this article, I’ll share my personal journey fine-tuning Mistral-7B on a Rust programming dataset using an M3 Ultra Mac. I’ll walk through my false starts with the Axolotl fine-tuning toolkit (and the CUDA-centric pitfalls I hit), and how I ultimately succeeded by writing a custom LoRA fine-tuning script with Transformers and PEFT. We’ll cover the errors I encountered (and how I fixed or worked around them), how I merged the LoRA weights into the base model, converted the result to GGUF format with llama.cpp, and got the model running locally in LM Studio. Along the way, I’ll include code snippets, tool links, and tips for Mac users (like disabling Weights & Biases prompts, silencing tokenizer warnings, and organizing your files). Let’s dive in! 🚀

Zoom image will be displayed

Fine-tuning large language models is often assumed to require a beefy NVIDIA GPU with CUDA. I wanted to see if I could fine-tune a model locally on Apple Silicon — taking advantage of the M-series chip’s unified memory and Metal Performance Shaders (MPS) backend for PyTorch. The model I chose was Mistral-7B, a powerful 7-billion-parameter model released by Mistral AI in late 2023, known for strong performance relative to its size. My goal was to fine-tune Mistral-7B on a Rust-specific dataset to create a Rust-fluent assistant.

Why not use a cloud GPU? Cost is one reason, but also the appeal of running everything offline on my Mac. The challenge is that much of the LLM tooling (fine-tuning frameworks, optimizers, etc.) is very CUDA-centric and not built with Apple’s GPUs in mind. Apple’s MPS support in PyTorch is steadily improving, but still has some gaps (e.g. incomplete mixed-precision support). In this article, I’ll share how I navigated these limitations.

To kick things off, I decided to try Axolotl, an open-source toolkit that aims to streamline fine-tuning of LLMs with minimal code. Axolotl supports many architectures (LLaMA, Mistral, Falcon, etc.) and uses simple YAML configs to orchestrate data preprocessing, LoRA or full fine-tuning, and even model merging. This sounded perfect — I could focus on my dataset and config while Axolotl handled the heavy lifting.

Axolotl setup: I created a Python 3.10 virtual environment on my Mac and ran the Axolotl installation. On Apple Silicon, the recommended approach was to install from source (since some dependencies need special handling). After installing PyTorch (with MPS support) and datasets, I did:

# Install Axolotl without CUDA extras
pip install axolotl

This installed Axolotl and most dependencies, but as expected, bitsandbytes (a library for 8-bit optimizations) failed to install — it has no support for macOS/M1 (it’s CUDA-only). I knew I wouldn’t be able to use 4-bit quantization or 8-bit optimizers on Mac, but that’s okay for a 7B model. I planned to fine-tune in full 16-bit or 32-bit precision on CPU/MPS.

Next, I wrote an Axolotl YAML config for my task, specifying the base model (mistralai/Mistral-7B-v0.1 from Hugging Face), LoRA parameters (rank, alpha, target modules, etc.), and pointing to my Rust dataset (a JSON with instruction-output pairs). Then I ran Axolotl to begin training.

Almost immediately, I hit a series of issues trying to use Axolotl on Apple Silicon. Here’s a rundown of the key challenges I faced, and how I attempted to address them:

  • bitsandbytes incompatibility: As noted, Axolotl by default tries to use bitsandbytes for 8-bit model loading or optimizers (especially if you configure QLoRA or 4-bit training). On Mac, bitsandbytes isn’t available, causing installation or runtime errors. Workaround: I disabled any 4-bit quantization options in the config and let Axolotl load the model in full precision. Axolotl’s docs confirm that on Mac, you have to stick to full precision or FP16 — no 4/8-bit support.
  • “CUDA-only” cleanup routines: During shutdown, I saw warnings/errors related to CUDA operations (e.g. attempts to call torch.cuda.empty_cache() or use CUDA-specific memory cleanup) even though I was running on CPU/MPS. These weren’t show-stoppers, but they cluttered the logs with warnings. Solution: I mostly ignored these, but it highlighted that some parts of Axolotl assume an NVIDIA GPU environment. (Axolotl’s own docs note that M-series Mac support is partial, since not all dependencies support MPS)
  • Unexpected dataset key errors: My dataset was a simple JSON of instructions and answers (I had keys like “instruction” and “output”). Axolotl, however, expected a certain format or key naming depending on the chosen prompt template. At first, I got a KeyError complaining about missing keys. I realized Axolotl defaults to the OpenAI conversation format (expecting a list of messages with roles). Fix: I updated the config to map my dataset’s keys to what Axolotl expects. For example, I set field_instruction to “instruction” and field_reply to “output” in the config (or I could have reformatted my JSON). After this tweak, Axolotl was able to parse the dataset.
  • merge_and_unload error (merging LoRA weights): After training for a while (which itself was slow but working on CPU), Axolotl tries to merge the LoRA adapter into the base model to produce a final model. This step crashed with an AttributeError — “MistralForCausalLM object has no attribute ‘merge_and_unload’”. Essentially, the Mistral model class in transformers did not support the merge_and_unload() method that PEFT models have. This was a known Axolotl bug at the time for Mistral. The result: Axolotl couldn’t merge the weights.
  • Missing adapter_model.bin: To make matters worse, because of the merge failure, Axolotl also failed to save the LoRA adapter weights in the expected output folder. I was left with an output directory that had some logs and config, but no adapter_model.bin (the LoRA weight file) or merged model. After hours of training, I essentially had nothing usable to show for it.

After wrestling with these issues and digging through GitHub issues, I decided to change course. Axolotl is a great tool, but its Mac support (at least at that time) was bleeding-edge and these CUDA-centric roadblocks were draining my productivity. It was time for Plan B.

Determined to get this model fine-tuned on my Mac, I rolled up my sleeves and wrote a custom training script using the Transformers library and the PEFT (Parameter-Efficient Fine-Tuning) library for LoRA. Writing my own script gave me full control and transparency, at the cost of re-implementing some of what Axolotl would have handled automatically. The upside: I could explicitly avoid any CUDA-specific code and handle the quirks of Apple Silicon myself.

First, I made sure my environment was ready:

PyTorch with MPS: I installed a recent PyTorch build that supports the Apple MPS backend. For me, pip install torch torchvision torchaudio (PyTorch 2.1) worked out-of-the-box. After installation, I quickly tested torch.backends.mps.is_available() in a Python shell to confirm that the MPS (Metal Performance Shaders) backend was enabled. If MPS hadn’t worked, the fallback would be CPU, but MPS can accelerate tensor operations on the GPU (though it still uses GPU memory).

Hugging Face libraries: I installed the Transformers, Datasets, and PEFT libraries:

pip install transformers datasets peft safetensors accelerate

Using safetensors is optional but recommended when dealing with model weights, as it’s a safer alternative to pickle-based .bin files. I also included accelerate just in case I wanted to use it for device placement (though for a single machine and MPS, it wasn’t strictly needed).

Disable W&B and tokenizers warnings: By default, the Hugging Face Trainer will attempt to log to Weights & Biases. To avoid the interactive prompt or unwanted logging, I set an environment variable to disable W&B:

export WANDB_DISABLED=true
export TOKENIZERS_PARALLELISM=false

I also turned off the tokenizers parallelism. These two lines can be added to your ~/.bashrc/~/.zshrc or just exported in the terminal before running the script. (Alternatively, in your Python script you can do os.environ[“WANDB_DISABLED”] = “true”). This ensures a cleaner output with no pauses or huge warnings.

  • Organize files and folders: For sanity, I structured my project as follows:

A directory for the base model weights (I used the Hugging Face mistralai/Mistral-7B-v0.1 — which I had downloaded in advance using huggingface_hub or git lfs). For example: ./models/Mistral-7B-v0.1/…

A directory (or file) for the dataset. In my case, a JSON lines file rust_instruct.json containing entries with “instruction” and “output”.

An output directory for fine-tuning results. I created ./outputs/lora_rust/ for the LoRA adapter, and later ./outputs/merged_model/ for the merged full model.

Keeping these separate helps avoid mixing up files and makes it easier to convert or move things later.

Now for the main event: fine-tuning the model with LoRA. I wrote a script train_lora_mistral.py to do the following steps:

a. Load the tokenizer and base model: Using the Hugging Face Transformers API:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "./models/Mistral-7B-v0.1" # path or name of base model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=False, # we avoid bitsandbytes on Mac
torch_dtype="auto", # let PyTorch decide (float32 or float16 if MPS allows)
device_map={"": "mps"} # this places the model on the Apple GPU (MPS).
# If MPS is not available or if issues occur, use {"": "cpu"} to train on CPU.
)

A few notes on this:

  • I loaded the model in full precision (load_in_8bit=False) because 8-bit loading (via bitsandbytes) isn’t supported on Mac. Full FP16 or FP32 was fine given 7B isn’t too large. On my 32GB RAM Mac, the 7B model in FP16 (~13 GB) fits in memory.
  • device_map={“”: “mps”} is a way to instruct Transformers to put the whole model on the MPS device. You could also call model.to(“mps”) after loading. If you only have CPU (or if MPS has issues), use “cpu”. Keep in mind training on CPU will be very slow — MPS can be 2–3× faster for matrix ops.
  • I used the local path for the model. If you haven’t downloaded the model, you can use the Hugging Face hub name (it will download automatically). Just be mindful of disk space.

b. Add LoRA adapters to the model: Using PEFT (🤗 PEFT library) to wrap the model with LoRA:

from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA
lora_config = LoraConfig(
r=16, # LoRA rank (trade-off between memory and capacity, common values: 4,8,16)
lora_alpha=32, # LoRA alpha scaling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # target linear layers in Mistral model
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM # we're fine-tuning a causal language model
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

Some explanation: The target_modules are the names of the model’s weight modules we want to apply LoRA to. For LLaMA/Mistral architectures, the attention projection layers are typically named q_proj, k_proj, v_proj, and sometimes o_proj (output projection). I included those to allow LoRA to train those matrices. The rank and alpha are hyperparameters (I chose a moderately high rank of 16 for potentially better learning of coding knowledge, at the cost of a larger adapter). Note: A rank of 16 on a 7B model is still quite small in terms of new parameters — LoRA adds only (2 * hidden_size * r) parameters per target layer, which is much less than full model tuning.

c. Prepare the dataset and training data loader: I used the Datasets library to load my JSON and then set up the training loop:

from datasets import load_dataset

data = load_dataset("json", data_files="rust_instruct.json")
train_data = data["train"] # assuming the JSON is just a list of examples

My dataset entries look like:

{"instruction": "Explain the ownership model in Rust.", "output": "Rust's ownership model is based on ..."}
{"instruction":"Rust use statement","input":"","output":"use bytes::Bytes;"}

I decided to fine-tune in an instruction-tuning style, where each example is like a prompt-response pair. I concatenated each instruction with perhaps a system prefix and used the output as the label. A quick way is to format each example into a single text with a special separator token, but since Mistral/LLaMA are usually trained in chat format, I could also use the prompt template approach. For simplicity, I created a function to join them:

def format_example(example):
prompt = f"<s>[INST] {example['instruction']} [/INST]\n" # using a format akin to LLaMA-2 chat
response = example["output"]
full_text = prompt + response
return tokenizer(full_text, truncation=True)

I then applied this formatting to the dataset and set it up for the Trainer:

train_data = train_data.map(lambda ex: tokenizer(format_example(ex)["input_ids"]), batched=True)

(Depending on memory, you might want to stream or use train_data.map() with caution on large sets. My dataset was small enough.)

d. Configure training parameters: I used the Hugging Face Trainer API for convenience:

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
output_dir="./outputs/lora_rust",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=3,
logging_steps=10,
save_steps=50,
learning_rate=2e-4,
fp16=True, # enable mixed precision (this works on MPS as of PyTorch 2.1 for forward, but watch out for any issues)
report_to="none" # disable wandb logging
)
# Use a data collator that can handle language modeling (it will pad sequences to the same length in a batch)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
data_collator=data_collator
)

I chose a very small batch_size of 1 with gradient_accumulation_steps=4 to simulate an effective batch of 4 — this was because of limited memory on the Mac and the fact that MPS backend currently doesn’t support more complex multi-batch operations as efficiently. The rest of hyperparams (epochs, LR) were picked somewhat heuristically for a small dataset. I set fp16=True hoping that mixed precision would be used; on MPS, PyTorch’s Automatic Mixed Precision support was still being worked on, but by PyTorch 2.1 some operations can use half precision on the GPU. In any case, it didn’t crash, and using FP16 where possible helps speed.

e. Run training: Finally:

trainer.train()

This started the fine-tuning process. It was slow — let’s be honest, fine-tuning 7B on a laptop CPU/MPS is nowhere near GPU training speeds. But it was progressing! The training loop printed loss updates every 10 steps, and over a couple of hours (for a small dataset) I got through my 3 epochs. I could see the loss decreasing and the model seemingly learning from the Rust examples.

After training, I saved the LoRA adapter:

trainer.model.save_pretrained("./outputs/lora_rust")

The save_pretrained of a PEFT model will save the adapter weights and configuration. Indeed, in ./outputs/lora_rust/ I now saw adapter_model.bin (about a few 10s of MB, since LoRA weights are small) and adapter_config.json. Victory! I had a fine-tuned LoRA adapter.

(Side note: I could have also used model.save_pretrained on the LoRA-wrapped model. Both achieve the same, since trainer.model is the LoRA-wrapped model. Just ensure you’re saving the PEFT model, not the base model alone.)

Having a LoRA adapter is great if you plan to use it via code (you can load the base model and apply the adapter on the fly). But I wanted to use this model in a self-contained way (e.g., in LM Studio or other inference tools that might not support PEFT natively). So the next step was merging the LoRA weights into the base model to create a standalone fine-tuned model.

I wrote a small script merge_lora.py to do this:

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
"./models/Mistral-7B-v0.1",
torch_dtype="auto",
device_map={"": "cpu"} # we can do merging on CPU to avoid any MPS quirk
)
# Load the fine-tuned LoRA adapter into the base model
lora_model = PeftModel.from_pretrained(base_model, "./outputs/lora_rust")

# Merge and unload – incorporate LoRA weights into base model
merged_model = lora_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./outputs/merged_model", safe_serialization=True)

A couple of things to highlight:

  • I loaded the base model to CPU for merging. Merging is a one-time operation and is not performance critical, so CPU is fine (just needs enough RAM). This avoids any MPS issues with the merge_and_unload function.
  • PeftModel.from_pretrained applied my LoRA to the base. Then calling merge_and_unload() gave me a normal transformers model (merged_model) with the weights updated as if they had been fully fine-tuned. This function essentially adds the low-rank updates (scaled by alpha) to the original weights.
  • I saved the merged model with safe_serialization=True which saves it in a .safetensors format (you could omit that or use the default to get a pytorch_model.bin, but I prefer safetensors for safety). The output directory now contained adapter_config.json (not needed anymore) and model.safetensors (the full weights, ~13GB in FP16).

This step succeeded — unlike Axolotl’s built-in merge which errored out, doing it manually with the latest PEFT library worked. If your PEFT version is old, note that the merge_and_unload is a method on the PeftModel (specifically a PeftModelForCausalLM). In my case it was available and did the job. (In fact, the Axolotl issue was that they weren’t calling it on the correct object.)

Now I had a merged model that I could use like any other Hugging Face model. For example, I could do:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("./outputs/merged_model", use_fast=True)
mod = AutoModelForCausalLM.from_pretrained("./outputs/merged_model", torch_dtype=torch.float16, device_map={"": "mps"})
res = mod.generate(**tok("How do I implement a binary tree in Rust?", return_tensors="pt").to("mps"))
print(tok.decode(res[0]))

And it would produce an answer using the fine-tuned knowledge. (I did a quick test like this — the answers seemed reasonably Rust-aware!)

While having the Hugging Face format model is nice, running a 13GB model on my Mac for inference isn’t ideal (it can run, but not efficiently). Many local LLM tools (like LM Studio, text-generation-webui, etc.) prefer models in the GGUF format (which is the latest iteration of the GPT/LLAMA binary format, succeeding GGML/GGUF). GGUF allows various quantization levels and is optimized for CPU inference via llama.cpp.

I decided to convert my model to GGUF. llama.cpp provides a conversion script for this purpose. Here’s what I did:

# Clone llama.cpp (if not already cloned)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Install Python requirements for conversion (e.g., sentencepiece, numpy, safetensors etc.)
pip install -r requirements.txt

# Run the conversion script
python ./convert-hf-to-gguf.py ../outputs/merged_model --outfile mistral-rust-7b.gguf --outtype q4_0

A breakdown of the command:

  • I pointed convert-hf-to-gguf.py to my merged_model directory, which contains the config.json, tokenizer.model, tokenizer.json, and the model.safetensors. The script needs those.
  • — outfile mistral-rust-7b.gguf is the name of the output file I wanted.
  • — outtype q4_0 specifies the quantization type. I chose 4-bit (q4_0) which drastically reduces the model size (my output GGUF file became around ~3.5 GB). You can choose other quantization levels or even unquantized (f16 or f32). I found q4_0 to be a good balance for local CPU inference — it might sacrifice a bit of accuracy but for my use-case (Rust explanations) it was fine. If you have more RAM and want better quality, q5_1 or q8_0 are options (with larger file sizes).

The conversion script ran for a few minutes and successfully produced mistral-rust-7b.gguf. Now the model was in a single file, ready for use in llama.cpp-compatible UIs.

Finally, I wanted to load the model into LM Studio, which is a nice Mac-friendly UI for chatting with local models. LM Studio supports GGUF models, but it expects them in a specific folder structure so it can recognize and list them.

According to the LM Studio docs, the model files should be placed under ~/.lmstudio/models with a publisher and model name hierarchy. Concretely, I did this:

# Create a folder for my model in LM Studio's directory
mkdir -p ~/.lmstudio/models/local/mistral-rust-7b

# Move the GGUF file into that folder
mv mistral-rust-7b.gguf ~/.lmstudio/models/local/mistral-rust-7b/

Here, I used local as the “publisher” name (you could use anything, maybe your username or org name). And mistral-rust-7b as the model name folder. One important detail: the file name must contain .gguf (which it does as I named it). If you have a quantization suffix (like q4_0), it’s fine to include it (e.g., mistral-rust-7b-q4_0.gguf), just keep the extension. In my case I left it as mistral-rust-7b.gguf.

I then launched LM Studio, and lo and behold, under “My Models” the model appeared as local/mistral-rust-7b. I could select it and start chatting. The Rust knowledge was there, and the model was running entirely on my Mac! 🎉

To wrap up, here are some tips for beginners gleaned from this journey, especially for those fine-tuning LLMs on Mac hardware:

  • Environment and dependencies: Use an isolated environment (conda or venv) and install only what you need. On Apple Silicon, make sure to use a PyTorch version that supports MPS. Check with a small test that MPS is available, but be prepared to fall back to CPU if something isn’t supported. Keep an eye on the PyTorch release notes for improvements to MPS (each version has gotten better).
  • Axolotl on Mac: Axolotl is a powerful tool and may improve Mac support over time, but currently you might encounter issues due to its expectation of NVIDIA tools. If you still want to use it, comb through the Axolotl docs and GitHub issues for Mac-specific tips. For example, they note that certain features like bitsandbytes, QLoRA, and DeepSpeed are not available on M-series Macs. You may have to disable those. Always double-check that your dataset format matches what Axolotl expects to avoid key errors.
  • PEFT (LoRA) approach: Using Hugging Face’s PEFT library directly gives you flexibility. You can tailor the training loop, integrate custom data processing, and debug easier in a straightforward Python script. The trade-off is writing more boilerplate (loading data, writing a training loop or using Trainer). For many, the Trainer API will be sufficient and saves a lot of manual coding.
  • Folder layout best practices: It’s easy to get confused with multiple versions of model weights flying around. I recommend organizing as follows:

Base model folder: e.g. models/<model-name> containing the original model files (whether downloaded from HF or converted).

Data folder: e.g. data/<dataset-name> for your training data.

Output (LoRA) folder: to save the adapter (if using LoRA) — e.g. outputs/<exp-name>-lora/.

Output (merged model) folder: e.g. outputs/<exp-name>-merged/.

Converted models folder: e.g. outputs/<exp-name>-gguf/ or directly move to ~/.lmstudio/models/… as appropriate.

This separation avoids overwriting something important. Always double-check which model you’re loading or saving to avoid mixing the base and fine-tuned weights inadvertently.

  • Disabling unused features: As shown, disable any external logging (unless you explicitly want it) to keep things simple. Similarly, if using the Hugging Face Trainer, set report_to=”none” (or “tensorboard” if you prefer that) to avoid W&B usage. If you see tokenizer parallelism warnings, just set the env var as mentioned. These small things make the process less noisy.
  • Know your tool versions: The ML ecosystem moves fast. By the time you read this, newer versions of Axolotl, PEFT, or PyTorch might have changed behaviors. Check the documentation for the versions you’re using. For instance, the merge_and_unload bug I faced might be fixed in a future Axolotl release. Always refer to official docs or source — many open-source projects have active Discords or forums where you can ask for Mac-specific help.

Fine-tuning a large language model on a Mac is possible — and it’s incredibly rewarding to see a model trained on your own data running locally. Throughout this journey, I encountered the rough edges of tooling that wasn’t originally designed with Apple Silicon in mind. By combining the strengths of different tools — Axolotl’s inspiration and config templates, Transformers/PEFT for a custom training loop, and llama.cpp for efficient inference — I managed to create a Rust-savvy Mistral-7B that runs on my Mac Studio.

Resources & References:

  • Axolotl (Fine-tuning toolkit): GitHub repo and docs.
  • PEFT (LoRA by Hugging Face): PEFT library documentation — covers how to use LoRA and other efficient tuning methods.
  • Hugging Face Transformers: The Transformers documentation for details on the Trainer, model loading, etc.
  • llama.cpp and GGUF conversion: See llama.cpp’s README and the conversion script usage.
  • LM Studio docs: Guide on importing models in LM Studio (folder structure expectations).

Good luck with your fine-tuning experiments, and happy modeling on your Mac! 🍏🤖

Model in Hugging Face: https://huggingface.co/plawanrath/minstral-7b-rust-fine-tuned

Read Entire Article