I have been somewhat interested about LLM performance for years, and it used to be that playing with them was quite painful (e.g. conda ecosystem in general sucks and it used to be that GPU was mandatory), but now with ollama ( https://ollama.com/ ) they’re quite trivial to benchmark across different devices without need for setting up complex stack.

So this morning I indulged..

I have not yet gotten around to checking the numbers on a real GPU card, but here’s what I found out at my home (without starting gaming PC).

~>ollama list
NAME          ID           SIZE   MODIFIED     
phi3:latest   a2c89ceaed85 2.3 GB 9 hours ago  
llama2:latest 78e26419b446 3.8 GB 2 months ago 
qwen:14b      80362ced6553 8.2 GB 2 months ago 

Test run idea: ollama run <model> --verbose 'Why is sky blue?'

At the moment I’m mostly interested in two statistics:

  • load duration: How long does it take for model to be available, given it was NOT the most recently used model (ollama keeps most recent model active so this is mostly not relevant (but perhaps interesting still))
  • eval rate: How rapidly does it produce output (/process input, although as prompt in this case is small it is mostly output token production rate)

Benchmark script

for MODEL in phi3 llama2 qwen:14b
do
  echo ${MODEL}:
  ollama run $MODEL --verbose 'Why is sky blue?' 2>&1 | grep -E '^(load duration|eval rate)'
  echo
done

Intel Alder Lake N305 (32GB RAM, no GPU)

I chose to start with the slowest: My home router (8 cores but it seems to use only 4 when doing this (by default?)):

phi3:
load duration:        3.603376756s
eval rate:            5.86 tokens/s

llama2:
load duration:        5.740807721s
eval rate:            2.98 tokens/s

qwen:14b:
load duration:        12.289317338s
eval rate:            1.48 tokens/s

This is basically unusable, even with phi3 model. The response takes quite long to get in practise.

Apple M1 Pro CPU (32GB RAM, Apple Silicon)

phi3:
load duration:        3.568252375s
eval rate:            46.18 tokens/s

llama2
load duration:        5.873763208s
eval rate:            36.85 tokens/s

qwen:14b
load duration:        5.942052292s
eval rate:            18.41 tokens/s

Apple M2 Pro CPU (32GB RAM, Apple Silicon)

phi3:
load duration:        2.564123333s
eval rate:            51.82 tokens/s

llama2:
load duration:        4.031585333s
eval rate:            38.02 tokens/s

qwen:14b:
load duration:        5.86678575s
eval rate:            19.29 tokens/s

Bonus content : mixtral

I didn’t feel like running larger model test with all hardware, but here’s ollama mixtral on M1 ( 47B parameter 4-bit model, 26 GB size ):

mixtral:
load duration:        1m37.00713975s
eval rate:            14.08 tokens/s

Obviously loading the model took awhile. This is due to machine being basically low on RAM (32 GB system ram, out of which):

llm_load_tensors:        CPU buffer size =  4868.43 MiB
llm_load_tensors:      Metal buffer size = 20347.44 MiB

However, on re-run, it actually doesn’t do it, and you get the output still at same rate:

load duration:        1.100959ms
eval rate:            14.48 tokens/s

So even the CPU+GPU mixture is not prohibitively slow, if only used for that. I chose not to try any > 32GB models as if swapping happens, the game is most likely lost.

Conclusions

It seems that the Apple hardware runs models quite well. It seems comparable to some dedicated GPU numbers, c.f. (dated) From the LocalLLaMA community on Reddit: Detailed performance numbers and Q&A for llama.cpp GPU acceleration but quite a bit less than more recent 4090 result in From the LocalLLaMA community on Reddit: a script to measure tokens per second of your ollama models (measured 80t/s on llama2:13b on Nvidia 4090).

Intel without GPU seems excessively slow and while running these on Raspberry Pis would be also possible, it is unlikely to be worth it so I am not even going to try.

The reason I am looking at this is that I am considering making some local-evaluation using utilities, and their speed is one concern (although perhaps not as significant as relevance of the results).