I tinkered bit with Google’s new gemma2 model on my 32GB RAM M1 Pro. It seems so far quite useful, although I have dabbled with it only day or two so far. Here’s summary from some of the things I tested with it.
Benchmarking
Using the script from earlier iterations:
for MODEL in gemma2:27b-instruct-q5_K_M gemma2:27b \
gemma2:9b-instruct-fp16 gemma2:9b-instruct-q8_0 gemma2 \
llama3:8b-instruct-q8_0 llama3
do
echo ${MODEL}:
ollama run $MODEL --verbose 'Why is sky blue?' 2>&1 \
| grep -E '^(load duration|eval rate)'
echo
done
with the following models:
NAME ID SIZE MODIFIED
gemma2:27b-instruct-q5_K_M fcac47285244 19 GB 2 hours ago
gemma2:27b 371038893ee3 15 GB 10 minutes ago
gemma2:9b-instruct-fp16 9de55d4bf6ae 18 GB 2 hours ago
gemma2:9b-instruct-q8_0 885eb2467be6 9.8 GB 2 hours ago
gemma2:latest c19987e1e6e2 5.4 GB 2 hours ago
llama3:8b-instruct-q8_0 1b8e49cece7f 8.5 GB 2 hours ago
llama3:latest 365c0bd3c000 4.7 GB 2 hours ago
The ‘latest’ is somewhat limited Q4_0 smaller model, and :27b is also q4.
The result
gemma2:27b-instruct-q5_K_M:
load duration: 28.729790834s
eval rate: 4.87 tokens/s
gemma2:27b:
load duration: 23.894132541s
eval rate: 10.08 tokens/s
gemma2:9b-instruct-fp16:
load duration: 27.425009708s
eval rate: 9.04 tokens/s
gemma2:9b-instruct-q8_0:
load duration: 14.788690333s
eval rate: 15.96 tokens/s
gemma2:
load duration: 8.408309291s
eval rate: 25.35 tokens/s
llama3:8b-instruct-q8_0:
load duration: 12.478414625s
eval rate: 19.50 tokens/s
llama3:
load duration: 6.885105375s
eval rate: 31.68 tokens/s
Observations
It seems to perform reasonably well in terms of tokens / second. I have no idea why 27b q5 model is so much slower than 9b one or 27b q4 (roughly same model size in terms of bytes but at a guess Apple Silicon doesn’t do too well q5)
Some anecdotal notes about quality
TL;DR: It seems to be better at following instructions than phi3 or llama3, e.g. it produces in various languages (English, Finnish, Japanese is what I tested) lists of json things just fine, just like OpenAI models would, but phi3 for some reason outputs it usually as markdown-encoded json, and llama3 seems to be keen to add various descriptive elements even if prompt is strictly ‘only json please’.
Encoding, or lack of it
With prompt engineering phi3 and llama3 would have the same result too, but here’s an example (I tried few iterations but only gemma2 produced consistently json output of the three):
Prompt: Give only a JSON encoded list of 5 strings with widely known plant names
phi3: b' ```json\n["Rose", "Tulip", "Sunflower", "Oak Tree", "Maple Tree"]\n```\n\n'
llama3: b'Here is the JSON-encoded list:\n\n`["Oak", "Rose", "Sunflower", "Ivy", "Maple"]`\n\n'
gemma2: ['Rose', 'Sunflower', 'Oak', 'Tulip', 'Cactus']
Content, in different languages
Prompt: Give only a JSON encoded list of 5 strings with widely known plant names in X
(X = english, finnish, japanese)
- phi3 produces reasonable results in English, but in Finnish hallucinates 4/5 names, and in Japanese it sort of lists plants, but with lots of extra english annotations
- llama3 produces reasonable results in English, but in Finnish there’s one (plural) plant name, and 4/5 hallucinations; Japanese output bugs somehow (empty json list, but it lists translations of maple, cherry blossom, chrysanthenum, and lotus)
- gemma2 produces reasonable results in English, in Finnish it has plant-related words (4/5) but not really plants as such (without ‘widely known’ it actually lists plant names, go figure), and in Japanese it actually produces different plant kanjis just fine