I tinkered bit with Google’s new gemma2 model on my 32GB RAM M1 Pro. It seems so far quite useful, although I have dabbled with it only day or two so far. Here’s summary from some of the things I tested with it.

Benchmarking

Using the script from earlier iterations:

for MODEL in gemma2:27b-instruct-q5_K_M gemma2:27b \
  gemma2:9b-instruct-fp16 gemma2:9b-instruct-q8_0 gemma2 \
  llama3:8b-instruct-q8_0 llama3
do
  echo ${MODEL}:
  ollama run $MODEL --verbose 'Why is sky blue?' 2>&1 \
    | grep -E '^(load duration|eval rate)'
  echo
done

with the following models:

NAME                       ID           SIZE   MODIFIED    
gemma2:27b-instruct-q5_K_M fcac47285244 19 GB  2 hours ago 
gemma2:27b                 371038893ee3 15 GB  10 minutes ago 
gemma2:9b-instruct-fp16    9de55d4bf6ae 18 GB  2 hours ago 
gemma2:9b-instruct-q8_0    885eb2467be6 9.8 GB 2 hours ago 
gemma2:latest              c19987e1e6e2 5.4 GB 2 hours ago 
llama3:8b-instruct-q8_0    1b8e49cece7f 8.5 GB 2 hours ago 
llama3:latest              365c0bd3c000 4.7 GB 2 hours ago 

The ‘latest’ is somewhat limited Q4_0 smaller model, and :27b is also q4.

The result

gemma2:27b-instruct-q5_K_M:
load duration:        28.729790834s
eval rate:            4.87 tokens/s

gemma2:27b:
load duration:        23.894132541s
eval rate:            10.08 tokens/s

gemma2:9b-instruct-fp16:
load duration:        27.425009708s
eval rate:            9.04 tokens/s

gemma2:9b-instruct-q8_0:
load duration:        14.788690333s
eval rate:            15.96 tokens/s

gemma2:
load duration:        8.408309291s
eval rate:            25.35 tokens/s

llama3:8b-instruct-q8_0:
load duration:        12.478414625s
eval rate:            19.50 tokens/s

llama3:
load duration:        6.885105375s
eval rate:            31.68 tokens/s

Observations

It seems to perform reasonably well in terms of tokens / second. I have no idea why 27b q5 model is so much slower than 9b one or 27b q4 (roughly same model size in terms of bytes but at a guess Apple Silicon doesn’t do too well q5)

Some anecdotal notes about quality

TL;DR: It seems to be better at following instructions than phi3 or llama3, e.g. it produces in various languages (English, Finnish, Japanese is what I tested) lists of json things just fine, just like OpenAI models would, but phi3 for some reason outputs it usually as markdown-encoded json, and llama3 seems to be keen to add various descriptive elements even if prompt is strictly ‘only json please’.

Encoding, or lack of it

With prompt engineering phi3 and llama3 would have the same result too, but here’s an example (I tried few iterations but only gemma2 produced consistently json output of the three):

Prompt: Give only a JSON encoded list of 5 strings with widely known plant names

phi3: b' ```json\n["Rose", "Tulip", "Sunflower", "Oak Tree", "Maple Tree"]\n```\n\n'
llama3: b'Here is the JSON-encoded list:\n\n`["Oak", "Rose", "Sunflower", "Ivy", "Maple"]`\n\n'
gemma2: ['Rose', 'Sunflower', 'Oak', 'Tulip', 'Cactus']

Content, in different languages

Prompt: Give only a JSON encoded list of 5 strings with widely known plant names in X
(X = english, finnish, japanese)
  • phi3 produces reasonable results in English, but in Finnish hallucinates 4/5 names, and in Japanese it sort of lists plants, but with lots of extra english annotations
  • llama3 produces reasonable results in English, but in Finnish there’s one (plural) plant name, and 4/5 hallucinations; Japanese output bugs somehow (empty json list, but it lists translations of maple, cherry blossom, chrysanthenum, and lotus)
  • gemma2 produces reasonable results in English, in Finnish it has plant-related words (4/5) but not really plants as such (without ‘widely known’ it actually lists plant names, go figure), and in Japanese it actually produces different plant kanjis just fine