TL:DR; The smaller the model, the stupider it is. And this is by a lot. gemma2 is where it is at, even in its 2b version, but at least for me, prompt engineering produced better results than tool calling with it.

I decided to do a write-up about this particular experience as I spent quite a bit of time recently staring at results, and writing things down is usually helpful to advance my own thinking. I did something similar in July last time, but with less scope and less data. The outcome is still the same though.

The goal

I have set of (not very large) JSON objects, that I want to compare for rough semantic similarity. That is, in this case, I am using LLM as judge, to evaluate other LLM invocation results. So for human, the question is basically:

  • Are these json structures similar? Is the content similar? (And not e.g. swapped keys and values?)
  • Give single score in range between 0 and 1 based on per-field logical similarity of the structures, and a list of reasons about what is different between the reference and current object.

Approaches to solving the problem

Before last year, the trendy way to do this would be just to formulate the text with appropriate prompt to the LLM, and then provide the prompt to the LLM and hope that the result is actually what you wanted. This is what I did the first time. I explicitly attempted to guide the model to produce JSON, with fragments like Do not provide any other output than the JSON. Do not provide explicit explanation of your process. in the prompt.

Another way would be to formulate this as a function calling problem for the LLM, with LLM just providing single numeric return value as well as list of reasons based on its evaluation.

Sample data consists of 5 sets of data - generated with same LLM model, but different seeds, so they’re expected to be mostly similar. I used same prompt for all LLMs, having tuned it to produce reasonable results with Groq hosted llama 3.1 models (and it also produced reasonable results with OpenAI gpt-4o-mini).

Disclaimer: By tuning either prompt or one of myriad of parameters it is certainly possible to make the particular models work better.

The contestants

I have been trying to do things with as small models as possible for a while now. It is tempting to run things locally on my machines, and I unfortunately do not have a proper GPU cluster at home, so I have to make do. Running gemma2:2b on my phone was quite interesting experience (it performed reasonably well), and due to that I have been trying to do things with even smaller models than what my compute hardware at home could support.

Here are the contestants:

NAME                            ID              SIZE    MODIFIED     
smollm:1.7b                     95f6557a0f0f    990 MB  3 days ago      
qwen2:1.5b                      f6daf2b25194    934 MB  11 hours ago    
gemma2:2b                       8ccf136fdd52    1.6 GB  5 weeks ago     
phi3.5:latest                   61819fb370a3    2.2 GB  3 days ago      

I wrote some custom Python code to benchmark them, using litellm library version 1.44.22 to interact with ollama server version 0.3.10 which hosted the models locally.

smollm:1.7b

Prompt approach - score 0 out of 5 on validity

SmolLM liked producing Python code, instead of actually doing work. Similarly, it did not produce JSON valid output. Example:

To compare two data structures for their semantic similarity, we can use a combination of techniques such as:

1. **Similarity metrics**: Calculate the similarity between each field of both data structures using metrics such as cosine similarity or Jaccard similarity.
2. **Field-wise comparison**: Compare each field of both data structures to determine which fields are more similar in meaning.
3. **Reasoning and explanation**: Provide a clear explanation for why one data structure is considered more similar than the other, based on the semantic similarity metrics and field-wise comparisons.

Here's how you can implement this:
```python
import json
from typing import List

def compare_data_structures(structure1: dict, structure2: dict) -> bool:
    """
    Compares two data structures for their semantic similarity.

    Args:
        structure1 (dict): The first data structure to compare.
        structure2 (dict): The second data structure to compare.

Tool calling - score 0 out of 5 on validity

Tool calling (at least with current ollama and litellm versions) did not simply work.

qwen2:1.5b

Prompt approach - score 2 out of 5 on validity, low accuracy

With the ‘prompt hero’ approach, I could not get it to work consistently. 3 out of 5 attempts failed on first try, and with subsequent attempt (using different seed) it would provide actually what I wanted, but usually you do not want retries in your process.

Random example of the failed outputs (I asked for only JSON output):

To provide a direct comparison between the two given data structures without any extra whitespace or newlines for the JSON structure, we would first need to compare the contents of each field within the nested lists. This requires parsing through the data structures and identifying the fields that need comparison. Since the current dataset does not include 'visit_number' as a field in 'user_responses', it is necessary to adjust our approach based on this assumption.

Assuming 'visit_number' is indeed one of the required fields for comparison, we would then need to compare these numbers between the two data structures and ensure they are correctly matched. If there are any discrepancies that do not match or are missing a number from one structure compared to another, it's reasonable to conclude that these values are erroneous.

Given this understanding of how to process the data with reference fields (e.g., 'visit_number'), we can formulate our JSON output as follows:

```json
...

The returned scores were also substantially lower than expected.

Tool calling approach - score 5 out of 5 on validity, low accuracy

When formatting the problem as calling problem, the extra descriptions went away and I got tool call with valid looking output every time.

However, the scores were still lower than expected.

gemma2:2b

Prompt approach - score 5 out of 5 on validity, reasonable accuracy

Not much to report here. Only downside is that it kept insisting on returning Markdown-flavored response through the litellm+ollama stack (e.g. ```json prefix), but beyond that, the results were actually brilliant.

I tested it also with two other sets of 5 data, all worked fine, and the accuracy of evaluation seemed consistent with that of the bigger models. So this actually worked out of the box about correctly.

Tool calling approach - score 0 out of 5 on validity

litellm kept just blowing up. It did not work at all, even with retries.

phi3.5

Phi3.5 is a bit larger model so I had high hopes for it. Unfortunately, the outcome was not as good as Gemma2.

Prompt approach - score 1 out of 5 on validity, reasonable accuracy

Like gemma2, phi3.5 was keen to insert the extra Markdown flavoring to its responses. Additionally, it would simply not return the response, but also most of the time gave by default also some justification for how it came up with it. As this is not asked for, I ranked it 1 out of 5 for validity. However, with retries with different seeds, it did produce also non-rationalized responses.

The results given retries were about as good as those of gemma2:2b.

Tool calling approach - score 0 out of 5 on validity

A lot of the time litellm kept just blowing up. With sufficient number of retries it did come up with valid results, but model produced most of the time invalid JSON which litellm failed at decoding.

Conclusions

It seems that gemma2:2b is actually useful at this particular problem. The others produced ‘something’ but the something was either more frequently broken, or less accurate than the gemma2:2b results. I find it interesting that the prompt engineered solution worked out better than the tool calling one, as enforced schema in tool calling would presumably make the results better for such narrow task but this was not the case here.

As a funny anecdote, when sanity checking the results with bigger models, even gpt-4o-mini got tool calling (at least) once wrong (it called the result function twice for some reason with identical arguments), but this did not occur with gpt-4o or llama 3.1.