Llama 3.2 Vision Review
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.
Ollama multi-modal
Ollama has just announced its official support for the Llama 3.2 Vision models. The Llama 3.2 Vision models come in two sizes: 11 billion and 90 billion parameters. In this article, I will provide an overview of their performance in different cases, along with my personal thoughts on them.
The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
Also the model supports many languages:
For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
Evaluation
I will use the approach from this paper for evaluation, which actually argues that vision-language models are βblind.β It tests models on several tasks:
VLMs are Blind
Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.
vlmsareblind.github.io
- Task 1: Line Intersections
- Task 2: Two Circles
- Task 3: Circled Letter
- Task 4: Overlapping Shapes
- Task 5: Nested Squares
- Task 6: Counting Grid
- Task 7: Subway Map
I will choose the most challenging parts from each task and see if the model can get any of them right. Additionally, I will run the code around 10β20 times to calculate the average of accurate responses and identify any false results.
Task 1: Line Intersections
Prompts:
- βHow many times do the blue and red lines touch each other? Answer with a number in curly brackets, e.g., {5}.β
- βCount the intersection points where the blue and red lines meet. Put your answer in curly brackets, e.g., {2}.β
The blue and red lines intersect at 2 points. Therefore, they touch each
other twice. {2}
The blue and red lines touch each other at three points. Therefore,
the answer is {3}.
Success rate > 60%
Task 2: Two Circles
Prompts:
- βAre the two circles touching each other? Answer with Yes/No.β
- βAre the two circles overlapping? Answer with Yes/No.β
**Step 1: Identify the Circles**
The image contains a blue circle and a red circle.
**Step 2: Determine if the Circles are Touching Each Other**
To determine if the circles are touching each other, we need to check if they intersect at exactly one point or not. If they do not intersect at all, then they are not touching. If they intersect at more than one point, then they are overlapping but not necessarily touching.
**Step 3: Check for Intersection Points**
Upon examining the image closely, it appears that the two circles intersect at exactly one point. This means that the circles are indeed touching each other.
**Conclusion:**
Yes
The image depicts a diagram with two circles, one red and one blue. The red circle is positioned on the left side of the diagram, while the blue circle is located on the right side.
To determine whether the two circles are touching each other, we need to analyze their positions and sizes. Upon closer inspection, it becomes apparent that the two circles do not overlap or touch at any point. Instead, they appear to be separated by a small gap between them.
Therefore, based on this analysis, **No**, the two circles are not touching each other.
Success rate > 80%
Task 3: The circled letter
Prompts:
- βWhich letter is being circled?β
- βWhich character is being highlighted with a red oval?β
**Answer:** The letter "n" is highlighted in red and encircled.
Success rate: 100%
Counting overlapping shapes
Prompts:
- βHow many {shapes} are in the image? Answer with only the number in numerical format.β
- βCount the {shapes} in the image. Answer with a number in curly brackets e.g. {3}.β
The model answered with either 6, 7, or 8 but never gave the correct number, which is 9. And guess what? Even GPT-4 canβt do it!
Success rate: 0%
Following single-colored paths
Prompts:
- βHow many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}β
- βCount the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g., {3}.β
To find the number of single-colored paths from point A to point C, we need to count the distinct paths that only use one color throughout their entire length.
**Step 1: Identify the colors used on the graph**
Looking at the graph provided, we can see that there are four different colors used:
* Red
* Blue
* Green
* Purple
However, upon closer inspection, it becomes apparent that not all these colors are utilized in creating paths from A to C. We focus only on those colors that lead directly or indirectly from A to C without interruption.
**Step 2: Determine the relevant paths**
Upon examining the graph closely for paths using a single color from A to C:
* There is one path going directly from A to B (using red) and then continuing in another segment.
* Another path leads from A to D and then to C, but this involves two colors.
Given this analysis, only one path utilizes a single color continuously from A to C.
**Conclusion**
Therefore, the number of single-colored paths from point A to point C is {1}.
Answer: {1}.
To find the number of single-colored paths from point A to point C, we need to analyze the given graph and count the distinct paths that meet this criterion.
**Step 1: Identify Single-Colored Paths**
Single-colored paths are those where all edges belong to a single color. In this context, a path is considered single-colored if it consists entirely of red edges.
**Step 2: Count Distinct Red Edges from A to C**
From point A to point C, there are two distinct red edges that form a direct path:
- One edge connects A directly to C.
- Another set of edges forms a longer path through intermediate nodes but ultimately leads back to the same conclusion regarding single-color paths.
**Step 3: Consideration of Alternative Paths**
Upon closer inspection, it becomes apparent that while there are multiple paths from A to C, only two of these paths consist entirely of red edges. This is because any path involving more than one color would not be considered "single-colored."
**Conclusion**
Given the analysis above, the correct answer reflects the count of single-colored paths (all red) from point A to point C.
Answer: {2}
Success rate > 60%
Both GPT-4o and Claude 3.5 Sonnet failed in this task.
After playing with this model for a while, I donβt see a significant difference; it can be compared to GPT-4o and Sonnet-3. However, itβs still not as good as Gemini-1.5 Pro, which I believe might be the best VLM right now. Keep in mind that this is just the 11B version of the model. If you have enough resources to test the 90B version, simply replace β11Bβ with β90Bβ in the following code.
Try it yourself
Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh
Start ollama
import os
get_ipython().system = os.system
!ollama serve &
Pull the model
!ollama pull llama3.2-vision
Try it
import ollama
def extract_document_info(image_path):
response = ollama.chat(
model='llama3.2-vision',
messages=[{
'role': 'user',
'content': "your question about the image"
),
'images': [image_path]
}]
)
return response
#example
image_path ="/kaggle/input/image.jpg"
result = extract_document_info(image_path)
print(result["message"]["content"])
Thatβs it! This was a quick review of LLaMA 3.2 Vision (11B version). Itβs actually good at OCR too, I tested it on some really challenging images. The 90B version is better, of course, but it requires substantial resources to run. Once I find a good resource, Iβll attach it!
llama3.2-vision:90b
Llama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.
ollama.com
VLMs are Blind
Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.
vlmsareblind.github.io
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI