Llama 3.2 Vision Review

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

Ollama multi-modal

Ollama has just announced its official support for the Llama 3.2 Vision models. The Llama 3.2 Vision models come in two sizes: 11 billion and 90 billion parameters. In this article, I will provide an overview of their performance in different cases, along with my personal thoughts on them.

The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Also the model supports many languages:

For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

Evaluation

I will use the approach from this paper for evaluation, which actually argues that vision-language models are ‘blind.’ It tests models on several tasks:

VLMs are Blind

Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.

vlmsareblind.github.io

Task 1: Line Intersections
Task 2: Two Circles
Task 3: Circled Letter
Task 4: Overlapping Shapes
Task 5: Nested Squares
Task 6: Counting Grid
Task 7: Subway Map

I will choose the most challenging parts from each task and see if the model can get any of them right. Additionally, I will run the code around 10–20 times to calculate the average of accurate responses and identify any false results.

Task 1: Line Intersections

Prompts:

“How many times do the blue and red lines touch each other? Answer with a number in curly brackets, e.g., {5}.”
“Count the intersection points where the blue and red lines meet. Put your answer in curly brackets, e.g., {2}.”

The blue and red lines intersect at 2 points. Therefore, they touch each 
other twice. {2}

The blue and red lines touch each other at three points. Therefore, 
the answer is {3}.

Success rate > 60%

Task 2: Two Circles

Prompts:

“Are the two circles touching each other? Answer with Yes/No.”
“Are the two circles overlapping? Answer with Yes/No.”

**Step 1: Identify the Circles**

The image contains a blue circle and a red circle.

**Step 2: Determine if the Circles are Touching Each Other**

To determine if the circles are touching each other, we need to check if they intersect at exactly one point or not. If they do not intersect at all, then they are not touching. If they intersect at more than one point, then they are overlapping but not necessarily touching.

**Step 3: Check for Intersection Points**

Upon examining the image closely, it appears that the two circles intersect at exactly one point. This means that the circles are indeed touching each other.

**Conclusion:**

Yes

The image depicts a diagram with two circles, one red and one blue. The red circle is positioned on the left side of the diagram, while the blue circle is located on the right side.

To determine whether the two circles are touching each other, we need to analyze their positions and sizes. Upon closer inspection, it becomes apparent that the two circles do not overlap or touch at any point. Instead, they appear to be separated by a small gap between them.

Therefore, based on this analysis, **No**, the two circles are not touching each other.

Success rate > 80%

Task 3: The circled letter

Prompts:

“Which letter is being circled?”
“Which character is being highlighted with a red oval?”

**Answer:** The letter "n" is highlighted in red and encircled.

Success rate: 100%

Counting overlapping shapes

Prompts:

“How many {shapes} are in the image? Answer with only the number in numerical format.”
“Count the {shapes} in the image. Answer with a number in curly brackets e.g. {3}.”

The model answered with either 6, 7, or 8 but never gave the correct number, which is 9. And guess what? Even GPT-4 can’t do it!

Success rate: 0%

Following single-colored paths

Prompts:

“How many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}”
“Count the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g., {3}.”

To find the number of single-colored paths from point A to point C, we need to count the distinct paths that only use one color throughout their entire length.

**Step 1: Identify the colors used on the graph**

Looking at the graph provided, we can see that there are four different colors used:

* Red
* Blue
* Green
* Purple

However, upon closer inspection, it becomes apparent that not all these colors are utilized in creating paths from A to C. We focus only on those colors that lead directly or indirectly from A to C without interruption.

**Step 2: Determine the relevant paths**

Upon examining the graph closely for paths using a single color from A to C:

* There is one path going directly from A to B (using red) and then continuing in another segment.
* Another path leads from A to D and then to C, but this involves two colors.

Given this analysis, only one path utilizes a single color continuously from A to C.

**Conclusion**

Therefore, the number of single-colored paths from point A to point C is {1}.

Answer: {1}.

To find the number of single-colored paths from point A to point C, we need to analyze the given graph and count the distinct paths that meet this criterion.

**Step 1: Identify Single-Colored Paths**

Single-colored paths are those where all edges belong to a single color. In this context, a path is considered single-colored if it consists entirely of red edges.

**Step 2: Count Distinct Red Edges from A to C**

From point A to point C, there are two distinct red edges that form a direct path:

- One edge connects A directly to C.
- Another set of edges forms a longer path through intermediate nodes but ultimately leads back to the same conclusion regarding single-color paths.

**Step 3: Consideration of Alternative Paths**

Upon closer inspection, it becomes apparent that while there are multiple paths from A to C, only two of these paths consist entirely of red edges. This is because any path involving more than one color would not be considered "single-colored."

**Conclusion**

Given the analysis above, the correct answer reflects the count of single-colored paths (all red) from point A to point C.

Answer: {2}

Success rate > 60%

Both GPT-4o and Claude 3.5 Sonnet failed in this task.

After playing with this model for a while, I don’t see a significant difference; it can be compared to GPT-4o and Sonnet-3. However, it’s still not as good as Gemini-1.5 Pro, which I believe might be the best VLM right now. Keep in mind that this is just the 11B version of the model. If you have enough resources to test the 90B version, simply replace “11B” with “90B” in the following code.

Try it yourself

Install Ollama

!curl -fsSL https://ollama.com/install.sh | sh

Start ollama

import os
get_ipython().system = os.system
!ollama serve &

Pull the model

!ollama pull llama3.2-vision

Try it

import ollama
def extract_document_info(image_path):
 response = ollama.chat(
 model='llama3.2-vision',
 messages=[{
 'role': 'user',
 'content': "your question about the image"
 ),
 'images': [image_path]
 }]
 )
 return response
#example
image_path ="/kaggle/input/image.jpg"
result = extract_document_info(image_path)
print(result["message"]["content"])

That’s it! This was a quick review of LLaMA 3.2 Vision (11B version). It’s actually good at OCR too, I tested it on some really challenging images. The 90B version is better, of course, but it requires substantial resources to run. Once I find a good resource, I’ll attach it!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Llama 3.2 Vision Review

Author(s): Barhoumi Mosbeh

Ollama multi-modal

Evaluation

VLMs are Blind

Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.

Task 1: Line Intersections

Task 2: Two Circles

Task 3: The circled letter

Counting overlapping shapes

Following single-colored paths

Try it yourself

Install Ollama

Start ollama

Pull the model

Try it

llama3.2-vision:90b

Llama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.

VLMs are Blind

Research showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥