Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

MiniGPT-4: ChatGPT for your images
Latest   Machine Learning

MiniGPT-4: ChatGPT for your images

Last Updated on July 17, 2023 by Editorial Team

Author(s): Sriram Parthasarathy

Originally published on Towards AI.

Use MiniGPT-4 to describe your images in detail, identify objects, recognize scenes, and generate captions & social media ads.

MiniGPT-4 is an AI model that uses a combination of language and visual processing to understand images and generate text descriptions about them. It is a type of computer program that has been trained to recognize visual features in images and use that information to create descriptions of what is in the image. In this article, we will describe the model and how it was trained and show several potential applications of using this model.

Table of contents

Introduction to MiniGPT-4

Training process

Why a two-step training process?

10 Real Life Examples

1. Detecting street names and traffic signal

2. Detecting car plates

3. Detecting if a car has headlights (when snowing)

4. Detecting rotten produce

5. Captions for an designer dress

6. Identify a Car wash place and create a social media ad

7. Identify Pizza toppings

8. Detecting Rail tracks and identifying if its safe

9. Examine a Floor plan

10. Detect a scratch in the car

Conclusion

Introduction to MiniGPT-4

MiniGPT-4 is an AI model that can be used to generate text descriptions from images. It can perform various tasks that involve understanding and interpreting images and language. For example, it can be used to generate a description of an image, answer questions about an image, and create a caption or a social media ad for an image. It is a powerful tool that can be applied in a variety of domains, such as manufacturing, healthcare, or finance, to improve efficiency and accuracy in tasks that involve image and language processing. Read more papers here. Read more about BLIP here.

Training process

It was trained using a two-stage process: In the first stage, it was taught to learn about images by looking at millions of image and text pairs. This allowed it to understand how certain objects, people, or places look like and how to describe them in words. Input is the image, and the output is the text created by the vision transformer. The training took around 10 hours and used four A100 (80GB) GPUs.

From : https://arxiv.org/pdf/2301.12597.pdf First step is to learn to convert the image to representative text. Second step is to convert the text in to a conversational style.

In the second stage, a smaller dataset of high-quality image and text pairs were used to fine-tune and make its text descriptions more accurate and natural.

From: https://github.com/Vision-CAIR/MiniGPT-4/blob/main/MiniGPT_4.pdf Input is the image which is converted to text. This text is read by Vicuna to respond back to the user’s prompts

During the second stage of training, the MiniGPT-4 model is fine-tuned with a smaller but high-quality dataset of image-text pairs that have been curated specifically for alignment purposes. This dataset is used to further enhance the model’s generation reliability and usability by allowing it to produce more natural and reliable responses. The fine-tuning process is remarkably efficient and only takes about 10 minutes to complete with a single A100 GPU.

Why two-step training process?

The reason for using two stages of training is to address certain limitations in the first stage of pre-training. During the first stage, the MiniGPT-4 model is pre-trained on a large dataset of aligned image-text pairs to acquire vision-language knowledge. However, even though the model acquires a wealth of knowledge, it may struggle to produce coherent linguistic output, such as generating repetitive words or sentences, fragmented sentences, or irrelevant content.

In the second step, a high-quality image-text dataset is used to fine-tune the model, enabling it to generate more natural and reliable responses. This two-step process ensures that the model can handle complex visual-language tasks with improved accuracy and reliability.

The LLM used by MiniGPT-4 is called Vicuna, which is constructed upon LLaMA and can perform a wide range of complex linguistic tasks. The visual encoder used by MiniGPT-4 is the same one used in BLIP-2, which is a ViT backbone coupled with a pre-trained Q-Former.

10 Real Life Examples

We will walk through 10 potential real-life examples of using MiniGPT 4. For each, we will submit an image and ask a set of specific questions and comment on the outcome to see if we got it right or not.

1. Detecting street names and traffic signal

Outcome:

It detects text but does not associate that with an address. It got the traffic signal wrong. It's red and not green. This can be fixed with training. See the output from mini GPT4 below.

Explain in detail what is there in the picture.

The image shows a traffic light with the words “no turn on” and “main street” on it. The light is green, and there are trees in the background. The image is a street sign indicating the direction of traffic flow and the name of the street.

What is the name of the street?

The name of the street is not specified in the image.

Is the traffic light red?

The traffic light in the image is green, not red.

2. Detecting car plates

Outcome:

It detects the plate, car color, and make. See the output from mini GPT4 below.

Read the make of the car and the license plate

The license plate on the car reads 228 FIF1.

What is the color of the car

The color of the car is silver.

What is the make of the car

The make of the car is BMW.

3. Detecting if a car has headlights (when snowing)

Outcome:

It detected that headlights were on during snowy conditions. See the output from mini GPT4 below.

Are the headlights on?

Yes, the headlights are on.

is there snow on the road?

Yes, there is snow on the road.

4. Detecting rotten produce

Outcome:

It failed to detect the rotten items. This can be fixed with training. See the output from mini GPT4 below.

Do you see any rotten apples?

Yes, I see several green apples arranged in a circle on a bamboo mat.

Explain what u see

I see a group of green apples arranged in a circle on a bamboo mat. The apples appear to be fresh and unblemished, with no signs of rot or decay. The apples are of different sizes, with some being larger than others. The apples are arranged in a neat and orderly manner, with each apple touching the one next to it. The apples are placed on the bamboo mat in such a way that they form a circle, with the largest apple at the center and the smallest apples at the edges. The apples are green in color, with a glossy surface that reflects the light. The apples appear to be fresh and ripe, with no signs of spoilage or decay. The apples are arranged in a symmetrical pattern, with each apple facing inward toward the center of the circle. The apples are of different sizes, with some being larger than others. The apples are arranged in a neat and orderly manner, with each apple touching the one next to it. The apples are placed on the bamboo mat in such a way that they form a circle, with the largest apple at the center and the smallest apples at the edges. The apples are green in color, with a glossy surface that reflects the light. The apples appear to be fresh and ripe, with no signs of spoilage or decay.

5. Captions for a designer dress

Outcome:

It created the caption and a social media ad. See the output from mini GPT4 below.

Explain the picture.

The image shows a woman wearing a beige and black checkered shirt dress with a belt around the waist. The dress has a collar and long sleeves, and the belt is tied around the waist. The dress has a straight cut and falls just below the knee. The dress is a casual, everyday wear piece.

Give an engaging caption for this picture and create a social media ad for the same

Caption: “Stay on trend with our classic checkered shirt dress! Featuring a belted waist and long sleeves, this dress is perfect for any occasion. Pair it with some heels, and you’re ready to take on the day! #burberlycheckered #ootd”

6. Identify a Car wash place and create a social media ad

Outcome:

It detects handwritten text. It got the number of cars. Could not identify the color of the car but could identify the color of the t-shirt. This can be fixed with training. See the output from mini GPT4 below.

Explain the picture.

The image shows a young man standing in front of a car wash sign. He is wearing a green shirt and black shorts, and holding a sign that says “car wash” in yellow letters. The background is a parking lot with several cars parked in it.

how many cars are parked?

There are three cars parked in the background of the image.

Identify the color of cars and identify the color of the t-shirt the young man is wearing.

The color of the cars parked in the background is not specified in the image. The young man is wearing a green shirt.

is there a red car?

It is not possible to determine if there is a red car in the image as the color of the car is not specified.

Come up with a social media ad for this picture.

“Get your car washed today at our convenient location! With our state-of-the-art equipment and friendly staff, we’ll have your car looking like new in no time. Plus, with our eco-friendly products, you can feel good about doing your part for the environment. Stop by today and see for yourself why we’re the go-to car wash in town! #carwash #ecofriendly #convenient”

7. Identify Pizza toppings & number of slices

Outcome:

It got the topping and the number of slices. See the output from mini GPT4 below.

Explain in detail what you see in the picture.

The image shows a pizza with different toppings on it. There are slices of mushrooms, black olives, and pepperoni on the pizza. The pizza is on a metal tray with a handle on the side. The tray has a pattern of lines on it. There is a napkin on the table next to the pizza.

The pizza has a crispy crust, and the cheese is melted. The toppings are evenly distributed on the pizza. The color of the pizza is golden brown. The pizza looks delicious and appetizing.

How many slices are there?

The image shows six slices of pizza on the tray.

8. Detecting Rail tracks & identify if its safe

Outcome:

It detects rail tracks but cannot detect that it's broken. This can be fixed with training. See the output from mini GPT4 below.

What do u see in the picture?

The image shows a set of train tracks with weeds growing along the sides.

Is the train tracks in good condition?

It is difficult to determine the condition of the train tracks in the image without more information. The image only shows the tracks, so it is not possible to assess their condition.

The image shows a set of train tracks with weeds growing along the sides.

9. Floor plan

Explain the picture?

This is a floor plan for a house with two bedrooms, a living room, and a kitchen. The layout is open and spacious, with a large living room and kitchen that are connected to each other. The bedrooms are located on the second floor, and there is a staircase leading up to them. The house also has a garage and a backyard.

what is the total square feet? How many rooms do we have?

I’m sorry, but I don’t have enough information to answer your question. The floor plan you provided does not include the total square feet or the number of rooms in the house.

Please provide more details or a different image for me to be able to help you better.

What kind of questions can u answer from a floor plan like the one provided?

From the floor plan provided, I can answer the following questions:

1. The floor plan shows a two-story house with three rooms, a living room, a kitchen, and two bedrooms.
2. The house has an open layout with the living room and kitchen connected.
3. The bedrooms are located on the second floor.
4. The house has a garage and a backyard.
5. The total square footage of the house is not specified in the floor plan.
6. The number of rooms in the house is three.

10. Scratch in the car

Outcome:

It detected the scratch. See the output from mini GPT4 below

Explain the picture

This is a blue car with a damaged front bumper. The car has a damaged front bumper, which is covered in scratches and dents. The car’s front bumper is damaged, and it has a large dent on the left side.

Conclusion

In conclusion, MiniGPT-4 is an advanced language model that has the ability to interpret and understand images. With its ability to collaborate with specialized vision models, MiniGPT-4 can generate text descriptions for images, answer questions, and even provide detailed explanations & social media ads. The two-stage training approach, involving pre-training on a large collection of aligned image-text pairs followed by fine-tuning with a high-quality alignment dataset, helps to improve the model’s language generation ability to generate more human-friendly outputs.

MiniGPT-4 has many use cases in industries such as e-commerce, healthcare, and manufacturing. With its open-source solution, it enables businesses to efficiently analyze and understand images, enhance customer engagement, and improve decision-making. With the advancement in AI technology, MiniGPT-4 & related computer vision technologies will continue to improve and revolutionize the way we interact with images.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓