Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

How Good is Google Gemini 1.5 With a Massive 1 Million Context Window?
Artificial Intelligence   Data Science   Latest   Machine Learning

How Good is Google Gemini 1.5 With a Massive 1 Million Context Window?

Author(s): Dipanjan (DJ) Sarkar

Originally published on Towards AI.

Created by Author with DALL-E

Introduction

Artificial Intelligence, and particularly Generative AI, is evolving at a rapid pace; every other week, we hear a new Large Language Model (LLM) being released! Google has once again pushed the boundaries with the introduction of Gemini 1.5, its next-generation large language model. Google claims this new model enhances efficiency and quality through a Mixture-of-Experts approach, where it routes your request to a group of smaller β€œexpert” neural networks so responses are faster and higher quality.

Gemini 1.5 has a MASSIVE token context window; Source: Google Tech Blog

This model promises to revolutionize how developers interact with AI, given its multimodal capabilities to work with text, images, and even video. The eye-catching part, Gemini 1.5, promises a MASSIVE 1 Million token context window! Yes, you heard it right, 1 Million tokens. Currently, Open AI’s most powerful GPT-4 model has a context window of 128K tokens.

But just how transformative is Gemini 1.5? Luckily as a Google Cloud Champion Innovator, I got early access to the new Gemini 1.5 model. Let’s dive into its key capabilities before taking it for a test drive in Google AI Studio!

Key Capabilities

Gemini 1.5 stands out with its groundbreaking 1M (million) token context window, a significant leap from the Pro version of Gemini 1.0, which gives a 32K context window. Do note the default version of Gemini 1.5 Pro will come with a 128K token context window but the 1M token context window is right now in private preview, accessible to a select userbase and beta testers. But expect it to be in general availability very soon.

This is a multimodal LLM, and this new super lengthy 1M context window opens up new possibilities for developers, enabling the processing of extensive documents, code repositories, or even lengthy videos in a single go. Gemini 1.5 Pro reasons across these multiple modalities of data and output text.

Gemini 1.5 can understand and extract useful information from large documents; Source: Google Dev Blog
  • Unprecedented Context Window: With a default 128K token window expandable to 1M tokens, Gemini 1.5 can handle over 700K words of text, allowing for the analysis of entire codebases and large PDFs directly.
  • Multimodal Understanding: Whether it’s text, code, image or video, Gemini 1.5 can reason across different formats, providing insights and solutions by integrating various types of data. Unfortunately, audio support is not there yet for videos.
  • Enhanced Developer Tools: Google AI Studio simplifies integrating the Gemini API into applications, supported by features like easy tuning for specific needs and lower pricing for the Gemini 1.0 Pro model, ensuring a good balance of cost and performance.

Test Driving Google Gemini 1.5 in Google AI Studio

Here I will showcase two real-world examples where I tried out Google Gemini 1.5 with its 1M context window. The first example will be based on analyzing a large PDF document and the second example will involve analyzing multiple large videos. Before that let’s look at the platform where I will access these LLMs.

Getting Started with Google AI Studio

You can access the current Google Gemini 1.0 and very soon the new Gemini 1.5 LLMs in Google AI Studio. Simply go to the link here and sign in with your Google email account, and you should be good to go. You can then click on Create New on the left side and select Chat Prompt, as depicted in the following screenshot. Take a closer look at the right side; that is where I can select the Gemini 1.5 Pro model, and as soon as I do that, you can see the Preview box at the bottom, which shows how much of the 1M context window you are using at the moment.

Google AI Studio prompt creation page

Analyzing a Large Investment Scheme Document

You can upload really large documents and make Google Gemini Pro 1.5 understand, reason, and give you useful insights pertaining to these documents. Personally, I invest quite a bit internationally as well as in my home country’s investment schemes. I have been recently looking at a new ETF offering that is launching, which is backed by Gold. You can take a look at this document here if you are interested.

Our input will be a huge 96 page investment document

This is an investment scheme document for an ETF which will closely track the price of gold. Now, if you check the PDF, it is a massive 96-page document! Now, who has the time to read it, considering everyone’s busy schedules? So I uploaded this document into Google AI Studio and asked Gemini 1.5 to summarize and show me the key pros and cons. You can see the output from Gemini 1.5 below, which is pretty comprehensive and showcases the key highlights, being it's an ETF, easy to transact, cost-effective, and also some risks, including tracking errors, market, and liquidity risks.

Summarizing the large document with Gemini 1.5

On asking about how the NAV is computed, it gives a pretty good summary and you can see it has generated this from parts of the document after page 85.

Querying specific aspects of the large document using Gemini 1.5

I also tried to make it act as an advisor for various age groups and to give us a recommendation of if someone who is in their 30s should invest in this scheme vs someone who is in their 60s. It does a pretty good job where you can see it focus on long-term growth for people who are in their 30s because they do have time on their side and it focuses on aspects like capital preservation and volatility as a risk because people above 60 would not want to invest in schemes which are subject to extreme fluctuation in prices in the short-term. Invest allocation remains the same for both age groups, though I would say this is something that the model probably said, considering how much you should put in gold as an asset vs other schemes involving equities, bonds, etc. It is also not surprising that the LLM plays it safe with recommendations, given probably the heavy guardrails and moderation put on it internally.

Comparison of investment recommendations from Gemini 1.5 based on different age groups

Overall, Gemini 1.5 does pretty well on large documents but it is still not perfect but given the rapid advancements taking place, it is quite an achievement in a short time since Gemini 1.0 launched.

Extracting Video Game Strategies from YouTube

Now Gemini 1.5 is a multi-modal model that can extract insights from videos, too. However, it breaks it down into thousands of frames (without audio) and then analyzes these frames and can perform complex reasoning, summarization, and question-answering. I believe that once the audio dimension is added to this model it will probably get even better as I do notice some shortcomings, which will be shown in my analysis below.

Honkai Star Rail; Source: EuroGamer

The scenario here is that we have a very popular online RPG called Honkai Star Rail which is something I play in my spare time. You usually play with a team of four characters and a large part of this game involves turn-based combat and optimizing character builds with equipment, skills and more! Some folks create really detailed guides for each game character on YouTube and I wanted to see if there was a way to get key insights from these videos about specific character builds using Google Gemini 1.5

Turn-based 4 character team gameplay; Source: EuroGamer

The input videos for this use-case comes from a popular YouTuber Braxophone who makes interesting guides on how to build each game character as they are released. I took three of his videos for building three popular characters from Honkai Star Rail and started prompting Gemini 1.5 as follows.

Each video is processed frame by frame in Gemini 1.5

It is clear that each video has multiple frames which are then processed by the model and we can see the token count quickly rise up to almost 800K tokens! I started off with a simple question as follows.

Querying videos for key entities using Gemini 1.5

Now so far, it does quite well and is able to recognize the characters from each of the three videos. I make the next question a bit more complex asking it to give me a full character build for Blade.

Getting a specific character build from a video using Gemini 1.5

Looks pretty comprehensive and covers all the key components needed for a build actually which include equipment (relics and light cones), stats (partly correct) and possible teams. However it does a fundamental mistake here by suggesting the 4-piece Thief set for relics when that is actually the best set for another character in the 2nd video I put (Ruan Mei). So I tried another prompt now being a bit more specific.

Specific prompting to make Gemini give more accurate results

It does get it correct this time and probably gets the information from the relevant video at the 5:25 mark! Of course it’s still not a 100% correct because a relic set comes with 4 pieces like Longevous Disciple and you can stack an additional 2 piece set like Rutilant Arena which is called as planar ornaments. So overall each character can wear a total of 6 pieces to keep it simple. However the LLM probably may not have this context from a short video so that is understandable.

Correct recommendations for Blade’s equipment from the source video; Source: Braxophone

The interesting part however is the above video on Blade was released almost 6 months back if you check it on YouTube. But if you look at the team compositions in the LLM response above, you will see one of the team compositions for Blade being β€” Blade along with Bronya and Ruan Mei. However, the character Ruan Mei was not even released then!

The LLM is smart enough to understand Blade being mentioned in the character video of Ruan Mei which was created just a month back on its release on YouTube here (that was one of the other two videos I had given to the Gemini 1.5 model). This is actually mentioned by Braxophone and you can see it in the following screenshot taken from the video at the 13:53 mark.

Blade’s team composition suggested in a different video but captured correctly by Gemini 1.5; Source: Braxophone

This does reinforce the aspect that the LLM is quickly able to understand and relate to entities occuring across different videos even if it was never trained on this data before! It’s not perfect and does mix up aspects amongst different characters but overall performs quite well given its one of Google’s initial multimodal models.

Conclusion

With its massive 1M token context window and multimodal capabilities, Gemini 1.5 has demonstrated its prowess in handling complex, diverse data types ranging from large documents to video analytics. Through practical use-cases, such as analyzing an elaborate investment scheme document and extracting nuanced strategies from video game guides, Gemini 1.5 definitely shows promise in being able to understand large and diverse modalities of data.

However, it does come with its own set of challenges. Larger contexts lead to the model sometimes mixing things up. Also, there are limitations in the LLM not being able to leverage the audio from video data which hopefully gets fixed in a future release. Another serious limitation is the compute time. Each of these queries took me between 30–60 seconds which proves that the larger the model and context window is, the longer it takes to process prompts. Hopefully with various optimizations, we see an improvement to this aspect in the future.

In conclusion, Google Gemini 1.5, despite its nascent state in the realm of multimodal large language models, marks a significant stride towards the progress of Generative AI and I hope you got something useful out of this read!

Reach out to me at my LinkedIn or my website if you want to connect. I do quite a bit of AI consulting, trainings, and projects.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓