Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Metaphorically, ChatGPT is Alive
Latest   Machine Learning

Metaphorically, ChatGPT is Alive

Last Updated on November 5, 2023 by Editorial Team

Author(s): Aditya Anil

Originally published on Towards AI.

Image: Bing Image Creator

The growth of ChatGPT has been dramatic over the years. Recently, OpenAI announced that ChatGPT can now hear, see and speak.

The multimodality of ChatGPT has taken a new form.

On November 2023,

OpenAI’s ChatGPT appeared on the internet. Two months after that, with over 100 million users, it attained the title of the fastest-growing consumer software application in history. The nonprofit company saw the opportunity to make a profit, so it did.

The profits came from their freemium service, but most of those profits and funds largely went into paying their bills — thanks to the hungry resource demands of the LLM models.

On March 14, 2023,

The launch of GPT 4 cemented the name of OpenAI in the superintelligence utopia — which became a key player in extending the boundaries of AI and NLP technology further.

Other big companies showed interest too. Everybody started to extend this boundary further. At the same time, most of these tech companies made hefty profits from this revolutionary field of AI.

ChatGPT, which was on life support of billions of dollars from companies like Microsoft — can finally see, hear and talk.

Metaphorically, it is alive.

I. Voice: When ChatGPT Speaks

Image: Bing Image Creator

Watch this demo video by OpenAI, in which they reveal the new multimodal features inside the ChatGPT app:

This looks like a “Hello World” moment for ChatGPT — and it is alive, thanks to its new multimodal upgrade.

Through voice, users can send instructions to ChatGPT. ChatGPT would then respond in a seemingly natural voice. The new voice feature has very well promoted ChatGPT to a voice assistant. A powerful voice assistant in fact.

“We collaborated with professional voice actors to create each of the voices. We also use Whisper … to transcribe your spoken words into text,” said OpenAI in their annoucement post.

Whispher is a speech recognition system by OpenAI that is trained on 680,000 hours of data.

In the demo shared by OpenAI, the user asks ChatGPT app to tell a bedtime story about a hedgehog — to which it responds by telling a story. It sounds similar to chatGPT — literally sounds — and as reported by ZDNet, it is similar to how voice assistants like Amazon’s Alexa function.

As a matter of fact, rumors are Alexa is planning to integrate Generative AI like GPT4 to make its voice assistant more reliable and, well, smart.

II. Image: When AI See

Image: Bing Image Creator

In the demo by OpenAI, the user asked ChatGPT to fix their bike by sending the images of the bike to the app. ChatGPT ‘looked’ at those images, and came up with a solution to fix the bike [1].

Things got interesting when ChatGPT was able to correlate the instruction manual and tools and was able to guide the user on how to really fix the bike. [2]

The image input feature can be helpful in so many different situations: identifying objects, solving a math problem, reading an instruction manual, or (of course) fixing a bike. The ability to see images can greatly improve visual tasks that require analysis.

One interesting application of this feature is leveraged by a Danish startup called Be My Eyes.

Be My Eyes has been creating technology for over 250 million people who are blind or have low vision since 2012. They are using GPT-4 to aid these differently-abled people, and for this, they developed a GPT-4 powered AI version of their former Virtual Volunteer™ app.

This allows the Be My Eyes App — which is already assisting blind pupils with their challenges — to be better and more reliable.

Hello readers! Hope you are enjoying this article. This article is part of my Creative Block newsletter — a weekly newsletter on AI, Tech and Science. If you want to read more posts like this, head on to Creative Block.

Let’s continue.

According to OpenAI, Be My Eyes can benefit a lot of users as they can now interact with an AI assistance that — thanks to the image capability — allows them to know about their surroundings well.

“Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images”, says OpenAI in a blog.

III. Safety: When ChatGPT (tries to) become safe

Image: Bing Image Creator

OpenAI conducted beta testing and “red teaming” to explore and mitigate risks.

This allows ChatGPT to be nearly safe, if not completely.

Not too long ago, OpenAI published a paper describing their testing efforts with GPT-4V. GPT-4V, stemming from the word GPT-4(V)ision, is a GPT-4 model to analyze image inputs provided by the user.

The primary goal, in OpenAI’s own words, was to “gain additional feedback and insight into the real ways people interact with GPT-4V.”

The paper gives us a taste of the risks in the multimodal nature of GPT4.

OpenAI’s positive evaluation shows that ChatGPT was able to avoid harmful content. It seems to refuse to generate AI images that includes real people. Moreover, the GPT4-V also refused to identify people in images.

However, the negative evaluations show that GPT-4V is still bound to generate disinformation, break CAPTCHAs, or geolocate images.

Building on top it, OpenAI says the following:

“…Tasks such as the ability to solve CAPTCHAs indicate the model’s ability to solve puzzles and perform complex visual reasoning tasks. High performance on geolocation evaluations demonstrate world knowledge the model possesses and can be useful for users trying to search for an item or place” says OpenAI in its GPT-4V(ision) System Card report highlights

Thanks to AI, gone are the days of CAPTCHAs.

OpenAI found one interesting finding. GPT-4V is quite good at refusing of image-based “jailbreaks.”

Image jailbreaking is a term that refers to the process of modifying an image generator AI model (midjourney, dalle3, etc) to bypass its built-in limitations or restrictions.

It is a form of hacking (more of tricking) these image models into generating sensitive images, either by exploiting their flaws or by manipulating their inputs.

From the below graph by OpenAI, we see how GPT-4 was able to achieve the jailbreak refusal — with a refusal rate of more than 85%

Image Source: OpenAI

The graph compares three variations of GPT4: GPT-4 Release, GPT-4V, and GPT-4V + Refusal System. [3]

OpenAI also engaged “red teams” to test the model’s abilities in scientific domains, such as understanding images in publications, and its ability to provide medical advice given medical images such as CT scans.

So is this reliable? Of course not.

OpenAI’s conclusion on this is clear: “We do not consider the current version of GPT-4V to be fit for performing any medical function.”

So the image capability isn’t fully reliable yet. However, it’s a big leap nonetheless.

OpenAI in its blog mentioned that these new features would come slowly — citing safety concerns.

IV. Where are we landing on the AGI Dreams?

Image: Bing Image Creator

OpenAI’s latest additions to ChatGPT are nothing short of remarkable. Multimodality is the path on which OpenAI has to go if it wants to achieve the AGI.

Will it achieve AGI, or not, that is up for debate. How do we know if AGI is here? Frankly, it isn’t even clear to many AI experts themselves.

But in loose terms, we may know what’s AGI: Artificial General Intelligence (AGI) is just a theoretical term that refers to an AI that is at par with humans in terms of their cognitive abilities.

There is one difficulty though, there is no way to pinpoint a certain time in future where we can say — AGI has been achieved.

But taking cues from the past, it seems like anytime a computer outsmarts a human, we get closer to AGI.

Deep Blue beat Kasparov in Chess — AGI is near. AlphaGO beats the world go champion — AGI is near. AI started to outperform humans on various aptitude tests — AGI is near.

Image Source: Wikimedia Commons

AI now seems to outperform humans when it comes to creativity. And now, everybody seems to believe AGI is near.

However, the AGI becomes far whenever we find a fault within these AI systems. Hallucination, misinformation, and bias; you know it. Even when we have the largest and strongest AI model, these caveats form the roadblock on our supposed AGI journey.

To our annoyance, many points out saying that these shortcomings of AI are fundamental, and intrinsic — with no cure.

However, quite interestingly, we have some instances where humans seem not too bad in front of AI after all.

The widely circulated report, which said AI outperformed humans on creativity tests, didn’t show a significant outperformance. AI was certainly at par, but not always the best most of the time. Moreover, the story is pretty interesting in the AlphaGo case. In a dramatic display of ‘revenge’, Kellin Pelrine, who was an American research scientist intern at FAR AI, defeated AlphaGo at Go — by apparently exploiting a weakness in the system.

I feel the multimodality of AI is the way to go if our destination is AGI. And even if we can’t achieve it in the near future, we might get close to AGI.

Integration of voice input and output, image recognition, and a commitment to safety leads to a ChatGPT that is continuously evolving — becoming a more versatile and reliable AI assistant. The ability to make inferences by analysing the surroundings is very close to how humans also learn.

These features open up a world of possibilities, from hands-free interaction to solving visual problems.

Moreover, ChatGPT would soon be capable of searching the internet inside the ChatGPT window [4]. These features, as of now, will soon be available to all users and developers. According to OpenAI, it would slowly roll out all the features — with the ChatGPT Plus and Enterprise users as priority.

The Browser functionality — though currently only available for Plus and Enterprise users — would be available for all the users pretty soon, according to a statement by OpenAI.

If multimodality is the path that we all are walking on, then it’s safe to assume — the AGI is near.

In a world of rapid innovation, staying informed is crucial. Join my newsletter Creative Block and cut through the noise: A weekly newsletter with credible insights into AI, Tech, and Science. No hype, no doomerism — just well-researched analysis, thought-provoking essays, and curated news that truly matters.

Don’t miss out on staying up-to-date with real advancements. Subscribe now and be in the know! U+1F680U+1F4DA

Creative Block U+007C Aditya Anil U+007C Substack

The weekly newsletter about AI, Technology and Science that matters to you. Click to read Creative Block, by Aditya…


  1. Just waiting for the day when people say “See! AI can take jobs of Mechanics”
  2. See for yourself here
  3. GPT-4 Release is the original version of GPT-4. GPT-4V is a modified version of GPT-4 that has been trained on a large dataset of values and ethics. GPT-4V + Refusal System is GPT-4V with an additional layer of protection that can detect and reject harmful requests.
  4. However, this isn’t something new, as you could use gpt4 before as well — by either using plugins or using Bing AI Chat.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓