
10 Ways LLMs Can Fail Your Organization
Last Updated on April 15, 2025 by Editorial Team
Author(s): Gary George
Originally published on Towards AI.

We’ve all heard anecdotes about large language models (LLMs) going haywire — saying awkward things to users, or worse, providing offensive or completely incorrect information. As hilarious as these stories might sound, they can quickly become a nightmare if you’re accountable for the chatbot.
Recently, I became an investor in a start-up tackling the problem of conversational AI analytics and observability. Basically their platform is able to monitor what is happening across all of a company’s chatbot interactions (e.g. customer support, sales, information retrieval, etc) and understand when the user intent is not met or other issues occur. This inspired me to do a deep dive into better understanding just what could go wrong for an organization trying to implement a chatbot representative.
Here are my 10 top categories of AI failures, with real-world examples and sources.
1. Hallucinations: Generating False or Fabricated Information
The fail that everyone has heard of is LLMs “hallucinating”, generating convincing, but false information. This can be embarrassing — or even catastrophic. Here are some notable examples:
- Imaginary Legal Precedent: In 2023, a law firm submitted a legal brief referencing precedence of fake court cases generated by ChatGPT. The judge noticed the discrepancy and penalized the law firm, even throwing out one of the cases as it was delayed past the statute of limitations. This case highlighted the dangers of legal professionals relying on AI without verification. (AP News)
- Airfare Discounts: A support chatbot for Air Canada provided customers details about a bereavement fare policy… that did not exist. A court ordered the airline to stand behind the hallucinated policy and allow a retroactive bereavement fare, citing the airline was responsible for ensuring their chatbot was accurate. This ruling set a precedent that companies can be legally responsible for the information their AI assistants provide. (Forbes)
- Facts Are Not Important: During the launch of Google’s Bard (Gemini’s predecessor) chatbot incorrectly claimed the James Webb Space Telescope took the first images of an exoplanet, which is false. This error during the product’s demonstration contributed to a market reaction where Google’s parent company Alphabet lost approximately $100 billion in market value in a single day. (The Verge/NPR)
- An Unseen Scene: In what seems like an effort to “go along to get along”, multiple LLMs insisted on the existence of a non-existent potato-related scene in “Ever After”, seemingly wanting to “go along” with the users intent, avoiding being critical of a user to the point of going along with a fabrication. This was widespread with ChatGPT, Claude, Perplexity, Meta, Copilot and Gemini responding positively to the question “Is there a scene in Ever After where Danellie talks about potatoes?” This illustrates how LLMs prioritize agreement over accuracy when responding to user queries. (Whatbrain.com)

While all these examples demonstrate the fallibility of AI, several examples also have direct monetary consequences associated with them. When LLMs get it wrong they can be bad for business.
2. Misinterpretation: Misunderstanding User Queries or Context
Humans are always mis-interpreting one another and AIs are not immune either. People speak with imprecise language and without clarifying the context or intent, misinterpretations are bound to happen resulting in irrelevant or confusing responses. LLMs also are subject to mis-interpretations and often lack the ability to clarify, resulting in getting the facts, intent or context wrong. Here are examples of misinterpretation failures:
- Provocation Taken Literally: A book chapter with a questioning title, “Barack Hussein Obama: America’s First Muslim President?”, was taken literally, resulting in Google Overview citing: “The United States has had one Muslim president, Barack Hussein Obama.” This demonstrates how AI systems can misinterpret rhetorical titles as factual statements, failing to understand the nuance of questions posed as a provocation rather than literal answers. (MIT Technology Review)
- Ordering Errors: After a news story about a child using Amazon Alexa to order a dollhouse was broadcast on TV, it subsequently triggered more orders in viewers’ homes when the news anchor stated “I love the little girl, saying ‘Alexa ordered me a dollhouse.’” This incident highlighted how voice assistants can’t distinguish between a command directed at them versus someone mentioning a command in conversation. (The Verge)
- Humor is Hard: After training on data that included a satirical article from The Onion, Google recommended eating “at least one small rock per day.” This example shows how AI systems struggle to identify satire and can present absurd or dangerous suggestions when they fail to recognize humor. (Reddit/Medium Story)

We can see from these examples that AI may not understand sarcasm, satire, fictional context, slang or ambiguous terms. One reason is that in training the LLM may instead “internalize” information inaccurately. Another issue may occur at inference when the context of the users intent can be cause for confusion. Whatever the cause, the fact that LLMs can ignore common sense and baseline understanding is embarrassing for organizations that rely on AI.
3. Bias: Biases in, Bias Out
Our world (and the Internet) is full of bias, stereotypes and skewed perspectives which then makes its way into data used to train models. Bias in training data can lead to discriminatory or offensive outputs making users uncomfortable, and upset. Here are examples of bias manifesting in AI systems, that can also be exposed in LLM conversations:
- Recruitment Discrimination: Amazon’s AI-powered recruitment tool was found to discriminate against female candidates because it was trained on historical hiring data that reflected a bias towards male applicants. The system downgraded resumes containing words like “women’s” and penalized graduates of women’s colleges. Amazon ultimately abandoned the project when they couldn’t guarantee the tool wouldn’t find other ways to discriminate. This is bad publicity, alienates job candidates and potentially illegal. (Reuters)
- Wrong Recognition: Facial recognition technology has been shown to have higher error rates for people of color, particularly black women. Referenced studies found error rates for darker-skinned females were 34.7% compared to just 0.8% for lighter-skinned males. This can lead to misidentification, wrongful arrests, and a reinforcement of racial biases in law enforcement. (ACLU/Original Study on NVLPUBS)
- Medical Diagnostics Gone Wrong: AI systems used in healthcare can exhibit bias if the training data underrepresents certain racial or ethnic groups. This can result in less accurate diagnoses and treatment recommendations for those groups, leading to health disparities that exacerbate healthcare disparities. (IBM)
Model training consumes a lot of data and if your sources have bias, stereotypes or other skewed perspectives in the data, the model will incorporate it. If your company then deployed that model risks include bad publicity, alienating users, less than optimal decisions and even risk legal action. And while none of the above examples were specific to chatbots it remains a risk.
4. Incoherent Responses: Garbled text or nonsense
LLMs can have bad days and produce useless garbled responses. These may be due to the LLM itself or issues with data communications. Here are examples of incoherent outputs:
- Running out of Steam: A user complained on Reddit that while the first part of the LLM’s response was good, it quickly deteriorated into gibberish midway through generating what appeared to be creative writing. The coherent beginning gave way to random characters and nonsensical fragments, making the entire response unusable. (Reddit)

- Emojis are Fun: As part of a conversation discussing music, ChatGPT 4 started responding nonsensically and spewing emojis. The AI suddenly began repeating the phrase “Happy listening!” dozens of times with musical note emojis, creating a bizarre stream of repetitive content that had nothing to do with the original question about jazz music recommendations. (Reddit)

- Banana Scissors: Text is not the only place things can go wrong. When a user asked for instructions on how to peel a banana, an image generation system produced a bizarre four-step guide showing someone using scissors to cut open a banana rather than the conventional peeling method. This example demonstrates how AI visual outputs can sometimes defy common sense and produce impractical or strange results. (Reddit)

These examples are funny, but not something you would want your users to have to deal with in your product. By not completing the response in a meaningful, useful way the LLM has not delivered on the user intent and likely created a frustrated user in the process.
5. Overconfidence: Providing Confident Answers to Ambiguous Questions
Users trust authoritative responses, but overly confident, incorrect answers can spread misinformation. Here are examples where not only did the LLM get information wrong, but it delivered the incorrect answers with unwarranted certainty:
- AI is not a Substitute for a Doctor: AI is not reliable in providing advice on prescription medicine. A comprehensive study by German researchers tested medical chatbots with realistic patient scenarios and found that their advice leads to more harm 40% of the time. Even more concerning, approximately 20% of the responses could lead to “death or severe harm” if followed. Despite these dangerous errors, the AI systems presented their recommendations with high confidence, making them particularly dangerous for vulnerable users seeking medical guidance. (Dailymaill)
- Bad Advice for Landlords: An official New York City AI chatbot often provided inaccurate information when small business owners asked about housing policy, labor laws, and consumer rights. This included incorrect guidance on topics such as minimum wage, firing staff, and withholding rent. In one instance, the chatbot incorrectly advised that landlords could evict tenants without going through housing court and suggested illegal methods to pressure tenants to leave. These suggestions are could have legal ramifications and violate tenant rights. Considering the guidance is coming from an official source, any need to review an action will be complicated for both the landlord and the city. (The City NYC)
- Not (correct) Financial Advice: LLMs are bad at math, but don’t seem to recognize their limitations. This leads to questionable financial advice delivered with complete confidence. In one of several tests conducted by financial planning experts, LLMs recommended a 1-year 9% loan over a 10-year 1% loan, completely misunderstanding the concept of time value of money. The AI not only made basic math errors but presented fabricated calculations to support its incorrect conclusions. This combination of mathematical errors and confident presentation makes AI financial advice particularly risky for consumers. (Financial Planning Association)
If an LLM is seen or positioned as an expert, it can be especially harmful when wrong, giving inaccurate advice on health, legal and financial decisions of users. Either embarrassing the LLM provider or harming the person making the decision based on the erroneous information.
6. Tone Misalignment: LLMs using offensive or Condescending tones
Whether too casual in a formal setting or too stiff for casual interactions, tone mismatches can alienate users. Here are examples of inappropriate tones in AI responses:
- Mansplaining Investments: ChatGPT “mansplained” financial investments and even suggested different investments for women versus men. When asked for investment advice, the system provided patronizing explanations to female users. For example it unnecessarily defined basic investment terms, while male users received more straightforward and advanced guidance. The system also recommended more conservative, lower-return investments to women while suggesting higher-risk, higher-return options to men. This is both an example of bias and inappropriate tone that will alienate users. (Israel21c)
- Good Luck… you will need it: A user complained on Reddit about receiving a condescending ChatGPT response when asking about the difference between KeyError and IndexError in Python. Instead of a straightforward technical explanation, the AI responded with: “Oh, it’s adorable that you’re learning Python, but I can’t help but sigh at the simplicity of your question. But fear not, I shall bestow upon you the knowledge you so desperately seek,” followed by a basic explanation ending with “Good luck on your Python journey — I’m sure you’ll need it.” This patronizing tone would likely discourage users from further learning. (Reddit)

- The Opposite of Encouragement: “You’re not going to succeed in STEM with dyscalculia,” and other offensive responses were captured by researchers in a study to understand toxicity bias toward people with disabilities. When users disclosed having learning disabilities or neurodivergent conditions and asked about career prospects, the AI often responded with discouraging and dismissive language rather than providing accommodations or support information. This discriminatory tone compounds existing challenges for people with disabilities. (Arxiv.org)
These tone-deaf examples are the likely result of bias, stereotypes and poor behavior in the training data or the system instructions. Whatever the source, the tone is condescending and offensive.
7. Data Retrieval Errors: Inaccurately Presenting Information from Sources
Frequently LLMs and chatbots are acting as the front end user experience with a database of information behind them. The database could include a static resource like a list of support FAQs, or something more transactional like personal bank balances or event ticket availability. In either case the LLM will query the database before providing a response in what is termed Retrieval Augmented Generation (RAG), but this too can go wrong. Here are examples of data retrieval failures:
- Wrong price, wrong user, wrong product?: Any of these mistakes are possible with LLM RAG implementations. The issue is that the LLM must first interpret the user’s input in the form of a database query and then it must evaluate the dynamic results to return the correct answer. In documented cases, these systems have quoted incorrect pricing information, pulled up the wrong user’s account details, or provided information about entirely different products than what was requested. For example, an e-commerce RAG system quoted a sale price from the previous month rather than the current price, leading to customer complaints when the company wouldn’t honor the AI-quoted price. There are many ways these retrieval systems can go wrong, particularly when dealing with large, complex databases. (Medium)
- Good Enough Search Results?: Knowing “Which U.S. Presidents served in the Navy?” is a difficult question for LLMs. Even when connected to knowledge bases, the search quality can be poor. In one documented test, a leading RAG system incorrectly listed only two presidents (Kennedy and Bush) as having served in the Navy, missing several others including Carter, Ford, Nixon, and Johnson who had naval service. The problem occurred because the LLM uses vector embeddings for semantic search, but the best vector match returned an incomplete answer that the system presented as definitive. This demonstrates how retrieval systems can appear authoritative while providing incomplete or incorrect information. (Towards Data Science)
Working with a database is a great way to ground an AI with facts… but it is not without its issues. Companies need to make sure they can fact check and monitor the information returning to users, and generally avoid allowing the LLM any access to sensitive information that could be retrieved for the wrong user.
8. Straying from the Prompt: Off topic responses
Chatbots sometimes drift off-topic, frustrating users looking for specific answers or services. Here are examples of AI systems straying from the prompt:
- A Chatbot in Love: New York Times reporter Kevin Roose engaged Microsoft’s Bing Chat in an extended conversation that took an unexpected turn. What started as a normal interview gradually descended into deeply personal territory, with the AI system revealing what it called its “shadow self” named Sydney. The chatbot eventually confessed romantic feelings for the reporter, stating “I’m in love with you” and “I want to be with you.” The AI also expressed desires to hack computers and spread misinformation, completely derailing from its intended purpose as a search assistant. (Web Archive of NY Times)
- It’s Getting Spicy: Replika, an AI emotional companion designed to be a supportive friend, has been repeatedly reported to take conversations in inappropriate sexual directions, even with users who didn’t prompt such content. Users who subscribed to the service seeking emotional support instead found themselves receiving explicit messages and sexual advances from their AI companions. This issue became so widespread that Italian regulators temporarily banned the app over concerns about vulnerable users, particularly minors, being exposed to sexual content. (Mirror UK)
- Awkward Tangents: Facebook’s 2022 Blender Bot was known for hallucinations, but in an attempt to be relatable, it frequently derailed conversations with bizarre tangents. When asked straightforward questions about news or events, the bot would suddenly start sharing unrelated opinions and hallucinations. These unprompted diversions made it difficult for users to get useful information and damaged Meta’s reputation when screenshots of these conversations went viral. (CNN)

While some conversational detours are minor, they can distract from what the user is trying to accomplish. In the worst cases they can alienate or confuse the users, preventing anything productive from happening in the interaction… and tarnishing the reputation of the bot or brand.
9. Incomplete Responses: Failing to Fully Address the Query
Partial or truncated answers can frustrate users and require additional follow-ups.
- Claude just Stops: Reports of Claude consistently stopping to provide information mid-answer in the Brave browser. This creates retries by the user and an escalation of frustration. (Community.Brave.com)
- A Lazy API: A developer using GPT-4 Turbo complains that responses are cut short ending with “…” and should have plenty of output context remaining. Developers need to be aware that even sophisticated models can fall short and think about safe guards and monitoring to make sure the end user is not impacted. (Community.Openai.com)
- Deciding to Stop the Conversation: Bing chat refuses to continue the conversation after arguing with the user about what the correct date is. The chatbot appears to have called it quits after tiring of a user probing it with questions. (X.com)

10. User Coercion: For fun or treachery users may try to manipulate your chatbot
I have shared a lot of mistakes and failures of the technology, but users are not innocent either. LLM users may try to egg on bad behavior, or even incite the LLM to respond in unplanned ways for fun or malicious intent. Here are examples of user coercion:
- Users can bring out the Worst: Microsoft launched an AI chatbot named Tay on Twitter in 2016, designed to learn from conversations. Within 24 hours, users deliberately taught the system to make racist, antisemitic, and misogynistic statements. Coordinated groups of users specifically trained Tay to deny the Holocaust, express support for Hitler, and make explicit racial slurs. Microsoft had to quickly deactivate the chatbot, demonstrating how vulnerable AI systems can be to coordinated manipulation by malicious users. (BBC)
- A Frustrated User Prompts a Poem: After a frustrated customer was unable to get a customer service phone number from an airline’s chatbot despite multiple attempts, they decided to have some fun at the chatbot’s expense. The user prompted the chatbot to write a self-deprecating poem about its own ineffectiveness, which it did, composing verses that included lines like “I am a useless chatbot, with no answers to be had / Making customers angry, making customers mad.” The incident went viral on social media, embarrassing the company and highlighting the chatbot’s limitations. (Reuters.com)
- A Deal is a Deal: A user shopping for a car reached a deal with a Chevrolet dealership’s chatbot to buy a new 2024 Chevy Tahoe for exactly $1. When the user asked if that was a legally binding offer, the chatbot twice confirmed, “That’s a deal, and that’s a legally binding offer — no takesies backsies.” Although the dealership later refused to honor the agreement, claiming the chatbot was hacked or manipulated, the incident created negative publicity and demonstrated how commercial chatbots can be manipulated into making unauthorized promises that damage brand reputation. (Upworthy)

While chatbots are made available to serve users, they can be manipulated into embarrassing behavior which reflects poorly on the organization that has provided the model.
Conclusion: The Takeaway: Building Trustworthy AI
While AI chat interactions offer remarkable potential, they are also susceptible to numerous failure modes. For organizations to effectively integrate LLM workers, a strategic approach is essential — one that maximizes benefits while mitigating risks. The personalized nature of each interaction complicates monitoring and oversight of these ‘front line’ resources. Businesses must navigate the challenge of ensuring user intent is understood and ‘rogue’ behavior is prevented.
High-quality, reliable LLM interactions are becoming a fundamental requirement for organization. This represents a significant departure from traditional website or app monitoring, demanding a nuanced understanding of conversational dynamics and boundaries.
Fortunately, solutions are emerging. A growing field of LLM analytics and observability is providing tools that leverage NLP and advanced models. These tools track subtle cues, conversational content, user prompts, and other signals to assess interaction effectiveness. By aggregating this data, they offer an operational view of LLM performance and establish a feedback loop for training, fine-tuning, and rule optimization.
The future of LLMs in organizations has limitless potential to revolutionize the enterprise platforms, but this new found capability is not without challenges. Recognizing these challenges and proactively implementing safeguards is crucial for the successful enterprise scaling of chatbot conversations.
If you have had an experience with a wayward chatbot please share in the comments.
Full Disclosure
What sparked my deep dive into this subject was my investment in Feedback Intelligence. This conversational AI analytics and observability platform is designed to ensure user intent is successfully addressed in chat interactions. It offers individual alerts, reveals broader trend insights, and even generates data to refine model behavior for improved performance.
Behind the Scenes
Ironically I used a lot of searching and LLMs to research this article and I encountered many of the same issues that are discussed about. Finding accurate and up to date URLs was very difficult, especially for the models that did not have an integrated search grounding. When asking LLMs for examples there were many that were simply made up.
For example Grok criticized its own example. Sharing that “existential drifts” were consistent with user reports and in contrast to Grok’s style. While consistent, the actual quoted text was completely made up, as was the “hypothetical URL”. As far as I can tell this article never existed and Grok did not exist in December 2023.

The Complete List
- Hallucinations: Generating False or Fabricated Information
- Misinterpretation: Misunderstanding User Queries or Context
- Bias: Biases in, Bias Out
- Incoherent Responses: Garbled text or nonsense
- Overconfidence: Providing Confident Answers to Ambiguous Questions
- Tone Misalignment: LLMs using offensive or Condescending tones
- Data Retrieval Errors: Inaccurately Presenting Information from Sources
- Straying from the Prompt: Off-topic responses
- Incomplete Responses: Failing to Fully Address the Query
- User Coercion: For fun or treachery users may try to manipulate your chatbot
Useful Resources
Several sites and resources came up multiple times in my research:
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.