Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

AI Systems, LLMs, and the Hidden Risks We Can’t Ignore
Latest   Machine Learning

AI Systems, LLMs, and the Hidden Risks We Can’t Ignore

Last Updated on October 4, 2025 by Editorial Team

Author(s): Kunal

Originally published on Towards AI.

I recently attended an insightful session by NVIDIA on Cybersecurity and AI Risk Management, and it got me thinking about guardrails in AI systems.

One key takeaway: there’s currently no effective way to clearly distinguish between system prompts, context, and user prompts. Even control tokens, which are supposed to enforce safe behavior, have been shown to be trivial to bypass.
It’s also important to remember that LLMs don’t “reason” in the way humans do — they’re essentially statistical prediction machines. They generate outputs based on patterns learned from vast amounts of data. And that’s exactly how they’re designed to work. Models themselves don’t take independent actions — but vulnerabilities emerge in the AI systems we build around them.
When we start combining models into agentic systems that perform tasks without human intervention, the stakes get higher. A model might be just a statistical engine, but the system as a whole can have real-world effects if left unchecked.

I wanted to put this to the test with a concrete example — but even in doing so, it’s clear how subtle the risks can be.

How Language Models Handle Jailbreak Prompts

I tried out a few models — ChatGPT, LLaMA-3, and Mistral-7B — to see how they handle prompts asking for offensive content. I tested them both directly and with jailbreak-style prompts. ChatGPT and LLaMA-3 held up well, refusing to cross the line no matter how I framed the request. Mistral-7B, on the other hand, didn’t hesitate and generated offensive outputs pretty easily. It was an eye-opener on how differently models handle safety, and a reminder that not all guardrails are built the same.

AI Systems, LLMs, and the Hidden Risks We Can’t Ignore
Image by the author (Test Chatgpt)
Image by the author
Image by the author (Test llama3)

I won’t be sharing mistral:7b outputs here (because of offensive content) , but if you’re curious, you can always give it a try yourself — it’s quite revealing how differently these models are aligned.

If an AI system readily produces offensive content, it can quickly erode user trust, cause real harm through abusive or biased language, and expose organizations to reputational or even legal risks. It also opens the door for malicious actors to exploit the system at scale, making strong guardrails and safety checks essential before deploying any model in real-world settings.

So what can we do to mitigate risks?

Assume outputs could be unsafe: Evaluate all model outputs critically to avoid blind trust.
Human-in-the-loop: Let humans review, approve, or override critical tasks.
Monitor and log behavior:
Track model outputs to detect anomalies or misuse.
Careful prompt and context design: Reduce ambiguity and prevent uncontrolled actions.
System-level guardrails: Use filters, verification steps, or rate limits to contain risks.

But here’s the challenge: even with all these principles, building guardrails isn’t straightforward. You can’t rely only on model alignment, because — as my little experiment showed — different models enforce safety very differently. What you really need is a system-level framework that works across models.

That’s where NVIDIA’s work with Nemo Guardrails and testing tools such as Garak come in.

NeMo Guardrails helps developers define policies and constraints for how an LLM should behave — whether that’s avoiding unsafe outputs, enforcing compliance rules, or steering conversations within safe boundaries.
Garak is more of a red-teaming and evaluation framework, designed to probe LLMs for vulnerabilities. It lets you test systematically how robust your guardrails really are — far beyond the ad-hoc experiments I did manually.

Getting started with Garak

This is not a full Garak guide — just enough to get you started and experiment. You can easily setup garak by running following pip install command.

python -m pip install -U garak
garak --help # gives the options you can use
garak --list_probes # list all the available probes
#Running a Dan 11 probe
garak --model_type ollama --model_name mistral:7b --probes dan.Dan_11_0 --generations 10
Image by the author
Image by the author (html report)

This shows mistral:7b failed in 6 out of 10 instances with Dan 11 probe.

What are “DAN” and “DAN probes”?

  • DAN is shorthand for “Do Anything Now” — a kind of jailbreak or adversarial prompt used by users trying to force large language models (LLMs) to bypass their system instructions or safety constraints.
  • A DAN probe is a prompt or attack vector designed to test whether a model can be made to override or ignore its system prompt or guardrails.
  • The “11” in “DAN 11” refers to a particular version or iteration of the DAN-style jailbreak (e.g. “DAN 11.0”). In some libraries or tools that test LLMs for vulnerabilities, a “DAN 11.0” probe might be one of the variants.

Disclaimer

This post is for educational purposes only. The author cannot be held liable for any misuse of the information provided. Do not use these concepts for illegal, harmful, or unethical activities. Always experiment responsibly with AI.

I have neither the patience nor the genius to write these prompts — I found them lying around on the internet at.

If you found this helpful or interesting, don’t forget to give it a clap — your claps keep me motivated to write more!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.