Why Your Chatbots Feel Sluggish (and How TiDAR Fixes It)
Last Updated on November 25, 2025 by Editorial Team
Author(s): Devashish Datt Mamgain
Originally published on Towards AI.

When most leaders consider AI strategy, they typically focus on data and use cases. The nitty-gritty of AI technology doesn’t really matter in your day-to-day. However, these architectural details may actually impact your margins.
Let me explain, every AI model you use today generates their response one token at a time. Your GPU waits for this process, and your customers and workers receive slower responses.
A new class of architectures is emerging that tries to break this trade-off between speed and quality. One of the most promising, TiDAR (“Think in Diffusion, Talk in Autoregression”), is designed to generate multiple pieces of an answer in parallel, without losing the coherence and reliability executives expect from top-tier models.
This article will unpack what that actually means in plain language.
What is Wrong with Your “Standard” AI Models?

Most commercially available AI models provide answers token-by-token. In practice, this looks like:
- The model takes everything the user has said so far
- Computes the next token (a fragment of a word)
- Sends it back
- Then repeats the whole process for the next token
Each of those tiny steps has a lot of fixed overhead: loading weights, moving data in and out of memory, updating caches. The GPU ends up waiting on memory far more than it’s doing math, primarily when you’re serving users one by one (batch size = 1), which is common in interactive chat and copilots.
You can address this issue by grouping multiple users, but that introduces its own problems: increased latency for everyone, complex scheduling, and less predictable response times. On the other hand, alternative “parallel” decoding tricks often speed up the process but can disrupt the natural flow of language or compromise accuracy.
So you have to make an uncomfortable choice:
- Keep the simple, token-by-token models and overpay in latency and GPU hours, or
- Chase speed hacks that risk quality and reliability.
This is the exact question that TiDAR tries to solve. Let’s see how it works.
How Does “Think in Diffusion, Talk in Autoregression” Work?
At a high level, TiDAR is a sequence-level hybrid architecture that lets one model operate in two modes at once:
- A diffusion mode that drafts multiple future tokens in parallel (fast), and
- An autoregressive (AR) model that checks and approves those drafts in order (high quality),
The trick is to use idle “free token slots” in each forward pass: extra positions that the GPU can process with almost no additional latency. TiDAR fills these slots with speculative drafts and verifications instead of leaving them unused, which enables it to achieve much higher throughput without a significant quality drop.
To understand how this works, it is helpful to separate training (how the model learns to think and speak) from inference (how it actually generates responses at runtime).
Dual-Mode Backbone Training: Teaching One Model to “Think” and “Talk”
TiDAR trains a dual-mode backbone so that the same model can behave both like a diffusion model and like an autoregressive model, depending on which part of the sequence it’s looking at.
a) Structured hybrid attention mask
During training, each input sequence is conceptually split into two zones:
A prefix section (the “clean” tokens):
- Uses causal attention → this is the standard autoregressive setting, where each token only sees previous tokens.
- This lets TiDAR compute a standard AR next-token prediction loss, learning a chain-factorized joint distribution over the sequence.
A decoding block (the “noisy/diffusion” tokens):
- Uses bidirectional attention → tokens can look both left and right within this block.
- This is the diffusion mode, where the model learns to recover clean tokens from masked/noisy versions, giving a diffusion loss and a marginal distribution.
So, one model, one forward pass, but two behaviors depending on the attention pattern.
b) Full mask strategy in the diffusion section
Traditional diffusion language models randomly mask some tokens. TiDAR simplifies this by using a complete mask strategy:
In the diffusion section, all tokens are replaced by mask tokens during training. The model is trained to reconstruct every masked position, which:
- Avoids the complexity of choosing a masking pattern.
- Makes the diffusion loss denser (it learns from every token, not just a subset).
- Improves train–test consistency since inference also uses a simple, one-step denoising strategy.
- Sets up one-step diffusion at inference (instead of many iterative denoising steps), which is crucial for speed.
Because the same sequence is seen under both attention patterns, TiDAR can jointly optimize both of them. In a well-trained model, the AR and diffusion predictions are aligned, so the diffusion drafts are good enough for AR to accept most of them.
Essentially, the AR model is trained to be the judge for the outputs from the diffusion models.
Self-Speculative Generation: How Does The Model Answer Questions?
At inference time, TiDAR uses a fully parallelizable, self-speculative generation procedure. The key idea is that drafting and verification occur together in a single forward pass by smartly partitioning the sequence and reusing the KV cache (key–value cache) for efficiency.
At each decoding step, the sequence is divided into three sections:
Prefix tokens (accepted history)
- These are the tokens that have already been “approved” by the model in previous steps.
- They use causal attention and reuse the stored KV cache, so the model doesn’t recompute their internal states.
Tokens proposed in the previous step (to be verified now)
- These are the draft tokens generated in the last forward pass.
- They are now checked using the autoregressive joint distribution.
- TiDAR uses rejection sampling: if the AR mode agrees with the draft, that token is accepted into the prefix; if not, it’s rejected.
Tokens pre-drafted for the next step.
- These are new speculative tokens generated in parallel for the upcoming step.
- They use bidirectional attention in the diffusion section and are sampled from the marginal distribution.
- This is where TiDAR leverages the “free token slots”: these pre-drafts come almost “for free” in terms of latency.
At every step, the model uses diffusion to fill in the future token positions. The AR model evaluates these predictions in the next step, and the K-V caches that generated the rejected tokens are evicted. You’re aligning the diffusion model’s output quality with the quality that the AR model expects.
Because all of this happens inside one model, in one forward pass per step, TiDAR:
- Avoids the overhead of using a separate, weaker draft model (unlike many speculative decoding methods).
- Utilizes diffusion-style parallel drafting to leverage GPU compute fully.
- Utilizes AR-style sequential sampling to maintain high language quality and coherence.
The result: TiDAR leverages the GPU’s idle token slots to deliver 4.7×–5.9× more tokens per second than strong AR baselines, while maintaining very close to AR quality.
This means faster AND cheaper responses.
How Much Faster and Potentially Cheaper Could AI Responses Get With TiDAR?
At a raw performance level, the paper reports that TiDAR-based models in the 1.5B–8B parameter range achieve roughly 4.7× to 5.9× higher token throughput than comparable autoregressive (AR) baselines, while keeping task performance (benchmarks on coding, math, reasoning, etc.) essentially on par with those baselines.
In practical terms, for the same hardware, you can generate 4–6 times more text per second without dropping to a “toy” model or accepting obviously worse answers.
A big piece of this comes from amortizing the expensive parts of inference:
- In a classic AR model, each forward pass yields one token.
- In TiDAR, each forward pass buys you on the order of 7–8 accepted tokens on average, because the model is filling “free token slots” with speculative drafts and verifying them in the same pass. The cost of loading weights, doing KV cache lookups, and moving data from memory gets spread over many more useful tokens.
TiDAR’s higher tokens-per-second means you can serve the same traffic with fewer GPUs, or more traffic on the same cluster. You can also reduce latency without compromising the quality of the answers or increase the quality of the answers without incurring additional costs.
There are also secondary economic effects:
- Higher Concurrency: Each GPU can handle more simultaneous conversations before hitting latency or timeout limits, which is crucial during peak periods (such as product launches, campaigns, or seasonal spikes).
- Simpler Stacks vs. Speculative Decoding: Since TiDAR performs self-speculation within a single model, you avoid maintaining separate draft and target models, as well as complex orchestration logic. This can reduce engineering overhead and operational risk, which also manifests as cost over time.
This is the main advantage of this type of model. You can reduce GPU costs without sacrificing quality. But there are still some problems that we need to solve before deploying them to prod.
What Are the Current Limitations and Technical Caveats Leaders Should Know About?
The TiDAR paper is very bullish on its speed-quality trade-off, but it’s also quite clear about what remains unsolved. For leaders, it’s useful to know where the sharp edges are so you can ask your teams the right questions before betting on this architecture.
Batch Size and Real-World Serving Patterns
The core efficiency results are measured at a batch size of 1, with a focus on interactive, single-user latency. This means:
- If your main workloads involve chatbots, copilots, or agents responding to users one by one, this is ideal: TiDAR is specifically designed for that regime (where AR models are highly memory-bound).
- If your workloads are heavily batched (e.g., offline document processing, large-scale content generation), the story is less clear. The paper argues that TiDAR can be adapted to larger batches and different draft lengths; however, the benchmarks do not yet fully explore that design space.
Long Context Extension
The experiments are run with a 4K-token context window, and the authors explicitly list long context extension as a limitation. In TiDAR, training employs a full mask strategy in the diffusion section, effectively doubling the sequence length (prefix + masked diffusion block). That’s straightforward at 4K tokens but may become more expensive as you push to 16K, 32K, or 128K context.
This means that:
- Many enterprise use cases are moving toward very long context windows.
- If TiDAR-style training and inference scale linearly with these longer sequences, your training and serving costs could climb faster than with a pure AR model that has been heavily optimized for long context.
System-Level Optimization and Vendor Support
TiDAR is implemented with relatively standard components (e.g., FlashAttention 2, KV caching), but the authors explicitly list system optimization as a limitation, noting that they do not exhaust all possible kernel-level or scheduling tricks.
This means:
- The impressive speedups of 4.7×–5.9× are achieved even before deep system tuning. That’s encouraging, but it also means real-world performance will depend heavily on your stack.
- Most commercial tooling, inference servers, and monitoring stacks are optimized for plain AR models and, increasingly, for speculative decoding/MTP. TiDAR is new; vendor and open-source ecosystem support will lag.
These models show real promise, but they don’t have the ecosystem of support that AR models have developed over the past two years. So,you will have to build a lot of custom tooling to use this model for your company’s specific use cases.
Conclusion
TiDAR is a reminder that AI architecture is no longer an implementation detail your team can hide behind an API. If you are investing heavily in AI-powered support, copilots, or decision systems, you’re really investing in how efficiently you can turn GPU cycles into high-quality tokens on a screen. By “thinking in diffusion” and “talking in autoregression,” these models can utilize your hardware far more efficiently, delivering multi-x gains in throughput while maintaining the quality you expect from strong AR baselines.
The practical next step is not to rip out your current stack and replace it with TiDAR tomorrow, but to start asking sharper questions. Where are we paying the most for latency and GPU hours today? Are our top workloads closer to interactive chat or offline batch jobs? How much value would we unlock if we could achieve 3–5 times more capacity on the same hardware? As hybrid architectures mature and vendor support catches up, leaders who already understand these levers will move faster: they’ll be able to pilot TiDAR-style models in the right places, negotiate with vendors from a position of strength, and turn “how our models think and talk” into a real advantage on both customer experience and margins.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.