Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Why Do Chinese LLMs Switch to Chinese in Complex Interactions?
Latest   Machine Learning

Why Do Chinese LLMs Switch to Chinese in Complex Interactions?

Last Updated on December 14, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

source

When I was a kid, my parents always encouraged me to learn other languages and even aim to speak 3 or 4 fluently. They especially emphasized learning English because most of the best resources on the internet are written in English. I often heard the same advice from my teachers: β€œIf you know two languages, you have two brains.” That’s 100% true! My native language is Arabic, but when it comes to writing something like this, I find it almost impossible to do it in Arabic. I learned AI and other technical skills in English, so it’s much easier for me to express myself in it (and it’s easier on the keyboard too, xD).

Recently, I’ve been using DeepSeek, one of the most advanced AI models out there, and I have to say, it’s been a game-changer. It’s helped me solve some really tough technical problems with its deep thinking option. However, I’ve noticed something strange: Sometimes, for particularly hard tasks, it switches to Chinese for no apparent reason. It’s weird to ask a question in English and get an answer in Chinese (or a part of the answer). This got me thinking about this phenomenon. Why does this happen?

Chinese LLMs are designed to understand and generate text in multiple languages, including English and Chinese. The decision to switch languages mid-conversation could be influenced by several factors, including the model’s training data, the context of the conversation, and the specific instructions or prompts provided by the user.

Why?

I have seen a lot of posts on Reddit or LinkedIn saying that these models switch to their β€œmother language” like humans do when they have to think about β€œhard” cases. For me, for example, when I had a deep conversation, I may not be able to go deeper with a native speaker about elections in the US and the results. I don’t have that much deep knowledge in politics in English, but I’ll be able to speak about the situation in the Middle East with you for about consecutive 10 hours because I have enough knowledge specifically in Arabic. This is because most of the time, I’m reading and hearing the news in Arabic.

Can this be the case with Chinese LLMs? Short answer: no! I don’t believe that this is the case with these models. It’s just an issue that can be the result of many problems in training data imbalance, or how reinforcement learning (RL) is done.

Training Data Imbalance

image from the author

One of the primary reasons for this language shift is the composition of the training data. These models are often trained on vast amounts of Chinese text, which makes them more proficient and confident in Chinese. When faced with intricate queries, the model might revert to Chinese because it has a richer understanding and more comprehensive data in this language. This is somewhat analogous to how humans might switch to their mother tongue when discussing complex topics, as it offers a more nuanced and precise means of expression.

Model Architecture and Reinforcement Learning

source

The architecture of these models plays a significant role in their language choice. If the model is designed with a bias towards Chinese, either due to the training data or the reinforcement learning from human feedback (RLHF) process, it might naturally favor Chinese responses. For instance, if the feedback used in the RLHF process is predominantly in Chinese, the model might learn to prefer Chinese outputs, even in mixed-language conversations.

Cultural and Linguistic Nuances

Chinese language and culture are deeply intertwined, offering a rich tapestry of expressions and nuances that might be better suited for certain discussions. The model might prefer Chinese for complex tasks because it can express subtleties and context more effectively in this language. This cultural and linguistic richness can make Chinese a more efficient medium for intricate conversations.

image from the author

Technical Considerations: Tokenization and Efficiency

From a technical standpoint, the model’s internal processing, including tokenization and language modeling, might be more efficient in Chinese for certain tasks. If processing Chinese text requires fewer computational resources or offers more efficient representations, the model might default to Chinese for resource-intensive tasks.

The advantage of β€œthinking” in Chinese is that a single token contains so much more information than Western languages. In Chinese, a single character can represent an entire word or concept, whereas in languages like English, multiple characters (letters) are often needed to form a single word. This means that Chinese text can be tokenized into fewer units, which can significantly reduce the computational load and improve processing speed.

For example, the phrase β€œδΈ­εŽδΊΊζ°‘ε…±ε’Œε›½β€ (People’s Republic of China) consists of just five characters, but if translated into English, it becomes β€œPeople’s Republic of China,” which requires more tokens to represent. This efficiency in tokenization can lead to better performance, especially in tasks that require deep understanding and long-range dependencies.

I would not be surprised if they discover that having a chain of thoughts in Chinese improves overall performance and optimizes context length. The ability to convey complex ideas with fewer tokens can allow the model to maintain a longer and more coherent context, which is crucial for tasks that require reasoning and deep understanding.

image from the author

User Interaction and Prompt Influence

User interaction patterns can also influence the model’s language choice. If the majority of users interacting with the model are Chinese-speaking, the model might naturally gravitate towards Chinese responses. Additionally, the structure and content of prompts can guide the language output. Prompts containing Chinese keywords or cultural references might trigger a Chinese response, as the model seeks to provide contextually appropriate answers.

Conclusion

In summary, the tendency of Chinese LLMs to switch to Chinese during complex interactions is a multifaceted phenomenon. It is influenced by the composition of their training data, architectural design, user interaction patterns, cultural nuances, and technical efficiencies. Understanding these factors provides valuable insights into the behavior of these models and highlights the importance of considering language biases and cultural contexts in the development of AI systems.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓