Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Jais: A Major Leap Forward in Arabic-English Large Language Models
News

Jais: A Major Leap Forward in Arabic-English Large Language Models

Last Updated on September 1, 2023 by Editorial Team

Source: MBZUAI

A groundbreaking collaborative effort by Inception, MBZUAI, and Cerebras

New York, NY — August 30, 2023: In a collaborative endeavor, Inception, a G42 company, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), the world’s first graduate-level AI research university, and Cerebras have developed Jais — a 13-billion parameter generative pre-trained transformer (GPT) model specialized in Arabic and English language processing tasks. The model was engineered on the Condor Galaxy 1 (CG-1) platform, a high-capacity AI supercomputer co-developed by G42 and Cerebras. CG-1 offers multi-exaFLOP computational capabilities and serves as the training infrastructure for Jais.

This development has practical implications for G42’s ongoing partnership with Condor Galaxy. The model will be accessible through a dedicated live chat interface and is also slated for inclusion in the Hugging Face model repository. Jais aim to cater to the significant user base of Arabic speakers, estimated to be over 400 million, thereby addressing a gap in the availability of advanced language models for this demographic.

Source: Cerebras, G42, MBZUAI
Source: Cerebras, G42, MBZUAI

Why an Arabic Large Language Model (LLM)?

The development of Jais addresses a longstanding gap in the field of AI by focusing on Arabic, a language spoken by over 400 million people in 25 countries. While many companies discuss the concept of “democratizing AI,” the Jais initiative moves beyond rhetoric by providing a substantive, data-driven solution for the Arab-speaking world. By open-sourcing the model under an Apache 2.0 license, we aim to catalyze the growth of an Arabic language AI ecosystem. The Jais project is expected to serve as a model for other languages that are underrepresented in AI, thereby setting a new standard for linguistic inclusivity.

Source: Cerebras, G42

Use Cases:

  • Government Ministries: Deployments have already been announced by the UAE Ministry of Foreign Affairs and the UAE Ministry of Industry and Advanced Technology.
  • Healthcare: The Department of Health — Abu Dhabi plans to use Jais for a range of applications, potentially including data analysis and patient interactions.
  • Energy Sector: Abu Dhabi National Oil Company (ADNOC) has committed to implementing Jais in their operations, where it could be used for tasks ranging from predictive maintenance to data analytics.
  • Aviation: Etihad Airways plans to deploy Jais for various applications, possibly including customer service and logistical optimizations.
  • Financial Services: Jais has potential applications in automating customer inquiries, risk assessment, and data analysis in the banking and insurance sectors.
  • Environmental Analysis: Jais can be used to analyze large sets of environmental data, helping to predict trends and identify areas requiring intervention.
  • Education: Educational programs can employ Jais to develop intelligent tutoring systems, automated grading, and even interactive, language-based educational games.
  • Natural Language Interfaces: Jais can be a key component in building more intuitive and responsive voice-activated or text-based interfaces for a range of software applications.
  • Customer Service: Chatbots powered by Jais can handle customer queries with higher accuracy and context awareness, improving user experience.

These varied use cases underscore Jais’ flexibility and adaptability, making it a robust solution for a wide range of applications in both the public and private sectors.

Source: Cerebras, G42, MBZUAI

Performance Metrics

Jais sets new performance standards in Arabic language tasks, surpassing all known open-source monolingual and multilingual models. While specific metrics will be released post-launch, preliminary evaluations indicate leading scores in areas such as text summarization, translation, and sentiment analysis. In the realm of English language tasks, Jais demonstrates a competitive edge, scoring within the 95th percentile when compared to existing models such as LLaMa 2, despite operating on 30% fewer English language tokens.

Source: Cerebras, G42, MBZUAI 

Technical Specifications

Jais employs a novel bilingual vocabulary that decreases the average number of tokens per word by approximately 15%, improving both computational efficiency and latency. Advanced techniques like ALiBi positional encodings and SwiGLU activation functions are integrated and adopted from other cutting-edge models like LLaMA.

Data Considerations

The model was trained on a diverse dataset comprising Arabic, English, and source code text, which is crucial given that high-quality Arabic data is sparse, constituting just 3% of the dataset. An innovative preprocessing pipeline was implemented to optimize data quality, utilizing heuristics-driven methods for data filtering and normalization.

Legal and Ethical Concerns

Data privacy and intellectual property considerations are integral to Jais’ development. The model’s operational framework incorporates the guidelines and regulations concerning data privacy and complies with global intellectual property laws.

Strategic Context

The initiative aligns with the UAE’s broader goals of fostering sovereign AI capabilities, without relying solely on externally developed solutions. It addresses the unique complexities of the Arabic language, such as its various dialects and unique writing system, while also offering avenues for future development in other Semitic languages.

Why is Jais a Significant Development?

Jais distinguishes itself through its specialized architecture designed to understand better the nuances of the Arabic language, including its writing style and word order. This specialization results in responses that are both more accurate and contextually relevant, allowing it to outperform existing models that contain Arabic text as only a minor part of their training data.

MBZUAI President and University Professor Eric Xing said, ”Developing such a high-caliber Arabic LLM demanded cutting-edge AI research in addition to an in-depth and nuanced understanding of the Arabic language, its diversity and heritage, and the growing importance of LLMs across all echelons of society. Thanks to our research and partnerships with Inception and other top regional and global organizations, MBZUAI will continue pioneering LLMs that are efficient, effective, and accurate.”

Jais also enables faster customization and easier fine-tuning for domain-specific Arabic use cases, thereby reducing both the time and cost associated with deploying AI solutions. This is particularly important in a context where data ownership and sovereignty are concerns; the project’s UAE-based origin alleviates some of these issues.

In quantitative terms, Jais sets new performance benchmarks for Arabic language tasks. Special attention to preprocessing has resulted in a model that better supports the intricacies of Arabic, marking a significant step forward for AI applications in the Arab world.

By combining technical sophistication with open-source accessibility, Jais offers an Arabic LLM that stands as the most accurate and capable of its kind, thus broadening the scope of generative AI applications across both the public and private sectors in the Arab world.

Future Research Directions

  • Benchmarking: A detailed evaluation against existing Arabic or bilingual models is planned, serving as a critical performance indicator.
  • Ethical Stewardship: Efforts are underway to ensure responsible handling of data from various sources, including ethical considerations around bias and data privacy.
  • Dialectical and Linguistic Extensions: Research is ongoing to adapt Jais for handling various Arabic dialects.
  • Code-Handling Capabilities: Jais exhibits promise in understanding code, although its proficiency in code-related tasks is under evaluation.
  • Hyperparameter Tuning: The model employs a Maximum Update Parameterization (muP) framework, whose applicability to other models is a subject for further investigation.
Source: Cerebras, G42, MBZUAI

Jais marks a significant advance in Arabic NLP and is the result of a strategic collaboration among Inception, MBZUAI, and Cerebras. Specifically engineered to address data scarcity issues in Arabic NLP, the model leverages the multi-exaFLOP computational capabilities of Condor Galaxy 1 (CG-1). This project aligns with the strategic objectives of key stakeholders, including MBZUAI and G42. Its comprehensive evaluations across Arabic and English language tasks offer a detailed understanding of its capabilities.

Jais is available for download on Hugging Face. Users can also try Jais online upon registering interest on Jais’ website and receiving an invite to access the playground environment. To know more about Jais and how it benchmarks against other models, you can read the Jais white paper.


Press release distributed by Towards AI, Inc. on Wednesday, August 30, 2023.

Feedback ↓