Fast LLM Inference on CPU: Introducing Q8-Chat
Last Updated on July 25, 2023 by Editorial Team

Author(s): Dr. Mandar Karhade, MD. PhD.

Originally published on Towards AI.

Optimization techniques that decrease LLM inference latency on Intel CPU

Large language models (LLMs) have rapidly gained prominence in the field of machine learning. These models, built on the powerful Transformer architecture, possess an astonishing ability to learn from massive amounts of unstructured data, encompassing text, images, video, and audio. Their remarkable performance extends to a wide range of task types, including text classification, text summarization, and even text-to-image generation. LLMs have revolutionized the way we approach language understanding and generation, captivating researchers and developers alike.

Photo by Annie Spratt on Unsplash

However, as the name suggests, LLMs are not lightweight models. In fact, they often exceed the 10-billion parameter mark, with…

