Inference and Quantization with Qwen1.5 LLMs on Your Computer
Author(s): Benjamin Marie

The best open LLMs?
‘QWEN is a moniker of Qianwen, which means “thousands of prompts” in Chinese’ (source) — Generated by DALL-E

Recently, Alibaba published the Qwen1.5 models. They are open pre-trained and chat LLMs available from tiny to large sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. We don’t know much about these models but there is evidence that they perform better than Mistral 7B, Mixtral-8x7B, and Llama 2 models.

The Qwen team also collaborates with the authors of popular packages for quantization, fine-tuning, and serving LLMs. Consequently, Qwen1.5 is already very well-supported by the deep learning frameworks.

In this article, I first briefly present the Qwen1.5 models and comment on their performance. Then, I demonstrate how to use them. We will see that Qwen1.5 can be challenging to use on consumer hardware. I also show how to quantize the models with AWQ and GPTQ.

I use Qwen1.5 7B for the examples but it would work the same for the other sizes. Only the 72B versions can’t be fine-tuned on consumer hardware. For the other sizes, a GPU with 24 GB of VRAM is enough.

The Qwen1.5 models are available in this Hugging Face collection:


The license of the model is a Tongyi Qianwen license. It allows commercial uses… Read the full blog for free on Medium.

