PowerInfer: 11x Speed up LLaMA II Inference On a Local GPU
Last Updated on December 21, 2023 by Editorial Team

Author(s): Dr. Mandar Karhade, MD. PhD.

Originally published on Towards AI.

Some neurons are HOT! Some are cold! A clever way of using a GPU-CPU hybrid interface to achieve impressive speeds!

In the last article, we saw that a clever compiler, quantization, Speculative decoding, and tensor parallelism implemented by Pytorch II can lead to a significant boost in inference performance.

There are many ways to run, here is a quick overview

Today we will discuss PowerInfer. Another clever way of distributing the workload between CPU and GPU in a way to speed up most of the local inference workloads.

The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity.

Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single

