November 2, 2024

Accelerating Llama.cpp Performance with AMD Ryzen AI 300

The journey of language models (LLMs) has evolved significantly since GPT-2, and now users can deploy advanced LLMs effortlessly through user-friendly applications like LM Studio. Supported by AMD, these tools make powerful AI capabilities available to anyone, with no coding or specialized knowledge required.

LM Studio is built on the llama.cpp framework, a highly popular and efficient way to deploy language models. Known for its minimal dependencies, llama.cpp can run effectively on CPUs but also supports GPU acceleration for enhanced performance. In LM Studio, llama.cpp utilizes AVX2 instructions to boost processing speed on x86-based CPUs, making it well-suited for modern consumer hardware.

Table of Contents:

1 Performance Metrics: Throughput and Latency
2 Boosting Model Throughput on Windows with Variable Graphics Memory (VGM)
3 Comparing Consumer-Friendly LLM Experiences: AMD vs. Intel
4 The AMD Vision for Accessible AI

Performance Metrics: Throughput and Latency

AMD Ryzen AI processors bring a competitive edge to llama.cpp applications, such as LM Studio, optimized for x86-based laptops. Notably, LLMs are particularly sensitive to memory speeds. In one performance test, an Intel laptop with faster RAM (8533 MT/s) was compared to an AMD laptop with slightly lower-speed RAM (7500 MT/s).

accelerating llama cpp performance with amd ryzen ai 300 1

However, despite this difference, the AMD Ryzen AI 9 HX 375 processor outperformed its competitor, achieving up to 27% faster throughput, measured in tokens per second (tk/s). This metric indicates the rate at which an LLM outputs tokens—essentially words per second displayed on the screen. The Ryzen AI 9 HX 375 processor reached speeds of up to 50.7 tk/s with Meta’s Llama 3.2 1b Instruct model (in 4-bit quantization).

accelerating llama cpp performance with amd ryzen ai 300 2

Another key benchmark for LLM performance is “time to first token”—the latency between prompt submission and the start of token generation. In this metric, the AMD Zen 5-based Ryzen AI HX 375 processor demonstrated up to 3.5 times faster response in larger models than comparable processors.

Boosting Model Throughput on Windows with Variable Graphics Memory (VGM)

AMD Ryzen AI processors feature three specialized accelerators tailored for distinct workloads. The XDNA 2-based NPUs, for example, excel at low-power, persistent AI tasks such as Copilot+ workloads, while CPUs provide compatibility across a range of tools and frameworks. Meanwhile, the iGPU (integrated GPU) handles dynamic, on-demand AI processing.

The llama.cpp framework in LM Studio is optimized for Vulkan API, a vendor-neutral graphics interface. Enabling GPU acceleration in LM Studio delivers a significant performance boost, with a 31% average increase in Meta Llama 3.2 1b Instruct model performance compared to CPU-only mode. Larger models like the Mistral Nemo 2407 12b Instruct, which are more bandwidth-intensive, saw a 5.1% average speed improvement with GPU offload.

However, when using Vulkan API acceleration, Intel processors performed below expectations in all but one model, leading to their exclusion from the GPU-offload comparison with AMD processors.

accelerating llama cpp performance with amd ryzen ai 300 3

AMD’s Ryzen AI 300 Series processors also introduce Variable Graphics Memory (VGM). VGM extends the iGPU’s memory allocation from its typical 512 MB limit to as much as 75% of available system RAM. This additional memory boost is particularly advantageous in applications with high memory demands. Activating VGM on a 16GB system delivered an additional 22% performance uplift on average in the Meta Llama 3.2 1b Instruct model, totaling a 60% improvement over CPU-only mode. For larger models, like the Mistral Nemo 2407 12b Instruct, VGM increased performance by up to 17%.

Comparing Consumer-Friendly LLM Experiences: AMD vs. Intel

To provide a balanced comparison of iGPU performance, we tested the first-party Intel AI Playground with models like Mistral 7b Instruct v0.3 and Microsoft Phi 3.1 Mini Instruct, both available within Intel’s ecosystem. With comparable quantization settings, the AMD Ryzen AI 9 HX 375 demonstrated an 8.7% performance edge in Phi 3.1 and a 13% increase in Mistral 7b Instruct v0.3, underscoring its efficiency in real-world applications.

accelerating llama cpp performance with amd ryzen ai 300 4

The AMD Vision for Accessible AI

AMD’s Ryzen AI accelerators represent a commitment to advancing AI accessibility. Applications like LM Studio allow users to deploy sophisticated LLMs on x86 laptops with minimal setup, creating a gateway for broader AI adoption. With features like VGM and CPU-GPU integration, AMD Ryzen AI 300 Series processors offer superior performance, enhancing the overall user experience with large language models.

For anyone eager to explore the potential of LLMs locally, LM Studio provides an intuitive, powerful platform to get started.

ByLeo Bien Durana

Updated November 02, 2024

What are You Looking for?

Accelerating Llama.cpp Performance with AMD Ryzen AI 300

Performance Metrics: Throughput and Latency

Boosting Model Throughput on Windows with Variable Graphics Memory (VGM)

Comparing Consumer-Friendly LLM Experiences: AMD vs. Intel

The AMD Vision for Accessible AI

Read Next

ASUS Shows-off ROG Falcata 75% Split Gaming Keyboard

Silver and Blood Composer Details its Haunting Soundtrack

Red Hat and AMD to Boost Cloud AI and Virtualization

Leave a Reply Cancel reply