Google’s Gemma 4 family introduces a new generation of open-weights models ranging from 2B to 31B parameters, featuring both dense and Mixture of Experts architectures. These multimodal models process text, vision, and select audio inputs while supporting context windows up to 256K tokens.
AMD provides Day Zero support for the entire Gemma 4 lineup across its hardware ecosystem, spanning Instinct GPUs for enterprise datacenters, Radeon GPUs for AI workstations, and Ryzen AI processors for PCs.
Here are the supported frameworks for deploying Gemma 4 on AMD hardware:
Table of Contents:
vLLM
You can deploy Gemma 4 using vLLM to take advantage of inference optimizations, particularly for handling multiple concurrent requests. This framework supports multiple generations of both Instinct and Radeon GPUs.
To get started, pull the launch build Docker image using docker pull vllm/vllm-openai-rocm:gemma4. When running the model, invoke the server with the TRITON_ATTN backend using the command vllm serve <model> --attention-backend TRITON_ATTN.
SGLang
For high-performance serving on AMD MI300X, MI325X, and MI35X GPUs, SGLang supports the full Gemma 4 family. These models require the Triton attention backend to process bidirectional image-token attention.
You can launch the server using python3 -m sglang.launch_server --model-path <model> --attention-backend triton --tp 1. A full Gemma 4 model fits on a single MI300X GPU at maximum context length, though you can increase the tensor parallelism parameter for higher throughput workloads.
LM Studio
If you are deploying on local consumer hardware, LM Studio leverages the llama.cpp project for quick setup. You can run these models on Ryzen AI processors or Radeon graphics cards by installing the LM Studio application and ensuring you have the latest AMD Adrenalin Edition drivers installed.
Lemonade Server
Lemonade provides a local LLM server equipped with OpenAI-compatible APIs. It accelerates inference on Radeon GPUs via ROCm and on Ryzen AI processors utilizing the XDNA 2 NPU.
To deploy on GPUs, download the preview ROCm build of llama.cpp for your specific GPU architecture, set the LEMONADE_LLAMACPP_ROCM_BIN environment variable to point to the executable, and start the Lemonade server.
For NPU deployment, support for the Gemma-4 E2B and E4B models is planned for the upcoming Ryzen AI software update.