AMD

Kog Reaches 3.5x Breakthrough Inference Speed on AMD Instinct MI300X GPUs

Kog Inference Engine reaches up to 3.5x faster token generation on AMD Instinct MI300X GPUs

TL;DR

Kog Inference Engine hits up to 3.5× faster token generation than vLLM and TensorRT-LLM on AMD MI300X, across all tested model sizes (1B to 32B), with cross-GPU latency down to 4μs.

AMD covered Kog's inference results on MI300X in their engineering blog. Read the post on AMD's website ↗

Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine

Today Kog is releasing the weights and model code of Laneformer 2B on Hugging Face Hub, the 2.3B-parameter instruction-tuned coding model designed for high-speed decoding. Most LLM research optimizes for benchmark quality first, and inference metrics like speed are often treated as a serving problem that

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs

We implemented the entire LLM decode pass in a single persistent kernel, no kernel launches, no interruptions, achieving 3,000+ tokens/s per request on AMD MI300X.

Delayed Tensor Parallelism for Faster Transformer Inference

DTP is a new Transformer architecture that hides communication overhead behind computation and weight streaming, enabling significantly faster batch-size-one inference on AMD and NVIDIA GPUs.

Read more

Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs

Delayed Tensor Parallelism for Faster Transformer Inference