Kog Labs

Kog Laneformer 2B: The Latency-First Model Behind Kog Inference Engine

Kog Team · 25 Jun 2026

Today Kog is releasing the weights and model code of Laneformer 2B on Hugging Face Hub, the 2.3B-parameter instruction-tuned coding model designed for high-speed decoding. Most LLM research optimizes for benchmark quality first, and inference metrics like speed are often treated as a serving problem that

Inference 16 min read

Real-time LLM Inference on Standard Datacenter GPUs (3,000 tokens/s per request)

Kog Team · 28 May 2026

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

GPU Engineering 24 min read

Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs

Kog Team · 28 May 2026

We implemented the entire LLM decode pass in a single persistent kernel, no kernel launches, no interruptions, achieving 3,000+ tokens/s per request on AMD MI300X.

Model Architecture 13 min read

Delayed Tensor Parallelism for Faster Transformer Inference

Kog Team · 28 May 2026

DTP is a new Transformer architecture that hides communication overhead behind computation and weight streaming, enabling significantly faster batch-size-one inference on AMD and NVIDIA GPUs.

AMD 1 min read

Kog Reaches 3.5x Breakthrough Inference Speed on AMD Instinct MI300X GPUs

Kog Team · 14 Jul 2025

Kog Inference Engine reaches up to 3.5x faster token generation on AMD Instinct MI300X GPUs