Model Performance
​Gemma 4 Deployment on Simplismart: Omni-Modal Open-Weight Inference That Scales in Production
TL;DR: Google DeepMind's Gemma 4 rewrites the open-weight playbook with a 256K-context omni-modal architecture that punches well above its parameter count. Run it today on Simplismart at up to 149 tokens/sec with zero infrastructure overhead.
TABLE OF CONTENTS
Regular Item
Selected Item
Last Updated
April 20, 2026

The open-weight vs. closed-source gap is closing and Gemma 4  is one of the clearest data points yet. Google DeepMind's latest release isn't an incremental update. It's a full architectural rethink: hybrid attention for 256K context, a 550M-parameter variable-resolution vision encoder, interleaved multimodal inputs, and benchmark numbers that compete with Kimi K2.5, Qwen 3.5 397B, and GLM5 models that are 10–30× larger by parameter count.

Simplismart now serves Gemma 4 26B MoE at 88.14 tokens/sec and 31B Dense model runs at 149.44 tokens/sec on optimized shared endpoints available today, OpenAI-compatible API, no infrastructure overhead. Here's what actually changed, and why it matters for your Gemma 4 deployment.

Why Gemma 4 Is an Architectural Departure, Not an Incremental Release

Most model updates optimize within an existing architecture. Gemma 4 makes structural changes that affect inference behavior, memory consumption, and what the model can actually do and they have direct implications for how you plan your Gemma 4 deployment.

In a nutshell, every previous Gemma model forced a tradeoff between context length, quality, and multimodal capability. Gemma 4 doesn't!

The full model family spans Dense and Mixture-of-Experts (MoE) architectures:

Model Total Params Active Params Context Modalities
E2B 5.1B (2.3B effective) 128K Text, Image, Audio
E4B 8B (4.5B effective) 128K Text, Image, Audio
26B A4B (MoE) 25.2B 3.8B 256K Text, Image
31B Dense 30.7B 256K Text, Image, Video

The "E" models use Per-Layer Embeddings to squeeze more capability out of fewer active parameters, built for on-device deployment. The 26B MoE activates only 3.8B parameters per inference pass, making it the cost-efficient workhorse. The 31B Dense is the production-grade option: full parameter count, 256K context, full multimodal support, no architectural compromises.

Apache 2.0 license across all variants. No usage restrictions, no permission requirements for commercial deployment.

The Architecture Decisions That Matter

Hybrid Sliding-Window + Global Attention

Running 256K context naively means the KV cache grows linearly with sequence length. At 256K tokens, that's a memory budget that makes most attention implementations impractical.

Gemma 4 solves this with interleaved local and global attention. Local layers apply sliding-window attention over nearby context: efficient, bounded memory. Global layers reason across the full sequence, but only where necessary. The final layer always uses global attention to ensure long-range coherence.

To keep memory sane at full context depth, global layers use unified Keys and Values with Proportional RoPE (p-RoPE), a position encoding variant that scales relative position biases proportionally with sequence length, preventing degradation at the edges of the context window.

The practical result: you can feed a 256K-token input without blowing your GPU memory budget. This is engineering, not marketing.

Variable-Resolution Vision Encoder

The 31B model ships with a ~550M parameter vision encoder, nearly 4× larger than the ~150M encoders in the E-series models. More importantly, it processes images at their native aspect ratios and resolutions, without resizing or padding to a fixed square.

This matters more than it sounds. Documents are tall and narrow. Panoramic photos are wide. Charts and dashboards are landscape. Forcing every image into a fixed square loses spatial information before the model ever sees the content. Gemma 4 processes the native shape, which means fewer pre-processing steps in your pipeline and better performance on structured visual content like PDFs, slides, and screenshots.

Interleaved Multimodal Input

Most vision-language models expect a fixed structure: image first, then text. Gemma 4 accepts freely interleaved text and image inputs in any order within a single prompt: question, image, more context, another image, continuation.

For multi-step visual reasoning or document analysis where context and visuals are naturally interwoven, this removes the need to artificially restructure your input.

Shared KV Cache

In the final layers, Gemma 4 reuses K and V tensors from the last layer that computed them rather than projecting fresh ones per token. This applies separately to sliding-window and full-attention layers, each sharing within its own group.

The quality impact is negligible. The savings are real: lower memory pressure, fewer FLOP operations, and meaningful throughput improvement on long sequences. This is one reason the 31B can sustain 149.44 tokens/sec on Simplismart. The architecture is built to be inference-efficient, not just accurate.

Per-Layer Embeddings (E-series)

Standard transformers force a single embedding vector to carry all layer-specific information through the entire forward pass. Gemma 4's E-series models use Per-Layer Embeddings (PLE) to address this bottleneck.

PLE adds a parallel, low-dimensional pathway alongside the main residual stream. For each token, it generates a small conditioning vector per layer by combining a token-identity lookup and a learned projection of the main embeddings. Each decoder layer then uses its dedicated vector to modulate hidden states after attention and feed-forward operations.

The effect: each layer receives token-specific information exactly when it's relevant, rather than frontloading everything into the initial embedding. Because the PLE dimension is much smaller than the main hidden size, you get meaningful per-layer specialization without a proportional increase in parameter count.

Gemma 4 Benchmarks

The 31B instruction-tuned model produces results you'd expect from a much larger closed-source model:

Benchmark Gemma 4 31B Gemma 3 27B Kimi K2.5 (~1T/32B MoE)
MMLU Pro 85.2% 67.6% 87.1%
AIME 2026 (no tools) 89.2% 20.8% ~96%*
LiveCodeBench v6 80.0% 29.1% 85.0%
Codeforces ELO 2150 110
GPQA Diamond 84.3% 42.4% 87.6%
MMMU Pro (Vision) 76.9% 49.7% 78.5%

* Kimi K2.5 reports AIME 2025 at 96.1%

Gemma 4: Model Performance vs Size
Source: Google Deepmind

Gemma 4 31B is competitive with Kimi K2.5, a model with 30× more total parameters and roughly comparable active parameters. That's not an incremental improvement over Gemma 3; it's a capability step change. The AIME improvement from 20.8% to 89.2% and a 2,040-point jump on Codeforces ELO represent a genuine architectural dividend, not benchmark optimization.

What You Can Actually Do With Gemma 4

  • Document and image analysis: The variable-resolution encoder handles PDFs, screenshots, charts, handwriting, and OCR across 140+ languages natively. No resizing, no pre-processing pipeline required.
  • Video understanding (31B): The 31B model processes sequences of frames from video inputs. Audio support is not included in the 31B and 26B variants. If your workload requires speech recognition or speech-to-translated-text, the E2B and E4B models cover audio natively; the right choice for voice-adjacent or edge deployment use cases.
  • Long-context pipelines: 256K tokens is large enough to process entire codebases, document sets, or extended conversation histories in a single pass. For RAG pipelines, this means you can reduce or eliminate chunking overhead by feeding more context directly.
  • Multilingual workloads: Pre-trained on 140+ languages, with strong instruction-following across 35+ in the instruct-tuned variants. The 31B achieves 88.4% on MMMLU, evidence of genuine generalization, not just English-first training with multilingual fine-tuning applied afterward.
  • Coding and reasoning: Codeforces ELO of 2150 puts the 31B in competitive programmer territory. For code review, generation, or complex reasoning pipelines, quality is not the bottleneck.

Gemma 4 Deployment on Simplismart: Production-Grade Throughput

Simplismart has applied production inference optimizations to both the 31B and 26B variants, resulting in throughput numbers that make Gemma 4 deployment viable for high-volume workloads on shared infrastructure:

Model Throughput Architecture Context
Gemma 4 31B Dense 149.44 tokens/sec Dense 256K
Gemma 4 26B MoE 88.14 tokens/sec MoE (3.8B active) 256K

These are production measurements on Simplismart infrastructure, not synthetic benchmarks. At 149.44 TPS, a 31B Dense model is viable for real-time applications, not just batch evaluation. At 88.14 TPS, the 26B MoE gives you a cost-efficient alternative for throughput-sensitive workloads where you don't need the full Dense parameter count.

The API is OpenAI-compatible. If you're already using the OpenAI SDK, the switch is a base_url change.

Quick Start

1. Log in to Simplismart and open the model marketplace

2. Search "Gemma 4 31B Instruct". The shared endpoint is accessible right now.

Gemma 4 on Simplismart Marketplace

3. Copy the endpoint URL and add it under the base_url parameter and generate an API key from Settings → API Keys and add it under the  api_key parameter.

4. Use the client below: text, image, and video, one endpoint.

from openai import OpenAI

client = OpenAI(
    api_key="SIMPLISMART_API_KEY",
    base_url="https://api.simplismart.live"
)

# --- Text ---
def ask(question):
    res = client.chat.completions.create(
        model="google/gemma-4-31B-it",
        messages=[{"role": "user", "content": [{"type": "text", "text": question}]}]
    )
    return res.choices[0].message.content

# --- Image ---
def describe_image(image_url, question="What is shown in this image?"):
    res = client.chat.completions.create(
        model="google/gemma-4-31B-it",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {"url": image_url}}
            ]
        }]
    )
    return res.choices[0].message.content

# --- Video ---
def describe_video(video_url):
    res = client.chat.completions.create(
        model="google/gemma-4-31B-it",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what is happening in this video."},
                {"type": "video", "url": video_url}
            ]
        }]
    )
    return res.choices[0].message.content


if __name__ == "__main__":
    print("Text:", ask("What is 2+2?"))
    print("Image:", describe_image(
        "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"
    ))
    print("Video:", describe_video(
        "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"
    ))


Gemma 4 Deployment Options: Shared vs. Dedicated

The same inference stack powers both Gemma 4 deployment options. What changes is how you pay for capacity and what guarantees you get.

Shared is pay-per-token; no infrastructure provisioning, no reserved compute, no GPU fleet to think about. Dedicated locks in GPU resources exclusively for your workload, with SLA-backed latency and throughput floors.

Shared endpoint is right for:

  • You want immediate access with zero provisioning
  • Your workload traffic is variable or unpredictable and you'd rather pay for API usage than reserve capacity
  • You're running multimodal pipelines, long-context RAG, or reasoning workloads where throughput is sufficient and you don't need guaranteed availability under burst load

Dedicated deployment is necessary when:

  • You need throughput beyond shared infrastructure capacity for high-concurrency workloads
  • You're shipping to production and require SLA-backed latency guarantees
  • Data has compliance constraints like HIPAA, SOC 2, GDPR that require single-tenant isolation
  • You want to deploy models on Your Own Cloud and need the model running within your own infrastructure with the ability to manage your own GPU fleet

Most teams start on shared, validate the model against their workload, then move to dedicated before going live. The API surface is identical; no code changes required.

For deployment configuration and optimization options, see the model optimization docs and deployment docs.

The Honest Assessment

Gemma 4 is the strongest open-weight multimodal model Google has shipped. The reasoning numbers aren't inflated by benchmark optimization. A 4× AIME improvement and a 2,000-point Codeforces ELO jump represent a genuine architectural step change, not fine-tuning on test-adjacent data. The 256K context window is large enough to be useful in real applications. The Apache 2.0 license removes the legal friction that makes proprietary models a liability for production systems.

The gap between open-weight and closed-source is closing faster than most teams expected. Gemma 4 deployment at 149.44 TPS on Simplismart is a concrete example of what that looks like in production: a 31B-class model, multimodal, 256K context, competitive with 1T-parameter alternatives, no GPU fleet to manage.

Start on the shared endpoint today or contact the team for a dedicated deployment tuned to your specific workload.
   

   

Find out what is tailor-made inference for you.