2026-05-05
Gemma 4 26B on consumer-grade 5070Ti GPU
A week running Google's Gemma 4 26B as my daily local agent on a single RTX 5070 Ti. No API calls, no cloud, no rate limits. Here's what it took, what I measured, and an honest read on what consumer hardware can now do for tool-using agentic work.
For the past week I've been running Google's Gemma 4 26B as my daily local agent on a workstation. No API calls, no cloud, no rate limits. Just a single RTX 5070 Ti, llama.cpp compiled from source, and a systemd unit.
This post covers what it took to get there, the throughput and accuracy numbers I measured, and an honest read on what this hardware tier is now capable of for tool-using agentic work.
TL;DR
- Model: Gemma 4 26B A4B Instruct — MoE, 26B total / 3.8B active
- Quant:
unsloth/gemma-4-26B-A4B-it-GGUF(UD-IQ4_XS, 12.65 GiB) - Hardware: RTX 5070 Ti (16 GB), Ryzen 9 9950X3D, 64 GB DDR5 — Fedora 43
- Throughput: 5,951 t/s prompt processing, 137.7 t/s token generation (pp2048 / tg64, llama-bench)
- BFCL accuracy: 89.13% non-live, 63.80% live — first published BFCL numbers for Gemma 4
- Daily use: a week as the model behind opencode, no OOMs, 65k context, sub-second response on agentic loops
Why this matters
The narrative that serious agentic work requires either H100s or API access is increasingly dated. Gemma 4's mixture-of-experts architecture activates just 3.8B of its 26B parameters per forward pass, which means a quantized version fits comfortably in 16 GB of VRAM and runs at speeds that match or beat what most managed services give you on shared infrastructure.
For someone running a small consultancy, a research project, or a privacy-sensitive workload, the math has changed. A workstation that costs less than a single year of frontier-API credits can now serve a model that benchmarks competitively against hosted offerings on structured tasks.
I wanted to test that claim with real numbers rather than vibes.
The hardware
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 9 9950X3D — 16C/32T, 5.7 GHz boost, 128 MB L3 |
| RAM | 64 GB DDR5 |
| GPU | NVIDIA RTX 5070 Ti — 16 GB GDDR7, GB203 (Blackwell, sm_120) |
| OS | Fedora 43, kernel 6.18 |
Unfortunately I bought this GPU at the top of the recent price spikes for €1150 (21% VAT included) so YMMV. The rest of the box is consumer hardware that any serious developer might already own.
The build saga
Getting llama.cpp to build cleanly on this hardware required a three-layer compatibility fix. None of the layers are llama.cpp's fault — they're the joint cost of running a brand-new GPU architecture (Blackwell) on a bleeding-edge distro (Fedora 43, GCC 15, glibc 2.41). Each layer is small in isolation; the combination wasn't documented anywhere I could find.
Layer 1: CUDA toolkit version pin
CUDA 13.x has a known segfault in the MMQ (Matrix Multiply Quantized) kernel on Blackwell. Without MMQ you fall back to cuBLAS and lose 5–6× on prompt processing. Solution: install CUDA 12.8 toolkit alongside the 580.x driver. The driver is forward-compatible with the older toolkit, and dnf install cuda-toolkit-12-8 cleanly drops the toolkit at /usr/local/cuda-12.8/ without touching kernel modules.
Layer 2: GCC host compiler downgrade
CUDA 12.8's cudafe++ cannot parse GCC 15's <type_traits> headers — it chokes on __is_pointer and __is_volatile builtins. The --allow-unsupported-compiler flag isn't enough; cudafe++ rejects the headers before the flag matters. Solution: install Fedora's compatibility package gcc14-c++ and point CMAKE_CUDA_HOST_COMPILER at g++-14. Only nvcc's host pass uses GCC 14; the rest of the C++ compilation stays on GCC 15.
Layer 3: glibc 2.41 noexcept conflict
Fedora 43 ships glibc 2.41, which adds C23-conformant declarations of cospi, sinpi, and rsqrt (plus their float variants) marked noexcept(true). CUDA 12.8's math_functions.h declares the same names without noexcept. cudafe++ rejects the mismatch. There's no upstream CUDA fix yet. Workaround: sed six declarations in crt/math_functions.h to add noexcept. The patch is six lines and covered by a single backup file.
The full automation, including idempotent re-runs, lives in the Build automation appendix below — ./setup-llama-cpp.sh all runs the whole pipeline.
The systemd unit
llama-server runs as a user-level systemd unit on port 8080 with an OpenAI-compatible API. The relevant flags:
--n-gpu-layers 99 full GPU offload (31/31 layers)
--ctx-size 65536 65k context window
--flash-attn on required for mixed KV
--cache-type-k q8_0
--cache-type-v q4_0 asymmetric KV: bigger gains where it matters
--batch-size 2048 --ubatch-size 512
--cache-reuse 256 KV-shifted prefix caching
--parallel 1 single-stream, max throughput per request
--jinja official chat template (post-2026-05-04 quant)
--reasoning on
--reasoning-format deepseek thoughts → message.reasoning_content
--chat-template-kwargs '{"enable_thinking":true}'Two non-obvious choices worth calling out:
Asymmetric KV cache (k=q8_0, v=q4_0)
Keys are more sensitive to quantization than values for retrieval-style attention, so the asymmetry costs almost no quality but cuts KV cache footprint by ~25% versus uniform q8_0. This requires GGML_CUDA_FA_ALL_QUANTS=ON at build time — without it, flash attention silently falls back to a CPU path that tanks prompt processing by ~110×. Easy to miss; easy to debug only after wondering why you're getting 50 t/s instead of 5,000.
Reasoning format deepseek
This routes the model's thinking-channel output to message.reasoning_content in the OpenAI response, leaving message.content clean for the final answer. Clients like opencode can render the trace as collapsible context without parsing it out of the response body.
Throughput
llama-bench numbers, build 9024, full GPU offload, with the asymmetric KV cache:
| Metric | Result |
|---|---|
| pp2048 | 5,951 ± 51 t/s |
| tg64 | 137.7 ± 1.3 t/s |
Compared to the same model on an RTX 3090 Ti (a generation older but with 50% more memory bandwidth) running Q4_K_M in a stock container:
| Host | Quant | Weights | pp2048 | tg64 |
|---|---|---|---|---|
| RTX 5070 Ti | UD-IQ4_XS | 12.65 GiB | 5,951 | 137.7 |
| RTX 3090 Ti | Q4_K_M | 15.70 GiB | 4,184 | 138.2 |
The 5070 Ti wins prompt processing by ~40% on its newer tensor cores; the 3090 Ti matches it on token generation, where its higher memory bandwidth compensates. For a tool-using agent — where prompt processing dominates total latency because each turn re-prefills the conversation history — the Blackwell card pulls ahead noticeably.
Full bench detail and methodology in the llama-bench detail appendix below.
BFCL accuracy
The Berkeley Function Calling Leaderboard v4 is the standard benchmark for LLM tool use. It tests whether a model can take a function schema and a natural-language request and produce a correct, parseable function call across single-turn, parallel, and irrelevance-detection scenarios.
BFCL doesn't ship a Gemma 4 handler. I wrote one — handler code, model config registration, and the multimodal context-length workaround are documented in the Gemma 4 handler patch appendix below.
I ran BFCL twice. Once in early April against the original IQ4_XS quant with llama.cpp's compatibility-shimmed chat template (the model's embedded jinja predated llama.cpp's expected format). And once on 2026-05-05 against the new UD-IQ4_XS quant Unsloth shipped on 2026-05-04 with Google's official Gemma chat template baked in.
| Category | Apr 2026 | May 2026 |
|---|---|---|
| simple_python | 96.50% | 95.00% |
| simple_java | 65.00% | 65.00% |
| simple_javascript | 74.00% | 74.00% |
| multiple | 93.50% | 93.00% |
| parallel | 94.00% | 93.00% |
| parallel_multiple | 93.00% | 92.50% |
| irrelevance | 92.50% | 91.25% |
| Non-Live Overall | 89.75% | 89.13% |
| live_simple | 87.21% | 86.82% |
| live_multiple | 61.06% | 57.93% |
| live_parallel | 87.50% | 87.50% |
| live_parallel_multiple | 62.50% | 58.33% |
| live_irrelevance | 91.97% | 92.65% |
| live_relevance | 50.00% | 43.75% |
| Live Overall | 66.40% | 63.80% |
For context, GPT-4o lands around 97% on simple_python, Claude 3.5 Sonnet around 96%, Llama 3.1 70B around 93% — all served as full-precision API endpoints. A 3.8B-active MoE running at IQ4_XS on consumer silicon scoring 95% in the same category is, frankly, where the disruption lives.
Reading the April → May delta
Two variables changed at once: the quant moved from vanilla IQ4_XS to Unsloth's dynamic UD-IQ4_XS (12.65 vs 13.3 GiB, with selective higher-precision layers), and llama.cpp's chat-template compat shim was replaced by the model's officially-shipped template. Single-seed runs at temperature=1.0 have ±2–3% noise per category, so individual deltas can't be cleanly attributed.
The pattern across categories is informative, though. Irrelevance detection ticked up (the model is correctly declining to call tools more often) while relevance and multi-call live categories dipped. The new official template appears to make the model slightly more conservative about emitting tool calls. I read this as a behavioral shift, not a quality regression.
Full results and methodology in the BFCL results appendix below.
The multi-turn wall
I tried to extend the BFCL run to the multi-turn category, which is where benchmarks start to actually resemble agentic workloads. It didn't work, and the reason is interesting.
Gemma 4 with --jinja enabled emits tool calls in its native syntax:
<|tool_call>call:find(name="document")<tool_call|>
BFCL's prompt-mode parser was designed for an earlier generation of models that emit [find(name="document")]. In single-turn cases, the BFCL handler's system-prompt pre-processing wins out and the model produces the expected format. In multi-turn cases, after the first turn the assistant's prior native-format response gets replayed back through the chat template, the BFCL injection no longer dominates, and the model reverts to native syntax that the parser can't decode.
Two real fixes are possible: extend the handler's decode_ast() to parse Gemma's native tool-call syntax, or strip the chat template's tool-section rendering entirely. Both are non-trivial engineering work. Both are out of scope for this writeup.
This is a useful illustration of how rapidly evolving model capabilities outpace standardized benchmarks. Native tool-call tokens are a meaningful upgrade — they're cleaner to parse than free-form text and let the runtime distinguish reliably between content and tool invocations — but they invalidate years of evaluation infrastructure built around the old format. Multi-turn benchmarks for Gemma 4 are coming in a follow-up post.
A week in opencode
Numbers tell part of the story. The other part is what it's like to use.
For the past week, this stack has been the model behind many coding sessions, code reviews, and exploratory queries I've run through opencode. Tool calls in agentic loops are reliable. Context management at 65k windows works without OOMs, even when sessions accumulate substantial history. Cache reuse via KV shifting (--cache-reuse 256) means turning a long thread into another turn rarely re-prefills the whole conversation.
The thinking-mode separation deserves a callout. Routing reasoning into message.reasoning_content and keeping message.content clean for the final answer means clients can render the trace as collapsible context rather than dumping it inline. This is closer to how Anthropic's extended thinking used to surface in clients, and it makes the model considerably more pleasant to work with in a tool that expects structured responses.
What still doesn't work as well as a frontier API: deeply nested reasoning where the answer requires holding many constraints across long chains. Gemma 4 holds short chains beautifully but can lose threads on hour-long agent sessions in ways that GPT-5 or Claude don't. For that class of work, I still reach for an API. For the other 80% — code review, exploration, structured queries, scripting help — local Gemma 4 is the default.
What this means
The argument that serious agentic work requires hosted infrastructure is no longer a hardware argument. It's a software-stack argument, and the software stack is closing fast. A workstation that costs less than a year of API credits at modest usage now runs a model that competes with last year's flagships on structured output. It runs that model with sub-second latency on most queries, and it does so with full data sovereignty — no prompts leaving the box, no provider seeing what you're working on, no rate limits, no retraining concerns.
For consultancies, research labs, and teams with privacy or compliance requirements: the trade has become serious. Local-first agentic work isn't a hobbyist's compromise anymore. It's a viable production path, with a build script, a systemd unit, and a benchmark line.
The open question — and it's a real one — is what happens when frontier models extend their lead at the agentic-loop level. Tool-use reliability across 30-step workflows is where the closed labs still pull ahead. Gemma 4 closes the gap on the per-turn capability that most local models were missing. The next benchmark to watch is whether the open ecosystem catches up on long-horizon reasoning before the gap widens further.
For now, on a single Blackwell GPU and a Saturday's worth of build pain: this is what local-first agentic infrastructure looks like in May 2026.
Appendix
Five technical companion docs — build script, throughput methodology, full BFCL results, the handler patch, and remaining evaluation plan. Click any to expand.
Building something local-first?
Whether it's local AI inference, GitOps, or rescuing infrastructure after a crisis — I'd love to hear about it.
← Back to Blog