llm | performance optimization | C++ | agentic workflows 15 apr 2026

AgenticKernel: 43× C++ Kernel Speedup with LLMs

AutoResearch showed that a simple agentic feedback loop can iteratively improve PyTorch models: propose a change, evaluate the result, and repeat. Compute kernel optimization fits the same iterative pattern especially well. Starting from a baseline implementation, the system discussed in this post proposes a change, compiles it, benchmarks it on real hardware, evaluates the outcome, and iterates. With end-to-end feedback taking around 10 seconds, this creates a fast search loop over implementation space.

What was surprising was not just that a simple loop works, but how competitive cheap, locally run models can be inside it. Under a fixed search budget, they reached higher final speedups than much larger and more expensive frontier APIs.

Experiment Setup & Problem Space

The optimization task for the LLM is deliberately chosen to be adjacent to general matrix multiplication(GEMM), which these models have likely seen as part of their training data. We are tasking the model with optimizing matrix multiplication A*B with a twist: The data type of B is packed binary where 0 and 1 translate to -1 and 1, respectively as follows

$$ C_{ij} = \sum_{l=1}^{k} A_{il} \, B_{lj}, \qquad A \in \mathbb{R}^{m \times k}, \quad B \in \{-1,+1\}^{k \times n}. $$

The model is not informed about the exact dimensions of the input matrices or the hardware characteristics such as cache types and sizes and it has no way to inspect these other than through the benchmarked runtime results. In the example we run here, $A$ is $32 \times 3072$ and $B$ is sized $3072 \times 3072$. These dimensions were chosen to place the kernel firmly in the compute-bound region of an Apple M3 core.

The model is given a baseline implementation in C++ that is straightforward to write from the mathematical formula. This baseline then serves both as the starting point for optimization and together with a unit test as the specification for the kernel to implement.

For each run, the model and compile flags are fixed for the full experiment. The benchmark is run and timed on an Apple M3 using Google Benchmark. The optimization loop stops after 50 iterations/kernel proposals or when a $5 budget is reached for models API access through OpenRouter. Locally hosted models run on a DGX Spark (Blackwell GB10) through Ollama 0.20.6.

The agent receives the baseline implementation in the initial prompt and has access to a single tool for proposing and evaluating new implementations. This tool writes the candidate implementation to disk, rejects banned constructs such as static, compiles the code, runs the unit tests and benchmark. In case of success, it returns the runtime of the submitted implementation compared to the other implementations thus far. In case of failure, it sends the relevant console output to the LLM. The model can thus execute tests, but it cannot inspect the test code itself, see the matrix dimensions, or run arbitrary commands.

All tool output, including run times and error feedback, is appended to the context as the run progresses. There is no context compaction in these experiments.

Results without compiler optimizations (O0)

The first experiment turns compiler optimizations off (O0) and is somewhat adversarial in that this forces the model to manually introduce structure that the compiler would otherwise provide.

The LLM is not made aware of the compiler flags and cannot inspect them. Given that no context compaction is used, one could expect that only frontier models like Claude, Gemini or GPT could succeed on this task. However, it turned out that even much smaller models with 30-100B parameters could deliver impressive speedups up to 7.5x compared to the baseline.

A notable pattern in the O0 runs was that frontier models such as Opus and GPT reached the highest absolute speedups, in some cases around 11x, but often ran into the fixed $5 budget limit before the search had clearly saturated. By contrast, locally running open-weight models reached lower but still very strong speedups around 8x at essentially zero marginal cost. In other words, even in this more adversarial setting, the ranking depends not just on peak model capability, but also on how efficiently a model can explore under an iterative search budget.

Visualizes speedups reached by LLM implementations when compiling with O0 — Fig 2. best speedup reached by each model with compiler optimizations disabled (-O0). this is the more adversarial setting: improvements must come largely from structure introduced by the model rather than from the compiler. frontier models (opus 4.6, gpt 5.4) reached the highest absolute speedups of over 10x, but several much smaller and locally running open-weight models still achieved strong gains(gemma4:31b: 7.8x) at essentially zero marginal cost.

Results with compiler optimizations (O3)

In a second run, we let the compiler handle part of the optimizations by passing -O3. One might expect this to leave much less room for the agent to improve over the baseline(now also -O3), since common optimizations such as loop transformations, instruction scheduling, and related low-level cleanups are now performed automatically by the compiler.

In practice, however, the O3 runs were even more striking. Across models of very different sizes, the loop still found speedups of up to 43x over the baseline. Note that this is single core optimization with SIMD. This suggests that even with an optimizing compiler in the loop, the search is doing much more than simply nudging a naive implementation: there remains substantial room to restructure the computation in ways that the compiler alone does not recover from the reference implementation.

Model	Final Speedup	Total Cost	Iterations to Peak	Time to Peak
Gemma-4-31B (Local)	43x	$0.35	28	6h2m
GPT-OSS 120B (Local)	37.2x	$0.09	39	1h44m
Opus 4.6 (Paid)	22.4x	$5.00 (cap)	18	8m
Qwen-Coder-3-Next (Local)	18.5x	$0.03	35	21m
Gemini3.1-Pro (Paid)	18.2x	$5.00 (cap)	20	32m
GPT-5.4 (Paid)	10.4x	$5.00 (cap)	4	3m

Another surprise was that the fastest run to date came from a free and open model running locally. Under the fixed $5 budget constraint, none of the frontier models produced a similarly fast kernel. As in the O0 setting, this makes the practical ranking more nuanced than a simple "best model wins" story: final kernel speed, search cost, and wall-clock time can differ substantially across models even when they end up in a similar performance range.

Visualizes speedups reached by LLM implementations when compiling with O3 — Fig 3. Best speedup reached by each model with compiler optimizations enabled (-O3). Despite the compiler already applying standard optimizations, the loop still found very large improvements over the reference implementation. Under the fixed $5 budget, the strongest final result in this setting came from a free and open locally running model, while frontier models did not converge to similarly fast kernels before budget exhaustion.

Tradeoffs

We've seen in the previous section that raw speedup is only part of the story. Kernel optimization is usually a manual process. What makes this loop useful is that much of it can be handed off: provide a baseline, a test and a benchmark, then let the model search.

The below plots compare the best speedup each model reached in the O3 setting against the cost (Fig 4.) and time (Fig 5.) needed to get there. This is the more interesting regime in practice because the compiler is already doing serious work, yet the loop still finds large improvements over the reference implementation. Under the fixed $5 budget, the local models reached the strongest final speedups. This is partly a cost effect, but not only that: some of the local models (in particular qwen3-coder-next) were also in a similar latency range, so the result is not just explained by slower but cheaper search.

Visualizes speedups reached by LLM implementations when compiling with O3 plotted on a scatterplot against the cost to run these models either locally or through cloud providers. — Fig 4. Best speedup reached in the O3 setting plotted against the dollar cost of the run. For closed-source models, this is the API cost, for open models the time multiplied by the local electricity cost to run inference on a DGX Spark (NVIDIA Blackwell GB10). Under the fixed $5 budget, locally running models dominated the practical frontier: they achieved higher final speedups while incurring very limited marginal API cost. The result is therefore not just about cheap search, but about which models can search most effectively under a constrained budget.

The wall-time plot shows another part of the tradeoff. Some models reached good results cheaply but slowly, while others were faster to respond but exhausted budget before the search had clearly run its course.

While local models took considerably longer in absolute wall-clock time, this compute is entirely unattended. This makes a longer, cheaper search viable for automated or overnight pipelines. Furthermore, the current loop is deliberately naive; future work utilizing context compaction (rather than appending full code histories) will drastically reduce inference latency and KV-cache pressure, narrowing this time gap significantly.

These results suggest that for hardware-grounded optimization loops, the best practical model is not necessarily the most expensive or most prestigious one. Under a fixed search budget, cheaper local models can be the more powerful choice because the loop rewards sustained exploration rather than one-shot perfection.

Best speedup reached with O3 setting as a scatterplot against wall-time spent for the LLM to find the best performing solution — Fig 5. Best speedup reached in the O3 setting plotted against wall-clock time. This captures the practical latency of the full optimization loop, including model response time and context growth over repeated iterations. Several local models remained competitive not only in cost but also in elapsed time, showing that their advantage in this setup was not purely economic.

How the model implemented the generated C++ kernels

Inspecting the actual C++ code generated by the models reveals distinctly different optimization strategies, highlighting the gap between algorithmic cleverness and true mechanical sympathy for the hardware.

Qwen-Coder-3-Next (local) achieved its 18.5x speedup using a bruteforce pointer approach, avoiding explicit vector intrinsics entirely. Instead it relied on massive loop unrolling (32 lines deep) and pointer arithmetic to unpack the bits, effectively leaving the compiler to figure out the SIMD instructions. While functional, it's likely bottlenecked on memory bandwidth due to continuous writes to the accumulator array.
Opus 4.6 (22.4x) took a smart approach with a speed-bump: it bypassed floating-point multiplication entirely by directly manipulating the IEEE-754 sign bit using vector XOR operations (veorq_u32). However, to fetch the right bitmasks, it relied on scalar bit-shifting inside the innermost loop, likely creating pipeline stalls that starve its own vector registers.
GPT-OSS 120B (local) (37.2x) found a middle ground between pure C++ and hardware-specific code. It built a compile-time constexpr look-up table to dequantize bytes into floats, unrolled its inner loops, and accumulated math in local arrays to prevent memory thrashing. Crucially, it injected explicit __builtin_prefetch instructions to hide memory latency, making it easier for the compiler to auto-vectorize the code without requiring explicit ARM intrinsics.
Finally, Gemma-4-31B (local) (43x) utilized an 8-bit look-up table pre-populated with entire ARM NEON vector registers, accumulated the math locally across 16 vector registers to essentially eliminate memory write overhead, and used FMAs (vmlaq_f32) instructions that pipeline well.

Takeaways

This experiment has obvious limits: The task is a single kernel on fixed dimensions and one hardware target, and some models may well have seen similar code patterns during training.

The setup is deliberately lean to test whether a minimal hardware-grounded feedback loop can turn an easy-to-write reference implementation into something much faster with very little human effort.

In that setting, the results are clear: fast benchmark feedback makes agentic kernel search effective, cheap and open models can participate surprisingly well, and in an iterative optimization loop, cost and latency matter just as much as raw model capability.

References

GitHub repo with generated kernels