We built the fastest local LLM inference engine for Apple Silicon. Written in Swift. Powered by MLX. Running on your Mac, iPad, and iPhone.
Axon runs quantized LLM inference using Apple's MLX framework — the same Metal-accelerated ML primitives that power mlx-lm. The difference: we're written in Swift, compiled to native code, with no Python runtime overhead. We run on iPhone.
Most local LLM tools for Apple Silicon use Python bindings (mlx-lm) or CPU-based approaches (llama.cpp). Axon is different — it's built from the ground up on MLX, Apple's own ML framework, which means we get the same hardware-level optimizations as the official tools, but without Python's overhead.
The result: on M4 Pro Mac Mini, Axon decodes up to 11% faster than mlx-lm on the same models. On iPhone, it's the only inference engine that can run 30B-parameter MoE models locally — because we built the flash streaming infrastructure to match.
Benchmarks on Apple M4 Pro Mac Mini, 24 GB unified memory. Decode throughput (tok/s), 50–100 tokens, greedy sampling. Lower is better latency, higher is better throughput.
| Model | Axon | mlx-lm | vs mlx-lm |
|---|---|---|---|
| Qwen3-4B-4bit | 93.9 tok/s | 88.6 tok/s | +5.9% |
| Qwen3-14B-4bit | 29.3 tok/s | 26.4 tok/s | +11.0% |
| Qwen3-8B-4bit | 54.6 tok/s | 55.5 tok/s | ~tied |
| Qwen3-0.6B-4bit | 400.5 tok/s | — | draft model |
Every architectural decision is made to eliminate overhead at the token-level loop.
Written in Swift, compiled to native Metal. No Python interpreter dispatch, no GIL, no object allocation tax on every forward pass.
We use MLX's Metal kernels directly. MLX is Apple's official ML framework — same hardware-level optimizations as Core ML but with full control.
Hadamard rotation + 4-bit KV quantization. Near-lossless quality at 4-bit density — KV cache bandwidth cut in half, more headroom for model compute.
Flash streaming (LLM in a Flash) — models larger than RAM stream weights from SSD per layer. 30B MoE fits on iPhone 16 Pro with 8GB RAM. Runs on Mac M-series chips too.
INT4 / INT8 weight quantization + per-group scales. Matches mlx-lm model format — same safetensors files, no conversion needed.
Draft-verification decoding with smaller draft models. K=1 on 4B with 0.6B draft gives near-perfect acceptance rate for small speedups.
Full SwiftUI apps with model download, flash mode toggle, live debug log, and streaming token output — on every Apple platform.
Streams model weights from SSD — run 30B MoE models on 8GB iPhone. Weights load per-layer as context grows.
Download models directly from HuggingFace. Background downloads with progress bar. No HuggingFace account needed.
Live inference diagnostics: load time, token timing, memory usage, error traces — all visible in-app.
Tokens appear as they're generated — no waiting for the full response. Token-per-second readout during generation.
Full macOS app with sidebar model selector, API server toggle, and keyboard shortcuts. Run locally or expose as a localhost API for other tools.
The same Axon engine powers a local API server — use any OpenAI client library to connect from other apps.
This is an internal research project. Benchmarks are measured on M4 Pro Mac Mini. Native apps for macOS, iPadOS, and iOS — available via TestFlight.