⚡ Axon — Research Project

LLM Inference
At Metal Speed.

We built the fastest local LLM inference engine for Apple Silicon. Written in Swift. Powered by MLX. Running on your Mac, iPad, and iPhone.

Swift + MLX Apple Silicon Metal GPU INT4 / INT8 Quantization TurboQuant KV Cache macOS iPadOS iOS
93.9
tok/s on Qwen3 4B
+11%
vs mlx-lm on 14B
30B
MoE runs on iPhone
Mac
native app + API server

A native inference engine
for Apple's stack.

Axon runs quantized LLM inference using Apple's MLX framework — the same Metal-accelerated ML primitives that power mlx-lm. The difference: we're written in Swift, compiled to native code, with no Python runtime overhead. We run on iPhone.

Most local LLM tools for Apple Silicon use Python bindings (mlx-lm) or CPU-based approaches (llama.cpp). Axon is different — it's built from the ground up on MLX, Apple's own ML framework, which means we get the same hardware-level optimizations as the official tools, but without Python's overhead.

The result: on M4 Pro Mac Mini, Axon decodes up to 11% faster than mlx-lm on the same models. On iPhone, it's the only inference engine that can run 30B-parameter MoE models locally — because we built the flash streaming infrastructure to match.

// Inference pipeline
Input: "Explain gravity" → Tokenize
MLX Embedding + Quantized GEMM
// Per transformer layer
Attention (RoPE + GQA) + SwiGLU FFN
INT8 / TurboQuant KV Cache
// Output
Metal GEMM — all matmuls on GPU
Sampling → Token output
MLX + Metal GPU backend

Beating the baseline.

Benchmarks on Apple M4 Pro Mac Mini, 24 GB unified memory. Decode throughput (tok/s), 50–100 tokens, greedy sampling. Lower is better latency, higher is better throughput.

Model Axon mlx-lm vs mlx-lm
Qwen3-4B-4bit 93.9 tok/s 88.6 tok/s +5.9%
Qwen3-14B-4bit 29.3 tok/s 26.4 tok/s +11.0%
Qwen3-8B-4bit 54.6 tok/s 55.5 tok/s ~tied
Qwen3-0.6B-4bit 400.5 tok/s draft model
Hardware: Apple M4 Pro Mac Mini, 24 GB unified memory · Date: 2026-03-29

Where the speed comes from.

Every architectural decision is made to eliminate overhead at the token-level loop.

No Python overhead

Written in Swift, compiled to native Metal. No Python interpreter dispatch, no GIL, no object allocation tax on every forward pass.

🧠

MLX-native

We use MLX's Metal kernels directly. MLX is Apple's official ML framework — same hardware-level optimizations as Core ML but with full control.

💾

TurboQuant KV Cache

Hadamard rotation + 4-bit KV quantization. Near-lossless quality at 4-bit density — KV cache bandwidth cut in half, more headroom for model compute.

📱

Apple Silicon-native

Flash streaming (LLM in a Flash) — models larger than RAM stream weights from SSD per layer. 30B MoE fits on iPhone 16 Pro with 8GB RAM. Runs on Mac M-series chips too.

🔢

Quantization-aware

INT4 / INT8 weight quantization + per-group scales. Matches mlx-lm model format — same safetensors files, no conversion needed.

🚀

Speculative decoding

Draft-verification decoding with smaller draft models. K=1 on 4B with 0.6B draft gives near-perfect acceptance rate for small speedups.

Runs on your Mac, iPad,
and iPhone. No cloud. No API.

Full SwiftUI apps with model download, flash mode toggle, live debug log, and streaming token output — on every Apple platform.

Axon — Mac App
Write a Rust function that parses JSON
Here's a clean implementation using serde_json...
Send a message...
Axon — iPadOS
Summarize this paper
This paper introduces a novel attention mechanism that reduces quadratic complexity to linear...
Compare GPT-4 vs Claude
GPT-4 and Claude take different approaches — GPT-4 emphasizes raw capability scale while Claude focuses on constitutional AI principles for safer outputs.
Ask anything...
Qwen3-30B-A3B MoE · 4bit
What's the capital of Japan?
The capital of Japan is Tokyo. It's the country's largest city and political center, home to the Imperial Palace and over 13 million residents.
Explain quantum entanglement
Quantum entanglement is a phenomenon where two particles become correlated — measuring one instantly affects the other, regardless of distance.
Ask anything...
📦

Flash Mode

Streams model weights from SSD — run 30B MoE models on 8GB iPhone. Weights load per-layer as context grows.

📥

Model Download

Download models directly from HuggingFace. Background downloads with progress bar. No HuggingFace account needed.

🐛

Debug Log Panel

Live inference diagnostics: load time, token timing, memory usage, error traces — all visible in-app.

🔄

Streaming Output

Tokens appear as they're generated — no waiting for the full response. Token-per-second readout during generation.

🖥️

Native Mac App

Full macOS app with sidebar model selector, API server toggle, and keyboard shortcuts. Run locally or expose as a localhost API for other tools.

📡

OpenAI-compatible API

The same Axon engine powers a local API server — use any OpenAI client library to connect from other apps.

Built with Apple's stack.

Swift 5.12
MLX 0.31.2
Metal GPU
MLXNN kernels
Vapor API server
SwiftUI iOS app
safetensors format
HuggingFace tokenizer
INT4 / INT8 quantization
TurboQuant rotation
SnapKV / flash decoding
MoE expert streaming
Internal Research

Fastest inference.
On your device.

This is an internal research project. Benchmarks are measured on M4 Pro Mac Mini. Native apps for macOS, iPadOS, and iOS — available via TestFlight.

// Benchmark methodology: greedy sampling, 50-100 tokens, warm-up run, tok/s = tokens / total_time