Axon ⚡ — Fastest LLM Inference on Apple Silicon

What we built

A native inference engine
for Apple's stack.

Axon runs quantized LLM inference using Apple's MLX framework — the same Metal-accelerated ML primitives that power mlx-lm. The difference: we're written in Swift, compiled to native code, with no Python runtime overhead. We run on iPhone.

Most local LLM tools for Apple Silicon use Python bindings (mlx-lm) or CPU-based approaches (llama.cpp). Axon is different — it's built from the ground up on MLX, Apple's own ML framework, which means we get the same hardware-level optimizations as the official tools, but without Python's overhead.

The result: on M4 Pro Mac Mini, Axon decodes up to 11% faster than mlx-lm on the same models. On iPhone, it's the only inference engine that can run 30B-parameter MoE models locally — because we built the flash streaming infrastructure to match.

// Inference pipeline

Input: "Explain gravity" → Tokenize

MLX Embedding + Quantized GEMM

// Per transformer layer

Attention (RoPE + GQA) + SwiGLU FFN

INT8 / TurboQuant KV Cache

// Output

Metal GEMM — all matmuls on GPU

Sampling → Token output

MLX + Metal GPU backend

Performance

Beating the baseline.

Benchmarks on Apple M4 Pro Mac Mini, 24 GB unified memory. Decode throughput (tok/s), 50–100 tokens, greedy sampling. Lower is better latency, higher is better throughput.

Model	Axon	mlx-lm	vs mlx-lm
Qwen3-4B-4bit	93.9 tok/s	88.6 tok/s	+5.9%
Qwen3-14B-4bit	29.3 tok/s	26.4 tok/s	+11.0%
Qwen3-8B-4bit	54.6 tok/s	55.5 tok/s	~tied
Qwen3-0.6B-4bit	400.5 tok/s	—	draft model

Hardware: Apple M4 Pro Mac Mini, 24 GB unified memory · Date: 2026-03-29

Why faster

Where the speed comes from.

Every architectural decision is made to eliminate overhead at the token-level loop.

⚡

No Python overhead

Written in Swift, compiled to native Metal. No Python interpreter dispatch, no GIL, no object allocation tax on every forward pass.

🧠

MLX-native

We use MLX's Metal kernels directly. MLX is Apple's official ML framework — same hardware-level optimizations as Core ML but with full control.

💾

TurboQuant KV Cache

Hadamard rotation + 4-bit KV quantization. Near-lossless quality at 4-bit density — KV cache bandwidth cut in half, more headroom for model compute.

📱

Apple Silicon-native

Flash streaming (LLM in a Flash) — models larger than RAM stream weights from SSD per layer. 30B MoE fits on iPhone 16 Pro with 8GB RAM. Runs on Mac M-series chips too.

🔢

Quantization-aware

INT4 / INT8 weight quantization + per-group scales. Matches mlx-lm model format — same safetensors files, no conversion needed.

🚀

Speculative decoding

Draft-verification decoding with smaller draft models. K=1 on 4B with 0.6B draft gives near-perfect acceptance rate for small speedups.

Native Apps

Runs on your Mac, iPad,
and iPhone. No cloud. No API.

Full SwiftUI apps with model download, flash mode toggle, live debug log, and streaming token output — on every Apple platform.

Axon — Mac App

Write a Rust function that parses JSON

Here's a clean implementation using serde_json...

Send a message...

Axon — iPadOS

Summarize this paper

This paper introduces a novel attention mechanism that reduces quadratic complexity to linear...

Compare GPT-4 vs Claude

GPT-4 and Claude take different approaches — GPT-4 emphasizes raw capability scale while Claude focuses on constitutional AI principles for safer outputs.

Ask anything...

↑

Qwen3-30B-A3B MoE · 4bit

What's the capital of Japan?

The capital of Japan is Tokyo. It's the country's largest city and political center, home to the Imperial Palace and over 13 million residents.

Explain quantum entanglement

Quantum entanglement is a phenomenon where two particles become correlated — measuring one instantly affects the other, regardless of distance.

Ask anything...

↑

📦

Flash Mode

Streams model weights from SSD — run 30B MoE models on 8GB iPhone. Weights load per-layer as context grows.

📥

Model Download

Download models directly from HuggingFace. Background downloads with progress bar. No HuggingFace account needed.

🐛

Debug Log Panel

Live inference diagnostics: load time, token timing, memory usage, error traces — all visible in-app.

🔄

Streaming Output

Tokens appear as they're generated — no waiting for the full response. Token-per-second readout during generation.

🖥️

Native Mac App

Full macOS app with sidebar model selector, API server toggle, and keyboard shortcuts. Run locally or expose as a localhost API for other tools.

📡

OpenAI-compatible API

The same Axon engine powers a local API server — use any OpenAI client library to connect from other apps.

LLM Inference
At Metal Speed.

A native inference engine
for Apple's stack.

Beating the baseline.

Where the speed comes from.

No Python overhead

MLX-native

TurboQuant KV Cache

Apple Silicon-native

Quantization-aware

Speculative decoding

Runs on your Mac, iPad,
and iPhone. No cloud. No API.

Flash Mode

Model Download

Debug Log Panel

Streaming Output

Native Mac App

OpenAI-compatible API

Built with Apple's stack.

Fastest inference.
On your device.

LLM Inference At Metal Speed.

A native inference enginefor Apple's stack.

Beating the baseline.

Where the speed comes from.

No Python overhead

MLX-native

TurboQuant KV Cache

Apple Silicon-native

Quantization-aware

Speculative decoding

Runs on your Mac, iPad,and iPhone. No cloud. No API.

Flash Mode

Model Download

Debug Log Panel

Streaming Output

Native Mac App

OpenAI-compatible API

Built with Apple's stack.

Fastest inference.On your device.

LLM Inference
At Metal Speed.

A native inference engine
for Apple's stack.

Runs on your Mac, iPad,
and iPhone. No cloud. No API.

Fastest inference.
On your device.