enni. ← all posts
2026 · JuneOn-device AI

AFM 3 Core Advanced: a 20-billion-parameter model that runs on a phone

When Apple announced the AFM 3 family earlier this month, one number travelled fast: a 20-billion-parameter model, running on device. The more useful story is how — and the question every app developer asked within the hour: what about the phones people already own?

The lineup

Apple shipped four models under a single framework, scaling from the chip in your pocket up to a datacenter:

AFM 3 · models
Core
The compact on-device model — the default for most local tasks.
Core Advanced
A ~20B sparse mixture-of-experts. All experts stay resident; only ~1–4B parameters activate per prompt. On-device, gated to the A19 Pro.
Cloud
Runs on Apple Silicon through Private Cloud Compute.
Cloud Pro
The largest tier — reportedly served on NVIDIA GPUs in Google Cloud, distilled with help from Gemini.

Why Core Advanced needs an A19 Pro

"20B that runs on a phone" is true, but easy to misread. A mixture-of-experts model only computes a fraction of its weights per token — here, a few billion — which keeps the compute per token modest. That part is friendly to mobile silicon.

The catch is everywhere else. The full 20B still has to be resident in unified memory, and the device has to move the right experts in and out fast enough to keep generation flowing. That's a memory-capacity and memory-bandwidth problem, not a compute one. Older Neural Engines can't sustain the throughput, and older devices don't have the headroom — no amount of quantization cleverness closes that gap. So the A19 Pro requirement isn't gatekeeping; it's a real hardware ceiling.

The question that actually matters

Most apps don't ship to a fleet of A19 Pros. They ship to whatever's in people's pockets — iPhone 13s, M1 Macs, devices that will never run Core Advanced. And "bring AFM 3 to older hardware" isn't achievable in the literal sense: the weights aren't redistributable, and the on-device runtime is tied to Neural Engine features that simply don't exist on older silicon.

The practical version of the question has a clean answer: stop designing around the model, and design around capability detection.

The pattern

Treat on-device AI as one implementation behind an interface — never the only path. Detect the device's AI capability at launch, and route accordingly:

// one interface, two (or more) implementations
protocol LLMClient { func respond(to: String) async throws -> String }

struct OnDeviceClient: LLMClient { // Foundation Models framework }
struct RemoteClient:   LLMClient { // your backend / hosted model }

let client: LLMClient = SystemLanguageModel.isAvailable
    ? OnDeviceClient()
    : RemoteClient(endpoint: "https://api.enni.example")

At runtime that resolves to three tiers:

The result: the app still feels AI-powered on an iPhone 13, and there's one code path doing it. The fallback isn't a compromise — it's the architecture.

What we take from it

The lesson under the launch is the one we keep relearning building on-device systems: the binding constraint is rarely raw compute. It's memory residency, bandwidth, and the honest ceiling of the hardware in the user's hand. Designing for that ceiling — with a capability-detection layer instead of a single happy path — is most of what separates a demo from a product.

Written by Enni Technologies. We build and research applied AI, with a particular interest in efficient on-device inference. Get in touch →