AFM 3 Core Advanced: a 20-billion-parameter model that runs on a phone
When Apple announced the AFM 3 family earlier this month, one number travelled fast: a 20-billion-parameter model, running on device. The more useful story is how — and the question every app developer asked within the hour: what about the phones people already own?
The lineup
Apple shipped four models under a single framework, scaling from the chip in your pocket up to a datacenter:
- Core
- The compact on-device model — the default for most local tasks.
- Core Advanced
- A ~20B sparse mixture-of-experts. All experts stay resident; only ~1–4B parameters activate per prompt. On-device, gated to the A19 Pro.
- Cloud
- Runs on Apple Silicon through Private Cloud Compute.
- Cloud Pro
- The largest tier — reportedly served on NVIDIA GPUs in Google Cloud, distilled with help from Gemini.
Why Core Advanced needs an A19 Pro
"20B that runs on a phone" is true, but easy to misread. A mixture-of-experts model only computes a fraction of its weights per token — here, a few billion — which keeps the compute per token modest. That part is friendly to mobile silicon.
The catch is everywhere else. The full 20B still has to be resident in unified memory, and the device has to move the right experts in and out fast enough to keep generation flowing. That's a memory-capacity and memory-bandwidth problem, not a compute one. Older Neural Engines can't sustain the throughput, and older devices don't have the headroom — no amount of quantization cleverness closes that gap. So the A19 Pro requirement isn't gatekeeping; it's a real hardware ceiling.
The question that actually matters
Most apps don't ship to a fleet of A19 Pros. They ship to whatever's in people's pockets — iPhone 13s, M1 Macs, devices that will never run Core Advanced. And "bring AFM 3 to older hardware" isn't achievable in the literal sense: the weights aren't redistributable, and the on-device runtime is tied to Neural Engine features that simply don't exist on older silicon.
The practical version of the question has a clean answer: stop designing around the model, and design around capability detection.
The pattern
Treat on-device AI as one implementation behind an interface — never the only path. Detect the device's AI capability at launch, and route accordingly:
// one interface, two (or more) implementations protocol LLMClient { func respond(to: String) async throws -> String } struct OnDeviceClient: LLMClient { // Foundation Models framework } struct RemoteClient: LLMClient { // your backend / hosted model } let client: LLMClient = SystemLanguageModel.isAvailable ? OnDeviceClient() : RemoteClient(endpoint: "https://api.enni.example")
At runtime that resolves to three tiers:
- Apple-Intelligence-capable device → the Foundation Models framework, which itself routes to AFM 3 Cloud over Private Cloud Compute when the local model isn't enough. Your code doesn't change; Apple decides where it runs.
- Below that baseline → your own backend: a hosted model, or on Mac, a quantized open model served locally via MLX or llama.cpp.
- Consent declined or rate-limited → the same remote fallback, which is why you want it even on supported hardware.
The result: the app still feels AI-powered on an iPhone 13, and there's one code path doing it. The fallback isn't a compromise — it's the architecture.
What we take from it
The lesson under the launch is the one we keep relearning building on-device systems: the binding constraint is rarely raw compute. It's memory residency, bandwidth, and the honest ceiling of the hardware in the user's hand. Designing for that ceiling — with a capability-detection layer instead of a single happy path — is most of what separates a demo from a product.