The Redis Creator Wrote a DeepSeek Inference Engine in P...

What happened

Salvatore Sanfilippo — universally known as antirez, the creator of Redis — released ds4, a local inference engine purpose-built for running DeepSeek 4 Flash on Apple Metal GPUs. The project, hosted at [github.com/antirez/ds4](https://github.com/antirez/ds4), hit 335 points on Hacker News, drawing immediate attention from both the systems programming and ML inference communities.

This isn't antirez's first foray outside the database world. After stepping back from Redis in 2020, he's worked on projects ranging from a Telegram bot framework to neural network experiments. But ds4 represents something more deliberate: a systems programmer's answer to the question of whether general-purpose inference runtimes are leaving performance on the table.

The project targets Apple's Metal API directly — the low-level GPU compute framework that powers everything from Final Cut Pro rendering to Core ML inference on Macs. By writing specifically for Metal rather than abstracting across CUDA, ROCm, and Metal simultaneously, ds4 makes a clear architectural bet: specialization over portability.

Why it matters

### The antirez factor

When the person who wrote Redis — a project legendary for its code clarity, performance discipline, and refusal to add unnecessary complexity — decides to write an inference engine, the developer community pays attention. Redis succeeded in part because Sanfilippo rejected the enterprise-software instinct to handle every possible use case. He built a sharp tool that did specific things extremely well. ds4 appears to follow the same philosophy: one model family, one GPU backend, zero compromise.

This stands in stark contrast to the current inference runtime landscape. Projects like llama.cpp (by Georgi Gerganov) have become the de facto standard for local inference, supporting dozens of model architectures across CUDA, Metal, Vulkan, and CPU backends. Ollama wraps llama.cpp in a Docker-like UX. MLX from Apple's own ML research team provides a NumPy-like framework optimized for Apple Silicon. Each of these tools optimizes for breadth — and each pays a tax for that breadth in the form of abstraction layers, compatibility shims, and code paths that exist to serve hardware the user doesn't have.

### The specialization thesis

ds4 asks a pointed question: if you know you're running one model family on one GPU architecture, how much faster can you go?

The answer, historically, is "meaningfully faster." Specialized inference kernels routinely outperform general-purpose runtimes by 20-40% on specific hardware, because they can exploit chip-specific memory hierarchies, batch sizes, and scheduling strategies that a cross-platform abstraction must ignore. Apple's Metal Shading Language gives developers direct control over threadgroup memory, SIMD operations, and the unified memory architecture that makes Apple Silicon's GPU-CPU data sharing nearly zero-copy. A runtime that targets only Metal can structure its entire memory management strategy around this unified architecture rather than treating it as one option among several.

DeepSeek's model family is a particularly interesting target. DeepSeek-V2 introduced Multi-head Latent Attention (MLA), which compresses the KV cache significantly compared to standard multi-head attention. DeepSeek 4 Flash likely extends this efficiency. A dedicated runtime can optimize its KV cache management specifically for MLA's compressed representation rather than supporting both MLA and standard attention paths.

### The local inference arms race

ds4 arrives at a moment when local model inference is shifting from hobbyist curiosity to production consideration. The reasons are well-documented: data privacy requirements, API cost reduction at scale, latency-sensitive applications, and the simple reality that a MacBook Pro with 96GB of unified memory can run surprisingly capable models without a network round-trip.

The proliferation of specialized inference engines — ds4 for Metal, TensorRT-LLM for NVIDIA, Intel's OpenVINO, Qualcomm's AI Engine Direct — suggests the industry is moving away from the "one runtime" model toward hardware-native runtimes with thin compatibility layers on top. This mirrors what happened with databases (Redis didn't replace PostgreSQL; it handled the workloads PostgreSQL shouldn't), and it may be how inference runtimes evolve: llama.cpp as the universal fallback, with specialized engines for developers who know their deployment target.

What this means for your stack

If you're a developer running inference on Apple Silicon — and the installed base of M-series Macs in the developer population makes this a substantial group — ds4 is worth benchmarking against your current setup. The key comparison points:

vs. llama.cpp with Metal backend: llama.cpp's Metal support is good but general. ds4's entire codebase is Metal-native, which should translate to tighter GPU utilization. If you're running DeepSeek models specifically, the specialized path likely wins on tokens-per-second.

vs. MLX: Apple's own framework is well-optimized for Apple Silicon but operates at a higher abstraction level (Python/NumPy-style API). ds4 in C/Metal sits closer to the hardware. The tradeoff is DX polish versus raw throughput.

vs. Ollama: Ollama is convenience-first (pull and run). ds4 is performance-first (build and tune). If you're embedding inference in a production service rather than running ad-hoc queries, the performance delta matters more than the setup friction.

The practical advice: don't rip out your current stack. But if you're running DeepSeek models on a Mac in any performance-sensitive context — local coding assistants, document processing pipelines, or edge inference services — benchmark ds4 against your current runtime. The antirez pedigree suggests the code quality will be high enough to trust in production, even at an early stage.

Looking ahead

ds4 is a small project from a legendary systems programmer, not a venture-backed platform play. Its significance is directional: it signals that the inference runtime layer is not yet settled, that hardware-specific optimization still has meaningful headroom, and that the "download Ollama and forget about it" convenience model may coexist with a performance tier for developers who care about throughput per watt. If antirez applies even half the craft he brought to Redis, ds4 will become the benchmark that other Metal inference implementations measure themselves against — and that alone makes it worth watching.

The Redis Creator Wrote a DeepSeek Inference Engine in Pure Metal

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

DeepSeek 4 Flash local inference engine for Metal

The Redis Creator Wrote a DeepSeek Inference Engine in Pure Metal

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

DeepSeek 4 Flash local inference engine for Metal

// share this