Introducing mvec: A vector db designed for Apple Silicon.
mvec is an standalone or app-embeddable vector database for Apple Silicon, designed for developers and power users to tap into a device's full GPU capabilities.
Coming soon to the App Store. Project link here: mvec
Your Mac has a GPU. So does your iPhone and iPad. It has potentially dozens of cores designed for massively parallel computation — matrix operations, vector similarity, batch processing — and for most of the day, they do nothing. No vector database on the market is built to use them. No general-purpose database is either. The GPU sits idle while the CPU handles work the GPU would finish in a fraction of the time.
Today, if you need vector database operations, you have two options:
Need GPU performance? Route your queries to the cloud — where costs stack up, and what takes sub-milliseconds locally takes seconds over the network.
Or accept that the GPU is simply off the table, and run everything on CPU locally.
Neither option is ideal. And the scale of what's being left on the table is not trivial. The GPU in your Mac represents more than A$73,000 per year in equivalent cloud GPU compute (50% utilisation, Mac Studio M3 Ultra).
The local-non-GPU comparison is particularly sharp considering just how much better GPUs are at the operations that matter most for vector search. Latest research and related benchmarks for this technology show 5–13× advantage over CPU on algebraic predicates, 45× at larger indexes. These are not small gains — they are an order of magnitude of performance, on hardware that is not being used.
Introducing mvec
mvec is an embeddable or standalone vector database designed to utilise the full GPU headroom of your Apple Silicon device. If you are running a local inference setup, a knowledge hub, or any RAG pipeline on your Mac, you almost certainly need vector operations alongside it. If you are building an app with semantic search, you need one embedded. mvec is built for both — as a tool for local development and research, and as an engine that ships inside your app. No cloud routing. No per-query cost. No data leaving the device.
The Memory Wall
Now this is where the plot thickens. Apple's Unified Memory Architecture does not just provide better access to a GPU — its implications are a shift in the entire equation between the two chip architectures.
Every vector database on the market today is shaped by a single architectural constraint: CPU or GPU. Not both.
In the traditional design, data lives in one memory pool or the other. Moving it between them means copying across a bus — PCIe — with hard bandwidth and latency costs. For large batch operations the transfer cost is tolerable. In machine learning, for instance, it is negligible relative to the training itself. For the small, frequent operations that define real-time search — traversing an index, scoring candidates, evaluating the next step — the transfer takes longer than the computation.
In summary: Copying from CPU to GPU kills performance.
The result? An industry built around choosing one or the other. And all databases in operation today have been with that single premise: In the CPU or the GPU camp.
Apple Silicon removes this wall. In unified memory, CPU and GPU read the same bytes at the same addresses. No copy. No transfer. No bus. Both processors, same data, same cycle.
mvec is architected from the ground up for this architecture. Not a CPU database with GPU acceleration bolted on. Not a discrete-GPU database ported to Apple hardware. A database that assumes shared memory and exploits it at every layer — from index structures designed for dual-mode access, to a query engine that schedules CPU and GPU operations in parallel across the same data.
What mvec Is
mvec is still in its prototyping stage, but the architectural premise is clear: CPUs doing what they do best and GPUs doing what they do best, asynchronously and in concert. No batching to amortise transfers. No choosing one processor over the other. Both lanes executing in massive parallel.
On-device is the game changer — it is arriving. Local inference, local RAG, local agents. The models are moving onto the hardware. Every local inference setup needs a vector database beside it. Every app shipping semantic search needs one embedded. That is where mvec comes in.
mvec has two forms. As a standalone application for developers and power users — a local vector database with a UI, a REST API, and the ability to benchmark and experiment on your own hardware. And as an embeddable SDK for app developers — a compiled engine that ships inside your iOS, iPadOS, or macOS app, giving it on-device vector search without a cloud dependency.
On-device. On GPU. On unified memory. With zero-copy execution from query to result.
Architecture
mvec's architecture is built around a simple organising principle: CPU and GPU as two parallel lanes operating over the same data. Each operation lands on the processor best suited to run it. Both lanes execute concurrently to complete the query.
CPU GPU Swim Lanes
| Operation | CPU | GPU | Why/What? |
|---|---|---|---|
| Brute force | 🟡 | ✅ | Every vector scored independently. |
| HNSW | ✅ | 🟡 | Graph traversal is sequential pointer-chasing. |
| IVF | ❌ | ✅ | Cluster search. |
| CAGRA | ❌ | ✅ | k-NN graph built and traversed entirely on GPU. |
| Graph traversal | ✅ | ❌ | Pointer-chasing, irregular memory access. |
| Algebraic Ops | ❌ | ✅ | Intersection (A∩B), difference (A\B), cardinality. etc. |
| Filtering | ✅ | ❌ | Conditional branching, irregular access patterns. |
| Distance | ❌ | ✅ | Cosine similarity, euclidean (L2), dot product. |
| Sort | ✅ | ❌ | Comparison-based, branching-heavy. Sequential by nature. |
| Embedding | ❌ | ✅ | Matrix multiplication, high arithmetic intensity. Native Metal execution. |
As an example — HNSW for recall or IVF for throughput — exists because discrete architectures force you onto one processor. On unified memory, both indexes access the same data. A query can run HNSW traversal on CPU and IVF scan on GPU in parallel, merge the results, and return — in a single execution.
Zero-Copy Architecture
A primary cost in a conventional pipeline is data allocation and serialisation. Filter results are copied into a re-ranking buffer. Re-ranking output is copied into a result structure. Each boundary is allocation, transformation, deallocation.
mvec's zero-copy architecture eliminates that. Data stays in place — all operations referencing the same memory. No intermediate copies. No serialisation between stages. From query arrival to result, the bytes are read, not duplicated.
This is complementary to the unified memory architecture. Unified memory shares address space between processors. Zero-copy is a software architecture designed to avoid unnecessary data movement - skipping serialisation/deserialisation yields faster algorithms at the cost of complexity.
Graphs as Queries
Queries in mvec are expressed as directed acyclic graphs — dependency graphs of operations. A simple vector search is a two-node DAG: scan, then rank. A hybrid search fans out into parallel branches — vector similarity on GPU, metadata filtering on CPU — that converge at a fusion node.
The DAG scheduler assigns each operation to the appropriate processor and executes independent branches in parallel. The execution model is inherited from the msearch engine — the same architecture that will power composable intelligence pipelines. In mvec, it drives search operations.
Metal Compute Shaders
GPU execution runs through Metal — Apple's native GPU API. Distance computation (cosine similarity, euclidean, dot product), index scans, and algebraic predicate evaluation are implemented as Metal compute shaders dispatched directly over unified memory. No abstraction layer.
Built on mgraph
mvec's engine is built on mgraph for thread-safe operations over complex data types — in our case, vectors.
The App UI
Coming soon.
The SDK
mvec ships as a compiled xcframework — a binary that links into your iOS, iPadOS, or macOS app via Swift Package Manager. The Rust internals, Metal kernels, and GPU scheduling are never exposed. You get an API surface and a binary.
From the developer's perspective, adding mvec means their app has on-device vector search. Upload data, embed with a local model, build an index, query it — all running on unified memory, all GPU-accelerated, all offline-capable. No server to provision. No API keys. No per-query cloud costs.
The same engine that powers the standalone application powers the SDK. The difference is packaging — one is a tool you run, the other is an engine your app runs.
A REST API is also available for developers who prefer HTTP integration or need to connect mvec to external tooling. Both surfaces expose the same capabilities.
Collection Lifecycle
REST
POST /v1/collections
{
"name": "research-papers",
"dimensions": 384,
"index": { "type": "hnsw", "m": 16, "ef_construction": 200 },
"metric": "cosine"
}
# → 201 { "id": "col_a8f2c", "name": "research-papers", ... }Swift
let config = CollectionConfig(
name: "research-papers",
dimensions: 384,
index: .hnsw(m: 16, efConstruction: 200),
metric: .cosine
)
let collection = try await mvec.createCollection(config)Rust
let config = CollectionConfig::builder("research-papers", 384)
.index(IndexType::hnsw(16, 200))
.metric(DistanceMetric::Cosine)
.build();
let collection = mvec.create_collection(config).await?;Directory Ingestion
REST
POST /v1/collections/col_a8f2c/ingest
{
"path": "/Users/jordan/research",
"file_types": [".md", ".pdf", ".txt"],
"chunking": { "strategy": "sentence", "max_tokens": 512, "overlap": 64 },
"model": "model_minilm",
"recursive": true
}
# → 202 { "job_id": "job_f7b3", "status": "running" }Swift
let job = try await collection.ingest(
path: URL(filePath: "/Users/jordan/research"),
fileTypes: [.md, .pdf, .txt],
chunking: .sentence(maxTokens: 512, overlap: 64),
model: miniLM
)
for await progress in job.updates {
print("\(progress.filesProcessed)/\(progress.filesTotal)")
}Rust
let job = collection.ingest(
IngestConfig::builder("/Users/jordan/research")
.file_types(&[FileType::Md, FileType::Pdf, FileType::Txt])
.chunking(Chunking::Sentence { max_tokens: 512, overlap: 64 })
.model(&mini_lm)
.recursive(true)
.build()
).await?;
while let Some(progress) = job.next().await {
println!("{}/{}", progress.files_processed, progress.files_total);
}Search
REST
POST /v1/collections/col_a8f2c/search
{
"text": "transformer attention mechanisms",
"model": "model_minilm",
"top_k": 10,
"filter": {
"and": [
{ "in": { "field": "type", "values": ["paper", "notes"] } },
{ "not": { "eq": { "field": "status", "value": "archived" } } }
]
}
}
# → 200 { "results": [{ "id": "vec_b2d1", "score": 0.94, "metadata": { ... }, "chunk": "..." }] }Swift
let results = try await collection.search(
text: "transformer attention mechanisms",
model: miniLM,
topK: 10,
filter: .and(
.in("type", values: ["paper", "notes"]),
.not(.eq("status", "archived"))
)
)Rust
let results = collection.search(
SearchRequest::text("transformer attention mechanisms", &mini_lm)
.top_k(10)
.filter(Filter::and([
Filter::is_in("type", &["paper", "notes"]),
Filter::not(Filter::eq("status", "archived")),
]))
).await?;Benchmarks
Coming soon.