Skip to content

cosmicallyrun/vllm-metal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Metal Plugin

High-performance LLM inference on Apple Silicon using MLX and vLLM

vLLM Metal is a plugin that enables vLLM to run on Apple Silicon Macs using MLX as the primary compute backend. It unifies MLX and PyTorch under a single lowering path.

Features

  • MLX-accelerated inference: 10-25x faster than PyTorch MPS on Apple Silicon
  • Unified memory: True zero-copy operations leveraging Apple Silicon's unified memory architecture
  • vLLM compatibility: Full integration with vLLM's engine, scheduler, and OpenAI-compatible API
  • Paged attention: Efficient KV cache management for long sequences
  • GQA support: Grouped-Query Attention for efficient inference

Requirements

  • macOS on Apple Silicon

Installation

Quick Install

./install.sh

Architecture

┌────────────────────────────────────────────────────────────┐
│                    vLLM Core (Unchanged)                   │
│         Engine, Scheduler, API Server, Tokenizers          │
└────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────┐
│                 vllm_metal Plugin Layer                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │MetalPlatform│  │ MetalWorker │  │ MetalModelRunner    │ │
│  │ (Platform)  │  │ (Worker)    │  │ (ModelRunner)       │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────┐
│              Unified Compute Backend                       │
│  ┌──────────────────────┐  ┌─────────────────────────────┐ │
│  │   MLX Backend        │  │   PyTorch Backend           │ │
│  │   (Primary)          │  │   (Model Loading/Interop)   │ │
│  │                      │  │                             │ │
│  │ • SDPA Attention     │  │ • HuggingFace Loading       │ │
│  │ • RMSNorm            │  │ • Weight Conversion         │ │
│  │ • RoPE               │  │ • Tensor Bridge             │ │
│  │ • Cache Ops          │  │                             │ │
│  └──────────────────────┘  └─────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌────────────────────────────────────────────────────────────┐
│                    Metal GPU Layer                         │
│         Apple Silicon Unified Memory Architecture          │
└────────────────────────────────────────────────────────────┘

Configuration

Environment variables for customization:

Variable Default Description
VLLM_METAL_MEMORY_FRACTION 0.9 Fraction of memory to use
VLLM_METAL_USE_MLX 1 Use MLX for compute (1=yes, 0=no)
VLLM_MLX_DEVICE gpu MLX device (gpu or cpu)
VLLM_METAL_BLOCK_SIZE 16 KV cache block size
VLLM_METAL_DEBUG 0 Enable debug logging

About

Community maintained hardware plugin for vLLM on Apple Silicon

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 95.1%
  • Shell 4.9%