🖥️ Dual-GPU LLM Workstation Build Spec
Target use case: Repo-aware coding assistant (30–34B models, fine-tuning, RAG) on local hardware.
1. Core Components
🔹 Motherboard
MSI MEG X670E ACE (Recommended)
- PCIe layout: Dual PCIe 5.0 x16 slots, official x16 / x0 or x8 / x8 mode.
- Spacing: 2 reinforced x16 slots with good clearance for dual 3-slot GPUs.
- VRM: Premium 22+2 phases → stable for Ryzen 9 CPUs under sustained load.
- Networking: Onboard 2.5 GbE + WiFi 6E.
- Storage: 4× M.2 (PCIe Gen5 + Gen4), plenty for NVMe datasets.
- BIOS: Stable lane bifurcation support, enthusiast-class firmware.
🔹 GPUs
2× NVIDIA GeForce RTX 5070 Ti 16 GB
- Architecture: Ada/Blackwell family (mid-high tier).
- VRAM: 16 GB GDDR7 per card → 32 GB effective with tensor parallel.
- Compute: Strong CUDA/Tensor core count, ideal for 30B class coder models in 4-bit quantization.
- Cooling: Triple-fan 3-slot designs recommended (ASUS TUF / MSI Gaming Trio / Gigabyte Eagle).
- Setup: Run via tensor parallel with vLLM/ExLlamaV2 for efficient sharding.
🔹 CPU
AMD Ryzen 9 7950X / 9950X
- 16 cores, 32 threads.
- Plenty for preprocessing, vector search, and RAG pipelines.
- PCIe 5.0 support to fully unlock X670E lane config.
🔹 Memory
128 GB DDR5 (2×64 GB)
- DDR5-5600 or higher (EXPO profile). (Crucial 5600, 2x64GB is good)
- Required for large context inference + QLoRA fine-tuning.
🔹 Storage
- 2 TB NVMe Gen4 (Samsung 990 Pro / WD Black SN850X) for OS + models.
- 4 TB NVMe Gen4/Gen5 for vector DB + repo snapshots.
- Optional: HDD RAID (NAS) for long-term storage/backup.
🔹 PSU
1000–1200 W ATX 3.0 Gold+
- Dual 16-pin (12VHPWR) connectors for GPUs.
- Corsair RM1200e Shift / Seasonic Vertex GX-1200 recommended.
🔹 Case & Cooling
- Full tower (Coolermaster TD500 Mesh v2, with extra fans).
- 6-fan airflow, 3-slot GPU clearance.
- CPU cooler: Thermalright Peerless Assassin 140.
2. Software Stack
Inference (multi-GPU)
vllm serve codellama/CodeLlama-34B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--dtype auto
- vLLM: Handles tensor parallel across 2 GPUs.
- ExLlamaV2: For GPTQ/EXL2 quantized models (efficient VRAM use).
- llama.cpp: Optional GGUF loader with CPU offload.
Fine-Tuning
- QLoRA/PEFT for efficient repo-specific fine-tuning.
- With 2×16 GB GPUs, LoRA adapters on 30B models become feasible.
RAG (Repo-Aware Assistant)
- Index repo into Chroma/Weaviate/Qdrant vector DB.
- Chunk by function/class boundaries.
- Serve via OpenAI-compatible API (
vllm.entrypoints.openai.api_server
). - Integrate with VS Code / JetBrains as local Copilot.
3. What You Can Run
GPU Setup | Max model size (quantized) | Coding ability vs ChatGPT |
---|---|---|
1×5070 Ti 16 GB | 13B (GGUF/4-bit) | “Junior dev who knows repo” |
2×5070 Ti 16 GB | 30–34B (4-bit) | “Mid/senior dev, repo-aware” |
4× cards (future) | 70B class (QLoRA/RAG) | “GPT-3.5-like, repo-aware” |
4. Why This Build Works
- Safe motherboard choice (MEG X670E ACE) guarantees x8/x8 dual GPU operation, avoiding lane quirks.
- 2×5070 Ti 16 GB = sweet spot for 30B-class coder models → excellent balance between VRAM and compute.
- 128 GB DDR5 ensures enough headroom for RAG + fine-tuning.
- 2.5 GbE LAN is sufficient for serving LLM API to your dev machines.
- Modular upgrade path → add GPUs later (4× config) or swap to larger VRAM SKUs.
✅ Verdict:
This dual-5070 Ti build on the MSI MEG X670E ACE is a reliable workstation for repo-tuned coding LLMs. You’ll be able to run and fine-tune 30–34B models, integrate them with your IDE, and get a private assistant that often beats ChatGPT on your codebase, while still using ChatGPT for global reasoning tasks.