Running Local LLM

AI-assisted annotations

HN thread analysis and synthesis with Claude Opus 4.6 via Claude Code.

HN: Real-world local LLM setups (Oct 2025)

Source: Ask HN: Who uses open LLMs and coding assistants locally?

GPT-OSS-120b: runtime and quantization matter

The biggest finding: GPT-OSS-120b quality varies wildly depending on how you run it. Runtime ranking per embedding-shape:

TensorRT — fastest
llama.cpp — easy + fast (~260 tok/s on RTX Pro 6000 for 20b variant)
vLLM — best for batching/throughput, harder to deploy
Ollama — easiest, slowest

Critical: use native MXFP4 weights, not Q8 quantization. gunalx was getting worse results from 120b than 20b — turned out Q8 degrades quality drastically. Early runners (Ollama, vLLM) also botched implementations at launch, so many people wrote the model off prematurely.

MoE architecture is the key enabler: 120B total params but only a fraction activate per token, so it runs on hardware that couldn’t touch a dense 120B model. Same for Qwen3-Coder-30B-A3B (30B params, 3B active).

The agentic coding gap

simonw couldn’t find a local model on 64GB Mac or 128GB that reliably runs bash-in-a-loop over multiple turns — the core of agentic coding (Claude Code, Codex CLI).

embedding-shape’s solution with GPT-OSS-120b + Codex + llama.cpp:

Hard-code inference params: top_k=0, top_p=1.0, temperature=1.0 (Codex doesn’t expose these)
Get Harmony parsing working correctly in llama.cpp
Heavy AGENTS.md prompting to teach the agent workflow

Implication: frontier models like GPT-5 are trained with tool-use loops in mind; open models need that behavior bolted on via prompting and inference config.

Practical sweet spot: Qwen3-Coder on a 64GB Mac

dust42 runs Qwen3-Coder-30B-A3B Q4 via llama.cpp on MBP 64GB — 50 tok/s generation, 550 tok/s prompt processing. Uses continue.dev for chat and llama.cpp’s VSCode plugin for FIM completion.

KV caching trick: load files with “read the code and wait” while you type your real instructions. KV caching makes the response near-instant once you submit — effectively hiding prompt processing latency behind typing time.

Honest assessment: “When giving well-defined small tasks, it is as good as any frontier model.” For anything harder → Claude or DeepSeek via API.

Recommended resources for quants:

Hardware setup census

Hardware	Model	Runtime	tok/s	User
Ryzen 9 + RTX Pro 6000 96GB	GPT-OSS-120b	llama.cpp	~260 (20b)	embedding-shape
M4 Max 128GB	GPT-OSS-120b MLX 8bit	MLX	66	jetsnoc
M4 Max 128GB	Qwen3-Coder-30B-A3B 8bit	MLX	78	jetsnoc
MBP 64GB	Qwen3-Coder-30B-A3B Q4	llama.cpp	50	dust42
Dual RTX 3090	Qwen3-Coder-30B-A3B Q8	llama.cpp	100	Mostlygeek
HP ZBook Ultra G1A 128GB (Strix Halo)	GPT-OSS-20b	llama.cpp	—	hacker_homie
Mac Studio M4 Max 128GB	GPT-OSS-120b	Ollama	—	Greenpants
Framework Desktop 128GB	GPT-OSS-120b	lemonade-server	—	dennemark

Common pattern: Mac unified memory for ease, NVIDIA for raw speed, AMD Strix Halo as budget middle ground. Memory bandwidth matters more than raw RAM — Max/Ultra chips outperform Pro chips at the same RAM capacity. See How to estimate local model performance for the math behind this and My AI Home Lab for a concrete M4 Pro 64GB setup.

HN: Qwen3.5 and the state of local models (Feb 2026)

Source: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

StepFun-3.5-Flash: the dark horse open model

kir-gadjello (thread) uses StepFun-3.5-Flash (196B/11B active MoE) for a complex Rust codebase with hundreds of integration tests and nontrivial concurrency. Claims it covers 95% of coding needs, beats MiniMax M2.5, and is competitive with GLM-5 (comparison).

Key insights:

“With suitable task decomposition or a test harness you can make the model do what you thought it could not” (thread)
2x faster than competitors — fast iteration loops are a real productivity advantage (nodakai)
Accuses MiniMax of heavy distillation from western frontier models, while StepFun has extensive custom post-training R&D
“The optimal configuration for maximizing output of correct software features per dollar involves using StepFun or its future class competitor for bulk coding” (thread)

Resource: StepFun-3.5-Flash on GitHub

Qwen3.5 model selection: 27B dense vs 35B MoE vs 122B MoE

The 35B-A3B MoE has only 3B active params — roughly equivalent to an 11B dense model per regularfry (thread). Multiple commenters say the 27B dense is the best in the lineup for quality:size (smahs, CamperBob2). The 122B-A10B is the only one people describe as “Sonnet-esque”:

pram (thread) runs it on M4 Max 128GB via LM Studio + OpenCode
derekp7 (thread) says it’s the first local model to nail an RPN calculator one-shot at q3 dynamic quant

Recommendation by hardware:

NVIDIA 4090 (24GB): 27B dense or 35B-A3B MoE
Mac 128GB: 122B-A10B MoE (MLX quants preferred)
32GB machines: 35B-A3B is the ceiling

Quantization sweet spot: 4-bit

jackcosgrove (thread) ran an analysis: 4-bit quantization is 99% similar to float32 at half the size of 8-bit — the clear sweet spot. deepsquirrelnet (thread) confirms GPT-OSS models were trained natively in MXFP4 (4-bit floating point, e2m1 format with per-block 8-bit scaling exponents).

Recommendations from zargon (thread):

For coding: don’t go below Q4_K_M
Prefer unsloth XL or ik_llama IQ quants at Q4 — better quality at same size
Ideally Q5 or Q6 if you have the VRAM

Resources:

Unsloth GGUF benchmarks for Qwen3.5
MXFP4 spec (OCP Microscaling Formats)
CMU Modern AI course — covers quantization fundamentals

🪴 Ziyun's Backyards

Recent Notes

Tmux Cheatsheet

CLIs

Claude Code Cheat Sheet

Agent Skills

MCPs

Running Local LLM

HN: Real-world local LLM setups (Oct 2025)

GPT-OSS-120b: runtime and quantization matter

The agentic coding gap

Practical sweet spot: Qwen3-Coder on a 64GB Mac

Hardware setup census

HN: Qwen3.5 and the state of local models (Feb 2026)

StepFun-3.5-Flash: the dark horse open model

Qwen3.5 model selection: 27B dense vs 35B MoE vs 122B MoE

Quantization sweet spot: 4-bit

Table of Contents

Graph View

Backlinks