AI-assisted annotations
HN thread analysis and synthesis with Claude Opus 4.6 via Claude Code.
HN: Real-world local LLM setups (Oct 2025)
Source: Ask HN: Who uses open LLMs and coding assistants locally?
GPT-OSS-120b: runtime and quantization matter
The biggest finding: GPT-OSS-120b quality varies wildly depending on how you run it. Runtime ranking per embedding-shape:
- TensorRT — fastest
- llama.cpp — easy + fast (~260 tok/s on RTX Pro 6000 for 20b variant)
- vLLM — best for batching/throughput, harder to deploy
- Ollama — easiest, slowest
Critical: use native MXFP4 weights, not Q8 quantization. gunalx was getting worse results from 120b than 20b — turned out Q8 degrades quality drastically. Early runners (Ollama, vLLM) also botched implementations at launch, so many people wrote the model off prematurely.
MoE architecture is the key enabler: 120B total params but only a fraction activate per token, so it runs on hardware that couldn’t touch a dense 120B model. Same for Qwen3-Coder-30B-A3B (30B params, 3B active).
The agentic coding gap
simonw couldn’t find a local model on 64GB Mac or 128GB that reliably runs bash-in-a-loop over multiple turns — the core of agentic coding (Claude Code, Codex CLI).
embedding-shape’s solution with GPT-OSS-120b + Codex + llama.cpp:
- Hard-code inference params:
top_k=0,top_p=1.0,temperature=1.0(Codex doesn’t expose these) - Get Harmony parsing working correctly in llama.cpp
- Heavy
AGENTS.mdprompting to teach the agent workflow
Implication: frontier models like GPT-5 are trained with tool-use loops in mind; open models need that behavior bolted on via prompting and inference config.
Practical sweet spot: Qwen3-Coder on a 64GB Mac
dust42 runs Qwen3-Coder-30B-A3B Q4 via llama.cpp on MBP 64GB — 50 tok/s generation, 550 tok/s prompt processing. Uses continue.dev for chat and llama.cpp’s VSCode plugin for FIM completion.
KV caching trick: load files with “read the code and wait” while you type your real instructions. KV caching makes the response near-instant once you submit — effectively hiding prompt processing latency behind typing time.
Honest assessment: “When giving well-defined small tasks, it is as good as any frontier model.” For anything harder → Claude or DeepSeek via API.
Recommended resources for quants:
- Unsloth GGUFs — Q4_K_M variant
- llama.cpp Apple Silicon benchmarks
- Quant selection guide
Hardware setup census
| Hardware | Model | Runtime | tok/s | User |
|---|---|---|---|---|
| Ryzen 9 + RTX Pro 6000 96GB | GPT-OSS-120b | llama.cpp | ~260 (20b) | embedding-shape |
| M4 Max 128GB | GPT-OSS-120b MLX 8bit | MLX | 66 | jetsnoc |
| M4 Max 128GB | Qwen3-Coder-30B-A3B 8bit | MLX | 78 | jetsnoc |
| MBP 64GB | Qwen3-Coder-30B-A3B Q4 | llama.cpp | 50 | dust42 |
| Dual RTX 3090 | Qwen3-Coder-30B-A3B Q8 | llama.cpp | 100 | Mostlygeek |
| HP ZBook Ultra G1A 128GB (Strix Halo) | GPT-OSS-20b | llama.cpp | — | hacker_homie |
| Mac Studio M4 Max 128GB | GPT-OSS-120b | Ollama | — | Greenpants |
| Framework Desktop 128GB | GPT-OSS-120b | lemonade-server | — | dennemark |
Common pattern: Mac unified memory for ease, NVIDIA for raw speed, AMD Strix Halo as budget middle ground. Memory bandwidth matters more than raw RAM — Max/Ultra chips outperform Pro chips at the same RAM capacity. See How to estimate local model performance for the math behind this and My AI Home Lab for a concrete M4 Pro 64GB setup.
HN: Qwen3.5 and the state of local models (Feb 2026)
Source: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers
StepFun-3.5-Flash: the dark horse open model
kir-gadjello (thread) uses StepFun-3.5-Flash (196B/11B active MoE) for a complex Rust codebase with hundreds of integration tests and nontrivial concurrency. Claims it covers 95% of coding needs, beats MiniMax M2.5, and is competitive with GLM-5 (comparison).
Key insights:
- “With suitable task decomposition or a test harness you can make the model do what you thought it could not” (thread)
- 2x faster than competitors — fast iteration loops are a real productivity advantage (nodakai)
- Accuses MiniMax of heavy distillation from western frontier models, while StepFun has extensive custom post-training R&D
- “The optimal configuration for maximizing output of correct software features per dollar involves using StepFun or its future class competitor for bulk coding” (thread)
Resource: StepFun-3.5-Flash on GitHub
Qwen3.5 model selection: 27B dense vs 35B MoE vs 122B MoE
The 35B-A3B MoE has only 3B active params — roughly equivalent to an 11B dense model per regularfry (thread). Multiple commenters say the 27B dense is the best in the lineup for quality:size (smahs, CamperBob2). The 122B-A10B is the only one people describe as “Sonnet-esque”:
pram(thread) runs it on M4 Max 128GB via LM Studio + OpenCodederekp7(thread) says it’s the first local model to nail an RPN calculator one-shot at q3 dynamic quant
Recommendation by hardware:
- NVIDIA 4090 (24GB): 27B dense or 35B-A3B MoE
- Mac 128GB: 122B-A10B MoE (MLX quants preferred)
- 32GB machines: 35B-A3B is the ceiling
Quantization sweet spot: 4-bit
See also perplexity benchmarks by quant level.
jackcosgrove (thread) ran an analysis: 4-bit quantization is 99% similar to float32 at half the size of 8-bit — the clear sweet spot. deepsquirrelnet (thread) confirms GPT-OSS models were trained natively in MXFP4 (4-bit floating point, e2m1 format with per-block 8-bit scaling exponents).
Recommendations from zargon (thread):
- For coding: don’t go below Q4_K_M
- Prefer unsloth XL or ik_llama IQ quants at Q4 — better quality at same size
- Ideally Q5 or Q6 if you have the VRAM
Resources:
- Unsloth GGUF benchmarks for Qwen3.5
- MXFP4 spec (OCP Microscaling Formats)
- CMU Modern AI course — covers quantization fundamentals