AI-assisted annotations

HN thread analysis and synthesis with Claude Opus 4.6 via Claude Code.

HN: Real-world local LLM setups (Oct 2025)

Source: Ask HN: Who uses open LLMs and coding assistants locally?

GPT-OSS-120b: runtime and quantization matter

The biggest finding: GPT-OSS-120b quality varies wildly depending on how you run it. Runtime ranking per embedding-shape:

  1. TensorRT — fastest
  2. llama.cpp — easy + fast (~260 tok/s on RTX Pro 6000 for 20b variant)
  3. vLLM — best for batching/throughput, harder to deploy
  4. Ollama — easiest, slowest

Critical: use native MXFP4 weights, not Q8 quantization. gunalx was getting worse results from 120b than 20b — turned out Q8 degrades quality drastically. Early runners (Ollama, vLLM) also botched implementations at launch, so many people wrote the model off prematurely.

MoE architecture is the key enabler: 120B total params but only a fraction activate per token, so it runs on hardware that couldn’t touch a dense 120B model. Same for Qwen3-Coder-30B-A3B (30B params, 3B active).

The agentic coding gap

simonw couldn’t find a local model on 64GB Mac or 128GB that reliably runs bash-in-a-loop over multiple turns — the core of agentic coding (Claude Code, Codex CLI).

embedding-shape’s solution with GPT-OSS-120b + Codex + llama.cpp:

  • Hard-code inference params: top_k=0, top_p=1.0, temperature=1.0 (Codex doesn’t expose these)
  • Get Harmony parsing working correctly in llama.cpp
  • Heavy AGENTS.md prompting to teach the agent workflow

Implication: frontier models like GPT-5 are trained with tool-use loops in mind; open models need that behavior bolted on via prompting and inference config.

Practical sweet spot: Qwen3-Coder on a 64GB Mac

dust42 runs Qwen3-Coder-30B-A3B Q4 via llama.cpp on MBP 64GB — 50 tok/s generation, 550 tok/s prompt processing. Uses continue.dev for chat and llama.cpp’s VSCode plugin for FIM completion.

KV caching trick: load files with “read the code and wait” while you type your real instructions. KV caching makes the response near-instant once you submit — effectively hiding prompt processing latency behind typing time.

Honest assessment: “When giving well-defined small tasks, it is as good as any frontier model.” For anything harder → Claude or DeepSeek via API.

Recommended resources for quants:

Hardware setup census

HardwareModelRuntimetok/sUser
Ryzen 9 + RTX Pro 6000 96GBGPT-OSS-120bllama.cpp~260 (20b)embedding-shape
M4 Max 128GBGPT-OSS-120b MLX 8bitMLX66jetsnoc
M4 Max 128GBQwen3-Coder-30B-A3B 8bitMLX78jetsnoc
MBP 64GBQwen3-Coder-30B-A3B Q4llama.cpp50dust42
Dual RTX 3090Qwen3-Coder-30B-A3B Q8llama.cpp100Mostlygeek
HP ZBook Ultra G1A 128GB (Strix Halo)GPT-OSS-20bllama.cpp—hacker_homie
Mac Studio M4 Max 128GBGPT-OSS-120bOllama—Greenpants
Framework Desktop 128GBGPT-OSS-120blemonade-server—dennemark

Common pattern: Mac unified memory for ease, NVIDIA for raw speed, AMD Strix Halo as budget middle ground. Memory bandwidth matters more than raw RAM — Max/Ultra chips outperform Pro chips at the same RAM capacity. See How to estimate local model performance for the math behind this and My AI Home Lab for a concrete M4 Pro 64GB setup.

HN: Qwen3.5 and the state of local models (Feb 2026)

Source: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

StepFun-3.5-Flash: the dark horse open model

kir-gadjello (thread) uses StepFun-3.5-Flash (196B/11B active MoE) for a complex Rust codebase with hundreds of integration tests and nontrivial concurrency. Claims it covers 95% of coding needs, beats MiniMax M2.5, and is competitive with GLM-5 (comparison).

Key insights:

  • “With suitable task decomposition or a test harness you can make the model do what you thought it could not” (thread)
  • 2x faster than competitors — fast iteration loops are a real productivity advantage (nodakai)
  • Accuses MiniMax of heavy distillation from western frontier models, while StepFun has extensive custom post-training R&D
  • “The optimal configuration for maximizing output of correct software features per dollar involves using StepFun or its future class competitor for bulk coding” (thread)

Resource: StepFun-3.5-Flash on GitHub

Qwen3.5 model selection: 27B dense vs 35B MoE vs 122B MoE

The 35B-A3B MoE has only 3B active params — roughly equivalent to an 11B dense model per regularfry (thread). Multiple commenters say the 27B dense is the best in the lineup for quality:size (smahs, CamperBob2). The 122B-A10B is the only one people describe as “Sonnet-esque”:

  • pram (thread) runs it on M4 Max 128GB via LM Studio + OpenCode
  • derekp7 (thread) says it’s the first local model to nail an RPN calculator one-shot at q3 dynamic quant

Recommendation by hardware:

  • NVIDIA 4090 (24GB): 27B dense or 35B-A3B MoE
  • Mac 128GB: 122B-A10B MoE (MLX quants preferred)
  • 32GB machines: 35B-A3B is the ceiling

Quantization sweet spot: 4-bit

See also perplexity benchmarks by quant level.

jackcosgrove (thread) ran an analysis: 4-bit quantization is 99% similar to float32 at half the size of 8-bit — the clear sweet spot. deepsquirrelnet (thread) confirms GPT-OSS models were trained natively in MXFP4 (4-bit floating point, e2m1 format with per-block 8-bit scaling exponents).

Recommendations from zargon (thread):

  • For coding: don’t go below Q4_K_M
  • Prefer unsloth XL or ik_llama IQ quants at Q4 — better quality at same size
  • Ideally Q5 or Q6 if you have the VRAM

Resources: