Week Overview
This week was dominated by one story: the Qwen3.5 family from Alibaba. A set of models spanning 3B to 122B parameters that collectively redraw the local AI capability map. Alongside it, GLM-4.7 from Zhipu AI continued gaining traction, and Unsloth's TTS fine-tuning support opened a new front in local audio AI.
Here's the full picture of what shipped.
The Qwen3.5 Family: Alibaba's Major Release
Qwen3.5 is Alibaba's largest model release since Qwen3, and it's a significant step forward on multiple dimensions simultaneously: architecture efficiency, multimodal capability, context length, language coverage, and agentic performance. The family spans four sizes, each with a distinct sweet spot.
Qwen3.5-35B-A3B: The Local AI Star
Qwen3.5-35B-A3B
The headliner. 35 billion total parameters, only ~3 billion active per token. Fits on a single RTX 3090/4090 at Q4 and runs at ~90 tok/sec — faster than many 7B dense models. Community benchmarks confirm real-world quality well above its active parameter count.
The architecture is the story here. 256 experts per MoE layer (unusually high granularity), 8 routed + 1 shared active per token. Three out of every four attention layers use Gated DeltaNet — linear attention that avoids the O(N²) scaling that makes long contexts expensive. The one standard softmax attention layer per 4 provides global context sensitivity where needed.
The practical result: 262K context works natively without tricks. The model handles full repository-scale codebases, long agent conversations with tool call history, and multi-document analysis without the memory explosion that standard attention would require.
The real-world test that got community attention: on a complex PDF merger app task (dark GUI, drag-and-drop, venv isolation, .bat installer), Qwen3.5-27B dense solved it in 3 outputs at 31 tok/sec. GPT-5 failed all three attempts. Qwen3.5-35B-A3B Q4 ran at 90 tok/sec at the same context. That's the capability/speed combination that makes this model exceptional.
Key benchmarks:
- MMLU-Pro: 85.3 (vs GPT-5-mini 83.7)
- GPQA Diamond: 84.2
- SWE-bench Verified: 69.2
- TAU2-Bench (agentic): 81.2 — highest in the Qwen3.5 family
- Codeforces: 2028
Quants available: BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS, Q3_K_M, IQ3_M, IQ2_M — all imatrix-generated. mmproj included for vision. Use --jinja flag with llama.cpp.
Qwen3.5-27B and Qwen3.5-122B-A10B
Qwen3.5-27B (Dense)
The dense counterpart. Slower than 35B-A3B (~31 tok/sec vs 90) but slightly higher scores on some benchmarks due to dense attention. Strong choice if you want the quality ceiling over the speed ceiling.
Qwen3.5-122B-A10B
The large-scale MoE option. MMLU-Pro 86.7, GPQA Diamond 86.6 — pushing toward frontier model quality. Requires multi-GPU local setup or cloud inference. Hosted as the API backing for complex enterprise tasks.
Qwen3.5-35B-A3B Uncensored Aggressive GGUF
Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive
A community-produced fully uncensored version of 35B-A3B. HauhauCS spent 12–16 hours per day over several days on this release. 0 refusals out of 465 tested, zero personality changes — just the original Qwen with safety filters removed. All quants available, imatrix-generated.
This is a community release, not an official Alibaba product. The significance: for researchers, red-teamers, and applications that need unfiltered model behavior, it's the highest-quality uncensored local model currently available. Same architecture, same speed, same quality — without the refusal layer.
GLM-4.7: Zhipu AI's Coding Powerhouse
GLM-4.7 (Zhipu AI)
Zhipu AI's flagship model leads the SWE-bench Verified leaderboard at 74.2% — ahead of Qwen3.5-35B-A3B (69.2%) and DeepSeek-V3.2 (70.2%). 355B total parameters with 32B active. Terminal-Bench 2.0: 46.4%. HLE (Humanity's Last Exam): 42%, a 38% improvement over GLM-4.6.
GLM-4.7 is Zhipu AI's answer to DeepSeek's dominance on coding benchmarks. The headline number — 74.2% on SWE-bench Verified — makes it the current state-of-the-art on that benchmark. For context: SWE-bench Verified tests real GitHub issue resolution in open-source Python repos. Getting above 70% is genuinely hard.
The tradeoff vs Qwen3.5-35B-A3B: GLM-4.7 is more compute-heavy (32B active vs 3B), slower per token, and requires significantly more hardware. It's a cloud/multi-GPU model, not a single-GPU local model. If coding quality is the primary criterion and you have the infrastructure, it leads. If you need local deployment, Qwen3.5 wins on efficiency.
GLM-4.7-Flash is the lighter variant — designed for fast local inference, following the same "Flash" naming as Qwen3.5-Flash. Comparable in the fast-local-MoE category to 35B-A3B, with different benchmark tradeoffs.
Unsloth TTS Fine-Tuning: Audio Enters the Local AI Stack
Unsloth expanded beyond LLMs this week, adding support for fine-tuning Text-to-Speech and Speech-to-Text models with their signature efficiency toolkit: 1.5× faster training, 50% less VRAM compared to standard implementations (Flash Attention 2).
Supported models for TTS fine-tuning:
- Sesame-CSM (1B) — the "conversational speech model" that went viral for its naturalness
- Orpheus (3B) — open-source TTS with strong prosody
- Spark-TTS (0.5B) — lightweight, high speed
- Llasa (1B) — another community TTS option
- Oute (1B)
- Whisper Large V3 — STT fine-tuning
The use case is voice cloning, style adaptation, domain-specific voice, and new language support. Zero-shot voice cloning captures tone but misses pacing and expression — fine-tuning bridges the gap. Quantized + original weights are available on HuggingFace (unsloth/csm-1b, etc.) and free Colab notebooks are provided for each model.
This is a meaningful expansion: until now, local audio AI meant running inference on pre-trained models. Fine-tuning TTS locally opens up custom voice creation without cloud APIs — useful for agents, content creators, and accessibility applications.
Benchmark Comparison: This Week's Key Models
| Model | Active Params | MMLU-Pro | SWE-bench | Context | Local? |
|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | ~3B | 85.3 | 69.2% | 262K | ✅ Single GPU |
| Qwen3.5-27B | 27B | 86.1 | 72.4% | 262K | ⚡ 2× GPU |
| Qwen3.5-122B-A10B | 10B | 86.7 | 72.0% | 262K | ⚠️ Multi-GPU |
| GLM-4.7 | 32B | — | 74.2% | — | ❌ Cloud |
| GPT-5-mini | — | 83.7 | 72.0% | — | ❌ API only |
| GPT-OSS-120B | — | 80.8 | 62.0% | — | ❌ API only |
TL;DR: What to Run This Week
- Best single-GPU local model: Qwen3.5-35B-A3B Q4 — 90 tok/sec, 262K context, genuinely frontier-quality reasoning and coding
- Best coding quality, cloud: GLM-4.7 — 74.2% SWE-bench, leads the leaderboard
- Best large-scale local MoE: Qwen3.5-122B-A10B — multi-GPU but near-frontier quality
- Uncensored local: Qwen3.5-35B-A3B-Uncensored-Aggressive — same model, no safety filters
- TTS fine-tuning: Unsloth + Sesame-CSM or Orpheus — local voice cloning now accessible
The pattern this week: local open models are closing the gap with API-only models on every meaningful benchmark. Qwen3.5-35B-A3B at 3B active parameters is competitive with GPT-5-mini at any parameter count, and beats GPT-5 on real-world coding tasks. The efficiency gap that once justified API reliance is shrinking fast.
Sources: Qwen3.5-35B-A3B HuggingFace · Uncensored GGUF HuggingFace · GLM-4.7 SiliconFlow · Unsloth TTS Tweet · r/LocalLLaMA community benchmark