📺 Watch the video version:

Week Overview

This week was dominated by one story: the Qwen3.5 family from Alibaba. A set of models spanning 3B to 122B parameters that collectively redraw the local AI capability map. Alongside it, GLM-4.7 from Zhipu AI continued gaining traction, and Unsloth's TTS fine-tuning support opened a new front in local audio AI.

Here's the full picture of what shipped.

The Qwen3.5 Family: Alibaba's Major Release

Qwen3.5 is Alibaba's largest model release since Qwen3, and it's a significant step forward on multiple dimensions simultaneously: architecture efficiency, multimodal capability, context length, language coverage, and agentic performance. The family spans four sizes, each with a distinct sweet spot.

🏗️ Shared Architecture Traits Across the Family All Qwen3.5 models use the same hybrid architecture: Gated DeltaNet linear attention (75% of layers, O(N) complexity) + sparse MoE feed-forward layers. Natively supports 262K context (extendable to 1M). Multimodal via early fusion. 201 languages. Apache 2.0 license.

Qwen3.5-35B-A3B: The Local AI Star

Qwen3.5-35B-A3B

35B total / 3B active MoE · 256 experts 262K context Multimodal ~20GB @ Q4 Apache 2.0

The headliner. 35 billion total parameters, only ~3 billion active per token. Fits on a single RTX 3090/4090 at Q4 and runs at ~90 tok/sec — faster than many 7B dense models. Community benchmarks confirm real-world quality well above its active parameter count.

The architecture is the story here. 256 experts per MoE layer (unusually high granularity), 8 routed + 1 shared active per token. Three out of every four attention layers use Gated DeltaNet — linear attention that avoids the O(N²) scaling that makes long contexts expensive. The one standard softmax attention layer per 4 provides global context sensitivity where needed.

The practical result: 262K context works natively without tricks. The model handles full repository-scale codebases, long agent conversations with tool call history, and multi-document analysis without the memory explosion that standard attention would require.

The real-world test that got community attention: on a complex PDF merger app task (dark GUI, drag-and-drop, venv isolation, .bat installer), Qwen3.5-27B dense solved it in 3 outputs at 31 tok/sec. GPT-5 failed all three attempts. Qwen3.5-35B-A3B Q4 ran at 90 tok/sec at the same context. That's the capability/speed combination that makes this model exceptional.

Key benchmarks:

Quants available: BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, IQ4_XS, Q3_K_M, IQ3_M, IQ2_M — all imatrix-generated. mmproj included for vision. Use --jinja flag with llama.cpp.

Qwen3.5-27B and Qwen3.5-122B-A10B

Qwen3.5-27B (Dense)

27B dense 262K context Multimodal ~16GB @ Q4

The dense counterpart. Slower than 35B-A3B (~31 tok/sec vs 90) but slightly higher scores on some benchmarks due to dense attention. Strong choice if you want the quality ceiling over the speed ceiling.

Qwen3.5-122B-A10B

122B total / 10B active MoE · large scale 262K context Multi-GPU / cloud

The large-scale MoE option. MMLU-Pro 86.7, GPQA Diamond 86.6 — pushing toward frontier model quality. Requires multi-GPU local setup or cloud inference. Hosted as the API backing for complex enterprise tasks.

Qwen3.5-35B-A3B Uncensored Aggressive GGUF

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

0/465 refusals Zero capability loss Community release HauhauCS / HuggingFace

A community-produced fully uncensored version of 35B-A3B. HauhauCS spent 12–16 hours per day over several days on this release. 0 refusals out of 465 tested, zero personality changes — just the original Qwen with safety filters removed. All quants available, imatrix-generated.

This is a community release, not an official Alibaba product. The significance: for researchers, red-teamers, and applications that need unfiltered model behavior, it's the highest-quality uncensored local model currently available. Same architecture, same speed, same quality — without the refusal layer.

⚠️ Use Case Note Uncensored models have legitimate research and application uses, but require responsible deployment. The model has no safety filters by design — operator-level safeguards become the deployer's responsibility.

GLM-4.7: Zhipu AI's Coding Powerhouse

GLM-4.7 (Zhipu AI)

355B total / 32B active MoE SWE-bench 74.2% Cloud / multi-GPU

Zhipu AI's flagship model leads the SWE-bench Verified leaderboard at 74.2% — ahead of Qwen3.5-35B-A3B (69.2%) and DeepSeek-V3.2 (70.2%). 355B total parameters with 32B active. Terminal-Bench 2.0: 46.4%. HLE (Humanity's Last Exam): 42%, a 38% improvement over GLM-4.6.

GLM-4.7 is Zhipu AI's answer to DeepSeek's dominance on coding benchmarks. The headline number — 74.2% on SWE-bench Verified — makes it the current state-of-the-art on that benchmark. For context: SWE-bench Verified tests real GitHub issue resolution in open-source Python repos. Getting above 70% is genuinely hard.

The tradeoff vs Qwen3.5-35B-A3B: GLM-4.7 is more compute-heavy (32B active vs 3B), slower per token, and requires significantly more hardware. It's a cloud/multi-GPU model, not a single-GPU local model. If coding quality is the primary criterion and you have the infrastructure, it leads. If you need local deployment, Qwen3.5 wins on efficiency.

GLM-4.7-Flash is the lighter variant — designed for fast local inference, following the same "Flash" naming as Qwen3.5-Flash. Comparable in the fast-local-MoE category to 35B-A3B, with different benchmark tradeoffs.

Unsloth TTS Fine-Tuning: Audio Enters the Local AI Stack

Unsloth expanded beyond LLMs this week, adding support for fine-tuning Text-to-Speech and Speech-to-Text models with their signature efficiency toolkit: 1.5× faster training, 50% less VRAM compared to standard implementations (Flash Attention 2).

Supported models for TTS fine-tuning:

The use case is voice cloning, style adaptation, domain-specific voice, and new language support. Zero-shot voice cloning captures tone but misses pacing and expression — fine-tuning bridges the gap. Quantized + original weights are available on HuggingFace (unsloth/csm-1b, etc.) and free Colab notebooks are provided for each model.

This is a meaningful expansion: until now, local audio AI meant running inference on pre-trained models. Fine-tuning TTS locally opens up custom voice creation without cloud APIs — useful for agents, content creators, and accessibility applications.

Benchmark Comparison: This Week's Key Models

ModelActive ParamsMMLU-ProSWE-benchContextLocal?
Qwen3.5-35B-A3B~3B85.369.2%262K✅ Single GPU
Qwen3.5-27B27B86.172.4%262K⚡ 2× GPU
Qwen3.5-122B-A10B10B86.772.0%262K⚠️ Multi-GPU
GLM-4.732B74.2%❌ Cloud
GPT-5-mini83.772.0%❌ API only
GPT-OSS-120B80.862.0%❌ API only

TL;DR: What to Run This Week

🎯 Decision Guide
  • Best single-GPU local model: Qwen3.5-35B-A3B Q4 — 90 tok/sec, 262K context, genuinely frontier-quality reasoning and coding
  • Best coding quality, cloud: GLM-4.7 — 74.2% SWE-bench, leads the leaderboard
  • Best large-scale local MoE: Qwen3.5-122B-A10B — multi-GPU but near-frontier quality
  • Uncensored local: Qwen3.5-35B-A3B-Uncensored-Aggressive — same model, no safety filters
  • TTS fine-tuning: Unsloth + Sesame-CSM or Orpheus — local voice cloning now accessible

The pattern this week: local open models are closing the gap with API-only models on every meaningful benchmark. Qwen3.5-35B-A3B at 3B active parameters is competitive with GPT-5-mini at any parameter count, and beats GPT-5 on real-world coding tasks. The efficiency gap that once justified API reliance is shrinking fast.


Sources: Qwen3.5-35B-A3B HuggingFace · Uncensored GGUF HuggingFace · GLM-4.7 SiliconFlow · Unsloth TTS Tweet · r/LocalLLaMA community benchmark