fix(client): vLLM uses chat_template_kwargs to toggle reasoning, not the Anthropic field

`apply_reasoning_effort`'s vLLM branch was injecting
`thinking: {type: "disabled"}` at the top of the request body to
turn off model reasoning. But vLLM speaks OpenAI's
chat-completions protocol, not Anthropic-native extension fields,
and silently ignored that directive — the model emitted a full
hidden reasoning trace into the non-OpenAI-standard `reasoning`
field (which this client does not surface), so users saw a
~13-second perceived freeze before the first content token
arrived.

The vLLM branch now emits the OpenAI extension
`chat_template_kwargs.enable_thinking` — the canonical way to
toggle Qwen3's `<think>` mode, DeepSeek-R1's reasoning trace, and
any other reasoning-capable model served via vLLM. End-to-end
measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:

  - TTFT:           13039ms → 274ms
  - Total LLM call: 13s     → 5.7s
  - Output rate:    3 ch/s  → 46 ch/s

The `high` / `max` reasoning levels likewise route through
`chat_template_kwargs` so the toggle is consistent across effort
levels. No change for any non-vLLM provider (NVIDIA NIM continues
to use the NVIDIA-specific `chat_template_kwargs.thinking` key;
Anthropic-native providers keep the Anthropic-native field).

Resolved a 3-way merge conflict against the v0.8.32 AtlasCloud
harvest (PR #1436) so AtlasCloud stays in the no-op match arm
alongside OpenAI / Ollama while the new vLLM arm gets its own
branch. Note for future Sglang / Fireworks / Novita work: those
servers likely have the same bug but each has its own
chat_template_kwargs schema; this PR is intentionally minimal
to the verified-fix scope.

Harvested from PR #1480 by @h3c-hexin

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hunter Bown
2026-05-12 01:25:16 -05:00
parent f60a77e191
commit dcc2c448eb
2 changed files with 47 additions and 6 deletions
+18
View File
@@ -16,6 +16,24 @@ real world uses."
### Fixed
- **vLLM provider: `reasoning_effort = "off"` now actually
disables thinking on Qwen3 / DeepSeek-R1 servers, cutting
TTFT from ~13s to ~270ms** (harvested from PR #1480 by
**@h3c-hexin**). The vLLM branch of `apply_reasoning_effort`
was injecting `thinking: {type: "disabled"}` at the top of
the request body — but vLLM speaks OpenAI's chat-completions
protocol, not Anthropic-native fields, and silently ignored
the directive. The model then emitted a full hidden reasoning
trace into the non-standard `reasoning` field (which this
client doesn't surface), so users saw a multi-second freeze
before any content token arrived. The vLLM branch now emits
the OpenAI extension `chat_template_kwargs.enable_thinking`
(which vLLM forwards into the model's chat template — the
canonical way to toggle Qwen3's `<think>...</think>` mode).
Measurement against vLLM + Qwen3.6-35B-A3B-FP8: TTFT
13039ms → 274ms, total LLM call 13s → 5.7s. The `high` /
`max` effort levels likewise switch to the OpenAI extension.
No change for non-vLLM providers.
- **`/sessions` picker no longer shows `<turn_meta>` as the
session title** (harvested from PR #1498 by **@wdw8276**).
`session_manager::create_saved_session_with_id_and_mode`
+29 -6
View File
@@ -888,10 +888,23 @@ pub(super) fn apply_reasoning_effort(
| ApiProvider::Openrouter
| ApiProvider::Novita
| ApiProvider::Fireworks
| ApiProvider::Sglang
| ApiProvider::Vllm => {
| ApiProvider::Sglang => {
body["thinking"] = json!({ "type": "disabled" });
}
// vLLM is an OpenAI-protocol server, not an Anthropic-protocol one.
// For Qwen3 / DeepSeek-R1 / other reasoning models hosted via vLLM,
// the canonical OpenAI extension to disable thinking is
// `chat_template_kwargs.enable_thinking`. The old
// `thinking: {type: disabled}` field is Anthropic-native and
// silently ignored by vLLM — the model still emits a full
// reasoning trace into the `reasoning` field (which this client
// doesn't surface), causing 10+ seconds of perceived "freeze"
// before the first content token (PR #1480 by @h3c-hexin).
ApiProvider::Vllm => {
body["chat_template_kwargs"] = json!({
"enable_thinking": false,
});
}
ApiProvider::Openai | ApiProvider::Atlascloud | ApiProvider::Ollama => {}
ApiProvider::NvidiaNim => {
body["chat_template_kwargs"] = json!({
@@ -905,11 +918,16 @@ pub(super) fn apply_reasoning_effort(
| ApiProvider::Openrouter
| ApiProvider::Novita
| ApiProvider::Fireworks
| ApiProvider::Sglang
| ApiProvider::Vllm => {
| ApiProvider::Sglang => {
body["reasoning_effort"] = json!("high");
body["thinking"] = json!({ "type": "enabled" });
}
ApiProvider::Vllm => {
body["chat_template_kwargs"] = json!({
"enable_thinking": true,
});
body["reasoning_effort"] = json!("high");
}
ApiProvider::Openai | ApiProvider::Atlascloud | ApiProvider::Ollama => {}
ApiProvider::NvidiaNim => {
body["chat_template_kwargs"] = json!({
@@ -924,11 +942,16 @@ pub(super) fn apply_reasoning_effort(
| ApiProvider::Openrouter
| ApiProvider::Novita
| ApiProvider::Fireworks
| ApiProvider::Sglang
| ApiProvider::Vllm => {
| ApiProvider::Sglang => {
body["reasoning_effort"] = json!("max");
body["thinking"] = json!({ "type": "enabled" });
}
ApiProvider::Vllm => {
body["chat_template_kwargs"] = json!({
"enable_thinking": true,
});
body["reasoning_effort"] = json!("max");
}
ApiProvider::Openai | ApiProvider::Atlascloud | ApiProvider::Ollama => {}
ApiProvider::NvidiaNim => {
body["chat_template_kwargs"] = json!({