Bug: Qwen3-Coder Errors With Vllm-0.11.0 Tool Calling
It appears there's a snag when using Qwen3-Coder with vllm-0.11.0 for tool calling, specifically when employing the hermes
tool parser. This issue manifests as a large number of errors, particularly a JSONDecodeError
, which we'll dive into. Let's break down the problem, the environment, and potential solutions.
Understanding the Issue
The core problem lies in how vllm-0.11.0
and Qwen3-Coder
interact when function calling is enabled with the hermes
parser. While the same setup works flawlessly with Qwen3, Qwen3-Coder seems to stumble, throwing a barrage of JSONDecodeError
exceptions. These errors suggest that the hermes
parser struggles to correctly interpret the JSON format of the tool calls generated by Qwen3-Coder. Basically, the model outputs something that json.loads()
can't handle, leading to the failure.
To put it simply, the hermes
tool parser is designed to extract tool calls from the model's response, typically by identifying JSON structures within the text. However, when used with Qwen3-Coder, the parser encounters responses that don't conform to the expected JSON format, resulting in the JSONDecodeError
. This indicates a mismatch between the expected and actual output format of the model, likely due to differences in how Qwen3 and Qwen3-Coder are trained to generate tool calls.
Environment Details
Let's take a look at the environment where this issue is occurring. The system is running Ubuntu 22.04.4 LTS with a fairly standard development setup, including:
- OS: Ubuntu 22.04.4 LTS (x86_64)
- Python: 3.10.18
- PyTorch: 2.8.0+cu128 (CUDA 12.8)
- vLLM: 0.11.0
- GPUs: 8x NVIDIA A800-SXM4-80GB
- CUDA Driver Version: 525.105.17
Notably, the setup includes several NVIDIA libraries (cublas, cuda, cudnn, etc.) at version 12, aligning with the CUDA version used by PyTorch. The vLLM
version is 0.11.0
, built without specific CUDA architectures set. The system has 128 CPUs and is configured with two NUMA nodes.
This environment is quite powerful, equipped with top-tier GPUs and ample CPU resources, suggesting that the hardware itself isn't the bottleneck. The issue seems to stem from the software configuration, particularly the interaction between vLLM, Qwen3-Coder, and the hermes
tool parser.
Diving into the Error
The error messages themselves provide valuable clues. Here's a breakdown of the traceback:
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] Error in extracting tool call from response.
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] Traceback (most recent call last):
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] File "/opt/conda/envs/vllm_py310/lib/python3.10/site-packages/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py", line 134, in extract_tool_calls
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] raw_function_calls = [
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] File "/opt/conda/envs/vllm_py310/lib/python3.10/site-packages/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py", line 135, in <listcomp>
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] json.loads(match[0] if match[0] else match[1])
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] File "/opt/conda/envs/vllm_py310/lib/python3.10/json/__init__.py", line 346, in loads
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] return _default_decoder.decode(s)
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] File "/opt/conda/envs/vllm_py310/lib/python3.10/json/decoder.py", line 337, in decode
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] File "/opt/conda/envs/vllm_py310/lib/python3.10/json/decoder.py", line 355, in raw_decode
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] raise JSONDecodeError("Expecting value", s, err.value) from None
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
The key line is json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
. This indicates that the json.loads()
function within the hermes_tool_parser.py
script is encountering an invalid JSON string. The error occurs specifically at line 2, column 1, suggesting that the string either starts with an unexpected character or is simply not a valid JSON object.
The hermes_tool_parser.py
script attempts to extract tool calls using regular expressions and then parses the extracted strings as JSON. The error implies that the regular expression might be incorrectly extracting the tool call, or that the Qwen3-Coder model is generating tool calls that are not properly formatted as JSON.
Possible Solutions and Workarounds
Given the information, here's a strategic approach to tackle this issue:
-
Inspect Model Output: The most crucial step is to directly examine the raw output from Qwen3-Coder when generating tool calls. Print the raw response before it's passed to the
hermes
parser. This will reveal the exact format of the tool calls and help identify any inconsistencies or malformations.# Inside vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py, before json.loads print("Raw model output:", match[0] if match[0] else match[1])
By observing the
match[0]
ormatch[1]
content, you can understand what thehermes
parser is attempting to decode. -
Adjust Regular Expression: If the model's output consistently deviates from the expected JSON format, the regular expressions used in
hermes_tool_parser.py
might need adjustment. The script uses regex to identify potential JSON blobs. Carefully craft the regex to match the specific structure produced by Qwen3-Coder.- Examine the existing regex patterns in the
hermes_tool_parser.py
file. - Modify the patterns to accurately capture the JSON structure generated by Qwen3-Coder.
- Test the updated regex patterns against the raw model output obtained in step 1.
- Examine the existing regex patterns in the
-
Implement Custom Parsing Logic: If regex adjustments prove insufficient, consider implementing custom parsing logic tailored to the Qwen3-Coder output. This might involve string manipulation, conditional checks, and more robust error handling.
- Create a new function that takes the raw model output as input.
- Implement logic to identify and extract the relevant parts of the output.
- Parse the extracted parts into a JSON-compatible format.
- Replace the
json.loads()
call with your custom parsing function.
-
Try a Different Tool Parser: While
hermes
is causing issues, explore alternative tool parsers available withinvLLM
. Experimenting with different parsers might reveal one that better aligns with the output format of Qwen3-Coder.- Modify the
vllm serve
command to use a different--tool-call-parser
option. - Evaluate the performance and error rate with each parser.
- Modify the
-
Update vLLM: Ensure you're using the latest version of
vLLM
. Updates often include bug fixes and improvements to tool parsing capabilities. Check the vLLM GitHub repository for newer releases and upgrade if necessary.pip install -U vllm
-
Check the Model Card: Review the Qwen3-Coder model card on Hugging Face. It might contain specific instructions or recommendations regarding tool calling and expected output formats. Pay close attention to any examples or guidelines provided by the model developers.
-
Report the Issue: If none of the above solutions work, consider reporting the issue to the
vLLM
orQwen3-Coder
developers. Provide detailed information about your environment, the error messages, and the steps you've taken to troubleshoot the problem. This will help them identify the root cause and develop a fix.
Code Example: Custom Parsing (Illustrative)
import json
import re
def custom_parse_qwen3_coder_tool_call(model_output: str) -> dict:
"""Custom parser for Qwen3-Coder tool call output."""
try:
# Example: Extract content between <tool_code> and </tool_code> tags
match = re.search(r'<tool_code>(.*?)</tool_code>', model_output, re.DOTALL)
if match:
tool_code = match.group(1).strip()
# Attempt to parse the extracted code as JSON
return json.loads(tool_code)
else:
print("No tool code found in the output.")
return None
except json.JSONDecodeError as e:
print(f"JSONDecodeError: {e}")
return None
# Example Usage (replace with your actual model output)
model_output = "Some text here <tool_code>{\"tool\": \"my_tool\", \"param\": \"value\"}</tool_code> more text"
parsed_tool_call = custom_parse_qwen3_coder_tool_call(model_output)
if parsed_tool_call:
print("Parsed tool call:", parsed_tool_call)
Important Considerations:
- Model Training: The core of the issue might be rooted in how Qwen3-Coder was trained. If the model consistently produces malformed JSON, it might require fine-tuning or adjustments to its training data.
- Edge Cases: Be prepared to handle edge cases and unexpected output formats. Robust error handling is crucial to prevent the parser from crashing.
- Security: When parsing JSON from untrusted sources, be mindful of security vulnerabilities. Sanitize the input and validate the parsed data to prevent potential attacks.
By systematically investigating the model's output, adjusting the parsing logic, and staying updated with the latest vLLM
releases, you can overcome this issue and successfully enable tool calling with Qwen3-Coder.
Good luck, and let me know if you have any more questions!