Bug: Qwen3-Coder Errors With Vllm-0.11.0 Tool Calling

by ADMIN 54 views

It appears there's a snag when using Qwen3-Coder with vllm-0.11.0 for tool calling, specifically when employing the hermes tool parser. This issue manifests as a large number of errors, particularly a JSONDecodeError, which we'll dive into. Let's break down the problem, the environment, and potential solutions.

Understanding the Issue

The core problem lies in how vllm-0.11.0 and Qwen3-Coder interact when function calling is enabled with the hermes parser. While the same setup works flawlessly with Qwen3, Qwen3-Coder seems to stumble, throwing a barrage of JSONDecodeError exceptions. These errors suggest that the hermes parser struggles to correctly interpret the JSON format of the tool calls generated by Qwen3-Coder. Basically, the model outputs something that json.loads() can't handle, leading to the failure.

To put it simply, the hermes tool parser is designed to extract tool calls from the model's response, typically by identifying JSON structures within the text. However, when used with Qwen3-Coder, the parser encounters responses that don't conform to the expected JSON format, resulting in the JSONDecodeError. This indicates a mismatch between the expected and actual output format of the model, likely due to differences in how Qwen3 and Qwen3-Coder are trained to generate tool calls.

Environment Details

Let's take a look at the environment where this issue is occurring. The system is running Ubuntu 22.04.4 LTS with a fairly standard development setup, including:

  • OS: Ubuntu 22.04.4 LTS (x86_64)
  • Python: 3.10.18
  • PyTorch: 2.8.0+cu128 (CUDA 12.8)
  • vLLM: 0.11.0
  • GPUs: 8x NVIDIA A800-SXM4-80GB
  • CUDA Driver Version: 525.105.17

Notably, the setup includes several NVIDIA libraries (cublas, cuda, cudnn, etc.) at version 12, aligning with the CUDA version used by PyTorch. The vLLM version is 0.11.0, built without specific CUDA architectures set. The system has 128 CPUs and is configured with two NUMA nodes.

This environment is quite powerful, equipped with top-tier GPUs and ample CPU resources, suggesting that the hardware itself isn't the bottleneck. The issue seems to stem from the software configuration, particularly the interaction between vLLM, Qwen3-Coder, and the hermes tool parser.

Diving into the Error

The error messages themselves provide valuable clues. Here's a breakdown of the traceback:

(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] Error in extracting tool call from response.
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] Traceback (most recent call last):
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]   File "/opt/conda/envs/vllm_py310/lib/python3.10/site-packages/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py", line 134, in extract_tool_calls
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]     raw_function_calls = [
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]   File "/opt/conda/envs/vllm_py310/lib/python3.10/site-packages/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py", line 135, in <listcomp>
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]     json.loads(match[0] if match[0] else match[1])
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]   File "/opt/conda/envs/vllm_py310/lib/python3.10/json/__init__.py", line 346, in loads
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]     return _default_decoder.decode(s)
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]   File "/opt/conda/envs/vllm_py310/lib/python3.10/json/decoder.py", line 337, in decode
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]   File "/opt/conda/envs/vllm_py310/lib/python3.10/json/decoder.py", line 355, in raw_decode
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157]     raise JSONDecodeError("Expecting value", s, err.value) from None
(APIServer pid=85343) ERROR 10-09 09:39:54 [hermes_tool_parser.py:157] json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

The key line is json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1). This indicates that the json.loads() function within the hermes_tool_parser.py script is encountering an invalid JSON string. The error occurs specifically at line 2, column 1, suggesting that the string either starts with an unexpected character or is simply not a valid JSON object.

The hermes_tool_parser.py script attempts to extract tool calls using regular expressions and then parses the extracted strings as JSON. The error implies that the regular expression might be incorrectly extracting the tool call, or that the Qwen3-Coder model is generating tool calls that are not properly formatted as JSON.

Possible Solutions and Workarounds

Given the information, here's a strategic approach to tackle this issue:

  1. Inspect Model Output: The most crucial step is to directly examine the raw output from Qwen3-Coder when generating tool calls. Print the raw response before it's passed to the hermes parser. This will reveal the exact format of the tool calls and help identify any inconsistencies or malformations.

    # Inside vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py, before json.loads
    print("Raw model output:", match[0] if match[0] else match[1])
    

    By observing the match[0] or match[1] content, you can understand what the hermes parser is attempting to decode.

  2. Adjust Regular Expression: If the model's output consistently deviates from the expected JSON format, the regular expressions used in hermes_tool_parser.py might need adjustment. The script uses regex to identify potential JSON blobs. Carefully craft the regex to match the specific structure produced by Qwen3-Coder.

    • Examine the existing regex patterns in the hermes_tool_parser.py file.
    • Modify the patterns to accurately capture the JSON structure generated by Qwen3-Coder.
    • Test the updated regex patterns against the raw model output obtained in step 1.
  3. Implement Custom Parsing Logic: If regex adjustments prove insufficient, consider implementing custom parsing logic tailored to the Qwen3-Coder output. This might involve string manipulation, conditional checks, and more robust error handling.

    • Create a new function that takes the raw model output as input.
    • Implement logic to identify and extract the relevant parts of the output.
    • Parse the extracted parts into a JSON-compatible format.
    • Replace the json.loads() call with your custom parsing function.
  4. Try a Different Tool Parser: While hermes is causing issues, explore alternative tool parsers available within vLLM. Experimenting with different parsers might reveal one that better aligns with the output format of Qwen3-Coder.

    • Modify the vllm serve command to use a different --tool-call-parser option.
    • Evaluate the performance and error rate with each parser.
  5. Update vLLM: Ensure you're using the latest version of vLLM. Updates often include bug fixes and improvements to tool parsing capabilities. Check the vLLM GitHub repository for newer releases and upgrade if necessary.

    pip install -U vllm
    
  6. Check the Model Card: Review the Qwen3-Coder model card on Hugging Face. It might contain specific instructions or recommendations regarding tool calling and expected output formats. Pay close attention to any examples or guidelines provided by the model developers.

  7. Report the Issue: If none of the above solutions work, consider reporting the issue to the vLLM or Qwen3-Coder developers. Provide detailed information about your environment, the error messages, and the steps you've taken to troubleshoot the problem. This will help them identify the root cause and develop a fix.

Code Example: Custom Parsing (Illustrative)

import json
import re

def custom_parse_qwen3_coder_tool_call(model_output: str) -> dict:
    """Custom parser for Qwen3-Coder tool call output."""
    try:
        # Example: Extract content between <tool_code> and </tool_code> tags
        match = re.search(r'<tool_code>(.*?)</tool_code>', model_output, re.DOTALL)
        if match:
            tool_code = match.group(1).strip()
            # Attempt to parse the extracted code as JSON
            return json.loads(tool_code)
        else:
            print("No tool code found in the output.")
            return None
    except json.JSONDecodeError as e:
        print(f"JSONDecodeError: {e}")
        return None

# Example Usage (replace with your actual model output)
model_output = "Some text here <tool_code>{\"tool\": \"my_tool\", \"param\": \"value\"}</tool_code> more text"
parsed_tool_call = custom_parse_qwen3_coder_tool_call(model_output)
if parsed_tool_call:
    print("Parsed tool call:", parsed_tool_call)

Important Considerations:

  • Model Training: The core of the issue might be rooted in how Qwen3-Coder was trained. If the model consistently produces malformed JSON, it might require fine-tuning or adjustments to its training data.
  • Edge Cases: Be prepared to handle edge cases and unexpected output formats. Robust error handling is crucial to prevent the parser from crashing.
  • Security: When parsing JSON from untrusted sources, be mindful of security vulnerabilities. Sanitize the input and validate the parsed data to prevent potential attacks.

By systematically investigating the model's output, adjusting the parsing logic, and staying updated with the latest vLLM releases, you can overcome this issue and successfully enable tool calling with Qwen3-Coder.

Good luck, and let me know if you have any more questions!