Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Qwen Team, a division of Chinese e-commerce giant Alibaba developing its growing family of open-source Qwen large language models (LLMs), has introduced QwQ-32B, a new 32-billion-parameter reasoning model designed to improve performance on complex problem-solving tasks through reinforcement learning (RL).
The model is available as open-weight on Hugging Face and on ModelScope under an Apache 2.0 license. This means it’s available for commercial and research uses, so enterprises can employ it immediately to power their products and applications (even ones they charge customers to use).
It can also be accessed for individual users via Qwen Chat.
Quan-with-Questions was Alibaba’s answer to OpenAI’s original reasoning model o1
QwQ, short for Qwen-with-Questions, was first introduced by Alibaba in November 2024 as an open-source reasoning model aimed at competing with OpenAI’s o1-preview.
At launch, the model was designed to enhance logical reasoning and planning by reviewing and refining its own responses during inference, a technique that made it particularly effective in math and coding tasks.
The initial version of QwQ featured 32 billion parameters and a 32,000-token context length, with Alibaba highlighting its ability to outperform o1-preview in mathematical benchmarks like AIME and MATH, as well as scientific reasoning tasks such as GPQA.
Despite its strengths, QwQ’s early iterations struggled with programming benchmarks like LiveCodeBench, where OpenAI’s models maintained an edge. Additionally, as with many emerging reasoning models, QwQ faced challenges such as language mixing and occasional circular reasoning loops.
However, Alibaba’s decision to release the model under an Apache 2.0 license ensured that developers and enterprises could freely adapt and commercialize it, distinguishing it from proprietary alternatives like OpenAI’s o1.
Since QwQ’s initial release, the AI landscape has evolved rapidly. The limitations of traditional LLMs have become more apparent, with scaling laws yielding diminishing returns in performance improvements.
This shift has fueled interest in large reasoning models (LRMs) — a new category of AI systems that use inference-time reasoning and self-reflection to enhance accuracy. These include OpenAI’s o3 series and the massively successful DeepSeek-R1 from rival Chinese lab DeepSeek, an offshoot of Hong Kong quantitative analysis firm High-Flyer Capital Management.
A new report from web traffic analytics and research firm SimilarWeb found that since the launch of R1 back in January 2024, DeepSeek has rocketed up the charts to become the most-visited AI model-providing website behind OpenAI.

QwQ-32B, Alibaba’s latest iteration, builds on these advancements by integrating RL and structured self-questioning, positioning it as a serious competitor in the growing field of reasoning-focused AI.
Scaling up performance with multi-stage reinforcement learning
Traditional instruction-tuned models often struggle with difficult reasoning tasks, but the Qwen Team’s research suggests that RL can significantly improve a model’s ability to solve complex problems.
QwQ-32B builds on this idea by implementing a multi-stage RL training approach to enhance mathematical reasoning, coding proficiency and general problem-solving.
The model has been benchmarked against leading alternatives such as DeepSeek-R1, o1-mini and DeepSeek-R1-Distilled-Qwen-32B, demonstrating competitive results despite having fewer parameters than some of these models.

For example, while DeepSeek-R1 operates with 671 billion parameters (with 37 billion activated), QwQ-32B achieves comparable performance with a much smaller footprint — typically requiring 24 GB of vRAM on a GPU (Nvidia’s H100s have 80GB) compared to more than 1500 GB of vRAM for running the full DeepSeek R1 (16 Nvidia A100 GPUs) — highlighting the efficiency of Qwen’s RL approach.
QwQ-32B follows a causal language model architecture and includes several optimizations:
- 64 transformer layers with RoPE, SwiGLU, RMSNorm and Attention QKV bias;
- Generalized query attention (GQA) with 40 attention heads for queries and 8 for key-value pairs;
- Extended context length of 131,072 tokens, allowing for better handling of long-sequence inputs;
- Multi-stage training including pretraining, supervised fine-tuning and RL.
The RL process for QwQ-32B was executed in two phases:
- Math and coding focus: The model was trained using an accuracy verifier for mathematical reasoning and a code execution server for coding tasks. This approach ensured that generated answers were validated for correctness before being reinforced.
- General capability enhancement: In a second phase, the model received reward-based training using general reward models and rule-based verifiers. This stage improved instruction following, human alignment and agent reasoning without compromising its math and coding capabilities.
What it means for enterprise decision-makers
For enterprise leaders—including CEOs, CTOs, IT leaders, team managers and AI application developers—QwQ-32B represents a potential shift in how AI can support business decision-making and technical innovation.
With its RL-driven reasoning capabilities, the model can provide more accurate, structured and context-aware insights, making it valuable for use cases such as automated data analysis, strategic planning, software development and intelligent automation.
Companies looking to deploy AI solutions for complex problem-solving, coding assistance, financial modeling or customer service automation may find QwQ-32B’s efficiency an attractive option. Additionally, its open-weight availability allows organizations to fine-tune and customize the model for domain-specific applications without proprietary restrictions, making it a flexible choice for enterprise AI strategies.
The fact that it comes from a Chinese e-commerce giant may raise some security and bias concerns for some non-Chinese users, especially when using the Qwen Chat interface. But as with DeepSeek-R1, the fact that the model is available on Hugging Face for download and offline usage and fine-tuning or retraining suggests that these can be overcome fairly easily. And it is a viable alternative to DeepSeek-R1.
Early reactions from AI power users and influencers
The release of QwQ-32B has already gained attention from the AI research and development community, with several developers and industry professionals sharing their initial impressions on X (formerly Twitter):
- Hugging Face’s Vaibhav Srivastav (@reach_vb) highlighted QwQ-32B’s speed in inference thanks to provider Hyperbolic Labs, calling it “blazingly fast” and comparable to top-tier models. He also noted that the model “beats DeepSeek-R1 and OpenAI o1-mini with Apache 2.0 license.”
- AI news and rumor publisher Chubby (@kimmonismus) was impressed by the model’s performance, emphasizing that QwQ-32B sometimes outperforms DeepSeek-R1, despite being 20 times smaller. “Holy moly! Qwen cooked!” they wrote.
- Yuchen Jin (@Yuchenj_UW), co-founder and CTO of Hyperbolic Labs, celebrated the release by noting the efficiency gains. “Small models are so powerful! Alibaba Qwen released QwQ-32B, a reasoning model that beats DeepSeek-R1 (671B) and OpenAI o1-mini!”
- Another Hugging Face team member, Erik Kaunismäki (@ErikKaum) emphasized the ease of deployment, sharing that the model is available for one-click deployment on Hugging Face endpoints, making it accessible to developers without extensive setup.
Agentic capabilities
QwQ-32B incorporates agentic capabilities, allowing it to dynamically adjust reasoning processes based on environmental feedback.
For optimal performance, Qwen Team recommends using the following inference settings:
- Temperature: 0.6
- TopP: 0.95
- TopK: Between 20-40
- YaRN Scaling: Recommended for handling sequences longer than 32,768 tokens
The model supports deployment using vLLM, a high-throughput inference framework. However, current implementations of vLLM only support static YaRN scaling, which maintains a fixed scaling factor regardless of input length.
Future developments
Qwen’s team sees QwQ-32B as the first step in scaling RL to enhance reasoning capabilities. Looking ahead, the team plans to:
- Further explore scaling RL to improve model intelligence;
- Integrate agents with RL for long-horizon reasoning;
- Continue developing foundation models optimized for RL;
- Move toward artificial general intelligence (AGI) through more advanced training techniques.
With QwQ-32B, Qwen Team is positioning RL as a key driver of the next generation of AI models, demonstrating that scaling can produce highly performant and effective reasoning systems.
Source link


 
				
Leave a Reply