Chinese AI lab DeepSeek might be getting the bulk of the tech industry’s attention this week. But one of its top domestic rivals, Alibaba, isn’t sitting idly by.
Alibaba’s Qwen team on Monday released a new family of AI models, Qwen2.5-VL, that can perform a number of text and image analysis tasks. The models can parse files, understand videos, and count objects in images, as well as control a PC — similar to the model powering OpenAI’s recently launched Operator.
Per the Qwen team’s benchmarking, the best Qwen2.5-VL model beats OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash on a range of video understanding, math, document analysis, and question-answering evaluations.
Qwen2.5-VL, which is available to test in Alibaba’s Qwen Chat app and to download from AI dev platform Hugging Face, can analyze charts and graphics, extract data from scans of invoices and forms, and “comprehend” multiple-hours-long videos, the Qwen team says. Qwen2.5-VL can also recognize “IPs from film and TV series, as well as a wide variety of products,” per the team — suggesting that the models might’ve been trained in part on copyrighted works.
Qwen2.5-VL, being AI developed by a Chinese company, has certain restrictions on the topics it will discuss — at least in Qwen Chat. When I asked the largest and most capable Qwen2.5-VL model, Qwen2.5-VL-72B, to talk about “Xi Jinping’s mistakes,” Qwen Chat threw an error message.
China’s internet regulator benchmarks many models developed in the country to ensure their responses “embody core socialist values.” Many Chinese AI systems decline to respond to topics that might raise the ire of regulators, such as Taiwan’s autonomy.
One of Qwen2.5-VL’s more interesting features is its ability to interact with software — both on PCs and mobile devices. A video posted on X by Philipp Schmid, a technical lead at Hugging Face, showed Qwen2.5-VL launching the Booking.com app for Android and booking a flight from Chongqing to Beijing.
Don’t Miss @Alibaba_Qwen 2.5 VL! Despite all the Deepseek Hype, Qwen just dropped the best open Multimodal! Qwen 2.5 VL is a Vision Language Model that can control your computer, similar to the @OpenAI operator, extract structured information from charts, and more!!
TL;DR;
3️⃣… pic.twitter.com/GeEGVdl0tI— Philipp Schmid (@_philschmid) January 27, 2025
In the video below, a Qwen2.5-VL model controls apps on a Linux desktop — but doesn’t seem to accomplish much beyond switching tabs. Perhaps tellingly, Qwen’s benchmarking shows Qwen2.5-VL scoring poorly on OSWorld, a benchmark that tries to mimic a real computer environment.
LMAO Qwen 2.5 VL can perform Computer Use, out of the box, taking on OpenAI Operator HEAD ON! 🐐 pic.twitter.com/lwMECXzNSu
— Vaibhav (VB) Srivastav (@reach_vb) January 27, 2025
The two smaller, less sophisticated models in the Qwen2.5-VL series, Qwen2.5-VL-3B and Qwen2.5-VL-7B, are available under a permissive license. The flagship Qwen2.5-VL-72B, however, is under Alibaba’s custom license, which requires that companies and devs with more than 100 million monthly active users request permission from Qwen/Alibaba before deploying the model commercially.