Alibaba releases new AI model Qwen2-VL that can analyze videos more than 20 minutes long

Published on:

Alibaba Cloud, the cloud providers and storage division of the Chinese language e-commerce big, has introduced the discharge of Qwen2-VL, its newest superior vision-language mannequin designed to reinforce visible understanding, video comprehension and multilingual text-image processing.

And already, it boasts spectacular efficiency on third-party benchmark assessments in comparison with different main state-of-the-art fashions reminiscent of Meta’s Llama 3.1, OpenAI’s GPT-4o, Anthropic’s Claude 3 Haiku and Google’s Gemini-1.5 Flash. You’ll be able to strive an inference of it hosted right here on Hugging Face.

Supported languages embrace English, Chinese language, most European languages, Japanese, Korean, Arabic and Vietnamese.

- Advertisement -

Distinctive capabilities in analyzing imagery and video, even for dwell tech assist

With the brand new Qwen-2VL, Alibaba is looking for to set new requirements for AI fashions’ interplay with visible knowledge, together with the aptitude to investigate and discern handwriting in a number of languages, determine, describe and distinguish between a number of objects in nonetheless pictures, and even analyze dwell video in near-realtime, offering summaries or suggestions that would open the door it to getting used for tech assist and different useful dwell operations.

Because the Qwen analysis staff writes in a weblog submit on GitHub concerning the new Qwen2-VL household of fashions: “Past static pictures, Qwen2-VL extends its prowess to video content material evaluation. It will probably summarize video content material, reply questions associated to it, and keep a steady move of dialog in actual time, providing dwell chat assist. This performance permits it to behave as a private assistant, serving to customers by offering insights and data drawn instantly from video content material.”

See also  ChatGPT vs. Copilot: Which AI chatbot is better for you?

As well as, Alibaba boasts it might probably analyze movies longer than 20 minutes and reply questions concerning the contents.

Alibaba even confirmed off an instance of the brand new mannequin appropriately analyzing and describing the next video:

- Advertisement -

Right here’s Qwen-2VL’s abstract:

The video begins with a person chatting with the digicam, adopted by a bunch of individuals sitting in a management room. The digicam then cuts to 2 males floating inside an area station, the place they’re seen chatting with the digicam. The lads look like astronauts, and they’re carrying house fits. The house station is crammed with varied tools and equipment, and the digicam pans round to indicate the totally different areas of the station. The lads proceed to talk to the digicam, and they seem like discussing their mission and the varied duties they’re performing. Total, the video supplies a captivating glimpse into the world of house exploration and the every day lives of astronauts.

Three sizes, two of that are totally open supply below Apache 2.0 license

Alibaba’s new mannequin is available in three variants of various parameter sizes — Qwen2-VL-72B (72-billion parameters), Qwen2-VL-7B, and Qwen2-VL-2B. (A reminder that parameters describe the interior settings of a mannequin, with extra parameters usually connoting a extra highly effective and succesful mannequin.)

The 7B and 2B variants can be found below open-source permissive Apache 2.0 licenses, permitting enterprises to make use of them at will for industrial functions, making them interesting as choices for potential decision-makers. They’re designed to ship aggressive efficiency at a extra accessible scale and can be found on platforms like Hugging Face and ModelScope.

See also  Meta plans to bring generative AI to metaverse games

Nevertheless, the most important 72B mannequin hasn’t but been launched publicly, and can solely be made obtainable later by means of a separate license and utility programming interface (API) from Alibaba.

Perform calling and human-like visible notion

The Qwen2-VL collection is constructed on the inspiration of the Qwen mannequin household, bringing vital developments in a number of key areas:

The fashions might be built-in into gadgets reminiscent of cellphones and robots, permitting for automated operations primarily based on visible environments and textual content directions.

- Advertisement -

This function highlights Qwen2-VL’s potential as a strong device for duties that require advanced reasoning and decision-making.

As well as, Qwen2-VL helps perform calling — integrating with different third-party software program, apps and instruments — and visible extraction of data from these third-party sources of data. In different phrases, the mannequin can have a look at and perceive “flight statuses, climate forecasts, or bundle monitoring” which Alibaba says makes it able to “facilitating interactions much like human perceptions of the world.”

Qwen2-VL introduces a number of architectural enhancements geared toward enhancing the mannequin’s means to course of and comprehend visible knowledge.

The Naive Dynamic Decision assist permits the fashions to deal with pictures of various resolutions, guaranteeing consistency and accuracy in visible interpretation. Moreover, the Multimodal Rotary Place Embedding (M-ROPE) system permits the fashions to concurrently seize and combine positional info throughout textual content, pictures, and movies.

What’s subsequent for the Qwen Workforce?

Alibaba’s Qwen Workforce is dedicated to additional advancing the capabilities of vision-language fashions, constructing on the success of Qwen2-VL with plans to combine extra modalities and improve the fashions’ utility throughout a broader vary of functions.

See also  This $1 billion AI chatbot has been accused of stealing content and lying

The Qwen2-VL fashions are actually obtainable to be used, and the Qwen Workforce encourages builders and researchers to discover the potential of those cutting-edge instruments.

- Advertisment -

Related

- Advertisment -

Leave a Reply

Please enter your comment!
Please enter your name here