Alibaba Cloud Releases Open-source Qwen-VL, A Large Vision Language Model

On August twenty fifth, Alibaba Cloud launched two open-source giant imaginative and prescient language fashions (LVLM), Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. Qwen-VL is the multimodal model of Qwen-7B, Alibaba Cloud’s 7-billion-parameter mannequin of its giant language mannequin Tongyi Qianwen. Capable of understanding each picture inputs and textual content prompts in English and Chinese, Qwen-VL can carry out numerous duties reminiscent of responding to open-ended queries associated to completely different pictures and producing picture captions.

Qwen-VL is a imaginative and prescient language (VL) mannequin that helps a number of languages together with Chinese and English. Compared to earlier VL fashions, Qwen-VL not solely has primary talents in picture recognition, description, query answering, and dialogue, but in addition provides capabilities reminiscent of visible localization and understanding of textual content inside pictures.

For instance, if a international vacationer who doesn’t perceive Chinese goes to the hospital for therapy and doesn’t know the best way to get to the corresponding division, he can take an image of the ground information map and ask Qwen-VL, “Which floor is the orthopedics department on?” or “Where should I go for ENT?” Qwen-VL will present textual content replies primarily based on the data within the picture. This is its picture question-answering functionality. Another instance is that when you enter a photograph of Shanghai’s Bund, and ask Qwen-VL to search out the Oriental Pearl Tower, it may precisely define the corresponding constructing utilizing detection packing containers. This demonstrates its visible localization potential.

SEE ALSO:Alibaba Cloud Open Sources Tongyi Qianwen with 7 Billion Parameter Model

Qwen-VL, primarily based on the Qwen-7B language mannequin, introduces a visible encoder in its structure to help visible enter alerts. Through the design of the coaching course of, the mannequin is ready to understand and perceive visible alerts at a fine-grained stage. Qwen-VL helps picture enter decision of 448, which is larger than the beforehand open-sourced LVLM fashions that sometimes supported solely 224 decision. Building upon Qwen-VL, the group at Tongyi Qianwen has developed Qwen-VL-Chat, a visible AI assistant primarily based on LLM with alignment mechanisms. This permits builders to shortly construct dialogue purposes with multimodal capabilities.

Multimodality is likely one of the essential technological developments normally synthetic intelligence. It is broadly believed that transitioning from a single-sensory, text-only language mannequin to a multimodal mannequin that helps numerous types of info enter reminiscent of textual content, pictures, and audio represents a major leap in direction of clever fashions on a bigger scale. Multimodality enhances the understanding capabilities of large-scale fashions and significantly expands their vary of purposes.

Vision is the first sensory potential of people and it’s also the primary modality that researchers goal to include into large-scale fashions. Following the discharge of M6 and OFA collection multimodal fashions, Alibaba Cloud’s Tongyi Qianwen group has now open-sourced a large-scale imaginative and prescient language mannequin (LVLM) referred to as Qwen-VL, primarily based on Qwen-7B.

Qwen-VL is the business’s first common mannequin that helps Chinese open-domain visible localization. The potential of open-domain visible localization determines the accuracy of huge fashions’ “vision”, that’s, whether or not they can precisely determine desired objects in pictures. This is essential for the sensible software of VL fashions in eventualities reminiscent of robotic management.

In mainstream multimodal activity analysis and multimodal conversational potential analysis, Qwen-VL has achieved efficiency far past that of equivalent-sized common fashions.

SEE ALSO:Alibaba Cloud’s Energy Expert Helps Analyze Carbon Footprint for The First Olympic Esports Week

In the usual English analysis of the 4 main multimodal duties (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved one of the best efficiency amongst open-source LVLMs of comparable measurement. In order to check the mannequin’s multimodal dialogue functionality, the Tongyi Qianwen group constructed a check set referred to as ‘Shijinshi’ primarily based on GPT-4 scoring mechanism, and carried out comparative exams on Qwen-VL-Chat and different fashions. Qwen-VL-Chat achieved one of the best outcomes amongst open-source LVLMs in each Chinese and English alignment evaluations.

Qwen-VL and its visible AI assistant Qwen-VL-Chat have been launched on the ModelScope, open-source, free, and accessible for business use. Users can instantly obtain fashions from the ModelScope or entry and invoke Qwen-VL and Qwen-VL-Chat by way of Alibaba Cloud DashScope. Alibaba Cloud supplies customers with complete companies together with mannequin coaching, inference, deployment, fine-tuning, and so forth.

In early August, Alibaba Cloud open-sourced the Qwen-7B Generalized Questioning Model and Qwen-7B-Chat Dialogue Model, with a complete of 70 billion parameters. This made it the primary large-scale know-how firm in China to hitch the ranks of open-source giant fashions. The launch of the Qwen-7B Generalized Questioning Model instantly gained widespread consideration and shortly climbed up HuggingFace’s trending listing that week. In lower than a month, it obtained over 3,400 stars on GitHub, and its cumulative obtain rely has exceeded 400,000.

Sign up right this moment for five free articles month-to-month!

Pandaily Substack subscribe