Technology News Desk!!! As the war on artificial intelligence (AI) chatbots heats up, Microsoft has unveiled Cosmos-1, a new AI model that can respond to visual cues or images in addition to text prompts or messages. The Multimodal Large Language Model (MLLM) can help with a range of new tasks, including image captioning, visual question answering, and more. COSMOS-1 could pave the way for the next step beyond text prompts for ChatGPT. A great convergence of language, multidimensional perception, action and world modeling is an important step towards artificial general intelligence, said Microsoft AI researchers. In this work, we present Cosmos-1, a multimodal large language model (MLLM) that can understand general modalities, learn in context, and follow instructions.
As ZDNet reports, the paper suggests that going beyond ChatGPT-like capabilities to Artificial General Intelligence (AGI) requires multimodal perception, or knowledge acquisition and grounding in the real world. More importantly, unlocking multimodal input expands the applications of language models to more high-value areas such as multimodal machine learning, document intelligence, and robotics. The goal is to align perception with the LLM, so that the models are able to see and talk. Experimental results have shown that Kosmos-1 achieves impressive performance in language understanding, generation, and even directly documenting images. It also showed good results in perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and vision tasks, such as image recognition with details (specifying classification via text directions). The team said- We also show that MLLMs can benefit from cross-modal transfer, that is, transfer knowledge from language to multimodal and from multimodal to language. In addition, we present a dataset from the Raven IQ test, which diagnoses MLLM’s nonverbal reasoning ability.