Google’s strongest AI model, which has attracted much attention and is considered by the industry to be the most likely to beat GPT-4, has finally appeared!
On December 6th, US Western Time, Google CEO Sundar Pichai and DeepMind Director Demis Hassabis jointly announced the official launch of Gemini 1.0 in official website.
Gemini is jointly created by several Alphabet teams such as Google Research and DeepMind. As a new powerful multi-modal artificial intelligence model, it can induce, understand, operate and combine different types of information. It can not only process text, images, video and audio, but also complete complex tasks in mathematics, physics and other scientific fields, and can understand and generate high-quality codes in various programming languages.
Pichai described Gemini as "our largest and most powerful AI model to date" and said that it has shown the most advanced performance in many leading benchmarks. "The first generation of Gemini 1.0 was optimized for different sizes, namely: Ultra, Pro and Nano. These are the first models to enter the Gemini era and the first realization of our vision when we established Google DeepMind earlier this year. "
At the Google I/O conference in May this year, Pichai spoiled the news that Google was developing Gemini. It is described as the "Next Generation Multi-modal Intelligent Network", which is designed for API integration. It is said that it has trillions of parameters like GPT-4, but its computing power is five times that of GPT-4, and it can perform more complex and diverse tasks.

To this end, Google also merged its two strongest artificial intelligence laboratories: Google Brain and DeepMind, led by Demis Hassabis, the former CEO of DeepMind, and invested hundreds of millions of dollars in research and development of Gemini.
Even sergey brin, the co-founder of Google, who has resigned for four years, has returned to the office of Mountain View in Silicon Valley to participate in the Gemini decision-making and join the LLM scuffle.
But just last week, The Information quoted people familiar with the matter as saying that Pichai recently silently canceled a series of Gemini public appearances that should have been held in California, new york and Washington, D.C., because it was found to be "unreliable in handling some non-English queries", which cast a gray color on the product launch process.
Perhaps to consolidate market confidence, early this morning, Gemini 1.0 suddenly made its grand debut, appearing in the public’s field of vision and quickly occupying the front pages of major scientific and technological media.
So, what is the strongest artificial intelligence model that Google is betting on the whole army?
Hassabis, head of DeepMind, called Gemini their most flexible model, which can run efficiently on all kinds of devices from data centers to mobile devices.

In order to achieve this extensibility, Gemini 1.0 has designed three versions with different functions and sizes:
Gemini Nano —The most efficient model on the end-side equipment. Designed for smart phones, it can perform efficient AI processing tasks without connecting to an external server. Now it has been integrated into Pixel 8 Pro, which can support the function of "summarizing text" in recording applications and the function of "intelligent reply" in Gboard. Starting with WhatsApp, it will be extended to more applications next year.
Gemini Pro —The best model for various tasks running in Google data center. Support the latest version of the AI chat robot Bard from today. It can respond quickly and understand complex queries, which is the biggest upgrade since Bard was introduced. At present, it provides English services to more than 170 countries and regions, and plans to support new languages and regions in the coming months, and apply them to more Google products such as search, advertising, Chrome and Duet AI.
Gemini Ultra -The largest and most powerful model, dedicated to highly complex tasks, will be provided to developers and enterprise customers early next year after completing the current testing phase. At that time, an updated version of Bard Advanced based on Gemini Ultra will be launched.
In terms of model architecture, Gemini still uses Transformer architecture, adopts an efficient attention mechanism, and supports the context length of 32k.
Google confidently said that after rigorous testing and performance evaluation of the Gemini model, from natural image, audio and video understanding to mathematical reasoning,Among the 32 academic benchmarks widely used in the research and development of large-scale language models, the performance of Gemini Ultra has exceeded the most advanced level represented by ChatGPT at present.. Before the release of Gemini, Google conducted a series of tests to evaluate it with standard industry indicators.
In the MMLU test, the score rate of Gemini Ultra is 90.0%, which is higher than that of GPT-4 (86.4%), making it the first model to surpass the performance of human experts. The MMLU test covers 57 subjects, including mathematics, physics, history, law, medicine and ethics, and aims to evaluate the model’s understanding and problem-solving ability in the global knowledge field.
In Reasoning, Math and Code, except HellaSwag, which lags behind GPT-4 by 87.8%, others include challenging math problems such as multi-step reasoning, algebra/geometry/pre-calculus and so on.

In terms of multimodal capability, Gemini Ultra achieved a SOTA score of 59.4% in the new MMMU benchmark test, which exceeded the 56.8% of the multimodal version GPT-4V of OpenAI. This benchmark test spans many different fields and requires precise reasoning for many tasks.
In the image benchmark test, even without the assistance of Object Character Recognition (OCR) system, the performance of Gemini Ultra is better than all previous models only by analyzing pixel information. In audio testing, the scores of automatic speech recognition and automatic speech translation of Gemini are higher than those of Whisper system connected to GPT-4.

Hassabis said that this highlights the superiority of Gemini’s native multimodal. Up to now, the usual way to create multi-modal models is to train individual components of different modes independently, and then splice them together to try to simulate some functions.
Although such models can sometimes effectively perform specific tasks such as describing pictures, they often perform poorly in more conceptual and complex reasoning.
Gemini was originally designed in a native multi-modal way, and was pre-trained in different modes from the beginning using TPUs v4 and v5e chips designed by Google.. Then it is fine-tuned with more additional multimodal data to improve its performance.
This method enables Gemini to understand and reason all kinds of input content more naturally in the initial stage, and its ability in almost all fields has reached an unprecedented advanced level.
First of all, Gemini has complex multi-modal reasoning ability, which can help to understand complex written and visual information and discover indistinguishable knowledge content in massive data.
For example, it is instructed to filter irrelevant papers through the natural language prompt, or extract key data through reading. With a lunch break, Gemini can help you extract 250 document points from 200,000 papers, and further transform the data into any required chart form. This will greatly help to achieve innovative breakthroughs in science, finance and other fields at a digital speed.

The trained Gemini can obviously recognize and understand text, image and audio data at the same time, better understand subtle information and answer questions related to complex topics. This makes it especially good at explaining reasoning in subjects such as mathematics and physics. In the example, the staff showed that Gemini can check handwriting physical mistakes and explain the correct way.

In order to show the multimodal capabilities of Gemini more intuitively, Brother Chai released a video in X, saying that "the best way to understand the amazing potential capabilities of Gemini is to look at their practical applications".

In the video, Gemini taught the staff the pronunciation of "duck" in Mandarin according to the instructions, and also explained the Chinese tone.

The staff also gave Gemini a demonstration of interaction only in Chinese. By asking about the indoor light in a picture, Gemini is asked to give the orientation of the apartment in use. Gemini answers in Chinese and guesses that the room faces south. Circle a plant in the photo and ask what kind of lighting it needs. gemini then explains the plant species and lighting requirements. The whole process is as smooth as the mother tongue, which shows that Gemini is outstanding in multilingual environment, which is no less than GPT-4.

In addition, Gemini can understand, interpret and generate high-quality codes of the most popular programming languages in the world, such as Python, Java, C++ and Go. The ability to achieve cross-language work and handle complex information makes it one of the world’s leading basic coding models, helping programmers to use the powerful AI model as a collaborative tool to design applications.
For developers, starting from December 13th, Gemini Pro can be obtained through the Gemini API in Google AI Studio or Google Cloud Vertex AI. Android developers can also use Gemini Nano, the most efficient model for end-to-end tasks, through AICore.
Looking back on the development of Google’s big model all the way, compared with the continuous and rapid update of OpenAI’s "Wang Fried", Microsoft always seems to be one step behind when it integrates various AI functions of GPT into core products and pushes them to customers. When Bard, a chat robot, was released in February this year, it got off to a bad start, and the market value evaporated by 100 billion dollars overnight with a factual error. Benchmarking Microsoft Copilot’s Duet AI work suite, the market response was moderate, and the financial performance of cloud services also lost to Microsoft.
Especially after the internal friction caused by the reorganization of Brain and DeepMind teams and the loss of senior talents to OpenAI, Google’s AI battle is even more exhausted.
However, this is, after all, an AI pioneer who contributed to Transformer’s pioneering masterpiece Attention is all you need and the landmark artificial intelligence program AlphaGo, and inspired the subsequent development of many large models including ChatGPT. No matter from the technical genes, training data, capital and infrastructure, it should have one of the best strengths.
The release of Gemini is regarded by Google as the most critical technological innovation in the past decade. Can it make Google rally, beat OpenAI and regain the throne of the big model stadium?
Will the AI competition pattern be reshaped after Genmini Ultra comes out next year?
And, has everyone tried the new Google Bard? How do you feel?
Welcome to leave a message to share, or join the group to discuss with us!