Technology Meet MAGE, MIT's unified image generation and recognition system

Meet MAGE, MIT’s unified image generation and recognition system


Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success. Learn more

In a major development, researchers at MIT’s Laboratory of computer science and artificial intelligence (CSAIL) have announced a framework that can handle both image recognition and image generation tasks with high accuracy. Officially called Masked Generative Encoder, or MAGE, the unified computer vision system promises a wide range of applications and could reduce the overhead of training two separate systems to identify images and generate new ones.

>>Follow VentureBeat’s ongoing generative AI coverage<

The news comes at a time when companies are going all in on AI, particularly generative technologies, to improve workflows. However, as the researchers explain, the MIT system still has some shortcomings and will need to be perfected in the coming months if it is to be implemented.

The team told VentureBeat that they also plan to expand the model’s capabilities.


Transform 2023

Join us on July 11-12 in San Francisco, where top executives will talk about how they integrated and optimized AI investments for success and how they avoided common pitfalls.

register now

So, how does MAGE work?

Today, building image generation and recognition systems largely revolves around two processes: state-of-the-art generative modeling and self-supervised representation learning. In the first case, the system learns to produce high-dimensional data from low-dimensional inputs such as class labels, text embedding, or random noise. In the latter case, a high-dimensional image is used as input to create a low-dimensional embedding for feature detection or classification.

>>Don’t miss our special issue: Building the Foundation for Customer Data Quality.<

These two techniques, which are currently used independently, both require a visual and semantic understanding of data. So the team at MIT decided to bring them together in a unified architecture. MAGE is the result.

To develop the system, the group used a pre-training approach called masked token modeling. They converted parts of image data into abstracted versions represented by semantic tokens. Each of these tokens represented a 16×16 token patch of the original image, which acted like mini puzzle pieces.

When the tokens were ready, some of them were randomly masked and a neural network was trained to predict the hidden tokens by collecting the context of the surrounding tokens. In this way, the system learned to understand the patterns in an image (image recognition) and generate new patterns (image generation).

“Our main insight from this work is that generation is seen as ‘reconstructing’ images that are 100% masked, while representational learning is seen as ‘encoding’ images that are 0% masked,” the researchers wrote in a paper. paper detail of the system. “The model has been trained to reconstruct a wide range of masking ratios, using high masking ratios that allow for generation possibilities, and lower masking ratios that allow representative learning. This simple but very effective approach allows a smooth combination of generative training and representational learning in the same framework: the same architecture, the same training schedule and the same loss function.”

In addition to producing images from scratch, the system supports conditional image generation where users can specify criteria for the images and the tool will create the correct image.

“The user can input an entire image and the system can understand and recognize the image and run the class of the image,” Tianhong Li, one of the researchers behind the system, told VentureBeat. “In other scenarios, the user can enter an image with partial cropping and the system can restore the cropped image. They can also ask the system to generate a random image or generate an image with a certain class, such as a fish or dog.”

Potential for many applications

After being pre-trained on data from the ImageNet image database, which consists of 1.3 million images, the model achieved a Fréchet inception distance score (used to assess image quality) of 9.1, outperforming earlier models. In recognition, it achieved an accuracy score of 80.9% on linear probing and an accuracy score of 71.9% on 10 shots while having only 10 labeled specimens of each class.

“Our method scales naturally to any unlabeled image dataset,” Li said, noting that the model’s image recognition capabilities could be useful in scenarios where limited labeled data is available, such as in niche industries or emerging technologies.

Likewise, he said, the generational side of the model could help in industries such as photo editing, visual effects and post-production with the ability to remove elements from an image while maintaining a realistic look, or, given a specific class, replace an image. element with another generated element.

“It has [long] It has always been a dream to realize image generation and image recognition in one system. MAGE is one [result of] groundbreaking research that successfully exploits the synergy of these two tasks and achieves state-of-the-art in a single system,” said Huisheng Wang, senior software engineer for research and machine intelligence at Google, who participated in the MAGE project.

“This innovative system has broad applications and has the potential to inspire many future works in computer vision,” he added.

Need more work

Going forward, the team plans to streamline the MAGE system, specifically the token conversion portion of the process. Currently, some of the information is lost when the image data is converted to tokens. Li and team plan to change that with other ways of compression.

Aside from this, Li said they also plan to scale up MAGE to real-world, large-scale unlabeled image datasets, and apply it to multimodality tasks, such as image-to-text and text-to-image generation. .

VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.

Shreya Christina
Shreya has been with for 3 years, writing copy for client websites, blog posts, EDMs and other mediums to engage readers and encourage action. By collaborating with clients, our SEO manager and the wider team, Shreya seeks to understand an audience before creating memorable, persuasive copy.

Latest news

1xbet Зеркало Букмекерской Конторы 1хбет На следующий ️ Вход и Сайт Прямо тольк

1xbet Зеркало Букмекерской Конторы 1хбет На следующий ️ Вход и Сайт Прямо только1xbet Зеркало на Сегодня Рабочий официальный Сайт...

Mostbet Pakistan ᐉ Online Casino Review Official Website

Join us to dive into an immersive world of top-tier gaming, tailored for the Kenyan audience, where fun and...

Casino Pin Up Pin-up Casino Resmi Sitesi Türkiye Proloq Ve Kayıt Çevrimiçi

ContentPin Up Nə Say Onlayn Kazino Təklif Edir?Pin Up Casino-da Pul Çıxarmaq Nə Miqdar Müddət Alır?Vəsaiti Kartadan Çıxarmaq üçün...

Играть В Авиатора: Самолетик Pin Up

ContentAviator: Son Qumar Oyunu Təcrübəsini AçınMobil Proqram Pin UpPin Up Aviator Nasıl Oynanır?Бонус За Регистрацию В Pin Up?Pin Up...

Pin Up 306 Casino əvvəl Qeydiyyat, Bonuslar, Yukl The National Investo

ContentDarajalarfoydalanuvchilar Pin UpCasino Pin-up Pin-up On Line Casino Resmi Sitesi Türkiye Başlanğıc Ve Kayıt ÇevrimiçPromosyon Və Qeydiyyatdan KeçməkAviator OyunuAviator...

Find Experts to Write My Paper for Me. Just Click a Button Even though you may have many...

Must read

You might also likeRELATED
Recommended to you