Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success. Learn more
In a major development, researchers at MIT’s Laboratory of computer science and artificial intelligence (CSAIL) have announced a framework that can handle both image recognition and image generation tasks with high accuracy. Officially called Masked Generative Encoder, or MAGE, the unified computer vision system promises a wide range of applications and could reduce the overhead of training two separate systems to identify images and generate new ones.
>>Follow VentureBeat’s ongoing generative AI coverage<
The news comes at a time when companies are going all in on AI, particularly generative technologies, to improve workflows. However, as the researchers explain, the MIT system still has some shortcomings and will need to be perfected in the coming months if it is to be implemented.
The team told VentureBeat that they also plan to expand the model’s capabilities.
So, how does MAGE work?
Today, building image generation and recognition systems largely revolves around two processes: state-of-the-art generative modeling and self-supervised representation learning. In the first case, the system learns to produce high-dimensional data from low-dimensional inputs such as class labels, text embedding, or random noise. In the latter case, a high-dimensional image is used as input to create a low-dimensional embedding for feature detection or classification.
>>Don’t miss our special issue: Building the Foundation for Customer Data Quality.<
These two techniques, which are currently used independently, both require a visual and semantic understanding of data. So the team at MIT decided to bring them together in a unified architecture. MAGE is the result.
To develop the system, the group used a pre-training approach called masked token modeling. They converted parts of image data into abstracted versions represented by semantic tokens. Each of these tokens represented a 16×16 token patch of the original image, which acted like mini puzzle pieces.
When the tokens were ready, some of them were randomly masked and a neural network was trained to predict the hidden tokens by collecting the context of the surrounding tokens. In this way, the system learned to understand the patterns in an image (image recognition) and generate new patterns (image generation).
“Our main insight from this work is that generation is seen as ‘reconstructing’ images that are 100% masked, while representational learning is seen as ‘encoding’ images that are 0% masked,” the researchers wrote in a paper. paper detail of the system. “The model has been trained to reconstruct a wide range of masking ratios, using high masking ratios that allow for generation possibilities, and lower masking ratios that allow representative learning. This simple but very effective approach allows a smooth combination of generative training and representational learning in the same framework: the same architecture, the same training schedule and the same loss function.”
In addition to producing images from scratch, the system supports conditional image generation where users can specify criteria for the images and the tool will create the correct image.
“The user can input an entire image and the system can understand and recognize the image and run the class of the image,” Tianhong Li, one of the researchers behind the system, told VentureBeat. “In other scenarios, the user can enter an image with partial cropping and the system can restore the cropped image. They can also ask the system to generate a random image or generate an image with a certain class, such as a fish or dog.”
Potential for many applications
After being pre-trained on data from the ImageNet image database, which consists of 1.3 million images, the model achieved a Fréchet inception distance score (used to assess image quality) of 9.1, outperforming earlier models. In recognition, it achieved an accuracy score of 80.9% on linear probing and an accuracy score of 71.9% on 10 shots while having only 10 labeled specimens of each class.
“Our method scales naturally to any unlabeled image dataset,” Li said, noting that the model’s image recognition capabilities could be useful in scenarios where limited labeled data is available, such as in niche industries or emerging technologies.
Likewise, he said, the generational side of the model could help in industries such as photo editing, visual effects and post-production with the ability to remove elements from an image while maintaining a realistic look, or, given a specific class, replace an image. element with another generated element.
“It has [long] It has always been a dream to realize image generation and image recognition in one system. MAGE is one [result of] groundbreaking research that successfully exploits the synergy of these two tasks and achieves state-of-the-art in a single system,” said Huisheng Wang, senior software engineer for research and machine intelligence at Google, who participated in the MAGE project.
“This innovative system has broad applications and has the potential to inspire many future works in computer vision,” he added.
Need more work
Going forward, the team plans to streamline the MAGE system, specifically the token conversion portion of the process. Currently, some of the information is lost when the image data is converted to tokens. Li and team plan to change that with other ways of compression.
Aside from this, Li said they also plan to scale up MAGE to real-world, large-scale unlabeled image datasets, and apply it to multimodality tasks, such as image-to-text and text-to-image generation. .
VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.