Researchers created a machine learning model that can generate accurate images of scenes from text descriptions and understands the underlying relationships between objects in a scene. When humans look at a scene, they notice the objects and their relationships to one another. On top of your desk, there could be a laptop to the left of a phone that is in front of a computer monitor.
Many deep learning models struggle to see the world in this way because they lack understanding of the entangled relationships between individual objects. A robot designed to assist someone in the kitchen would struggle to follow a command like “pick up the spatula to the left of the stove and place it on top of the cutting board” if it was unaware of these relationships.
To address this issue, MIT researchers created a model that understands the underlying relationships between objects in a scene. Their model depicts individual relationships one at a time, then combines these depictions to describe the entire scene. This allows the model to generate more accurate images from text descriptions, even when the scene contains multiple objects in different relationships to one another.
This research could be useful in situations where industrial robots must perform complex, multistep manipulation tasks, such as stacking items in a warehouse or assembling appliances. It also brings the field one step closer to enabling machines to learn from and interact with their environments in a manner similar to humans.
“I can’t say there’s an object at XYZ location just by looking at a table. That is not how our minds work. When we comprehend a scene in our minds, we do so based on the relationships between the objects. We believe that by developing a system that can understand the relationships between objects, we will be able to manipulate and change our environments more effectively” Yilun Du, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the paper’s co-lead author, agrees.
One interesting thing we discovered for our model is that we can increase our sentence from having one relation description to having two, three, or even four descriptions, and our approach continues to be able to generate images that are correctly described by those descriptions, whereas other methods fail.Yilun Du
The paper was co-authored by Du, a CSAIL PhD student, and Nan Liu, a graduate student at the University of Illinois at Urbana-Champaign, as well as Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and a member of CSAIL; and senior author Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL. The findings will be presented in December at the Conference on Neural Information Processing Systems.
One relationship at a time
The researchers’ framework can generate an image of a scene based on a text description of objects and their relationships, such as “To the left of a blue stool is a wood table. A red couch is located to the right of a blue stool.”
Their system would separate these sentences into two smaller pieces that describe each individual relationship (“a wood table to the left of a blue stool” and “a red couch to the right of a blue stool”), and then model each part separately. These fragments are then combined in an optimization process to produce an image of the scene.
To represent the individual object relationships in a scene description, the researchers used a machine-learning technique known as energy-based models. This technique enables them to encode each relational description using a single energy-based model and then compose them together in a way that infers all objects and relationships.
Li explains that by breaking the sentences down into shorter pieces for each relationship, the system can recombine them in a variety of ways, allowing it to better adapt to scene descriptions it hasn’t seen before.
“Other systems would take all of the relationships as a whole and generate the image from the description in a single shot. Such approaches, however, fail when we have out-of-distribution descriptions, such as descriptions with more relations, because these models can’t really adapt one shot to generate images with more relationships. However, by combining these separate, smaller models, we can model a greater number of relationships and adapt to new combinations” Du states.
The system can also find text descriptions that match the relationships between objects in a scene given an image. Furthermore, they can use their model to edit an image by rearranging the objects in the scene to match a new description.
Understanding complex scenes
The researchers compared their model to other deep learning methods that were given text descriptions and tasked with generating images of the corresponding objects and their relationships. Their model outperformed the baselines in each case.
They also asked humans to judge whether the generated images corresponded to the original scene description. In the most complex examples, with three relationships described, 91 percent of participants concluded that the new model performed better.
“One interesting thing we discovered for our model is that we can increase our sentence from having one relation description to having two, three, or even four descriptions, and our approach continues to be able to generate images that are correctly described by those descriptions, whereas other methods fail,” Du says.
The researchers also showed the model images of previously unseen scenes, as well as several different text descriptions of each image, and it successfully identified the description that best matched the object relationships in the image.
When the researchers gave the system two relational scene descriptions that described the same image in different ways, the model recognized that the descriptions were equivalent. The researchers were impressed by the model’s robustness, especially when working with descriptions it had never seen before.
“This is very promising because it is more in line with how humans work. Humans may only see a few examples, but we can extract useful information from those few examples and combine it to create an infinite number of combinations. And our model has the ability to learn from fewer data points while generalizing to more complex scenes or image generations “Li explains.
While these preliminary results are promising, the researchers would like to see how their model performs on more complex real-world images with noisy backgrounds and objects that are blocking one another. They also hope to incorporate their model into robotics systems, allowing a robot to infer object relationships from videos and then use this knowledge to manipulate objects in the real world.
“One of the key open problems in computer vision is developing visual representations that can deal with the compositional nature of the world around us. By proposing an energy-based model that explicitly models multiple relationships among the objects depicted in the image, this paper makes significant progress on this problem. The outcomes are truly remarkable” Josef Sivic, a distinguished researcher at Czech Technical University’s Czech Institute of Informatics, Robotics, and Cybernetics, who was not involved in this study, says.