DALL-E is a 12 billion parameter version of GPT-3 that is trained to generate images from text descriptions using a dataset of text-image pairs. We can say that it has a variety of abilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in sensible ways, creating text and applying transformations to existing images.

GPT-3 demonstrates that the language can be used to instruct a large neural network to perform various text rendering tasks. Image GPT can also be used to generate high fidelity images of the same type of neural network. He extends these findings to show that manipulating visual concepts through language is now achievable.


Like GPT-3, DALL·E is a transformative language model. It takes both text and image as a single data stream containing up to 1280 coins and is trained using maximum probability to generate all tokens one after another. This training procedure allows DALL·E not only to create an image from scratch, but also to reconstruct any rectangular region of an existing image that extends to the lower right corner, consistent with the text prompt.

Given the potential for significant and broad societal impacts from work involving generative models, plans to analyze how models like DALL·E in the future relate to societal issues such as the economic impact on certain business processes and occupations, the potential for bias in model outputs, and the long-term ethical challenges implied by this technology. .


We find that DALL·E is able to generate sensible images for a wide variety of sentences exploring the compositional nature of language. He demonstrates this in the next section using a series of interactive images. The examples shown for each title in the images were obtained by reordering with CLIP, then the first 32 out of 512, but no manual selection is used except for the thumbnails and individual images that appear outside.

Multiple Object Drawing

While simultaneously controlling multiple objects, their attributes, and their spatial relationships presents a new challenge, we might consider, for example, the expression "a hedgehog wearing a red hat, yellow gloves, blue shirt and green pants". To interpret this sentence correctly, DALL·E not only correctly composes each piece of clothing with the animal, but also connotations of (hat, red), (gloves, yellow), (shirt, blue), and (pants, green). can create. We also observe DALL·E's ability to position the image, stack objects, and control multiple attributes.

While DALL·E offers some control over the attributes and positions of a small number of objects, its success rate may depend on how the title is expressed. As more objects are introduced, DALL·E can mess up the relationships between objects and their colors, and the success rate drops sharply. It's also important to note that DALL·E is fragile with regard to subtitle restatement in these scenarios. Alternative, semantically equivalent subtitles often do not provide accurate results.

Visualizing Perspective and Three-dimensionality

We see that DALL·E also allows control over a scene's perspective and the 3D style in which a scene is rendered.

To take this further, we test DALL·E's ability to repeatedly draw the head of a well-known figure from a series of equidistant angles at each angle and observe that we can recover a smooth animation of the rotating head.

Extracting Contextual Details

The task of converting text to pictures is not specified enough: a single caption usually corresponds to an infinite number of reasonable pictures, so the picture is not uniquely determined. For example, consider the headline “image of a capybara sitting in a field at sunrise.” It may be necessary to draw a shadow depending on the orientation of the capybara, but this detail is never explicitly mentioned. We can examine DALL·E's ability to resolve missing features in three cases: style, setting and time change! Drawing the same object in several different situations and creating an image of an object with specific text written on it demonstrates DALL·E's spectacular performance.