Instead of training Imagen using an image-text dataset, the Google team simply utilised a "off-the-shelf" text encoder, T5, to turn input text into embeddings. Imagen employs a series of diffusion models to turn the embedding into a picture.
If you're ready to create Deep Art with our intuitive AI art dashboard, join the Artvy community.