GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Abstract

Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them.

In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models.

GraPE Framework

Let T be a textual instruction. In the task of T2I synthesis, given an instruction T, our goal is to be able to generate an image I_o which satisfies various requirements expressed via the instruction T. While most existing techniques take the approach of directly generating I_o via T , they often result in various kinds of inaccuracies, due to the complexity of the instruction. We are motivated by the observation that the task of T2I synthesis can be broken down into simpler steps of first generation, followed by identification of errors, and a sequence of corrective edits, each of which is simple and object specific in nature. Accordingly, we propose the following generation pipeline

Proposed GraPE framework, a given text prompt is used to generate an initial image from T2I model, I_g which is then fed into a MLLM based planner along with the text prompt which identifies the objects that are misaligned in the image and outputs a set of edit plans guided by few-shot prompting. The plans are executed as a series of edits over the initial image to produce the final image

Results

Iterative results by applying GraPE on images generated by SD3.5 Large and SDXL, these images are edited via the proposed PixEdit editing model.

Comparison with PixEdit

Qualitative comparison of Various Image Editing models. PixEdit shows good performance in semantically aligning the image with text-prompt.

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

GraPE is an unifying and generic framework for improving semantic alignemt of T2I models
by post-hoc alignment of generated images via iterating editing.

Abstract

GraPE Framework

Results

Comparison with PixEdit

BibTeX

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

GraPE is an unifying and generic framework for improving semantic alignemt of T2I models by post-hoc alignment of generated images via iterating editing.

Abstract

GraPE Framework

Results

Comparison with PixEdit

BibTeX

GraPE is an unifying and generic framework for improving semantic alignemt of T2I models
by post-hoc alignment of generated images via iterating editing.