Generating Consistent Illustrations with Gemini Image Generation

1 hour ago 2

A deep dive into building an automated illustration pipeline for storytelling applications

Introduction

AI image generation has revolutionized creative workflows, but there’s a significant difference between generating a single stunning image and producing hundreds of consistent illustrations for a complete project. When building storylearner.app, we faced the challenge of generating book illustrations that maintained visual consistency while telling compelling stories through imagery.

We’ve used powerful Gemini multimodal models, mostly because they are fast and because the experimental ones are available for free.

This article explores the technical and creative challenges of large-scale AI illustration generation, showcasing techniques for achieving visual consistency and building robust pipelines that can handle the complexity of full book illustration projects.

Below are images created for a chapter of a book adapted for the storylearner platform:

The article has a companion colab notebook, so you can play with the examples yourself.

The Consistency Challenge

Understanding Visual Consistency in AI

AI image generation models operate somewhat like a company of talented artists, each with amnesia. Every generation is essentially a fresh start unless you provide explicit context. This creates unique challenges when you need to maintain character consistency, style coherence, and narrative flow across hundreds of images.

Consider this simple experiment: generating three images of “the same pig with wings and a top hat flying over a futuristic city” in different weather conditions. Even with identical prompts and seeds, subtle inconsistencies emerge—eye colors change, proportions shift, and the overall character can feel different.

The Impact of Seeds and Prompts

Seeds act like selecting a specific artist from your AI company. Using the same seed with identical prompts yields consistent results, but even small prompt variations can dramatically alter the output. We discovered that:

Same model + same prompt + same seed = identical results
Same model + same prompt + different seed = completely different character
Same model + slightly altered prompt + same seed = often produces a different character entirely

This sensitivity means that scaling up requires careful orchestration of all these variables.

Building Consistency Through Reference Images

The Reference Image Approach

The most reliable method we found for maintaining character consistency involves using reference images. Here’s how it works:

Generate an initial character/scene using carefully crafted prompts
Upload this image as a reference for subsequent generations
Include explicit instructions like “Use the supplied image as a reference for how the pig should look like”

This approach significantly improves consistency, though it can also cause new issues in some very specific cases.

With the following reference image:

You can get this different, but consistent one!

Here is an example of triggering a specific case:

This specific case can happen when a prompt is pretty similar to the one that generated the initial image and the seed is identical. So, when using the reference image and similar prompt, I would actually recommend to change the seed or not set it.

See the reference colab for details.

Practical Implementation

# Upload reference image files = [client.files.upload(file="reference_character.png")] # Create parts with reference and prompt parts = [ types.Part.from_uri( file_uri=files[0].uri, mime_type=files[0].mime_type, ), types.Part.from_text( text=prompt + "\nUse the supplied image as a reference for character appearance" ), ] # Generate with reference response = client.models.generate_content( model=IMAGE_MODEL, contents=parts, config=types.GenerateContentConfig( response_modalities=['Text', 'Image'], # No seed set. ) )

The Storylearner.app Illustration Pipeline

High-Level Architecture

Our production pipeline consists of three main stages:

Idea Generation: Story text + guidelines → 3 illustration concepts per scene
Idea Selection: Multiple concepts → best ideas chosen for the complete set
Image Generation: Selected ideas + style guidelines + reference images → final illustrations

Stage 1: Brainstorming Illustration Ideas

Rather than feeding story text directly to image generation (which often produces poor results), we separate conceptualization from execution:

class IdeaGenerator: def generate_illustration_ideas(self, text: str, context: str): prompt = """ You are a visual scene designer. Based on the story below, describe 3 different highly detailed and imaginative illustration ideas. Do not include any people or humanoid figures. Focus on setting, atmosphere, lighting, symbolic objects, and environmental storytelling. Story: {text} Context: {context} """ # Returns 3 detailed scene descriptions per text excerpt

This approach generates rich, detailed scene descriptions that serve as blueprints for image generation.

Stage 2: Intelligent Selection

To avoid repetitive illustrations (like “a ship, a ship, a ship” in a sea voyage story), we use an AI selector to choose the best combination of ideas:

class IllustrationSelector: def select_illustrations(self, chapter): prompt = """ Choose the best idea for each illustration considering that: - The set should be diverse - Illustrations shouldn't contain people - Prefer illustrations matching the provided titles """ # Returns optimal selection indices

Stage 3: Consistent Generation

The final generation stage uses:

Style guidelines (detailed visual specifications)
Reference images for style consistency
Persistent chat sessions for maintaining context
Retry mechanisms for handling API limitations

Visual Guidelines and Style Consistency

Crafting Effective Style Guidelines

We developed comprehensive style guidelines that go beyond simple style names:

Style: watercolor Technique: Combine soft watercolor washes with fine ink line work for contrast and detail. Brushwork: Embrace visible brush strokes, blooming, and natural texture. Ink Lines: Use varied line weights for depth; apply cross-hatching or stippling for texture. Color Palette: Limit to a few harmonious hues with gentle gradations. Forms: Use simplified, geometric shapes; focus on essence over detail. White Space: Treat negative space as part of the composition. Texture: Highlight watercolor paper's natural texture and color variation. Atmosphere: Create light, airy scenes with openness and subtle contrast. Aesthetic: Preserve a hand-drawn look—embrace imperfections and human touch.

Chat-Based Generation for Context Continuity

Using persistent chat sessions helps maintain consistency within illustration sets:

def generate_set_of_illustrations(ideas_with_file_paths, pass_image=True): chat = client.chats.create( model=IMAGE_MODEL, config=types.GenerateContentConfig(response_modalities=["Text", "Image"]), ) # Initialize with style guidelines and reference image initial_prompt = f""" You are a creative artist helping on an illustration project. Create {len(ideas_with_file_paths)} beautiful illustrations. VISUAL GUIDELINES: {VISUAL_GUIDELINES} """ # Generate each illustration within the same chat context for idea in ideas_with_file_paths: response = chat.send_message(format_illustration_prompt(idea)) # Process and save generated image

Avoiding Common Pitfalls

Critical Design Decisions

Through extensive experimentation, we identified several key strategies:

Avoid Human Close-ups: Character face consistency is extremely challenging. Focus on environmental storytelling instead.

No Violence or Gore: Keep illustrations family-friendly and avoid content that might trigger safety filters.

Diversify Scene Types: The selection stage prevents repetitive imagery across the complete set.

Decouple Ideation from Generation: Separating concept creation from image generation improves both quality and debuggability.

Real-World Example: Illustrating The Three Musketeers

Let’s walk through illustrating a chapter from The Three Musketeers:

Context and Settings

First, we establish the story context:

Setting: France, primarily Meung and Paris, early 17th century Historical Context: Political tensions between French monarchy and Cardinal Richelieu Main Characters: D'Artagnan, Athos, Porthos, Aramis, Cardinal Richelieu, Milady de Winter

Generated Ideas

For the chapter opening, our system generated these concepts:

“The Jolly Miller Inn Chaos”: Exterior scene with a yellow pony, scattered debris, and dramatic lighting hinting at recent altercation
“Broken Sword”: Close-up of shattered steel on cobblestones, symbolizing lost honor and broken dreams
“Inn Kitchen Aftermath”: Dimly lit interior with earthenware, bandages, and flickering candlelight

Selection and Generation

The selector chose the most diverse and narratively appropriate ideas, which were then generated using our reference image and style guidelines, producing illustrations that maintain visual consistency while telling the story effectively.

Key Insights and Best Practices

Consistency vs. Perfection: Perfect consistency isn’t always necessary—visual coherence in style and mood often matters more than exact character matching.
The Artist Analogy: Think of AI models as artists with amnesia. You need to provide context, references, and clear instructions for each interaction.
Pipeline Modularization: Breaking the process into idea generation, selection, and execution improves quality and maintainability.
Style Guidelines Matter: Detailed, specific style descriptions work better than simple style names.
Reference Images Are Crucial: Upload and reference style examples for best consistency results.
Long Sessions are Fragile: It often works until a point, and at some point it fails poorly, e.g. inserting objects from a previous illustration into the following ones.

Conclusion

There is still a huge gap between a carefully handcrafted demo on an AI company blog and practical usage of the technology at scale. Things don’t work out well straight out of the box, but you can make these amazing tools work for you with help of systematic thinking about consistency, quality control, and robust engineering practices.

While challenges remain, I hope you’ll enjoy the techniques we’ve developed at storylearner.app to power your own projects!

Read Entire Article