gemimg is a lightweight (<400 LoC) Python package for easily interfacing with Google's Gemini API and the Gemini 2.5 Flash Image model (a.k.a. Nano Banana) with robust features. This tool allows for:
- Create images in many aspect ratios with only a few lines of code!
- Minimal dependencies, and does not use Google's Client SDK.
- Handles image I/O, including multi-image I/O and image encoding/decoding.
- Generates images only: no irrelevant text output
- Utilities for common use cases, such as saving, resizing, and compositing multiple images.
Although Gemini 2.5 Flash Image can be used for free in Google AI Studio or Google Gemini, those interfaces place a visible watermark on their outputs and have generation limits. Using gemimg and the Gemini API directly, not only do you have more programmatic control over the generation, but it's much easier to do more complex inputs which increases productivity for power users.
gemimg can be installed from PyPI:
First, you will need to get a Gemini API key (from a GCP project which has billing information), or a free applicable API key.
You can also pass the API key by storing it in an .env file with a GEMINI_API_KEY field in the working directory (recommended), or by setting the environment variable of GEMINI_API_KEY directly to the API key.
Now, you can generate images with a simple text prompt!
The generated image is stored as a PIL.Image object and can be retrieved with gen.image for passing again to Gemini 2.5 Flash Image for further edits. By default, generate() also automatically saves the generated image as a PNG file in the current working directory. You can save a WEBP instead by specifying webp=True, change the save directory by specifying save_dir, or disable the saving behavior with save=False.
Due to Gemini 2.5 Flash Image's multimodal text encoder, you can create nuanced prompts including details and positioning that are not as consistent in Flux or Midjourney:
Gemini 2.5 Flash Image allows you to make highly-targeted edits to images. With gemimg, you can pass along the image you just generated very easily for editing.
You may have noticed from the previous example that the prompt input is a Markdown dashed list. As a model based off of Gemini's text encoder, Nano Banana is extremely responsive to Markdown formatting compared to older text encoders used in traditional image generation models, and you can prompt engineer highly nuanced subject and compositional requirements, and Nano Banana follows them with very high accuracy:
You can also input two (or more!) images/image paths to do things like combine images or put an object from Image A into Image B without having to train a LoRA. For example, here's a mirror selfie of myself, and a fantasy lava pool generated with gemimg that beckons me to claim its power:
You can also guide the generation with an input image, similar to ControlNet implementations. Giving Gemini 2.5 Flash Image this handmade input drawing and prompt:
Jupyter Notebook which randomizes the character order.
This is just the tip of the iceberg of things you can do with Gemini 2.5 Flash Image (a blog post is coming shortly). By leveraging Gemini 2.5 Flash Image's long context window, you can even give it HTML and have it render a webpage (Jupyter Notebook). And that's not even getting into JSON prompting of the model, which can offer extremely granular control of the generation. (Jupyter Notebook)
- Gemini 2.5 Flash Image cannot do style transfer, e.g. turn me into Studio Ghibli, and seems to ignore commands that try to do so. Google's developer documentation example of style transfer unintentionally demonstrates this by incorrectly applying the specified style. The only way to shift the style is to generate a completely new image in that style, which can still have mixed results if the source style is intrinsic.
- This also causes issues with the "put subject from Image A into Image B" use case if either are a substantially different style.
- Gemini 2.5 Flash Image does have moderation in the form of both prompt moderation and post-generation image moderation, although it's more leient than typical for Google's services. In the former case, the gen.text will indicate the refusal reason. In the latter case, a PROHIBITED_CONTENT error will be thrown.
- Gemini 2.5 Flash Image is unsurprisingly bad at free-form text generation, both in terms of text fidelity and frequency of typos. However, a workaround is to provide the rendered text as an input image, and ask the model to composite it with another image.
- Yes, both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
- Cherry-picking outputs, in the sense that multiple generations with the same prompt are needed to get one good output, is surprisingly minimal for an image-generation model and Google 2.5 Flash Image tends to correctly interpret the intent on the first try. Any obvious logical mistakes are consistently fixed with more prompt engineering. Most superflous prompts you see in the examples are cases where such a fix is applied.
- Although the Gemini 2.5 Flash Image API schema suggests that it supports system prompts, it doesn't appear to have any impact on the resulting output, so they are not supported in this package.
- gemimg is intended to be bespoke and very tightly scoped. Compatibility for other image generation APIs and/or endpoints will not be supported, unless they follow the identical APIs (i.e. a hypothetical gemini-3-flash-image). As this repository is designed to be future-proof, there likely will not be many updates other than bug/compatability fixes.
- gemimg intentionally does not support true multiturn conversations within a single conversational thread as:
- The technical lift for doing so would no longer make this package lightweight
- It is unclear if it's actually better for the typical use cases.
- gemimg intentionally does not support text output (and therefore the "interweaving" use case from the API examples) because:
- Text output slows down the image generation, which is the purpose of this package
- Text output can cause the model to rethink aspects of the generations, which adds undesirable entropy to the prompt.
- Interweaving follows the same issues as generating multiple images in a single call and is unreliable.
- By default, input images to generate() are resized such that their max dimension is 768px while maintaining the aspect ratio. This is done a) as a sanity safeguard against providing a massive image and b) Gemini processes images in tiles of 768x768px, so this forces the input to be 1 tile which should lower costs and improve consistency. If you want to disable this behavior, set resize_inputs=False.
- Do not question my example image prompts. I assure you, there is a specific reason or objective for every model input and prompt engineering trick. There is a method to my madness...although for this particular project I confess its more madness than method.
- Async support (for parallel calls and FastAPI support)
- Additional model parameters if the Gemini API supports them.
Max Woolf (@minimaxir)
Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.
MIT
.png)









