使用 Gemini 2.5 进行图像分割

Max Woolf's discovery on Hacker News: On 18th April 2025, Max Woolf pointed out a new feature of the Gemini 2.5 series in a comment on Hacker News. For image inputs, the model can generate 2D bounding boxes and create segmentation masks.
Simon Willison's tools:
- Built a tool last year to explore Gemini's bounding box abilities.
- Created a new tool - Gemini API Image Mask Visualization - to try out the new segmentation mask feature. It's browser-based JavaScript that talks directly to the Gemini API and requires a Gemini API key.
Using the tools: Provide an image and a prompt like "Give the segmentation masks for the objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key "box_2d" and the segmentation mask in key "mask"". The tool runs the prompt and displays the resulting JSON with segmentation masks as base64-encoded PNG images.
Code development: Vibe coded the whole thing using a combination of Claude and ChatGPT. Started with a Claude Artifacts React prototype, pasted code from the old project into Claude and hacked on it until it ran out of tokens. Transferred the incomplete result to a new Claude session and kept iterating until it got stuck in a bug loop. Switched to O3 in ChatGPT to finish it off.
Pricing: Segmenting a pelican photo via the Gemini API was extremely inexpensive. Using Gemini 2.5 Pro, the call cost 0.2144 cents. With Gemini 2.5 Flash, it cost 0.099 cents. Gemini 2.5 Flash has two pricing models: input is $0.15/million tokens, and output is $0.60/million in non-thinking mode and $3.50/million in thinking mode.
Upgrading the tool: Upgraded the tool to try non-thinking mode but noticed the API library was deprecated. Pasted the code into the new o4-mini-high model in ChatGPT and it identified the changes needed to update to the new recommended JavaScript library (@google-ai/generative). The latest version has been rewritten by o4-mini for the new library, defaults to Gemini 2.5 Flash non-thinking and displays usage tokens.
Using JSON schemas: Since the LLM CLI tool supports JSON schemas, they can be used to return the exact JSON shape for a given image. For example, using Gemini 2.5 Flash to return bounding boxes and segmentation masks for all objects in an image. Vibe coded a tool for visualizing that JSON.
Coordinate system buttons: Not sure of the origin for the coordinate system when building the tool, added buttons for switching it to see which one fit. The buttons can be used to make pelican outlines flap around the page.