Gen-AI is on course to become the most powerful stock generator ever. Simultaneously its commercial success and its creative limitation. Unless...

candyandgrim
Apr 19
6 min read

If you've ever duplicated a layer before making a destructive edit, you already understand non-destructive pipelines. You were preserving optionality. Keeping the original intact. Maintaining the ability to go back, change your mind, or hand the work to someone else without explanation.

Non-destructive workflows are how professional creative tools have operated for decades. Smart objects in Photoshop. Live effects in Illustrator. Node-based compositing in Nuke, Resolve, After Effects. The principle is consistent: the original data stays intact, the edits are instructions layered on top, and the pipeline remains revisable at every stage.

This isn't a luxury feature. It's the foundation of collaborative, iterative, production-grade creative work. It's what separates a deliverable from a dead end.

Gen-AI breaks this chain completely.

Every output from every major generative tool today is a flat commitment. A baked render. Pixels with no memory of how they were made, no semantic structure, no named layers, no depth passes, no material information, no hierarchy. You can generate something extraordinary and have almost no practical way to integrate it into a production pipeline without rebuilding it from scratch or treating it as a texture to paint over.

This is not a quality problem. The outputs are often stunning. It's a structural problem—and it has a specific consequence.

When your outputs are flat, fast, and cheap to generate, you become very good at one thing: producing finished-looking images at volume. Which is precisely what a stock library does. Gen-AI is not on course to replace the creative industry. It is on course to replace Getty Images—and do it better than Getty ever could. Infinite variety, instant delivery, zero licensing friction.

That is a genuine commercial success. It is also a ceiling.

Because the moment a creative director, motion designer, or production studio needs to do something with a generated asset—composite it, reanimate it, relight it, adapt it for a different format, hand it to a developer, a printer, or a broadcast suite—the flat output fails them. Not because it looks wrong. Because it has no structure to work with.

Several tools emerging right now are—accidentally or otherwise—pointing toward something better.

Gaussian splatting doesn't render a flat image. It represents a scene as addressable spatial data. Depth, position, and opacity are intrinsic to every splat. Re-compositing, isolating elements, or querying depth isn't a post-process—it's just reading the representation. Structure isn't extracted after the fact. It never collapsed in the first place.

Krea's real-time canvas is the closest thing currently available to a true previz mode—low commitment, fast inference, structurally revisable before any expensive compute runs. The relationship with generation it models is different from everything else on the market.

Beeble / SwitchLight extracts PBR material passes from live footage. Albedo, roughness, metallic—the building blocks of relightable, production-ready assets—recovered from video that was never generated with structure in mind.

Qwen-Image-Layered decomposes a flat raster into semantically separated RGBA layers with depth ordering, recursively if needed. Not perfect. Not production-ready at scale. But structurally correct in its ambitions.

Tripo3D is already filtering generated 3D assets by Smart Mesh, Untextured, Textured, and Rigged outputs—evidence that the 3D generation world is beginning to think in structural terms rather than flat renders. Promising. But current limitations are instructive: it can only model what it can see, meaning internal mechanisms, occluded geometry, and functional moving parts are beyond its understanding. And the mesh topology is optimised for display at preview distance, not for the subdivision, rigging, or tight camera work a production pipeline demands. It's generating the right idea of structure. Not yet the right structure.

Even Adobe—carrying decades of legacy architecture—is accidentally closest to the right model. The generative rotate tool in Photoshop previews at low fidelity before committing to full resolution, outputs to a separate layer, and offers harmonisation as a deliberate step. The angle batch tool in Illustrator generates multiple views as lightweight vector reference before committing to rendered output—structurally intact, animation-ready, without burning credits on every iteration. Firefly's batch generation lets you iterate multiple variants simultaneously before committing. These arrived as UX solutions to generation cost. They are also, quietly, the correct pipeline architecture.

Then there's what's happening inside ComfyUI.

FXTD's Radiance suite—a 79-node HDR image and video pipeline—is running a true 32-bit scene-linear workflow with full OpenColorIO integration, native ACEScg support, AI-driven HDR highlight reconstruction, and a live two-way bridge directly into Nuke with full multi-channel 32-bit EXR. It includes temporal flicker reduction designed specifically for the instability patterns in generated footage—which is itself a telling detail. It's patching a structural wound that shouldn't exist if the pipeline were right.

This is serious infrastructure. And it lives entirely in ComfyUI.

That matters, because ComfyUI is where ideas go to get pressure-tested before they're adopted elsewhere. A rite of passage rather than a destination. The technical barriers are real—it is not a collaborative studio platform, and most creative teams will never touch it directly. But what gets proven in ComfyUI has a way of graduating. The question isn't whether these capabilities are real. It's how long before they exist inside the tools your team already uses.

None of these tools fully solve the problem. They solve corners of it, in isolation, for specific input types or narrow use cases. What they collectively prove is that structure-aware generation is not theoretically out of reach—it's just not what anyone has prioritised building.

What the industry actually needs is generation that outputs structure natively. Not as a post-pass. Not as a separate analysis job run on a flat result. As part of the generation itself.

A previz mode: fast, low resolution, structurally intact—previewing intent, not just fidelity.

A full-res flat mode: the current default, chosen deliberately rather than imposed universally.

An editable mode: expensive because it's doing more work during generation—surfacing the depth, the material passes, the semantic layers, the named geometry that the model already understands internally but currently discards before handing you the output.

For any developer reading this wondering what creatives actually need from a structure-aware pipeline, here it is in plain terms:

Vector Named, semantically grouped SVG or PDF layers—not flat path soup. Logical hierarchy with human-readable IDs that survive handoff to animation or development.

Pixel—still image Depth-separated RGBA layers with correct occlusion order. Smart selection masks per semantic element. Full PBR passes: albedo, roughness, metallic, normal. Multi-layered TIF as the transport format—32-bit, ICC-profiled, universally readable, not dependent on proprietary software to open.

Pixel—moving image Image sequences as multi-channel EXR—32-bit, one file per frame, passes baked in. Temporal consistency across frames as a baseline requirement, not an afterthought. Flicker reduction built into the pipeline, not patched on top.

HDR and colour True scene-linear output. 32-bit from generation, not upsampled from 8-bit after the fact. HDR highlight reconstruction that preserves latitude rather than synthesising headroom that was never captured.

3D and spatial Named geometry with parts separation at generation. Material groups that correspond to PBR channels. Subdivision-compatible topology—edge loops placed with intent, clean pole management, curvature held by geometry rather than texture—so the mesh survives rigging, deformation, and close camera work without retopology. Gaussian splat output as an alternative scene representation—structurally addressable rather than rendered flat.

None of this is speculative. Every item on this list exists somewhere in the pipeline already—in VFX tools, in game engines, in research papers. What doesn't exist is gen-AI that outputs them natively, by default, as part of a single coherent generation event.

The data exists inside these models. The structural understanding is already there. It's being thrown away at the last step.

Until that changes, gen-AI will keep producing extraordinary flat images at extraordinary scale—and the creative industry will keep treating them as expensive stock photography, because that's structurally all they are.

Gen-AI is on course to become the most powerful stock generator ever built. Simultaneously its commercial success and its creative limitation.

Unless the next generation of tools stops optimising for the render and starts outputting the scaffold.

Tools referenced

Krea—krea.ai
Beeble / SwitchLight—switchlight.beeble.ai
Qwen-Image-Layered—github.com/QwenLM/Qwen-Image-Layered
Tripo3D—tripo3d.ai
FXTD Radiance—fxtd.org
Immersity AI—immersity.ai
Gaussian Splatting—no single URL, worth noting it as an open research area rather than a single product. Working looking at LumaAI, Postshot, Polycam, NerfStudio, and Scaniverse.

Gen-AI is on course to become the most powerful stock generator ever. Simultaneously its commercial success and its creative limitation. Unless...

Recent Posts

Comments