Custom Controlnets for Virtual Staging -- An exploration

In my quest for developing high-quality and coherent staging options for our project redecorly.com, I am looking for ways to do virtual staging with AI: quick inference, high quality, and keeping the important parts correct: windows and doors should be kept, the room structure cannot change.
Previously, I built a ComfyUI-based tool to go from a furnished to a redecorate furnished room. This involves some segmentation and depth maps. This model is open source and can be used on Replicate (version 1, version 2)
Now, this model works well and is robust. However, empty rooms are a bit more difficult. A few months ago, there was a public competition on this exact problem - with big cash prices even - and the results are still not great in my opinion.
The first issue is that depth map conditioning leads to empty rooms too. A second issue is, segmentation and inpainting (keeping the windows, doors and whatever) - means that the diffusion model won't generate furniture in front of that. Thirdly, these types of convoluted workflows are messy, rife with logic, slow and memory-intensive.
To solve these issues, I trained a custom model for this. This was a bit inspired on LooseControl (for which the weighs don't seem to be public, unfortunately), however - using RGB input instead. The great thing is, this requires no more segmentation, no inpainting models, and it works with any SDXL checkpoint. The prompt controls style, furniture and materials.
The amazing thing is that the Unet has learned from scratch what doors and windows are, and that these should be kept. It also learned the structure of rooms just as well as depth map conditioning does. But it's not as constrained – furniture can be generated everywhere, and materials and colors can change too. Some results follow


The cool thing is - it works out of distribution too - the model was trained on interior images pairs only, but this garden can be transformed in one go too!

Loss function
I experimented with L1-loss too – it seems that L2 loss gives more creative solution (changes floor and wall color and material), while L1-loss solutions keep the original materials and basically overlay this with furniture.
It would be cool to add a fourth channel with depth information. I have tried adding the excellent Apple ML depth pro as a fourth input channel to the controlnet. This would achieve fully consistent measurements. The outcome of this is to be determined...
Inference
FAL.ai allows custom controlnets - inference is crazy fast and I don't need my own custom models with long startup times.
We have (for the time being) a free demo with limited options at https://redecorly.com/demo
Adapted from a demo I gave at AI Tinkerers - Paris Meetup on January 30th