paper-conference

Leveraging VLM-Based Pipelines to Annotate 3D Objects

We improve pretrained captioning and classification of 3D objects via visually grounded aggregation of VLM responses.

Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy J. Mitra

SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

A video scene model which separates the time-invariant, object-level contents of the scene from global time-varying elements such as viewpoint.

Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia Creswell, Matt Botvinick, Alexander Lerchner, Chris Burgess