LION: Latent Point Diffusion Models for 3D Shape Generation
Introduction
The art of generative modeling for 3D shapes has been an active research area and finds extensive applications in 3D content creation. For these models to be truly useful for digital artists, they must meet several criteria: high-quality and realistic shapes, interactive and flexible usage, and generating smooth surfaces or meshes, a common representation in graphics software.
Existing Research in the realm of 3D shape generation has extensively explored various models, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Flow-based models, and Autoregressive models. Most Recently, Denoising Diffusion Models (DDMs) have emerged as a promising tool for point cloud-based 3D shape generation. While existing DDMs have demonstrated significant potential, they often struggle to meet the multifaceted needs of digital artists. To truly serve the artist community, DDMs must ensure high generation quality and offer flexibility for manipulation, thus enabling applications like conditional synthesis and shape interpolation. It is where the Latent Point Diffusion Model (LION) comes into play. This model aims to overcome the drawbacks of traditional DDMs and meets the needs of 3D digital artists.
LION: Latent Point Diffusion Models for 3D Shape Generation
LION, similar to its 3D DDM predecessors, operates on point clouds but distinguishes itself by being constructed as a VAE with DDMs in latent space. It employs a hierarchical VAE with a global shape latent and a point cloud structured latent space. It utilizes a global shape latent and a point cloud-structured latent.
Its training proceeds in two stages, the first involving the maximization of a modified variational lower bound on the data log-likelihood. In this stage, the global shape and point cloud latents are sampled from their respective posterior distributions, which factorial Gaussians parameterize, and the decoder is parameterized as a factorial Laplace distribution.
The second stage of training is motivated by the inadequacy of the VAE's simple Gaussian priors in accurately matching the encoding distribution from the training data. Here, two latent DDMs are trained on the encodings, minimizing score-matching objectives.
The generation process is accomplished hierarchically with the latent DDMs, thus formally defining a generative model. The model architecture is based on Point-Voxel CNNs (PVCNNs), which efficiently combine point-based processing with a strong spatial inductive bias of convolutions. Furthermore, the global shape latent DDM uses a ResNet structure with fully-connected layers. All global shape latent conditionings are implemented via adaptive Group Normalization in the PVCNN layers. The score models in both latent DDMs use a mixed score parameterization to predict a residual correction to an analytic standard Gaussian score.
Applications of LION
Multimodal Generation
LION can synthesize different variations of a given shape, enabling multi-modal generation in a controllable manner. This function is achieved through a diffuse-denoise procedure, which only slightly modifies certain details.
Encoder Fine-tuning for Voxel-Conditioned Synthesis and Denoising
The encoder networks in LION can be fine-tuned for voxel-conditioned synthesis and denoising. This is particularly beneficial for an artist with a rough idea of a desired shape. The encoder can be fine-tuned to take voxelized shapes as input and map them to corresponding latent encodings that reconstruct the original non-voxelized point cloud. This method allows users to encode voxelized shapes and generate plausible detailed shapes. The encoder can also be fine-tuned on noisy shapes for multimodal shape denoising.
- Shape Interpolation: Different point clouds can be encoded into the latent DDMs' Gaussian priors in LION's hierarchical latent space using the probability flow, enabling plausible shape interpolation. This allows spherical interpolation to generate valid shapes along the interpolation path.
- Surface Reconstruction: LION can be combined with modern geometry reconstruction methods, specifically the Shape As Points (SAP) technique. SAP can be trained to extract smooth meshes from noisy point clouds and can be fine-tuned on training data generated by LION’s autoencoder to better adjust to the noise distribution in point clouds generated by LION.
- Fast Sampling with DDIM: LION's sampling time can be reduced by using a fast DDM sampler, such as the DDIM sampler, enabling real-time and interactive applications.
- Surface Reconstruction: LION can be combined with modern geometry reconstruction methods, specifically the Shape As Points (SAP) technique. SAP can be trained to extract smooth meshes from noisy point clouds and can be fine-tuned on training data generated by LION’s autoencoder to better adjust to the noise distribution in point clouds generated by LION.
Key Strengths of LION
LION distinguishes itself from existing models through its expressivity, flexibility, and mesh reconstruction:
- Expressivity: LION's training process first uses a VAE to regulate latent encodings to approximately match standard Gaussian distributions, simplifying the modeling task for the DDMs. The additional decoder network further improves expressivity. LION uses point cloud latents that merge the advantages of both latent DDMs and 3D point clouds, interpreted as smoothed versions of the original point clouds. The hierarchical VAE setup with an additional global shape latent enhances LION’s expressivity even further, enabling natural disentanglement between the overall shape and local details.
- Flexibility: The variational autoencoders in LION can be fine-tuned for various tasks, adding to the model’s flexibility. LION's Variational Autoencoders framework allows its encoders to be fine-tuned for various tasks and enables easy shape interpolation. This level of flexibility and expressivity is not matched by other 3D point cloud DDMs operating directly on point clouds.
- Mesh Reconstruction: This amalgamation allows LION to leverage a point cloud-based VAE backbone ideal for DDMs and smooth geometry reconstruction methods operating on the synthesized point clouds to generate smooth surfaces that can be easily transformed into meshes. These smooth surfaces can then be converted to meshes, which are more preferred by artists.
Experimental results
The experimental work presented involves the training and evaluating the model. LION has been trained on the ShapeNet dataset, which is widely used to benchmark 3D shape generative models, in categories such as airplane, chair, and car. Metrics such as 1-NNA, Chamfer distance (CD), and Earth Mover Distance (EMD) were used to evaluate the model's performance.
In single-class 3D shape generation, LION outperformed all other recent and competitive baseline models, with state-of-the-art performance across all classes and dataset versions. Its output samples were diverse and visually pleasing.
The model also performed well on a task of many-class unconditional 3D shape generation. It was trained on 13 diverse categories without any class conditioning, which posed a complex, multimodal data distribution challenge. Even in this challenging scenario, LION managed to synthesize high-quality, diverse shapes, outperforming all baseline models significantly. Encouraged by the results, the experiment was expanded to 55 different categories from ShapeNet, again without any class conditioning, and LION could generate high-quality, diverse shapes.
Additionally, LION was trained on smaller datasets, including the Mug and Bottle classes from ShapeNet, and on 553 animal assets from the TurboSquid data repository. In all these scenarios, LION was able to generate correct and high-quality shapes, indicating its robust performance even in low-data settings.
Conclusion
LION represents a significant stride in the field of 3D shape generation. It achieves state-of-the-art performance across various classes and dataset versions, outperforming other DDM models. Even when trained on small datasets, LION still manages to generate high-quality samples. While the sampling time can be substantial due to the numerous model steps required, using fast samplers like DDIM can produce high-quality samples in less than a second.
Moreover, LION's ability to reconstruct meshes proves invaluable to digital artists, given the preference for meshed outputs in digital artwork. With the incorporation of modern geometry reconstruction methods such as the Shape As Points (SAP) technique, LION serves as an ideal tool for 3D content creation. Ultimately, the rise of models like LION underscores the continuous evolution of 3D shape generation models and their increasing potential in revolutionizing digital artistry.
Overall, the Latent Point Diffusion Model (LION) is a robust and efficient model that pushes the boundaries of 3D shape generation. With its ability to produce high-quality shapes, offer flexible and interactive use, and deliver smooth meshes, it is poised to become an indispensable tool for digital artists and the broader 3D content creation industry.
Reference:
[1]: Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., & Kreis, K. (2022). LION: Latent Point Diffusion Models for 3D Shape Generation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2022). NVIDIA, University of Toronto, Vector Institute. Retrieved from https://nv-tlabs.github.io/LION/