Abstract
Understanding 3D structure from sparse visual observations is a foundational capability required for
robotic perception. In many real-world settings, objects are observed in arbitrary poses, under occlusion,
and from limited viewpoints. Inferring complete 3D shape and appearance from such inputs enables
downstream tasks like robotic manipulation, scene understanding, and 3D generation. This thesis explores
representation learning methods that operate over both explicit and implicit 3D representations to recover
structured geometry from minimal input, enabling category-level generalization across unseen instances
in challenging conditions.
To enable robust 3D inference in such scenarios, we first focus on the challenge of completing object
geometry from partial point cloud observations when objects appear in arbitrary orientations. Traditional
shape completion methods assume inputs are aligned to a canonical frame, which limits applicability in
robotics where objects may be present in non-canonical poses. To address this, we propose SCARP—a
method for Shape Completion in ARbitrary Poses. SCARP disentangles shape and pose using a multi-task
formulation; it learns rotation-equivariant features for pose estimation and rotation-invariant features
for shape reasoning. The network jointly predicts a canonical full point cloud and its 6D pose. Unlike
multi-stage pipelines that depend on external canonicalization, SCARP is a single network that learns
to complete shapes directly in the observed pose. We demonstrate SCARP’s utility in robotic grasping
by showing that completing shapes before grasp prediction reduces invalid and colliding grasps by over
70%, and outperforms prior methods by 45% on completion metrics across several categories.
While point clouds offer a compact and geometric view of 3D structure, they are inherently sparse and
do not capture appearance or high-frequency detail. To model richer, dense 3D structure directly from
images, we turn to implicit neural fields—specifically, Neural Radiance Fields (NeRFs). We introduce
HyP-NeRF, a framework for learning category-level priors over Neural Radiance Fields (NeRFs). Our
key insight in HyP-NeRF is to use a hypernetwork to not just parameterize a neural network (NeRF) but
also generate learnable input encodings, resulting in learning higher frequency details while reducing the
computation cost. The hypernetwork predicts both the NeRF MLP weights as well as parameters of a
multi-resolution hash encoding(MRHE) of a NeRF conditioned on an instance code, enabling efficient
synthesis of NeRFs across unseen objects in a class. To overcome rendering artifacts and improve
fidelity, we propose a denoising and finetuning pipeline that leverages a learned 2D denoiser followed
by view-consistent NeRF optimization. This formulation supports diverse downstream tasks, including
single-view NeRF generation, text-to-NeRF synthesis via CLIP, and retrieval from real-world images.
Experiments on high-resolution (512x512) object datasets show that HyP-NeRF achieves state-of-the-art performance in generalization, compression, and instance retrieval, while maintaining photorealistic
quality and fast inference.
In summary, this thesis presents methods for learning robust and generalizable 3D structure from
sparse inputs. SCARP addresses the challenge of shape completion at arbitrary poses through rotationaware modeling of partial point clouds, while HyP-NeRF develops scalable priors over neural fields
for instance-conditioned NeRF generation. Together, these approaches offer complementary pathways
toward unified, learnable 3D perception in robotic environments. Moreover, the ideas and methods of
representation learning presented in these papers have the scope of being scalable and being integrated
in future 3D foundation models which require the sufficient inductive biases to be able to account and
compensate for the limited availability of training data.