CV 3D Vision and Depth Estimation
The real world is three-dimensional, but a regular camera collapses it into a flat 2D image. 3D vision recovers that lost depth information — telling the computer not just what is in the scene, but how far away each part of it is. This knowledge powers autonomous vehicles, robots, augmented reality, and 3D scanning.
The Depth Perception Problem
When a camera captures an image, every point along a ray from the camera maps to the same pixel. A small nearby object and a large far object can produce identical pixels. The camera loses the depth dimension entirely.
Why a Single Camera Loses Depth
Camera (pinhole model):
Object A (small, close)
●
/
/
Camera ──╳──────────────────────────────→ Same pixel (x, y)
\
\
●
Object B (large, far)
Both objects project to the same pixel → cannot distinguish them
without additional information.
Solution: Use two cameras (stereo), or measure time-of-flight,
or infer depth from monocular cues.
Stereo Vision: Two Eyes, One Depth Map
Stereo vision mimics how human eyes perceive depth. Two cameras placed side by side capture the same scene from slightly different positions. The horizontal shift of a point between the two images — called disparity — directly encodes depth. Objects closer to the cameras have higher disparity; distant objects have lower disparity.
Stereo Disparity Diagram
Left Camera (L) Right Camera (R)
[●]──────────────────────[●]
| |
|←── baseline B ────────→|
| |
| ● (Object) |
| /|\ |
| / | \ |
| / | \ |
↓ ↓ | ↓ ↓
Left xL | xR Right
image | image
|
Disparity d = xL − xR
Depth Z = (f × B) / d
Where:
f = focal length of the camera
B = baseline distance between cameras
d = disparity (pixel difference between left and right images)
Large d (far apart in images) → Object is CLOSE.
Small d (almost same position) → Object is FAR.
Stereo Matching
For each pixel in the Left image: Search along the same horizontal line (epipolar line) in the Right image. Find the best matching patch. Measure shift (disparity). Compute depth from disparity formula. Challenging regions: Textureless surfaces (plain wall) → no unique match. Occluded areas → one camera sees it, other does not. Reflective surfaces → appearance changes between views.
Depth from LiDAR
LiDAR (Light Detection and Ranging) fires laser pulses and measures the time they take to bounce back. This Time-of-Flight measurement gives precise distance at each point, producing a 3D point cloud directly — no inference required.
LiDAR Point Cloud Concept
LiDAR sensor fires pulses in 360°: Pulse sent at time t=0. Pulse returns at time t=2ns (2 nanoseconds). Speed of light = 3×10⁸ m/s. Distance = (speed × time) / 2 = (3×10⁸ × 2×10⁻⁹) / 2 = 0.3m → This direction has an object at 0.3 meters. Repeat for millions of directions: → A cloud of 3D (x, y, z) points. Visualization: · · · · · · · · · · · ← sky (no return) · · · ●●●●●●● · · · ← building (returns at ~20m) ●●●●●●●●●●●●●●●●●●● ← ground (returns at ~1.8m) Each ● = one LiDAR point with x, y, z coordinates. Used in: Self-driving cars, 3D mapping, robotics.
Monocular Depth Estimation
Monocular depth estimation predicts a full depth map from a single RGB image — no stereo camera or LiDAR needed. It relies on visual cues humans also use: objects lower in the frame are typically closer, smaller objects are farther away, haze and lower contrast indicate distance, and perspective makes parallel lines converge.
Monocular Depth Cues
IMAGE CUE DEPTH INTERPRETATION ───────────────────────────────────────────────────── Object appears smaller → Object is farther away Object lower in the frame → Object is closer (on the ground) Overlaps another object → In front of the overlapped object Hazier / lower contrast → Object is far (atmospheric haze) Parallel lines converge → Those lines extend into the distance Larger texture gradient → Farther surface ───────────────────────────────────────────────────── A road photo example: Near road texture: large, coarse pixels. Far road texture: tiny, fine pixels. Texture gets finer → depth increases.
Deep Learning Monocular Depth (MiDaS / Depth Anything)
Architecture: Encoder-Decoder (similar to U-Net)
[RGB Image (H × W × 3)]
↓
[Encoder: Pre-trained ViT or ResNet]
Extracts multi-scale features.
↓
[Decoder: Progressive upsampling]
Combines features from multiple scales.
↓
[Depth Map (H × W × 1)]
Each pixel = predicted depth value.
Brighter = closer, darker = farther (or vice versa).
Training:
Supervised: Paired (RGB, LiDAR depth) data.
Self-supervised: Temporal consistency in video — no depth labels needed.
Accuracy: Not as precise as LiDAR, but works on any scene, any camera.
Structure from Motion (SfM)
Structure from Motion reconstructs a 3D model of a scene from a collection of 2D photos taken from different viewpoints. It computes both the 3D structure (point cloud) and the camera positions simultaneously — essentially reverse-engineering where each photo was taken.
SfM Pipeline
INPUTS: 50 photos of a building from different angles Step 1: Feature detection + matching SIFT keypoints detected in all photos. Matching features across photos identified. Step 2: Estimate relative camera positions From matched features, compute rotation and translation between each pair of cameras (essential matrix). Step 3: Triangulation For each matched point in two cameras, compute its 3D position using known camera geometry. Step 4: Bundle Adjustment Simultaneously optimize all camera positions AND all 3D points to minimize reprojection error (how far the 3D points project away from the matched 2D points). Output: → Sparse 3D point cloud of the building → Camera positions for all 50 photos Used in: Google Maps 3D view, Photogrammetry, VFX, archaeology.
Triangulation Geometry
Camera 1 (C1) Camera 2 (C2)
[●] [●]
| \ / |
| \ / |
| \ 3D Point P / |
| ↘ ↙ |
| ● ←───● |
ray from C1 and C2 intersect at P
P's 3D coordinates solved from the two ray equations.
SLAM: Simultaneous Localization and Mapping
SLAM solves two problems at once: building a map of an unknown environment while simultaneously tracking where the agent (robot or camera) is within that map. A robot uses SLAM to navigate a warehouse it has never seen before.
SLAM Loop
ROBOT STARTS at position (0, 0, 0) — map is empty.
At each time step:
1. Observe: Camera captures image + detects features.
2. Match: Match features to the existing map.
3. Localize: Estimate robot position from matched features.
4. Map update: Add newly seen features to the map.
5. Loop closure: Recognize a previously visited place →
correct accumulated position drift.
Visual map after 60 seconds:
┌──────────────────────────────────┐
│ ● ● │ ← 3D feature points (map)
│ ● ● ● ● │
│ ● ● ● │
│ ROBOT ──→ │ ← estimated robot path
│ start → → current position │
└──────────────────────────────────┘
Key Takeaways
- A single camera loses depth — 3D vision methods recover it using two cameras, laser pulses, or learned inference.
- Stereo vision computes depth from disparity — the horizontal pixel shift between left and right camera images.
- LiDAR measures exact distance using laser time-of-flight — produces precise 3D point clouds.
- Monocular depth estimation uses a trained CNN to predict depth from a single image using visual cues.
- Structure from Motion reconstructs 3D scenes and camera positions from multiple 2D photos.
- SLAM builds a map and tracks position simultaneously — essential for autonomous robots.
