SMART OBJECTS

World Models & V-JEPA

LeCun's bet on non-generative intelligence —
and why it matters for what our cameras can understand.

SVA MFA Interaction Design · Spring 2026

THE BIG IDEA

Predict meaning, not pixels.

A generative model predicting the next video frame has to reconstruct every blade of grass, every water ripple, every carpet texture — details that are computationally expensive and fundamentally unpredictable.

A world model should instead represent a car approaching a fork in terms of position, velocity, and orientation — with a latent variable encoding whether it turns left or right.

JEPA (Joint Embedding Predictive Architecture) predicts in abstract representation space. It learns what matters and discards the rest.

LECUN'S VISION

A Cognitive Architecture for Machines

Yann LeCun's 2022 paper "A Path Towards Autonomous Machine Intelligence" proposes six differentiable modules:

Perception

Encode the world into representations

World Model

Predict future states, estimate missing info

Actor

System 1 (reactive) + System 2 (deliberative)

Cost Module

Intrinsic drives + learned critic

Short-Term Memory

Maintain state across time

Configurator

Executive control over all modules

The world model sits at the center. V-JEPA is the first concrete implementation of this module.

HOW IT WORKS

Energy, Not Probability

Instead of computing P(y|x), an energy function F(x, y) assigns:

Low energy → compatible (x, y) pairs
High energy → incompatible pairs

The system doesn't predict y from x — it evaluates whether a proposed y is compatible with x.

For multiple valid futures, a latent variable z parameterizes the possibilities:

F(x, y) = min_z E(x, y, z)

The collapse problem

If the energy function goes flat, it assigns low energy everywhere — the model learns nothing.

Contrastive fix: Push energy up on negative samples. Problem: negatives grow exponentially with dimension.

JEPA's fix: VICReg regularization — ensure variance in each embedding dimension, decorrelate components. No negatives needed.

LANDSCAPE

JEPA vs. the Alternatives

Approach	What It Predicts	Limitation
Generative (MAE, diffusion, autoregressive)	Raw pixels or tokens	Wastes capacity on irrelevant low-level details
Contrastive (SimCLR, CLIP, MoCo)	Aligned embeddings via negative pairs	Negative samples grow exponentially with dimension
Joint embedding (CLIP-style)	Matched cross-modal representations	Not predictive — no temporal dynamics
JEPA	Abstract representations of masked regions	New — still proving scalability

10,000+

GPU-hours for MAE to reach 71.5% ImageNet

~2,500

GPU-hours for I-JEPA to reach 73.3% ImageNet

ARCHITECTURE

How JEPA Predicts

x-encoder → s_x + z (latent) → Predictor → s̃_y (predicted)

y-encoder (EMA) → s_y (target) ⇥ stop gradient

The energy is prediction error: E(x, y, z) = D(s_y, Pred(s_x, z))

Four training criteria:

Maximize information in encoder outputs
Minimize prediction error
Minimize information in latent z
EMA target encoder prevents collapse

Key insight: The encoders sit between raw perception and the prediction objective. This gives the system freedom to represent only what matters — the core advantage over pixel prediction.

V-JEPA 1

Video + Extreme Masking

Tokenization

Video is split into tubelets of 2×16×16 (2 frames × 16px × 16px). These are the "words" the Vision Transformer reads.

The masking strategy

Not random patches — those are trivially solvable via interpolation in redundant video. Instead: large contiguous blocks repeated across all frames.

Short-range: ~15% per frame (8 blocks)
Long-range: ~70% per frame (2 blocks)
Overall: ~90% masked

Why 90%?

This extreme masking forces genuine scene understanding. You can't cheat by interpolating nearby pixels when 90% of the spatiotemporal volume is missing. The model must reason about what is in the scene and how it moves.

The predictor

A narrow 12-layer transformer (~22M params). Receives visible token embeddings + learnable mask tokens with positional embeddings. Outputs predicted representations for masked positions.

Loss: L1 regression (robust to outliers, computes conditional median).

V-JEPA 1

Results (Frozen Evaluation)

No encoder fine-tuning — just attentive probing on top of frozen features:

81.9%

Kinetics-400

72.2%

Something-Something v2

77.9%

ImageNet-1K

Surpasses prior video self-supervised methods by 4–10 points.

Flagship models: ViT-L/16 (~300M params) and ViT-H/16 (~630M params). Trained on VideoMix2M (~2M videos from HowTo100M, Kinetics, Something-Something v2) with batch size 3072 for 90K iterations.

V-JEPA 2

Four Scaling Ingredients

1. Data 11×

VideoMix2M → VideoMix22M
~22M samples, >1 million hours of video

2. Model 2×

ViT-H (630M) → ViT-g (~1B params)
Full model: 1.2 billion parameters
3D Rotary Position Embeddings (3D-RoPE)

3. Training duration 2.8×

90K → 252K iterations

4. Resolution 8.4× speedup

Progressive: 16 frames / 256×256 →
64 frames / 384×384 during cooldown

Cumulative impact: starting from V-JEPA 1's 84.2% average → data (+1.0) + model (+1.5) + training (+0.8) + resolution → 88.2% average across six benchmarks.

V-JEPA 2

Benchmark Results

Benchmark	Task	V-JEPA 2	Context
Something-Something v2	Motion classification	77.3%	InternVideo: 69.7%
Epic-Kitchens-100	Action anticipation	39.7 R@5	44% improvement over prior SOTA
Diving-48	Fine-grained motion	90.2%	Frozen backbone
PerceptionTest	Video QA	84.0	SOTA at 8B scale
TempCompass	Temporal QA	76.9	SOTA at 8B scale
ImageNet-1K	Image classification	84.6%	Competitive with DINOv2

Pattern: V-JEPA dominates on temporal and motion understanding. It's competitive (not dominant) on static appearance tasks. The model understands dynamics better than anything else at its scale.

INTUITIVE PHYSICS

The Model Understands Objects

V-JEPA was separately evaluated on intuitive physics — detecting when things violate physical laws:

V-JEPA understands:

Object permanence (72.1% vs. 52.5% baseline)
Continuity
Shape constancy
Support
Inertia

Still struggles with:

Object-to-object interactions

Collisions, solidity, gravity — these likely require hierarchical representations LeCun has proposed but not yet built.

For reference: Pixel-prediction models AND multimodal LLMs both perform at near chance on these physics tasks. V-JEPA is the only approach that shows real understanding.

V-JEPA 2-AC

From Watching to Acting

Turn the passive video model into an active world model for robot control:

Frozen V-JEPA 2 encoder → Action-Conditioned Predictor (300M) → Predicted future state

Three input streams per timestep:

Visual features — 16×16×1408 from frozen encoder
7D actions — delta position, orientation, gripper
7D proprioception — absolute end-effector state

Planning: Model-Predictive Control with CEM. Sample 800 action sequences, pick the one that minimizes L1 distance to goal in representation space.

Training data:

62 hours

of unlabeled robot video (DROID dataset, 23K trajectories)

Planning speed:

~16s

V-JEPA 2-AC

~4 min

Cosmos (pixel-gen)

V-JEPA 2-AC

Zero-Shot Robot Control

Deployed on Franka arms in two labs never seen during training, with uncalibrated cameras:

65–80%

Pick-and-place

65%

Grasping

100%

Reaching

Compare:

Octo (vision-language-action, 1M+ trajectories): 15% grasping

Cosmos (pixel generation): 0–30% manipulation

The takeaway: 62 hours of unlabeled video > 1M+ labeled trajectories. World models that predict in representation space learn physics that transfers to new environments. No task-specific training, no rewards, no environment data collection.

ECOSYSTEM

The Growing JEPA Family

Core lineage

Model	Year	Domain
I-JEPA	2023	Images
V-JEPA 1	2024	Video
VL-JEPA	2024	Vision-language
V-JEPA 2	2025	Video (scaled)
LeJEPA	2025	Theory (math foundations)
V-JEPA 2.1	Mar 2026	Dense self-supervision

Community extensions

A-JEPA Audio (SOTA on AudioSet)
S-JEPA EEG signals
Brain-JEPA Neuroscience
3D-JEPA 3D understanding
T-JEPA Tabular data
ACT-JEPA Robotics policy
UI-JEPA User interfaces

VL-JEPA: 1.6B params, outperforms GPT-4o and Gemini-2.0 on world-modeling benchmarks. 50% fewer params, 2.85× faster decoding.

V-JEPA 2.1 — MARCH 2026

Dense Features Change Everything

Four new ingredients: dense predictive loss (all tokens train, not just masked), deep self-supervision (loss at intermediate layers), multi-modal tokenizers, and continued scaling.

Task	Score	What This Actually Measures
Robotic grasping	+20 pts over V-JEPA 2	Can a robot pick up objects it's never seen, in a lab it's never been in? 65% → ~85% success.
Ego4D object interaction	7.71 mAP	First-person video: predict what object you'll interact with next before you touch it. Measures anticipation, not recognition.
EPIC-KITCHENS	40.8 R@5	People cooking, filmed from head-mounted cameras. Predict the next action — "they're reaching for the drawer, they'll grab a knife." Tests temporal intent reasoning.
Something-Something v2	77.7%	Short clips of humans doing things with objects — "pushing something left," "covering something." The model must understand motion dynamics, not just recognize objects.
Depth estimation	0.307 RMSE	NYUv2 indoor scenes: estimate how far away every pixel is from a single RGB camera. No stereo needed — the model learned 3D structure from video alone.
Robot navigation	5.687 ATE	TartanDrive: off-road autonomous driving. Predict where you are after a sequence of movements. Tests spatial understanding at room/outdoor scale.

The shift from V-JEPA 2 → 2.1: V-JEPA 2 understood scenes globally ("this is a kitchen"). V-JEPA 2.1 understands scenes densely ("the mug is on the counter, 1.2m away, and you're about to reach for it"). That's the difference dense features make.

OPEN CHALLENGES

What V-JEPA Can't Do Yet

Short temporal context

3–4 second clips (16 frames). Events requiring longer causal reasoning are out of reach.

Long-horizon planning

Autoregressive predictions accumulate error. Robot demos limited to ≤16 seconds. Multi-step tasks need manually specified visual sub-goals.

Camera sensitivity

Action coordinates are implicitly learned. Significant camera angle changes break the system.

The big unsolved problems:

H-JEPA — hierarchical predictions at multiple time scales (still theoretical)
Multi-modal — vision + audio + tactile + proprioception (nascent)
Language goals — no natural language interface for the planning system yet
Physical reasoning — ~60% on CausalVQA vs. ~95% human

The central open bet: Can non-contrastive, non-generative methods match the scaled generative paradigm (GPT-4, Sora, Gemini)?

FOR OUR CAMERAS

Why V-JEPA Matters Here

Core connection: V-JEPA understands what's happening in a scene over time — not just what objects are present. This is exactly what we need to classify classroom modes, not just count people.

Solo

One person, free movement
→ Party mode

Duo / Group

Clustered, conversational
→ Café / Networking

Presentation

One standing, others seated
→ Full silence

YOLO counts bodies. V-JEPA understands the spatial dynamics — the difference between five people collaborating and five people listening to a lecture.

The path: Record short clips of each mode → extract V-JEPA embeddings → train a lightweight classifier on top → wire into Discord bot: !classroom-mode

SMART OBJECTS

The classroom changes all day.
The room never does.

V-JEPA's key insight: by placing learned encoders between raw perception and the prediction objective, the system gains the freedom to represent only what matters.

That's also the design challenge — decide what matters, and build a camera that understands it.

github.com/kandizzy/smart-objects-cameras

SMART OBJECTS

World Models & V-JEPA

THE BIG IDEA

Predict meaning, not pixels.

LECUN'S VISION

A Cognitive Architecture for Machines

HOW IT WORKS

Energy, Not Probability

LANDSCAPE

JEPA vs. the Alternatives

ARCHITECTURE

How JEPA Predicts

V-JEPA 1

Video + Extreme Masking

V-JEPA 1

Results (Frozen Evaluation)

V-JEPA 2

Four Scaling Ingredients

V-JEPA 2

Benchmark Results

INTUITIVE PHYSICS

The Model Understands Objects

V-JEPA 2-AC

From Watching to Acting

V-JEPA 2-AC

Zero-Shot Robot Control

ECOSYSTEM

The Growing JEPA Family

V-JEPA 2.1 — MARCH 2026

Dense Features Change Everything

OPEN CHALLENGES

What V-JEPA Can't Do Yet

FOR OUR CAMERAS

Why V-JEPA Matters Here

SMART OBJECTS

The classroom changes all day.The room never does.

The classroom changes all day.
The room never does.