LeCun's bet on non-generative intelligence —
and why it matters for what our cameras can understand.
SVA MFA Interaction Design · Spring 2026
A generative model predicting the next video frame has to reconstruct every blade of grass, every water ripple, every carpet texture — details that are computationally expensive and fundamentally unpredictable.
A world model should instead represent a car approaching a fork in terms of position, velocity, and orientation — with a latent variable encoding whether it turns left or right.
Yann LeCun's 2022 paper "A Path Towards Autonomous Machine Intelligence" proposes six differentiable modules:
Perception
Encode the world into representations
World Model
Predict future states, estimate missing info
Actor
System 1 (reactive) + System 2 (deliberative)
Cost Module
Intrinsic drives + learned critic
Short-Term Memory
Maintain state across time
Configurator
Executive control over all modules
The world model sits at the center. V-JEPA is the first concrete implementation of this module.
Instead of computing P(y|x), an energy function F(x, y) assigns:
The system doesn't predict y from x — it evaluates whether a proposed y is compatible with x.
For multiple valid futures, a latent variable z parameterizes the possibilities:
The collapse problem
If the energy function goes flat, it assigns low energy everywhere — the model learns nothing.
| Approach | What It Predicts | Limitation |
|---|---|---|
| Generative (MAE, diffusion, autoregressive) | Raw pixels or tokens | Wastes capacity on irrelevant low-level details |
| Contrastive (SimCLR, CLIP, MoCo) | Aligned embeddings via negative pairs | Negative samples grow exponentially with dimension |
| Joint embedding (CLIP-style) | Matched cross-modal representations | Not predictive — no temporal dynamics |
| JEPA | Abstract representations of masked regions | New — still proving scalability |
The energy is prediction error: E(x, y, z) = D(sy, Pred(sx, z))
Four training criteria:
Tokenization
Video is split into tubelets of 2×16×16 (2 frames × 16px × 16px). These are the "words" the Vision Transformer reads.
The masking strategy
Not random patches — those are trivially solvable via interpolation in redundant video. Instead: large contiguous blocks repeated across all frames.
Why 90%?
The predictor
A narrow 12-layer transformer (~22M params). Receives visible token embeddings + learnable mask tokens with positional embeddings. Outputs predicted representations for masked positions.
Loss: L1 regression (robust to outliers, computes conditional median).
No encoder fine-tuning — just attentive probing on top of frozen features:
Surpasses prior video self-supervised methods by 4–10 points.
1. Data 11×
VideoMix2M → VideoMix22M
~22M samples, >1 million hours of video
2. Model 2×
ViT-H (630M) → ViT-g (~1B params)
Full model: 1.2 billion parameters
3D Rotary Position Embeddings (3D-RoPE)
3. Training duration 2.8×
90K → 252K iterations
4. Resolution 8.4× speedup
Progressive: 16 frames / 256×256 →
64 frames / 384×384 during cooldown
| Benchmark | Task | V-JEPA 2 | Context |
|---|---|---|---|
| Something-Something v2 | Motion classification | 77.3% | InternVideo: 69.7% |
| Epic-Kitchens-100 | Action anticipation | 39.7 R@5 | 44% improvement over prior SOTA |
| Diving-48 | Fine-grained motion | 90.2% | Frozen backbone |
| PerceptionTest | Video QA | 84.0 | SOTA at 8B scale |
| TempCompass | Temporal QA | 76.9 | SOTA at 8B scale |
| ImageNet-1K | Image classification | 84.6% | Competitive with DINOv2 |
V-JEPA was separately evaluated on intuitive physics — detecting when things violate physical laws:
V-JEPA understands:
Still struggles with:
Collisions, solidity, gravity — these likely require hierarchical representations LeCun has proposed but not yet built.
Turn the passive video model into an active world model for robot control:
Three input streams per timestep:
Planning: Model-Predictive Control with CEM. Sample 800 action sequences, pick the one that minimizes L1 distance to goal in representation space.
Training data:
Planning speed:
Deployed on Franka arms in two labs never seen during training, with uncalibrated cameras:
Compare:
Core lineage
| Model | Year | Domain |
|---|---|---|
| I-JEPA | 2023 | Images |
| V-JEPA 1 | 2024 | Video |
| VL-JEPA | 2024 | Vision-language |
| V-JEPA 2 | 2025 | Video (scaled) |
| LeJEPA | 2025 | Theory (math foundations) |
| V-JEPA 2.1 | Mar 2026 | Dense self-supervision |
Community extensions
Four new ingredients: dense predictive loss (all tokens train, not just masked), deep self-supervision (loss at intermediate layers), multi-modal tokenizers, and continued scaling.
| Task | Score | What This Actually Measures |
|---|---|---|
| Robotic grasping | +20 pts over V-JEPA 2 | Can a robot pick up objects it's never seen, in a lab it's never been in? 65% → ~85% success. |
| Ego4D object interaction | 7.71 mAP | First-person video: predict what object you'll interact with next before you touch it. Measures anticipation, not recognition. |
| EPIC-KITCHENS | 40.8 R@5 | People cooking, filmed from head-mounted cameras. Predict the next action — "they're reaching for the drawer, they'll grab a knife." Tests temporal intent reasoning. |
| Something-Something v2 | 77.7% | Short clips of humans doing things with objects — "pushing something left," "covering something." The model must understand motion dynamics, not just recognize objects. |
| Depth estimation | 0.307 RMSE | NYUv2 indoor scenes: estimate how far away every pixel is from a single RGB camera. No stereo needed — the model learned 3D structure from video alone. |
| Robot navigation | 5.687 ATE | TartanDrive: off-road autonomous driving. Predict where you are after a sequence of movements. Tests spatial understanding at room/outdoor scale. |
3–4 second clips (16 frames). Events requiring longer causal reasoning are out of reach.
Autoregressive predictions accumulate error. Robot demos limited to ≤16 seconds. Multi-step tasks need manually specified visual sub-goals.
Action coordinates are implicitly learned. Significant camera angle changes break the system.
The big unsolved problems:
Solo
One person, free movement
→ Party mode
Duo / Group
Clustered, conversational
→ Café / Networking
Presentation
One standing, others seated
→ Full silence
YOLO counts bodies. V-JEPA understands the spatial dynamics — the difference between five people collaborating and five people listening to a lecture.
!classroom-mode
V-JEPA's key insight: by placing learned encoders between raw perception and the prediction objective, the system gains the freedom to represent only what matters.
That's also the design challenge — decide what matters, and build a camera that understands it.