SMART OBJECTS

World Models & V-JEPA

LeCun's bet on non-generative intelligence —
and why it matters for what our cameras can understand.

SVA MFA Interaction Design · Spring 2026

THE BIG IDEA

Predict meaning, not pixels.

A generative model predicting the next video frame has to reconstruct every blade of grass, every water ripple, every carpet texture — details that are computationally expensive and fundamentally unpredictable.

A world model should instead represent a car approaching a fork in terms of position, velocity, and orientation — with a latent variable encoding whether it turns left or right.

JEPA (Joint Embedding Predictive Architecture) predicts in abstract representation space. It learns what matters and discards the rest.

LECUN'S VISION

A Cognitive Architecture for Machines

Yann LeCun's 2022 paper "A Path Towards Autonomous Machine Intelligence" proposes six differentiable modules:

Perception

Encode the world into representations

World Model

Predict future states, estimate missing info

Actor

System 1 (reactive) + System 2 (deliberative)

Cost Module

Intrinsic drives + learned critic

Short-Term Memory

Maintain state across time

Configurator

Executive control over all modules

The world model sits at the center. V-JEPA is the first concrete implementation of this module.

HOW IT WORKS

Energy, Not Probability

Instead of computing P(y|x), an energy function F(x, y) assigns:

  • Low energy → compatible (x, y) pairs
  • High energy → incompatible pairs

The system doesn't predict y from x — it evaluates whether a proposed y is compatible with x.

For multiple valid futures, a latent variable z parameterizes the possibilities:

F(x, y) = minz E(x, y, z)

The collapse problem

If the energy function goes flat, it assigns low energy everywhere — the model learns nothing.

Contrastive fix: Push energy up on negative samples. Problem: negatives grow exponentially with dimension.
JEPA's fix: VICReg regularization — ensure variance in each embedding dimension, decorrelate components. No negatives needed.

LANDSCAPE

JEPA vs. the Alternatives

ApproachWhat It PredictsLimitation
Generative (MAE, diffusion, autoregressive) Raw pixels or tokens Wastes capacity on irrelevant low-level details
Contrastive (SimCLR, CLIP, MoCo) Aligned embeddings via negative pairs Negative samples grow exponentially with dimension
Joint embedding (CLIP-style) Matched cross-modal representations Not predictive — no temporal dynamics
JEPA Abstract representations of masked regions New — still proving scalability
10,000+
GPU-hours for MAE to reach 71.5% ImageNet
~2,500
GPU-hours for I-JEPA to reach 73.3% ImageNet

ARCHITECTURE

How JEPA Predicts

x-encoder sx + z (latent) Predictor y (predicted)
y-encoder (EMA) sy (target) stop gradient

The energy is prediction error: E(x, y, z) = D(sy, Pred(sx, z))

Four training criteria:

  • Maximize information in encoder outputs
  • Minimize prediction error
  • Minimize information in latent z
  • EMA target encoder prevents collapse
Key insight: The encoders sit between raw perception and the prediction objective. This gives the system freedom to represent only what matters — the core advantage over pixel prediction.

V-JEPA 1

Video + Extreme Masking

Tokenization

Video is split into tubelets of 2×16×16 (2 frames × 16px × 16px). These are the "words" the Vision Transformer reads.

The masking strategy

Not random patches — those are trivially solvable via interpolation in redundant video. Instead: large contiguous blocks repeated across all frames.

  • Short-range: ~15% per frame (8 blocks)
  • Long-range: ~70% per frame (2 blocks)
  • Overall: ~90% masked

Why 90%?

This extreme masking forces genuine scene understanding. You can't cheat by interpolating nearby pixels when 90% of the spatiotemporal volume is missing. The model must reason about what is in the scene and how it moves.

The predictor

A narrow 12-layer transformer (~22M params). Receives visible token embeddings + learnable mask tokens with positional embeddings. Outputs predicted representations for masked positions.

Loss: L1 regression (robust to outliers, computes conditional median).

V-JEPA 1

Results (Frozen Evaluation)

No encoder fine-tuning — just attentive probing on top of frozen features:

81.9%
Kinetics-400
72.2%
Something-Something v2
77.9%
ImageNet-1K

Surpasses prior video self-supervised methods by 4–10 points.

Flagship models: ViT-L/16 (~300M params) and ViT-H/16 (~630M params). Trained on VideoMix2M (~2M videos from HowTo100M, Kinetics, Something-Something v2) with batch size 3072 for 90K iterations.

V-JEPA 2

Four Scaling Ingredients

1. Data 11×

VideoMix2M → VideoMix22M
~22M samples, >1 million hours of video

2. Model

ViT-H (630M) → ViT-g (~1B params)
Full model: 1.2 billion parameters
3D Rotary Position Embeddings (3D-RoPE)

3. Training duration 2.8×

90K → 252K iterations

4. Resolution 8.4× speedup

Progressive: 16 frames / 256×256 →
64 frames / 384×384 during cooldown

Cumulative impact: starting from V-JEPA 1's 84.2% average → data (+1.0) + model (+1.5) + training (+0.8) + resolution → 88.2% average across six benchmarks.

V-JEPA 2

Benchmark Results

BenchmarkTaskV-JEPA 2Context
Something-Something v2 Motion classification 77.3% InternVideo: 69.7%
Epic-Kitchens-100 Action anticipation 39.7 R@5 44% improvement over prior SOTA
Diving-48 Fine-grained motion 90.2% Frozen backbone
PerceptionTest Video QA 84.0 SOTA at 8B scale
TempCompass Temporal QA 76.9 SOTA at 8B scale
ImageNet-1K Image classification 84.6% Competitive with DINOv2
Pattern: V-JEPA dominates on temporal and motion understanding. It's competitive (not dominant) on static appearance tasks. The model understands dynamics better than anything else at its scale.

INTUITIVE PHYSICS

The Model Understands Objects

V-JEPA was separately evaluated on intuitive physics — detecting when things violate physical laws:

V-JEPA understands:

  • Object permanence (72.1% vs. 52.5% baseline)
  • Continuity
  • Shape constancy
  • Support
  • Inertia

Still struggles with:

Object-to-object interactions

Collisions, solidity, gravity — these likely require hierarchical representations LeCun has proposed but not yet built.

For reference: Pixel-prediction models AND multimodal LLMs both perform at near chance on these physics tasks. V-JEPA is the only approach that shows real understanding.

V-JEPA 2-AC

From Watching to Acting

Turn the passive video model into an active world model for robot control:

Frozen V-JEPA 2 encoder Action-Conditioned Predictor (300M) Predicted future state

Three input streams per timestep:

  • Visual features — 16×16×1408 from frozen encoder
  • 7D actions — delta position, orientation, gripper
  • 7D proprioception — absolute end-effector state

Planning: Model-Predictive Control with CEM. Sample 800 action sequences, pick the one that minimizes L1 distance to goal in representation space.

Training data:

62 hours
of unlabeled robot video (DROID dataset, 23K trajectories)

Planning speed:

~16s
V-JEPA 2-AC
~4 min
Cosmos (pixel-gen)

V-JEPA 2-AC

Zero-Shot Robot Control

Deployed on Franka arms in two labs never seen during training, with uncalibrated cameras:

65–80%
Pick-and-place
65%
Grasping
100%
Reaching

Compare:

Octo (vision-language-action, 1M+ trajectories): 15% grasping
Cosmos (pixel generation): 0–30% manipulation
The takeaway: 62 hours of unlabeled video > 1M+ labeled trajectories. World models that predict in representation space learn physics that transfers to new environments. No task-specific training, no rewards, no environment data collection.

ECOSYSTEM

The Growing JEPA Family

Core lineage

ModelYearDomain
I-JEPA2023Images
V-JEPA 12024Video
VL-JEPA2024Vision-language
V-JEPA 22025Video (scaled)
LeJEPA2025Theory (math foundations)
V-JEPA 2.1Mar 2026Dense self-supervision

Community extensions

A-JEPA Audio (SOTA on AudioSet)
S-JEPA EEG signals
Brain-JEPA Neuroscience
3D-JEPA 3D understanding
T-JEPA Tabular data
ACT-JEPA Robotics policy
UI-JEPA User interfaces
VL-JEPA: 1.6B params, outperforms GPT-4o and Gemini-2.0 on world-modeling benchmarks. 50% fewer params, 2.85× faster decoding.

V-JEPA 2.1 — MARCH 2026

Dense Features Change Everything

Four new ingredients: dense predictive loss (all tokens train, not just masked), deep self-supervision (loss at intermediate layers), multi-modal tokenizers, and continued scaling.

TaskScoreWhat This Actually Measures
Robotic grasping +20 pts over V-JEPA 2 Can a robot pick up objects it's never seen, in a lab it's never been in? 65% → ~85% success.
Ego4D object interaction 7.71 mAP First-person video: predict what object you'll interact with next before you touch it. Measures anticipation, not recognition.
EPIC-KITCHENS 40.8 R@5 People cooking, filmed from head-mounted cameras. Predict the next action — "they're reaching for the drawer, they'll grab a knife." Tests temporal intent reasoning.
Something-Something v2 77.7% Short clips of humans doing things with objects — "pushing something left," "covering something." The model must understand motion dynamics, not just recognize objects.
Depth estimation 0.307 RMSE NYUv2 indoor scenes: estimate how far away every pixel is from a single RGB camera. No stereo needed — the model learned 3D structure from video alone.
Robot navigation 5.687 ATE TartanDrive: off-road autonomous driving. Predict where you are after a sequence of movements. Tests spatial understanding at room/outdoor scale.
The shift from V-JEPA 2 → 2.1: V-JEPA 2 understood scenes globally ("this is a kitchen"). V-JEPA 2.1 understands scenes densely ("the mug is on the counter, 1.2m away, and you're about to reach for it"). That's the difference dense features make.

OPEN CHALLENGES

What V-JEPA Can't Do Yet

Short temporal context

3–4 second clips (16 frames). Events requiring longer causal reasoning are out of reach.

Long-horizon planning

Autoregressive predictions accumulate error. Robot demos limited to ≤16 seconds. Multi-step tasks need manually specified visual sub-goals.

Camera sensitivity

Action coordinates are implicitly learned. Significant camera angle changes break the system.

The big unsolved problems:

  • H-JEPA — hierarchical predictions at multiple time scales (still theoretical)
  • Multi-modal — vision + audio + tactile + proprioception (nascent)
  • Language goals — no natural language interface for the planning system yet
  • Physical reasoning — ~60% on CausalVQA vs. ~95% human
The central open bet: Can non-contrastive, non-generative methods match the scaled generative paradigm (GPT-4, Sora, Gemini)?

FOR OUR CAMERAS

Why V-JEPA Matters Here

Core connection: V-JEPA understands what's happening in a scene over time — not just what objects are present. This is exactly what we need to classify classroom modes, not just count people.

Solo

One person, free movement
→ Party mode

Duo / Group

Clustered, conversational
→ Café / Networking

Presentation

One standing, others seated
→ Full silence

YOLO counts bodies. V-JEPA understands the spatial dynamics — the difference between five people collaborating and five people listening to a lecture.

The path: Record short clips of each mode → extract V-JEPA embeddings → train a lightweight classifier on top → wire into Discord bot: !classroom-mode

SMART OBJECTS

The classroom changes all day.
The room never does.

V-JEPA's key insight: by placing learned encoders between raw perception and the prediction objective, the system gains the freedom to represent only what matters.

That's also the design challenge — decide what matters, and build a camera that understands it.

github.com/kandizzy/smart-objects-cameras