SMART OBJECTS

What Else Can Your Camera See?

Five project ideas from the oak-examples repo
that work on our hardware today.

SVA MFA Interaction Design · Spring 2026

CONTEXT

What You Already Have

Four detectors running on your OAK-D cameras, all controllable via Discord:

DetectorWhat It SeesPipeline
person_detector.py People in frame, count YOLO v6 (single stage)
fatigue_detector.py Drowsiness, head tilt YuNet → MediaPipe landmarks
gaze_detector.py Where someone is looking YuNet → Head pose → Gaze ADAS
whiteboard_reader.py Text on a whiteboard PaddlePaddle detect + recognize
The pattern: every detector writes a JSON status file → the Discord bot reads it → you ask questions in chat and get answers.

IDEA 1

Hand Gesture Recognition

What it does

Detects hands in frame, maps 21 keypoints per hand, then classifies the pose into a gesture.

Built-in gestures:

FIST, OK, PEACE, ONE, TWO, THREE, FOUR, FIVE

MediaPipe Palm Detection Hand Landmarker (21 pts) Gesture classifier

Example location

oak-examples/neural-networks/
pose-estimation/hand-pose/

On our hardware

Works on RVC2 RVC2
Default ~8 FPS ~8 fps

IDEA 1

Gestures as Conversation

Core concept: Your hand becomes a controller. No keyboard, no mouse — just hold up a gesture and the camera responds.

Discord interaction ideas

  • !gesture — what gesture is the camera seeing right now?
  • !vote — thumbs up / thumbs down to vote on something
  • !gesture-trigger add PEACE "lights on" — bind a gesture to an action
  • Raise a fist to pause notifications, open palm to resume

Wilder ideas

  • Silent classroom polling — "Hold up 1, 2, or 3"
  • Gesture-controlled slideshow — swipe to advance
  • Multi-camera gesture relay — gesture on orbit triggers action on horizon
  • Combine with gaze: gesture + looking at camera = confirmed command

IDEA 2

Full Body Pose Estimation

What it does

Detects people with YOLO, then estimates 17 body keypoints per person — head, shoulders, elbows, wrists, hips, knees, ankles. A full skeleton.

YOLOv6 nano Lite-HRNet (17 keypoints)

Alternative: YOLOv8 Pose (single-stage, 17 keypoints in one pass)

Example location

oak-examples/neural-networks/
pose-estimation/human-pose/

On our hardware

Works on RVC2 RVC2
Default ~5 FPS ~5 fps

IDEA 2

Bodies as Input

Core concept: The camera understands posture and body language — not just "someone is there" but what they're doing.

Discord interaction ideas

  • !hand-raised — is anyone raising their hand? (wrist above shoulder)
  • !posture — standing, sitting, or slouching?
  • !activity — classify what the person is doing based on keypoint positions
  • Alert when someone stands up or sits down

Wilder ideas

  • Classroom hand-raise queue — first hand up gets called on first
  • Movement energy score — how active is the room?
  • Pose mirroring game — match the skeleton on screen
  • Combine with person detector: track individual skeletons over time

IDEA 3

Object Tracking with Persistent IDs

What it does

Detects objects with YOLO, then extracts a visual "fingerprint" (embedding) for each one. DeepSORT matches fingerprints across frames so each object keeps its ID — even if it leaves and returns.

YOLOv6 nano OSNet embedding DeepSORT tracker

Example location

oak-examples/neural-networks/
object-tracking/deepsort-tracking/

On our hardware

Works on RVC2 RVC2
Default ~5 FPS ~5 fps

IDEA 3

Tracking as Memory

Core concept: The current person detector is goldfish-brained — it knows someone is here right now but not that they were here five seconds ago. Tracking adds continuity.

Discord interaction ideas

  • !track — list everyone currently tracked with their ID
  • !who-left — report IDs that disappeared in the last N minutes
  • !dwell-time — how long has person #3 been in frame?
  • Announce arrivals and departures to a Discord channel

Wilder ideas

  • Trajectory heatmap — overlay paths people walked
  • Traffic flow — how many people crossed left-to-right vs. right-to-left?
  • Multi-camera handoff — person leaves orbit's view, appears in gravity's
  • Track non-person objects too — YOLO detects 80 COCO classes

IDEA 4

Human Re-Identification

What it does

Detects faces or bodies, computes a unique embedding, then compares it against previously seen embeddings using cosine similarity. Recognizes the same person appearing again — even after leaving and coming back.

Two modes:

  • Pose mode — SCRFD person detection + OSNet body embedding
  • Face mode — SCRFD/YuNet face detection + ArcFace embedding

Example location

oak-examples/neural-networks/
reidentification/
human-reidentification/

On our hardware

Works on RVC2 RVC2
Default ~2 FPS ~2 fps

Slow but powerful. 2 FPS is fine for attendance-style use cases where you don't need real-time.

IDEA 4

Recognition as Relationship

Core concept: The camera doesn't just see a person — it remembers which person. It can greet regulars, notice when someone hasn't been around, or learn visitor patterns.

Discord interaction ideas

  • !attendance — who has the camera seen today?
  • !register "Alex" — name the current face for future recognition
  • !seen "Alex" — when was Alex last spotted?
  • Auto-greet returning people in Discord

Wilder ideas

  • Privacy-first: store only embeddings, never photos
  • Anonymous re-id: "Person A has visited 3 times" without knowing who A is
  • Pair with fatigue: "Alex looks tired today" (personalized observation)
  • Opt-in system: only track people who register themselves

IDEA 5

Segmentation & Silhouettes

What it does

Instead of drawing a bounding box, segmentation classifies every pixel — is it a person, or is it background? You get a precise silhouette, not a rectangle.

Two examples available:

  • Background blur — DeepLab V3+, blur everything that isn't a person
  • Depth crop — DeepLab + stereo depth, isolate people by distance and shape
DeepLab V3+ Per-pixel mask

Example locations

oak-examples/neural-networks/
segmentation/blur-background/
oak-examples/neural-networks/
segmentation/depth-crop/

On our hardware

Works on RVC2 RVC2
Blur: ~4 FPS ~4 fps
Depth crop: ~10 FPS ~10 fps

IDEA 5

Shape Without Identity

Core concept: Silhouettes are inherently privacy-preserving. You can see presence, movement, and form without ever capturing a recognizable face. This is a camera that sees people without watching them.

Discord interaction ideas

  • !silhouette — screenshot showing only person outlines
  • !privacy-mode on — switch from full frame to silhouette-only capture
  • !background — extract and share just the background (people removed)
  • !depth-mask — combine segmentation + depth to isolate by distance

Wilder ideas

  • Shadow puppet theater — silhouettes as art output
  • Anonymous occupancy — count body shapes, not faces
  • Background timelapse — capture the room without people over hours
  • Combine with depth: "how much of the room is occupied?"

COMBINING IDEAS

The Interesting Part Is the Overlap

Each idea is useful alone. Together, they start to describe a room that understands what's happening inside it.

Tracking + Pose

Person #3 raised their hand 12 seconds ago and is still waiting.

Re-id + Fatigue

Alex looks tired today. Send a private DM instead of a public alert.

Gesture + Segmentation

Privacy-safe voting: count raised fists from silhouettes, no faces stored.

Implementation is easy: each detector writes a JSON status file. A new script can read multiple status files and fuse the information — no need to modify existing detectors.

REALITY CHECK

Working Within RVC2 Limits

Our OAK-D cameras use the RVC2 chip (Myriad X). It works — but it's not fast. Here's what to expect:

ExampleFPS on RVC2Good For
Hand gestures ~8 fps Interactive commands, voting
Human pose ~5 fps Posture checks, hand-raise detection
DeepSORT tracking ~5 fps Arrivals/departures, dwell time
Re-identification ~2 fps Attendance, periodic check-ins
Segmentation (blur) ~4 fps Privacy screenshots, silhouettes
Segmentation (depth crop) ~10 fps Distance-based isolation
Key insight: These frame rates are fine for conversational interaction. You're asking the camera a question via Discord and getting an answer — you're not streaming 60fps video. 2 FPS is plenty fast for "who's in the room?"

UPGRADE PATH

OAK 4: 40x Faster (Ships March 20)

The OAK 4 line uses the RVC4 platform (Qualcomm QCS8550) — 52 TOPS vs. ~1.4 TOPS on our current Myriad X. Every pipeline above would hit 30 FPS.

ModelPriceDepth
OAK 4 S ~$749 No
OAK 4 D ~$849 Yes
OAK 4 D Pro ~$949 Yes + laser

8 GB RAM, 128 GB storage, 48 MP RGB. USB + PoE built into every unit.

One camera, both setups. The same OAK 4 D works for USB stations and the PoE spatial tracking rig — just plug in the right cable.

What it unlocks

  • All five ideas on this deck at 30 FPS instead of 2–8
  • RVC4-only examples: YOLO-World (detect anything by text), DINO tracking, people demographics
  • Standalone on-device apps — camera processes without a host
  • Larger, more accurate models that won't fit on Myriad X

Alternatively: offload to a GPU. Stream frames from the current OAK-D to your PC or a cloud GPU (RunPod) and run inference there. Same speed boost, no new hardware.

GETTING STARTED

Try an Example

All five examples live in the oak-examples repo and follow the same pattern:

$ ssh orbit
$ activate-oak
$ cd ~/oak-examples/neural-networks/pose-estimation/hand-pose/
$ pip install -r requirements.txt
$ python3 main.py

To make it a Smart Objects detector, follow the existing pattern:

  • Copy structure from the closest existing detector (e.g. fatigue_detector.py)
  • Write a JSON status file so the Discord bot can read it
  • Add --discord / --log / --display flags
  • Add new !commands to discord_bot.py
  • Announce startup and shutdown to Discord

REFERENCE

Example Paths

IdeaPath in oak-examplesModels
Hand gestures neural-networks/pose-estimation/hand-pose/ MediaPipe Palm + Hand Landmarker
Human pose neural-networks/pose-estimation/human-pose/ YOLOv6 + Lite-HRNet
Object tracking neural-networks/object-tracking/deepsort-tracking/ YOLOv6 + OSNet + DeepSORT
Re-identification neural-networks/reidentification/human-reidentification/ SCRFD/YuNet + OSNet/ArcFace
Segmentation neural-networks/segmentation/blur-background/ DeepLab V3+
Depth crop neural-networks/segmentation/depth-crop/ DeepLab V3+ + StereoDepth
Browse all models: models.luxonis.com — filter by RVC2 to see what runs on our cameras.

SMART OBJECTS

Pick One. Make It Talk.

The camera already sees. Your job is to decide what it should say.

Start from an existing example. Write a status file.
Add a Discord command. Make the camera conversational.

github.com/kandizzy/smart-objects-cameras