SMART OBJECTS

What Else Can Your Camera See?

Five project ideas from the oak-examples repo
that work on our hardware today.

SVA MFA Interaction Design · Spring 2026

CONTEXT

What You Already Have

Four detectors running on your OAK-D cameras, all controllable via Discord:

Detector	What It Sees	Pipeline
`person_detector.py`	People in frame, count	YOLO v6 (single stage)
`fatigue_detector.py`	Drowsiness, head tilt	YuNet → MediaPipe landmarks
`gaze_detector.py`	Where someone is looking	YuNet → Head pose → Gaze ADAS
`whiteboard_reader.py`	Text on a whiteboard	PaddlePaddle detect + recognize

The pattern: every detector writes a JSON status file → the Discord bot reads it → you ask questions in chat and get answers.

IDEA 1

Hand Gesture Recognition

What it does

Detects hands in frame, maps 21 keypoints per hand, then classifies the pose into a gesture.

Built-in gestures:

FIST, OK, PEACE, ONE, TWO, THREE, FOUR, FIVE

MediaPipe Palm Detection → Hand Landmarker (21 pts) → Gesture classifier

Example location

oak-examples/neural-networks/
pose-estimation/hand-pose/

On our hardware

Works on RVC2 RVC2
Default ~8 FPS ~8 fps

IDEA 1

Gestures as Conversation

Core concept: Your hand becomes a controller. No keyboard, no mouse — just hold up a gesture and the camera responds.

Discord interaction ideas

!gesture — what gesture is the camera seeing right now?
!vote — thumbs up / thumbs down to vote on something
!gesture-trigger add PEACE "lights on" — bind a gesture to an action
Raise a fist to pause notifications, open palm to resume

Wilder ideas

Silent classroom polling — "Hold up 1, 2, or 3"
Gesture-controlled slideshow — swipe to advance
Multi-camera gesture relay — gesture on orbit triggers action on horizon
Combine with gaze: gesture + looking at camera = confirmed command

IDEA 2

Full Body Pose Estimation

What it does

Detects people with YOLO, then estimates 17 body keypoints per person — head, shoulders, elbows, wrists, hips, knees, ankles. A full skeleton.

YOLOv6 nano → Lite-HRNet (17 keypoints)

Alternative: YOLOv8 Pose (single-stage, 17 keypoints in one pass)

Example location

oak-examples/neural-networks/
pose-estimation/human-pose/

On our hardware

Works on RVC2 RVC2
Default ~5 FPS ~5 fps

IDEA 2

Bodies as Input

Core concept: The camera understands posture and body language — not just "someone is there" but what they're doing.

Discord interaction ideas

!hand-raised — is anyone raising their hand? (wrist above shoulder)
!posture — standing, sitting, or slouching?
!activity — classify what the person is doing based on keypoint positions
Alert when someone stands up or sits down

Wilder ideas

Classroom hand-raise queue — first hand up gets called on first
Movement energy score — how active is the room?
Pose mirroring game — match the skeleton on screen
Combine with person detector: track individual skeletons over time

IDEA 3

Object Tracking with Persistent IDs

What it does

Detects objects with YOLO, then extracts a visual "fingerprint" (embedding) for each one. DeepSORT matches fingerprints across frames so each object keeps its ID — even if it leaves and returns.

YOLOv6 nano → OSNet embedding → DeepSORT tracker

Example location

oak-examples/neural-networks/
object-tracking/deepsort-tracking/

On our hardware

Works on RVC2 RVC2
Default ~5 FPS ~5 fps

IDEA 3

Tracking as Memory

Core concept: The current person detector is goldfish-brained — it knows someone is here right now but not that they were here five seconds ago. Tracking adds continuity.

Discord interaction ideas

!track — list everyone currently tracked with their ID
!who-left — report IDs that disappeared in the last N minutes
!dwell-time — how long has person #3 been in frame?
Announce arrivals and departures to a Discord channel

Wilder ideas

Trajectory heatmap — overlay paths people walked
Traffic flow — how many people crossed left-to-right vs. right-to-left?
Multi-camera handoff — person leaves orbit's view, appears in gravity's
Track non-person objects too — YOLO detects 80 COCO classes

IDEA 4

Human Re-Identification

What it does

Detects faces or bodies, computes a unique embedding, then compares it against previously seen embeddings using cosine similarity. Recognizes the same person appearing again — even after leaving and coming back.

Two modes:

Pose mode — SCRFD person detection + OSNet body embedding
Face mode — SCRFD/YuNet face detection + ArcFace embedding

Example location

oak-examples/neural-networks/
reidentification/
human-reidentification/

On our hardware

Works on RVC2 RVC2
Default ~2 FPS ~2 fps

Slow but powerful. 2 FPS is fine for attendance-style use cases where you don't need real-time.

IDEA 4

Recognition as Relationship

Core concept: The camera doesn't just see a person — it remembers which person. It can greet regulars, notice when someone hasn't been around, or learn visitor patterns.

Discord interaction ideas

!attendance — who has the camera seen today?
!register "Alex" — name the current face for future recognition
!seen "Alex" — when was Alex last spotted?
Auto-greet returning people in Discord

Wilder ideas

Privacy-first: store only embeddings, never photos
Anonymous re-id: "Person A has visited 3 times" without knowing who A is
Pair with fatigue: "Alex looks tired today" (personalized observation)
Opt-in system: only track people who register themselves

IDEA 5

Segmentation & Silhouettes

What it does

Instead of drawing a bounding box, segmentation classifies every pixel — is it a person, or is it background? You get a precise silhouette, not a rectangle.

Two examples available:

Background blur — DeepLab V3+, blur everything that isn't a person
Depth crop — DeepLab + stereo depth, isolate people by distance and shape

DeepLab V3+ → Per-pixel mask

Example locations

oak-examples/neural-networks/
segmentation/blur-background/

oak-examples/neural-networks/
segmentation/depth-crop/

On our hardware

Works on RVC2 RVC2
Blur: ~4 FPS ~4 fps
Depth crop: ~10 FPS ~10 fps

IDEA 5

Shape Without Identity

Core concept: Silhouettes are inherently privacy-preserving. You can see presence, movement, and form without ever capturing a recognizable face. This is a camera that sees people without watching them.

Discord interaction ideas

!silhouette — screenshot showing only person outlines
!privacy-mode on — switch from full frame to silhouette-only capture
!background — extract and share just the background (people removed)
!depth-mask — combine segmentation + depth to isolate by distance

Wilder ideas

Shadow puppet theater — silhouettes as art output
Anonymous occupancy — count body shapes, not faces
Background timelapse — capture the room without people over hours
Combine with depth: "how much of the room is occupied?"

COMBINING IDEAS

The Interesting Part Is the Overlap

Each idea is useful alone. Together, they start to describe a room that understands what's happening inside it.

Tracking + Pose

Person #3 raised their hand 12 seconds ago and is still waiting.

Re-id + Fatigue

Alex looks tired today. Send a private DM instead of a public alert.

Gesture + Segmentation

Privacy-safe voting: count raised fists from silhouettes, no faces stored.

Implementation is easy: each detector writes a JSON status file. A new script can read multiple status files and fuse the information — no need to modify existing detectors.

REALITY CHECK

Working Within RVC2 Limits

Our OAK-D cameras use the RVC2 chip (Myriad X). It works — but it's not fast. Here's what to expect:

Example	FPS on RVC2	Good For
Hand gestures	~8 fps	Interactive commands, voting
Human pose	~5 fps	Posture checks, hand-raise detection
DeepSORT tracking	~5 fps	Arrivals/departures, dwell time
Re-identification	~2 fps	Attendance, periodic check-ins
Segmentation (blur)	~4 fps	Privacy screenshots, silhouettes
Segmentation (depth crop)	~10 fps	Distance-based isolation

Key insight: These frame rates are fine for conversational interaction. You're asking the camera a question via Discord and getting an answer — you're not streaming 60fps video. 2 FPS is plenty fast for "who's in the room?"

UPGRADE PATH

OAK 4: 40x Faster (Ships March 20)

The OAK 4 line uses the RVC4 platform (Qualcomm QCS8550) — 52 TOPS vs. ~1.4 TOPS on our current Myriad X. Every pipeline above would hit 30 FPS.

Model	Price	Depth
OAK 4 S	~$749	No
OAK 4 D	~$849	Yes
OAK 4 D Pro	~$949	Yes + laser

8 GB RAM, 128 GB storage, 48 MP RGB. USB + PoE built into every unit.

One camera, both setups. The same OAK 4 D works for USB stations and the PoE spatial tracking rig — just plug in the right cable.

What it unlocks

All five ideas on this deck at 30 FPS instead of 2–8
RVC4-only examples: YOLO-World (detect anything by text), DINO tracking, people demographics
Standalone on-device apps — camera processes without a host
Larger, more accurate models that won't fit on Myriad X

Alternatively: offload to a GPU. Stream frames from the current OAK-D to your PC or a cloud GPU (RunPod) and run inference there. Same speed boost, no new hardware.

GETTING STARTED

Try an Example

All five examples live in the oak-examples repo and follow the same pattern:

$ ssh orbit
$ activate-oak
$ cd ~/oak-examples/neural-networks/pose-estimation/hand-pose/
$ pip install -r requirements.txt
$ python3 main.py

To make it a Smart Objects detector, follow the existing pattern:

Copy structure from the closest existing detector (e.g. fatigue_detector.py)
Write a JSON status file so the Discord bot can read it
Add --discord / --log / --display flags
Add new !commands to discord_bot.py
Announce startup and shutdown to Discord

REFERENCE

Example Paths

Idea	Path in oak-examples	Models
Hand gestures	`neural-networks/pose-estimation/hand-pose/`	MediaPipe Palm + Hand Landmarker
Human pose	`neural-networks/pose-estimation/human-pose/`	YOLOv6 + Lite-HRNet
Object tracking	`neural-networks/object-tracking/deepsort-tracking/`	YOLOv6 + OSNet + DeepSORT
Re-identification	`neural-networks/reidentification/human-reidentification/`	SCRFD/YuNet + OSNet/ArcFace
Segmentation	`neural-networks/segmentation/blur-background/`	DeepLab V3+
Depth crop	`neural-networks/segmentation/depth-crop/`	DeepLab V3+ + StereoDepth

Browse all models: models.luxonis.com — filter by RVC2 to see what runs on our cameras.

SMART OBJECTS

Pick One. Make It Talk.

The camera already sees. Your job is to decide what it should say.

Start from an existing example. Write a status file.
Add a Discord command. Make the camera conversational.

github.com/kandizzy/smart-objects-cameras