How a deterministic stereo-vision safety monitor works · Engineering notes

01The one-bit problem

Walk onto almost any robot cell today and the safety story is the same: a light curtain or a 2D laser scanner draws a line, and the moment anything crosses it, the robot stops. One bit. Safe, or stopped.

That bit is honest but blunt. It treats a person standing two metres away exactly like a person reaching in at thirty centimetres. Full emergency halt, either way. The robot has no idea you're a comfortable distance off, so it assumes the worst and freezes. Multiply that across a shift and the throughput you lose to nuisance stops is enormous. People learn to keep clear, the line crawls, and the safety system gets blamed for being in the way.

The frustrating part is that the information to do better is right there. If you could measure how far away the nearest person is, continuously and densely across the whole scene, you wouldn't have to stop. You could slow down by exactly as much as the geometry demands and keep working the rest of the time.

Fig 1 · One bit vs. graded whiteboard sketch

BINARY → person at 2 m or 0.3 m, same response: STOP.
GRADED → far = FULL · nearer = 75% · closer = 50% · very close = STOP.

The whole idea in one picture. A binary sensor can only say stop; a graded one reads the distance and turns it into a speed.

02Grade the speed, don't trip the wire

This isn't a new idea in the safety standards. It's called speed-and-separation monitoring, and the principle is simple: the closer the hazard, the slower the machine, all the way down to a stop only when it genuinely has to. Instead of one line in the sand you get a set of nested zones, and the robot's allowed speed is a function of which zone the nearest person is in.

Concretely, the closest measured distance maps to a speed scale: full speed when the area is clear, 75% as someone approaches, 50% when they're closer still, and a stop only in the innermost band. The robot keeps doing useful work right up until the geometry says it shouldn't, and it recovers the moment the person steps back.

The robot doesn't need to know who or what is near it. It only needs to know, reliably and every frame, how far away the nearest thing is. That's a geometry problem, not a recognition problem, and geometry is something you can certify.

That last point is the whole reason this is built the way it is. The moment the safety decision is about measured distance rather than recognising objects, you can throw out the parts of the stack that are hard to trust. The neural network, the operating system, the unpredictable scheduler all go, replaced by a fixed circuit that does the same arithmetic in the same number of clock cycles on every frame.

03How the depth gets made

To grade speed by distance you first need distance. Dense distance: a value for essentially every pixel in the scene, not a sparse point cloud. We get it the old-fashioned way, with stereo. Two global-shutter cameras a known baseline apart, a 940 nm VCSEL dot projector throwing eye-safe texture onto blank surfaces, and a matching engine that works out, for each pixel in the left image, how far it shifted in the right. That shift, the disparity, is inversely proportional to depth.

The interesting bit is where that matching happens. There's no CPU in the loop, no GPU, no inference accelerator. The whole stereo matcher is a chain of hardwired logic stages inside the FPGA fabric of an AMD Kria K26, each one doing a fixed job and handing its result to the next, one pixel every clock cycle. Raw pixels go in the left end, a dense depth map comes out the right.

Fig 2 · The stereo pipeline (PL fabric) census → hamming → SGM → WTA → L–R → median

Stereo L + Stereo R → [Census 9×7] → [Hamming cost · 128 disp] → [SGM 4-path: L→R · R→L · T→B · TL→BR] → [WTA argmin] → [L–R consistency] → [3×3 median] → Dense depth map.
1 px / clock · ~117 µs fixed · no OS · no branch.

The datapath, end to end. Each box is real RTL on the device. The core matcher (census → hamming → SGM → WTA) is bit-exact against our Python golden reference; the two post-processing stages clean up occlusions and speckle.

Reading it left to right, here's what each stage actually does:

Census 9×7. For each pixel, encode how its 9×7 neighbourhood compares to the centre as a bit-string. This makes the match robust to brightness differences between the two cameras. We compare local structure, not raw intensity.
Hamming cost. For every pixel, score all 128 candidate disparities by the Hamming distance between census codes. Cheap, parallel, and a perfect fit for fabric.
4-path SGM. Semi-global matching aggregates those costs along four directions (L→R, R→L, T→B, TL→BR) so a pixel's chosen disparity agrees with its neighbours instead of being decided in isolation. That's what turns a noisy per-pixel guess into a smooth, dense surface.
WTA argmin. Winner-takes-all: pick the disparity with the lowest aggregated cost.
Left–right consistency. Cross-check the left and right solutions and throw out pixels where they disagree. That's how you catch occlusions honestly instead of inventing depth for them.
3×3 median. A final speckle clean-up, preserving real edges.

Every one of those stages is pipelined. While the median filter is finishing pixel N, the census transform is already chewing on pixel N + thousands. Nothing waits on anything. The data flows straight through, like water through a series of locks.

04Why determinism is the whole point

Here's the part that matters for safety, and it's easy to undersell. Because the pipeline is fully pipelined hardwired logic running one pixel per clock, the time from first pixel in to depth out is fixed. Not "fast on average". Not "usually under a budget". The same, to the clock cycle, on every frame.

That latency is roughly 117 µs, about 0.7% of a frame period. And critically, nothing in the path can make it vary:

No operating system. No scheduler to preempt the work, no other process to steal a time-slice.
No cache. No hit-or-miss lottery deciding whether this pixel is fast or slow.
No branch prediction. The circuit doesn't choose a path through code; the data path is the circuit.

Fig 3 · Fixed Δt, every frame clock @ 200 MHz target

FPGA: pixel-in → depth-out = Δt fixed (~117 µs), identical every frame.
CPU / GPU / NN: response time jitters frame to frame, can't promise the same.

A latency you can certify. A CPU, GPU or neural-net pipeline can be fast, but it can't promise the same response time on every frame. A fixed circuit can, and "every frame, on time" is exactly the property a safety case is built on.

This is the honest dividing line between this approach and a "camera plus deep-learning" safety pitch. A learned model can be impressively accurate on average, but its latency depends on the hardware's mood that millisecond, and its decision is a probability, not a proof. For a safety function, "usually correct, usually on time" isn't the bar. Always on time, always the same arithmetic is.

05The evidence

Claims about safety are worth nothing without receipts, so here are the measured ones. The first is the one I'm proudest of. We run the same scene through the real RTL (verilated, full production geometry) and through a Python golden reference, and compare disparity for disparity.

The result is 100.0000% bit-exact over 1,009,968 disparities on the Middlebury artroom1 scene, mean absolute error of zero. Not "close enough", not "99-point-something". The hardware computes the identical answer the reference does, every pixel. (That figure supersedes an earlier 99.987% number, from before we found and fixed two long-standing RTL bugs.)

100.0000%

bit-exact RTL vs. golden reference, over 1,009,968 disparities

~117 µs

fixed pipeline latency, 0.7% of a frame, no jitter

1280×800 · 128

full-res dense depth, 128 disparity levels, 1 px/clock

60+ fps

fabric throughput, 1.63× margin over 120 fps capture

The rest of the spec sheet, plainly: stereo capture on OV9281 global-shutter sensors at 1280×800, a 940 nm VCSEL dot projector for eye-safe active illumination, and the whole pipeline on a single AMD Kria K26 (Zynq UltraScale+ MPSoC) targeting a 200 MHz fabric clock. A Cortex-R5 lockstep core sits alongside as an independent safety monitor, watching frame cadence, plausibility and consistency, while the geometric zone engine turns the depth map into that graded speed scale.

The whole architecture is being engineered toward ISO 13849-1 PLd Cat 3*. The design choice that makes that tractable is deliberately boring: every unit runs identical firmware and an identical bitstream, and a Cat 3 configuration is just two of those units cross-checking each other: no separate "safety silicon", no per-unit special cases, one body of certification evidence.

06See it on the bench

None of this is a render. Everything above runs continuously on a development rig: a stereo HAT, a Kria K26, the governor closing the full capture-to-speed-scale loop. And you can watch it. The live bench streams real depth, the zone state and the graded speed it's commanding, in real time.

It's owner-gated (calibration and power controls stay locked down), but the depth view and telemetry are available behind access. If you're an integrator, or doing technical diligence, that's the fastest way to see whether "deterministic, measured, every frame" holds up under your own scrutiny.

Watch the governor run

Live depth and zone telemetry from the development rig, the real loop, not a recording.

Open the live bench ↗ Request access →

Owner-gated · calibration and power controls stay private.

*ISO 13849-1 PLd Cat 3 is a design target. The architecture is engineered toward it; it is not yet certified. Every other figure on this page is measured on real hardware. The 100.0000% bit-exact result is verilated RTL vs. the Python reference at full production geometry, and latency, resolution and throughput are from the deployed pipeline.