More detailed information can be found in the uploaded documentation folder in this repository. There you will find a prelimenary paper as well as the project presentation slides
The paradigm shift toward Industry 5.0 requires a fundamental change in human-robot collaboration. This change involves the transition from reactive collision detection to human-centric proactive safety frameworks. Although extrinsic sensing systems and resource-intensive AI-based perception and decision-making pipelines provide rich environmental context in complex scenarios, occlusion and significant processing latencies. This research addresses these challenges by presenting a low-latency, intrinsic, 3D motion tracking system that uses a Time-of-Flight sensor ring mounted directly on a Cobot link. The core contribution is a data processing pipeline integrating a hysteresis-based temporal occupancy filter that eliminates ghosting artifacts and spatial merging. Exploiting the discrete topology of super sparse voxel grids, we derive an optimized DBSCAN clustering configuration that ensures feature continuity while maintaining real-time performance. This approach minimizes geometric detail to the level strictly necessary for robust tracking, thereby maximizing computational performance. Experimental validation in high-proximity scenarios, including human-to-human interactions, demonstrates stable tracking at internal frequencies of 14-16 Hz. The findings confirm that combining low-cost onboard sensing with a training-free, parameter-optimized super sparse voxel pipeline yields predictable trajectories for collision avoidance, thus facilitating the inclusivity and safety requirements of modern industrial code.
The pipline was tested on the follwoing systems configuration:
- Microcontroller: Raspberry Pi Pico RP2040
- Time-of-Flight Sensor: VL53L7CX (8x8 multizone ranging sensor with 90° FoV) | Datasheet: https://www.st.com/resource/en/datasheet/vl53l7cx.pdf
- Cobot: Universal Robot UR10e
- Mounting Device: 3D printed single ring for 7 sensors
All experiments were conducted using Python 3.12.6 on Windows 11. The exact package versions are provided in the req.txt above. First, clone the repository and install the dependencies with:
pip install -r req.txtReal-life testing revealed many coding and design-flaws of the approach. Although the program handled the increased data given by 6 additional sensor quite good with cycle-times of 85 to 65ms, the main concern were significant problems with the detection and especially the distinction of the different objects. These problems were primarily adressde by adjusting the parameters, particularly the epsilon of the DBSCAN (meaning the distance at which two voxels are merged). A major cause for errors were the static surroundings, which often grouped together with nearby moving objects or merged and then unmerged with other static objects, both resulting in motion errors.This could only be partially adressed by adjusting the DBSCAN parameters. Because of this a significant adjustment was introduced by scanning for static objects before starting the main program as part of a preparation function. It is then assumed that these static voxels are not relevant for the object-detection. This way, the program is able to reduce the merging errors considerably, but the process used to identify the static objects is still not completely certain or fully developed.
Another problem was the cycle-clock-speed. Classic boolean-based voxels as well as the time-based approach were tested. The main advantages of the boolean method are a theoretically better performance, although it could not be demonstrated in the real time testing, and the possibility of better movement reconstruction, meaning the ability to trace back the movement further back in time as it would be possible using the time based method as each timeframe is separately recorded. The advantage of the time based method are a more consistent tracking, more adjustability in form of the TIME_TOLERANCE variable and the enhanced expandability for possible other sensor rings as there are no time-frames but a more fluid memorization of the received sensor data.
After adressing these problems it was possible to gather results for motion vectors which were relatively close to the real movement, although there was still significant noise sometimes resulting in erratic, but fundamentally true findings (real direction, but wrong velocity). Another problem occured when trying to detect two seperate movements, as the previous errors occuring with the static surroundings now also applied to the two objects when they came closer together. Nonetheless, the results were quite satisfactory.
livetest.two.persons.mp4
This video shows the test with two moving objects, birds eye view. The beginning (0-2 seconds) shows one person approaching the sensors relatively fast, resulting in erratic behavior. Until 5 seconds the movement can then be identified quite well. At 6 seconds, the second person apporaches the sensors. While both persons are then near the sensors, the detected objects merge and result in no (or very small) detected motion. Around 11 seconds, the first person exits the sensor-range and both objects seperate again. The second persons movement can then be observed well again unit he exits the sensor range as well.
The parameters used in the program were adjusted empirically. For the DBSCAN-parameters (eps and min_samples) the established "k-distance-plot" and "grid-search with slihouette score" algorithms were used to find the optimal values for the given problem.
The K distance plot is a widely used diagnostic tool for selecting the epsilon parameter for density-based clustering algorithms such as DBSCAN. For voxel-based point clouds the plot conveys not only a scale for neighborhood density but also signatures of the underlying grid structure. The elbow point is the location of maximal curvature in the K distance curve. In practice it is the point where the plotted distances change from a relatively flat trend to a markedly increasing slope. Geometrically this point separates points that reside in dense local neighborhoods from points that are isolated or belong to sparse clusters. The vertical coordinate at the elbow is commonly chosen as the epsilon value for DBSCAN. Intuitively, points that appear before the elbow have small distance to their k nearest neighbor and therefore belong to dense cluster interiors. Points that appear after the elbow have substantially larger k distance and are likely to be noise or members of very sparse clusters. Setting epsilon to the elbow value implements the decision rule: every point whose k distance is smaller than epsilon is considered part of a cluster while points with larger k distance are treated as noise. Points inside clusters tend to have many nearby neighbors so their k distance stays small and the curve is flat. When the curve reaches the elbow the population of points transitions from cluster interior to boundary or to background. This transition produces the characteristic knee shape that guides epsilon selection. Voxelized point clouds are defined on a discrete integer grid. Distances between voxel centers are computed with the Euclidean norm d=√(Δx^2+Δy^2+Δz^2) where each delta is an integer difference between voxel coordinates. Because the coordinate differences are integers the set of possible distance values is discrete and limited. When the k distance values for all points are sorted, many points frequently share identical or nearly identical distance values. This results in extended horizontal segments in the sorted curve. Each flat segment corresponds to a bin of identical distance values produced by the grid geometry. When the next larger discrete distance appears the curve jumps to the next step. The more regular the grid and the coarser the voxel resolution the stronger the staircase effect. The staircase appearance is not a flaw but an artifact of discretization. It implies that small changes to epsilon within a flat segment will not change cluster assignments. Conversely, choosing epsilon values at jump points will change the number of neighbors for many points at once and may cause abrupt changes in the clustering outcome. In practice it is therefore advisable to choose epsilon near the top of a stable flat segment immediately before a jump, or to use algorithms that estimate the elbow by curvature rather than by manual inspection.
Fundamentally there are two forms of limitatiosn for the final program: Theoretical limitations which stem from the way the program works and real life limitations which mainly come from uncertanties of the sensors.
The most critical limitation is the size of objects which can be detected, which is variable and dependent on distance from the sensors, movement speed and material (reflective or translucent meterials cannot be detected) of the object. Testing revealed that rods with a diameter of around 2cm can only be detected directly in front of the sensor ring while objects with a diameter larger than 5cm (a human arm for example) can be deteced at a distance greater than 100cm. Objects which are moving can generally be detected better, as they can often be seen by more sensors, although it's difficult to get precise thresholds for this phenomenon, as it again also depends on the distance to the sensors.
The theoretical limitations were already addressed previously, but the main problems are the difficulties regarding DBSCAN, the detection of movement speeds with variable maximal and minimal speeds, the necessary detection of stationary objects and the sometimes erratic detected motion vectors.
DBSCAN causes spatially distant points to be grouped together when used in environments with a limited number of voxels. This effect stems from the algorithm relying on local neighborhood density and k nearest neighbors. Figure 1 clearly shows that the scanner initially detected a pool noodle as a separate object. However, when it came close to the wall, the scanner merged both into a single cluster. Merging reduces the ability to separate objects by distance and to detect novel items reliably and semantically. Additionally, initial tests revealed that the sensor ring produces false points, which introduce noise in a already sparse voxel representations. It is vital to handle static objects that interfere with the clustering. The solution is straightforward: save them beforehand so that those detected voxels are not taken into consideration for the DBSCAN. This also allows for more empirical parameter optimisation of the DBSCAN parameter.
In this paper, a novel, low-cost intrinsic sensing system is presented. This system is designed to address the limitations of reactive force and predictive collision avoidance in Industry 5.0 environments. By integrating a multi-sensor ToF array with a high-performance temporal voxel pipeline, the potential for real-time object tracking within super-sparse data structures has been demonstrated. The methodology applied for temporal occupancy filtering effectively resolves the chronic issues of ghosting and spatial merging that typically plague density-based clustering in discretized spaces. Moreover, the integration of a topologically-grounded hyperparameter selection for DBSCAN offers a robust theoretical foundation for the development of future low-latency HRC applications. While the system demonstrated a high degree of correlation with ground-truth movements and displayed robustness in close-proximity interactions, critical limitations were identified with regard to sensitivity and low-profile geometric detection (d < 2cm). Consequently, future work will concentrate on integrating adaptive background subtraction to accommodate semi-static environments and developing a global optimization framework for hyperparameters to ensure cross-platform scalability. This research offers a foundation for cobotic workspaces that are safer, more resilient, and inclusive. It aligns technological advancement with the human-centric priorities of the primary labor market.