Abstract

Abstract

The paradigm shift toward Industry 5.0 requires a fundamental change in human-robot collaboration. This change involves transitioning from reactive collision detection to proactive, human-centric safety frameworks. Although extrinsic sensing systems and resource-intensive AI-based perception and decision-making pipelines provide rich environmental context in complex scenarios, occlusion and significant processing latencies. This research addresses these challenges by presenting a low-latency, intrinsic, 3D motion tracking system that uses a Time-of-Flight sensor ring mounted directly on a Cobot link. The core contribution is a data processing pipeline integrating a hysteresis-based temporal occupancy filter that eliminates ghosting artifacts and spatial merging. Exploiting the discrete topology of super-sparse voxel grids, we derive an optimized DBSCAN clustering configuration that ensures feature continuity while maintaining real-time performance. This approach minimizes geometric detail to the level strictly necessary for robust tracking, thereby maximizing computational performance. Experimental validation in high-proximity scenarios, including human-to-human interactions, demonstrates stable tracking at internal frequencies of 14-16 Hz. The findings confirm that combining low-cost onboard sensing with a zero-shot sparse voxel pipeline yields predictable trajectories for collision avoidance, thus facilitating the inclusivity and safety requirements of modern industrial code.

Introduction

Industry 5.0 (I 5.0) signifies a pivotal realignment of industrial priorities, emphasizing human-centric collaboration, sustainability, and resilience rather than the pursuit of automation for its own merit [1]. It builds on the digital and cyber-physical foundations of previous industrial advancements, particularly Industry 4.0 (I 4.0). I 5.0 utilizes technologies such as Internet of Things (IoT), artificial intelligence (AI), digital twins, industrial robots and additive manufacturing to facilitate smart factories, while redefining technological progress as a mean to empower human workers and reduce environmental impact [2]. The focus of I 4.0 was predominantly on efficiency and connectivity, which gave rise to concerns regarding job security, rising unemployment due to automation, and environmental issues such as excessive energy consumption and electronic waste [1]. I 5.0 therefore aims to address these social and ecological challenges by leveraging human capabilities through collaborative machine systems, as opposed to replacing them [2].

Collaborative robots (Cobot) are designed to undertake repetitive or hazardous tasks in contact-rich environments, thereby allowing human operators to focus on more complex activities such as oversight, problem solving or in general operations that add greater value [1]. This shift presents a range of practical opportunities, including mass customization, greater production flexibility, optimized resource use, and the inclusion of disabled people in the workforce. These opportunities can further help to meet growing consumer demand for personalized products while lowering material and energy footprints [2]. Simultaneously, the increased proximity of humans and machines gives rise to new and significant safety challenges. These challenges require robust, easy-to-handle technical solutions as well as regulatory guidance to ensure worker protection under all conditions [4]. In practice, common safety systems mostly rely on compliance control and collision detection [5] such as torque sensors in order to provide reliable collision detection and force control. However, they only signal after an impact has occurred. Complementary approaches such as proximity sensing offer promising opportunities for achieving predictive protection and facilitating smoother and safer human-robot interaction (HRI) [4].

From an occupational health and safety perspective, human-robot collaboration (HRC) has been shown to reduce (musculoskeletal) health risk factors, decrease physical effort, improve coordination and efficiency as well as lower exposure to hazards [4]. The study from Fournier et al. [4] also reports fewer errors per unit of time and maintained trust in cobotic systems, even when total error counts remain similar across setups.

By effectively lowering physical barriers to enter the primary labor market, these ergonomic benefits lay the foundation for a more inclusive workforce. Beyond the individual health benefits, the ability to serve diverse physical impairments fosters compliance with legal frameworks. For example, § 154 of the German Social Code Book IX [6] promotes inclusion and reasonable accommodations for severely disabled employees. To ensure that the previously mentioned objectives of I 5.0 are aligned with the general goals of social participation and equity, the implementation of Cobots must be realized in a manner that is both safe and anticipatory. A critical and first step for ensuring safety by collision avoidance in HRC is the ability to reliably detect objects real-time in proximity.

System Configuration

Hardware

The pipline was tested on the follwoing systems configuration:

Microcontroller: Raspberry Pi Pico RP2040
Time-of-Flight Sensor: VL53L7CX (8x8 multizone ranging sensor with 90° FoV) | Datasheet: https://www.st.com/resource/en/datasheet/vl53l7cx.pdf
Cobot: Universal Robot UR10e
Mounting Device: 3D printed single ring for 7 sensors

Software

All experiments were conducted using Python 3.12.6 on Windows 11. The exact package versions are provided in the req.txt above. First, clone the repository and install the dependencies with:

pip install -r req.txt

The Code

How the Program is structured

(Orange boxes represent different storage)

The program can be structured by these steps:

Collection of the Sensor-data (which is not a part of this project).
Storage of the collected data. There are exact (point-cloud) and discrete (voxels) options. The exact options have less built-in error, but the discrete options offer far better performance which is critical when dealing with object or motion detection. Therefore, voxels are the best option. As the objective is not only to detect objects but also the movement of these objects it is necessary to track these objects over time. There several options: Each timeframe could be stored as a different grid or all points could be stored in the same grid where the voxels themselves hold the needed information. The first option gives a history of all received data, but the management can become difficult and runs the risk of Memory Errors. The second option restricts the size of the stored data significantly, again trading performance for exactness.
First analysis of the stored data to find objects. This can be done by cluster-analysis. The DBSCAN-algorithm is a good compromise between performance and accuracy, as it uses the simplest algorithm which can detect overlapping objects (for example a ring with a different object inside) with reasonable accuracy.
The detected objects or groups of points must now be stored to use them for motion-detection later. Here, there are again two options: Storing all points associated with the object, or simplify the object to one point, for example the center. Obviously, the latter would be far simpler, but the first one has critical advantages: Firstly, the Object can be tracked more easily as more information about the object is memorized. Secondly, the dimensions of the object are preserved. This is particularly important, as some objects could otherwise become impossible to distinguish (the ring for example would have the same center as the object inside). Therefore all associated points must be stored.
Tracking of the detected objects. This is arguably the most difficult part, as now the data from different timeframes is required. The Program tracks different objects using two different methods. First, it checks for overlap between the objects last positions and the positions of the new, not yet associated clusters detected by DBSCAN. This method works reasonably well and has a very low error-rate, but has some difficulties and problems discussed later on. Notably, if no object can be associated by overlap the reason can be that the object moved away too fast. The object is then associated with the nearest yet unassociated object in a given radius. Although this is more prone to false associations it is a necessary enhancement of the basic method. Obtaining the motion vector for objects which have been tracked over time is very simple.
Afte the movement vectors has been worked out they can be used to predict the next movement. For this, the Progam only uses a very rudimentary algorithm but it could also track the whole history of detected movement vectors, making it possible to use more complex programs to analyse the movement, but this would exceed the scope of this project.
Lastly an output is given. The determined movement-vector is shown including the approximate position of the related object.

About the Final Program:

The problem of latency was largely resolved by using the voxel approach. The Raspberry Pi Pico used produced the sensor data at approximately 8.33 Hz. The main Program needed approximately 60 to 70 ms to complete one cycle (which includes data-gathering and all following steps) which was also tested with larger sample sizes from up to 7 sensors.

Experiments with classic boolean-based voxels as well as a time-based approach were carried out. The main advantages of the boolean method are theoretically better performance, although this could not be demonstrated in real-time testing, and the possibility of better movement reconstruction, meaning the ability to trace back the movement further back in time as it would be possible using the time based method as each timeframe is separately recorded. The advantages of the time based method are a more consistent tracking, more adjustability in form of the TIME_TOLERANCE parameter and the enhanced expandability for possible other sensor rings as there are no time-frames but a more fluid memorization of the received sensor data. The TIME_TOLERANCE parameter is critical in discretising the time-frames and is further discussed later. The final program makes use of the time based approach.

The Program tracks different objects using two different methods. First, it checks for overlap between the objects last positions and the positions of the new, not yet associated clusters detected by DBSCAN. This method works reasonably well and has a very low error-rate, but has some difficulties and problems discussed later on. Notably, if no object can be associated by overlap the reason can be that the object moved away too fast. The object is then assoicated with the nearest yet unassociated object in a given radius. Although this is more prone to false associations it is a necessary enhancement of the basic method. Obtaining the motion vector for objects which have been tracked over time is very simple.

Grey boxes: position of the object at t=0
Black boxes: position of the object at t=1
Red dots: active voxels at t=0
Green dots: active voxels at t=1
Yellow dots: voxels which are active at t=0 and t=1
Black dots: inactive voxels

In particular the overlap-solution introduces the first major theoretical problem because it can only detect motion up to a certain speed. The figure above illustrates the origin of this problem: The upper sketch shows the algorithm working as intended, as the object occupies some voxels of its last location one time-step later as well. If the object moves too fast however, no connection can be found and the Object can not be successfully linked. The upper limit on how fast an object can move before it can no longer be detected depends on the object's size, distance to the sensors and the program’s cycle-time and can therefore not be exactly determined.

Another problem stems from the inherent uncertainty of the voxel-solution. The further away an object is from the sensor, the more voxels are available to represent each data-input. This creates two critical regions: At a certain distance gaps begin to form between the voxels, creating unclear geometries with which the DBSCAN algorithm can also struggle. This can be combatted by increasing the amount of activated voxels with increasing distance to the sensor. The other critical region is the region directly in front of the sensor. Objects in close proximity can never be fully represented in the same voxel size and will activate most voxels directly in front of the sensor which results in erratic movement detection. Two possible solutions for this problem are to either increase the voxel-density in the proximity around the sensors (for example by using Adaptive Mesh Refinement) or by using data from a different sensor whose view of the critical region is unobstructed, both of which exceed the scope of this project.

Prelimenary Results from Testing:

Real-life testing revealed many coding and design-flaws of the approach. Although the program handled the increased data given by 6 additional sensor quite good with cycle-times of 85 to 65ms, the main concern were significant problems with the detection and especially the distinction of the different objects. These problems were primarily adressde by adjusting the parameters, particularly the epsilon of the DBSCAN (meaning the distance at which two voxels are merged). A major cause for errors were the static surroundings, which often grouped together with nearby moving objects or merged and then unmerged with other static objects, both resulting in motion errors.This could only be partially adressed by adjusting the DBSCAN parameters. Because of this a significant adjustment was introduced by scanning for static objects before starting the main program as part of a preparation function. It is then assumed that these static voxels are not relevant for the object-detection. This way, the program is able to reduce the merging errors considerably, but the process used to identify the static objects is still not completely certain or fully developed.

Another problem was the cycle-clock-speed. Classic boolean-based voxels as well as the time-based approach were tested. The main advantages of the boolean method are a theoretically better performance, although it could not be demonstrated in the real time testing, and the possibility of better movement reconstruction, meaning the ability to trace back the movement further back in time as it would be possible using the time based method as each timeframe is separately recorded. The advantage of the time based method are a more consistent tracking, more adjustability in form of the TIME_TOLERANCE variable and the enhanced expandability for possible other sensor rings as there are no time-frames but a more fluid memorization of the received sensor data.

After adressing these problems it was possible to gather results for motion vectors which were relatively close to the real movement, although there was still significant noise sometimes resulting in erratic, but fundamentally true findings (real direction, but wrong velocity). Another problem occured when trying to detect two seperate movements, as the previous errors occuring with the static surroundings now also applied to the two objects when they came closer together. Nonetheless, the results were quite satisfactory.

livetest.two.persons.mp4

This video shows the test with two moving objects, birds eye view. The beginning (0-2 seconds) shows one person approaching the sensors relatively fast, resulting in erratic behavior. Until 5 seconds the movement can then be identified quite well. At 6 seconds, the second person apporaches the sensors. While both persons are then near the sensors, the detected objects merge and result in no (or very small) detected motion. Around 11 seconds, the first person exits the sensor-range and both objects seperate again. The second persons movement can then be observed well again unit he exits the sensor range as well.

The parameters used in the program were adjusted empirically. For the DBSCAN-parameters (eps and min_samples) the established "k-distance-plot" and "grid-search with slihouette score" algorithms were used to find the optimal values for the given problem.

Hyperparameter Tuning of DBSCAN

The K distance plot is a widely used diagnostic tool for selecting the epsilon parameter for density-based clustering algorithms such as DBSCAN. For voxel-based point clouds the plot conveys not only a scale for neighborhood density but also signatures of the underlying grid structure. The elbow point is the location of maximal curvature in the K distance curve. In practice it is the point where the plotted distances change from a relatively flat trend to a markedly increasing slope. Geometrically this point separates points that reside in dense local neighborhoods from points that are isolated or belong to sparse clusters. The vertical coordinate at the elbow is commonly chosen as the epsilon value for DBSCAN. Intuitively, points that appear before the elbow have small distance to their k nearest neighbor and therefore belong to dense cluster interiors. Points that appear after the elbow have substantially larger k distance and are likely to be noise or members of very sparse clusters. Setting epsilon to the elbow value implements the decision rule: every point whose k distance is smaller than epsilon is considered part of a cluster while points with larger k distance are treated as noise. Points inside clusters tend to have many nearby neighbors so their k distance stays small and the curve is flat. When the curve reaches the elbow the population of points transitions from cluster interior to boundary or to background. This transition produces the characteristic knee shape that guides epsilon selection. Voxelized point clouds are defined on a discrete integer grid. Distances between voxel centers are computed with the Euclidean norm d=√(Δx^2+Δy^2+Δz^2) where each delta is an integer difference between voxel coordinates. Because the coordinate differences are integers the set of possible distance values is discrete and limited. When the k distance values for all points are sorted, many points frequently share identical or nearly identical distance values. This results in extended horizontal segments in the sorted curve. Each flat segment corresponds to a bin of identical distance values produced by the grid geometry. When the next larger discrete distance appears the curve jumps to the next step. The more regular the grid and the coarser the voxel resolution the stronger the staircase effect. The staircase appearance is not a flaw but an artifact of discretization. It implies that small changes to epsilon within a flat segment will not change cluster assignments. Conversely, choosing epsilon values at jump points will change the number of neighbors for many points at once and may cause abrupt changes in the clustering outcome. In practice it is therefore advisable to choose epsilon near the top of a stable flat segment immediately before a jump, or to use algorithms that estimate the elbow by curvature rather than by manual inspection.

Current Limitations:

Fundamentally there are two forms of limitatiosn for the final program: Theoretical limitations which stem from the way the program works and real life limitations which mainly come from uncertanties of the sensors.

The most critical limitation is the size of objects which can be detected, which is variable and dependent on distance from the sensors, movement speed and material (reflective or translucent meterials cannot be detected) of the object. Testing revealed that rods with a diameter of around 2cm can only be detected directly in front of the sensor ring while objects with a diameter larger than 5cm (a human arm for example) can be deteced at a distance greater than 100cm. Objects which are moving can generally be detected better, as they can often be seen by more sensors, although it's difficult to get precise thresholds for this phenomenon, as it again also depends on the distance to the sensors.

The theoretical limitations were already addressed previously, but the main problems are the difficulties regarding DBSCAN, the detection of movement speeds with variable maximal and minimal speeds, the necessary detection of stationary objects and the sometimes erratic detected motion vectors.

Prelimenary Merging Problem of DBSCAN

DBSCAN causes spatially distant points to be grouped together when used in environments with a limited number of voxels. This effect stems from the algorithm relying on local neighborhood density and k nearest neighbors. Figure 1 clearly shows that the scanner initially detected a pool noodle as a separate object. However, when it came close to the wall, the scanner merged both into a single cluster. Merging reduces the ability to separate objects by distance and to detect novel items reliably and semantically. Additionally, initial tests revealed that the sensor ring produces false points, which introduce noise in a already sparse voxel representations. It is vital to handle static objects that interfere with the clustering. The solution is straightforward: save them beforehand so that those detected voxels are not taken into consideration for the DBSCAN. This also allows for more empirical parameter optimisation of the DBSCAN parameter.

Conclusion

In this paper, a novel, low-cost intrinsic sensing system is presented. This system is designed to address the limitations of reactive force and predictive collision avoidance in Industry 5.0 environments. By integrating a multi-sensor ToF array with a high-performance temporal voxel pipeline, the potential for real-time object tracking within super-sparse data structures has been demonstrated. The methodology applied for temporal occupancy filtering effectively resolves the chronic issues of ghosting and spatial merging that typically plague density-based clustering in discretized spaces. Moreover, the integration of a topologically-grounded hyperparameter selection for DBSCAN offers a robust theoretical foundation for the development of future low-latency HRC applications. While the system demonstrated a high degree of correlation with ground-truth movements and displayed robustness in close-proximity interactions, critical limitations were identified with regard to sensitivity and low-profile geometric detection (d < 2cm). Consequently, future work will concentrate on integrating adaptive background subtraction to accommodate semi-static environments and developing a global optimization framework for hyperparameters to ensure cross-platform scalability. This research offers a foundation for cobotic workspaces that are safer, more resilient, and inclusive. It aligns technological advancement with the human-centric priorities of the primary labor market.

More detailed information can be found in the uploaded documentation folder in this repository. There you will find a paper as well as the project presentation slides

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
documentation		documentation
30_7_pico_serial_connection.ino		30_7_pico_serial_connection.ino
DBSCAN_hyperparameter_optimizations		DBSCAN_hyperparameter_optimizations
DBSCAN_k_distance_plot_for_multiple_k.py		DBSCAN_k_distance_plot_for_multiple_k.py
README.md		README.md
main.py		main.py
req.txt		req.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Introduction

System Configuration

Hardware

Software

The Code

How the Program is structured

About the Final Program:

Prelimenary Results from Testing:

Hyperparameter Tuning of DBSCAN

Current Limitations:

Prelimenary Merging Problem of DBSCAN

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abstract

Introduction

System Configuration

Hardware

Software

The Code

How the Program is structured

About the Final Program:

Prelimenary Results from Testing:

Hyperparameter Tuning of DBSCAN

Current Limitations:

Prelimenary Merging Problem of DBSCAN

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages