
Academic RUL models are trained on run-to-failure data from test rigs. Real factory conditions — variable load, contamination, temperature cycling — break those assumptions in ways the published benchmarks don't capture. After running pilot deployments across four manufacturing sites, we have a clearer picture of where the gap between published RUL accuracy and field accuracy actually comes from.
The PRONOSTIA and C-MAPSS Problem
The PRONOSTIA bearing dataset and NASA's C-MAPSS turbofan engine dataset are the two most commonly cited benchmarks for RUL estimation research. Models trained on these datasets routinely report root mean square error (RMSE) of 10–20 hours on bearing RUL with test sets drawn from the same experimental conditions as training. Some deep learning approaches report RMSE under 5 hours. These numbers look compelling until you understand what the datasets represent.
PRONOSTIA bearings were run to failure under fixed radial load (4 kN), fixed shaft speed (1,800 RPM), and controlled temperature. The defect progression is monotonic — each bearing's health index declines smoothly from nominal to failure. C-MAPSS similarly uses fixed operating conditions within each test trajectory. Neither dataset contains examples of load transients, lubrication contamination, thermal cycling, or intermittent operation — all of which are routine in manufacturing environments.
Why Factory Conditions Break Monotonic Degradation Assumptions
In a manufacturing facility, a press motor might run at 40% load for two production shifts, then at 95% load during a surge period, then sit idle for a weekend. The bearing's cumulative fatigue accumulation does not follow a smooth curve — it accelerates during high-load periods and effectively pauses during idle. A model that assumes monotonic degradation will produce RUL estimates that are too optimistic after idle periods and too pessimistic immediately after load spikes.
Contamination makes this worse. A bearing in a stamping press operates in a metal particulate environment. Periodic cleaning and relubrication can actually reset a stage-2 bearing back toward a healthier condition — the contamination-driven degradation is partially reversible in a way that fatigue-driven degradation is not. A health index that decreases monotonically cannot represent this. You need a model that can express recovery as well as degradation.
What Load Normalization Actually Requires
The standard approach to handling variable load is to normalize the vibration features against a load proxy — typically motor current draw or shaft torque — before feeding them to the anomaly model. This works reasonably well when the load proxy is accurate and sampled at the same rate as the vibration signal. It fails when the load proxy has significant lag (OPC-UA historian data is often averaged over 10-second windows, while vibration events may last milliseconds) or when multiple axes of load variation exist simultaneously.
In centrifugal pump applications, the relevant load axes are flow rate, suction pressure, and discharge pressure. Current draw is a proxy for hydraulic power, not specifically for bearing load. Using current draw as the sole normalization factor introduces systematic error that grows as the pump operates further from its best efficiency point. A pump running at 30% of design flow will show elevated vibration relative to a current-normalized baseline even when perfectly healthy, because the hydraulic forces at off-BEP operation are different in character from those at design point.
Run-to-Failure Data Scarcity in Production Environments
PRONOSTIA generated run-to-failure data by intentionally running bearings until they seized. Manufacturing facilities, quite reasonably, replace bearings before catastrophic failure. This means the ground truth labels needed to train supervised RUL models are essentially unavailable in production environments. If your maintenance team is competent, you will not accumulate the run-to-failure examples needed to fit a supervised degradation model.
This is the core reason why anomaly detection — measuring deviation from a known-healthy baseline — is a more practical approach for factory RUL estimation than supervised regression models that need labeled failure trajectories. The anomaly score is computable from normal operating data alone. The RUL estimate is then derived from the rate of anomaly score change (the degradation velocity) rather than from a model trained on labeled failure progressions.
Degradation Velocity: A More Deployable RUL Proxy
EdgeRun's RUL estimation uses a degradation velocity model rather than a trajectory-fitting regression. The anomaly score for each asset is computed continuously as the reconstruction error from a variational autoencoder trained on the baseline calibration period. The degradation velocity is the rolling 7-day slope of the anomaly score. When velocity exceeds a per-asset threshold — calibrated from the initial deployment period — the system computes a projected crossing time to the alert threshold based on that velocity.
The result is expressed as a range rather than a point estimate: "bearing will likely exceed alert threshold in 18–36 hours." Expressing uncertainty explicitly is important for maintenance planning. A point estimate of 24 hours with no uncertainty bounds will be taken as a precise commitment and damage confidence in the system when the actual failure occurs at 30 hours. A range of 18–36 hours sets correct expectations and is more honest about what the model actually knows.
The Intermittent Operation Complication
For equipment that operates intermittently — batch process reactors, injection molding machines with scheduled production runs, stamping presses with shift-based operation — the degradation clock runs only during operation. RUL estimates expressed in calendar hours are misleading for this equipment class. The correct unit is operating hours, which requires integrating the uptime periods explicitly.
Most commercial predictive maintenance platforms expose RUL in calendar time. This is a user experience simplification that introduces real error for intermittent equipment. A bearing with 40 hours of RUL on a machine that runs 8 hours per day has 5 days of calendar RUL, not 40 hours. For a machine that runs 24 hours per day, the same 40-hour RUL becomes 1.7 days. If the maintenance planner assumes calendar time and the machine is intermittent, the urgency calculation is wrong.
What Field-Validated RUL Accuracy Actually Looks Like
Across our current pilot deployments, the median absolute error on bearing RUL estimates (measured for the eight failure events we detected in advance) is 11.4 hours, with a 90th percentile of 28 hours. These are not controlled test rig conditions — these are production assets with variable load, periodic maintenance, and contamination events. By comparison, the PRONOSTIA benchmark results of 5–10 hours RMSE are measured in controlled conditions designed to produce clean degradation curves.
An 11-hour median error is still useful. For a bearing with 20 hours of RUL, an error of 11 hours means the alert arrives somewhere between 9 and 31 hours before failure — a wide enough window to schedule maintenance in most production environments. The goal is not precise failure time prediction; it is classification into actionable urgency buckets: "plan next scheduled stop," "plan within 48 hours," "interrupt current production run."
See RUL estimation on your assets
Request a demo to understand how degradation velocity tracking applies to your equipment class.
Request a Demo