The main goal of wind turbine predictive maintenance (PdM) is simple: to understand equipment health so that corrective actions can be planned and scheduled proactively, allowing maintenance to be done at the most cost-effective time. What is most cost-effective varies by operator, but considerations include the availability of replacement parts and lead times; the revenue impact of the required downtime (e.g., high-wind vs. low-wind periods); and the probability of damage spreading through the system, increasing the cost of repair.
The primary job of a condition monitoring system (CMS) is to alert an operator to a fault, with sufficient time to plan the corrective action and efficiently activate the supply chain. This clarification of the CMS goal is critical – an alert is only useful if it is early enough to allow for maintenance planning. If an alert is sent too late, the fault will be large and create secondary damage within the gearbox, greatly increasing the cost of repair and not allowing enough lead time to plan the maintenance efficiently. If an alert is sent too early, the fault may be so small that a visual inspection may not be able to identify it and it will be chalked up as a false alarm. The question then becomes: When is the right time to alert an operator?
Framing the issue
To start, it is important to clarify what an alert means. Ostensibly, an alert is a message sent by the CMS to the operator that a fault has occurred. It is meant to spur the operator toward an appropriate corrective action. But what criterion is used to determine when to send the message? From the perspective of a CMS, an alert is a message, triggered by a parameter exceeding a predefined threshold, that action is required. An alert has three critical parts: the parameter being monitored, predefined threshold and required action.
Any parameter value below the threshold indicates the system is “unfaulted,” while an exceedance indicates it is “faulted.”
For simplicity’s sake, we will assume the monitored parameters that the threshold is set on are positively correlated with a fault condition. This means that they increase when a fault is present.
The simplest example of this sort of alert is one that modern wind turbines already perform: an oil temperature exceedance. In this case, the monitored parameter is the oil temperature itself. The SCADA system captures this value from an onboard temperature probe. The threshold is a predefined temperature: in this case, 90˚C.
If the oil temperature is greater than the threshold, the operator is alerted and can take corrective action. Unfortunately, wind turbine drivetrains experience wide swings in rotor speed and power output.
Therefore, a simple threshold based on direct SCADA measurements without additional processing is not effective at catching a fault early.
Many operators already understand – and have perhaps experienced firsthand – when it can be too early to alert: the confidence-eroding “false alarm.”
There are two different kinds of classification errors a CMS can make. The first occurs when the CMS classifies a component as unfaulted when that component actually is faulted. This is commonly referred to as a “missed detection.” The second type of error arises when a component is classified as faulted but no fault exists. This is the more familiar “false alarm.” See Figure 2 for a graphical representation of these two errors.
In cases when both parameter values are known for both unfaulted and faulted cases, setting the threshold to balance missed detections and false alarms is straightforward.
Classification errors can occur because the parameter values used to trigger an alarm are inherently noisy due to the varying operating conditions of the wind turbine. This parameter variance (width of the distribution) can make it difficult to distinguish between an unfaulted and faulted component. For this reason, the rates at which these two errors occur are typically inversely correlated. If the threshold is moved to the right on the graph in Figure 2, the false alarm rate will be reduced while the missed detection rate will increase.
In practice, setting appropriate thresholds is even more difficult because operators typically don’t have measurements of what a faulted component looks like (red distribution in the graph shown in Figure 2), so balancing false alarms against missed detections is impossible. In the best case, CMS thresholds are set based on knowing what an unfaulted component looks like (green distribution in the graph) and applying a predefined probability of false alarms. This threshold-setting trade-off highlights the fact that the signal processing done to generate the parameter value is critical. Good processing will reduce the variance in the distribution, while poor processing will have a larger variance.
Even if the threshold is set so that false alarms are eliminated, there is still the possibility of alarming too early.
There are cases when a CMS can sense damage when it is very small – so small that a visual inspection might not detect it. An excellent example of this is a bearing that has a dent on the race due to debris over-rolling. In Figure 3, a borescope photo depicts a dent on the inner race of a bearing. The CMS detected the fault, but this fault could be very easy to miss during a visual inspection given its small size. If not for the professionalism of the borescope technician, this detection could have easily been chalked up to a false alarm.
Small faults like the previous example bring up another intriguing question: If a fault this size is detected and confirmed, is it appropriate to replace the component? One of the goals of a PdM program is to get the maximum service life out of all components. Thus, keeping the component in place until it must be replaced to avoid secondary damage is the intent. Depending on the location of the dent, it is conceivable that a dent can last for months or years before it starts to propagate into a larger
fatigue-driven fault. Unfortunately, no one has a crystal ball to predict which dents will propagate and which won’t, so a wait-and-see approach would be appropriate. Once the fatigue process starts, there is no stopping it: The larger the fault gets, the faster the fault grows.
...Or too late?
Catastrophic failure of a component that stops power production is an obvious example of an alert that is too late, but it is not the only sign. What severity of fault is too much?
The primary concern about operating with a known fault is the possibility of creating secondary damage. Secondary damage is any additional system damage that occurs as a result of the initial (primary) fault. An excellent example of how secondary damage can be created is the fatigue failure of a rolling element bearing in turbine gearboxes (see Figure 4). A bearing fatigue fault creates metal debris particles that are liberated from the races. When those particles are over-rolled, they create dents on the contact surface (bearing or gears). These dents create stress concentrations on the contact surface and accelerate the fatigue cycle. The dents can eventually create entirely new faults on other components, thus the name “secondary damage.”
In Figure 4, a bearing fatigue fault (left) creates metal debris particles (center). These particles make their way into other components where they are over-rolled, creating dents (right). These dents turn into additional faults.
Because the primary goal of PdM is to perform corrective action when it is most cost-effective, changing a component after it has already created secondary damage is too late. Therefore, it is important to perform maintenance before a significant amount of metal debris has been created to minimize the probability of secondary damage.
There is a window of opportunity regarding fault severity that can truly optimize the maintenance cost. Ideally, the maximum service life of the component has been achieved, but further damage has not occurred or increased the cost of maintenance. Figure 5 shows an example of a fault where fatigue failure has just started. This is an excellent time to replace the component.
As previously noted, the earliest a system should alert an operator is when the fault is large enough to be physically verified. Anything sooner will be construed as a false alarm and will erode confidence in the system. The latest an alert should be sent is when there is a high probability of secondary damage. The optimal time to change a component is once the initial damage creates its first debris particle. As evidenced in Figure 5, the fault was caught at the ideal time.
The fault started as a crack on the bearing race. After enough loading cycles, the edges of the crack started to fatigue and created debris. This can be seen as the small pits near the middle of the crack. From here, the fault will continue to create debris at an increasing rate, escalating the possibility of secondary damage.
It is important to understand how a CMS will inform your PdM program. Operators must know when you can expect alerts and how those will drive action to reduce maintenance costs in your organization. Discuss alert timing with any prospective CMS providers to understand their philosophy as well as the technology’s capability. Dealing with this important integration issue early will pay dividends later for the PdM program. w
Marketplace: Condition Monitoring
Scheduling Corrective Action: How Does Next Tuesday Sound?
By Brogan Morton
Condition monitoring systems can make owners aware of equipment problems so that repairs can be scheduled. But is there such a thing as an ideal time for alerts?
NAW_body NAW_body_bi NAW_body_b_i NAW_body_bNAW_body_i