Using WO as reliability data source to calculate equipment MTBF is a common practice in many big organizations. Meanwhile there are also many software vendors exploiting this demand by providing applications that extract WO data to generate MTBF values for the corresponding equipment.
This presentation describes the flaws in using WO captured in CMMS system to derive MTBF of equipment. First, I state the obvious assumptions when you use WO to derive MTBF. I will then explain the issues associated with these assumptions in practice.
Work-Order is not meant for data source for reliability analysis.
To derive the MTBF from Work-Order, the following assumptions are made by the software developer:
- Each Corrective Maintenance Work-Order (CM-WO) for an equipment corresponds to its failure.
- Repair start-time to Repair end-time of CM- WOs of an equipment is considered as equipment downtime.
- The duration between 2 CM- WOs of an equipment is considered as operational.
The following describes the problems that do not go well with the above assumptions.
Problem 1: A failure may have multiple Work-Order
It is common to see multiple Notification-Order/Work-Orders being issued for a failure. This may result in multiple counts for a failure.
Analyst needs to ensure WOs that refer to the same failure are combined as single event to avoid repeated counts on the same event. This is not an easy task!
If this problem is not handled properly, the first assumption will be voilated.
Problem 2: Error in downtime calculation
Consider a Crude-Oil-Transport-Pump (COTP) system that consists of 3-pumps (A, B and C), with “2 active and 1 standby” operating policy.
On day 50, Pump A failed; Work-Order was raised and repair started on day 70 and completed on day 80. The actual downtime for Pump A was 30 days, but repair time was 10 days.
If the analyst relies on Work-Order record as data source, the downtime would be only 10 days! To correct this error, the analyst needs to look from the corresponding Notification Order for Mal-function start-time.
This problem makes the second assumptions difficult to achieve.
Problem 3: Standby duration cannot be obtained from CMMS
The following shows the tri-state plot of three identical Gas-Turbines (with “2 active and 1 standby” operating policy) of a power generation system from 1 September 2016 to 18 February 2019 (900 days). There were 5 failures.
If we rely on CMMS system, 5 failures were observed over 900 days for 3 identical units. Combined downtime for all Gas-Turbines is 80 days. It can’t tell how long is the standby duration of each equipment.
The MTBF of individual Gas-Turbine = (900 - 80) x 3 / 5 = 492 days.
The problem with this calculation is that we used the calendar day of 900 for each unit (minus downtimes), and give credit for standby duration.
Standby duration should be removed from the calendar time.
Following shows the correct calculation:
From the profile determine total run times.
GT-7510A: 368 days
GT-7510B: 304 days
GT-7510C: 328 days
Total running time: 368 + 304 + 328 = 1000 days
Total number of failures: 5
MTBF = 1000/5 = 200 days
Relying on CMMS information will over estimate the reliability of the assets.
Problem 4: Non-Failure event (process event)
Work-Order does not capture downtime events that are due to process upset. Normally, any downing event that is not due to equipment failure, Work-Order are not generated, and hence the corresponding downtimes are not captured.
Situation where equipment A is down due to other equipment, B failure. Work-Order is issued for equipment B but not for A. Hence the downtime information for A is not captured.
CMMS systems are designed by accountant, and is a great tool for accounting/financial purpose. It is unlikely that reliability engineer is involved in the design of the system, and hence the requirements for reliability analysis are not considered. Using Work-Order as a reliability data source is a recipe for failure for the enterprise reliability program.
A data collection system should capture the following information:
- Start-time, end-time for each equipment failure event, with failure mode.
- Start-time, end-time for each process event, with root cause.
- Equipment on standby time.
Analyst should avoid using calendar time to approximate equipment uptime, unless the equipment is operating 24/7, and the non-operating times are negligible.
The objective of the data collection system is to capture events such that historical operating status can be reproduced with reasonable precision (Fig. 2). This allows the analyst to derive the failure rate behavior and downtime distribution of each critical equipment accurately.