Monday, July 28, 2014

Functional models - a tool for incident prediction and avoidance?

In the June issue of Hydrocarbon Processing one finds an article titled "A new era emerging for incident prediciton and avoidance strategries" by D. Hill. In this viewpoint on automation strategies D. Hill advocates the usage of multivariable statistical process control (SPC), principal component analysis (PCA) and conditional logic as the tools for predicting incidents before they happen.

However, these tools include only limited amounts for process knowledge, and I believe that accident prediction could benefit significant by using process knowledge. One tools, which is able to capture process knowledge and use it to analyze the current state of the process is multilevel flow models (MFM), as developed by M. Lind at the Technical University of Denmark. MFM allows the engineer to create a model of the process in the goals - means domain. This model can then be used to reason about possible causes and consequences of currently measured process deviations, and the results presented to the operator in way, that allows the operator to easily take action on the measured deviations before the develop into a major plant event.

M. Lind and co-workers have demonstrated how MFM can be used to assist in HAZOP analysis of complex facilities such as chemical plants or nuclear power plants. Currently work is ongoing on using MFM for operator assistance during abnormal situations in nuclear power plants and other complex facilities.

The mentioned tools for assisting in HAZOP analysis can easily be modified to predict how minor observed process deviations can develop into major accidents, and what process consequences such deviations may have. Prediction accident impact on surrounding communities will require additional modelling efforts. This allows process knowledge to be used, since it is captured in the goals - means structure of the MFM models.

Are you getting to the bottom line?

The purpose of investigating a process accident or deviation is to reduce future operating cost by avoiding a repeat event. In H.P. Bloch's reliability editorial "Small deviations can compromise equipment reliability" from the June 2014 issue of Hydrocarbon Processing is detailed how a change from liquid oil lubrication to pure oil mist lubrication can result in reduced bearing reliability. My conclusion from reading the editorial is, that the refinery simply had used a centrifugal pump designed for liquid oil lubrication in an application, which used pure-oil mist lubrication.

The pump which experienced thrust bearing failures was manufactured by a well known company, and the bearings were in compliance with current recommended practice for centrifugal pumps: API-610. However, in the particular case with a rotational speed for 3560 rpm and linear velocity of 2780 fpm better guidelines can be found in books (see H.P. article for references) than in API-610.

So what is the issue here? When unexpected events happens during the operation of a plant, such as premature bearing failures, then there is a learning opportunity. If one simply replace the broken part without further investigation, then this learning opportunity is missed, and so is an opportunity to improve the long term performance of the unit.

Unfortunately the editorial does not cover why pure-oil mist lubrication was selected in this particular applications, or why the pump design was not the best for this particular choice of lubrication. Neither do we know when these choices were made. My philosophy is that to gain maximum learning benefits from a plant deviation, then both human, technical and organisational factors must be analysed. Stopping before that has permanent impact on your bottom line.


Sunday, July 06, 2014

Is the peer review process of scientific publications failing?

In old times the editors of scientific magazine chose who to send a paper to for review. However, lately this has changed. Now authors are typically asked to suggest three persons (the authors friends?), whom they think could review their contribution. I think is a dangerous development. I think the result is less critical reviews and a declining quality of published papers.

I will demonstrate what I think reviewers have missed by analysing two publications, both of which I definitely think deserved to be published, but which in my view could have been improved by a more critical review.

The first paper is "Methodology for the Generation and Evaluation of Safety System Alternatives Based on Extended Hazop" by N. Ramzan et.al, and it was published in the March 2007 issue of Process Safety Progress (Vol. 26, No. 1, pp. 35-42). My first question is what is "Extended Hazop"? It turns out, that the authors by this mean combining a traditional Hazop study with dynamic simulation. I think this a great idea, and one which few people considered 7 years ago. I am fascinated by "Generation and Evaluation", and hope the authors will present procedures both for generation of safety systems and for evaluation of safety systems. It turned out, that I was disappointed on both counts.

The introduction in my view attempt to paint a picture, that major accidents are particular to the chemical processing industries. This is done by just mentioning chemical processing industry events and by choosing to reference just another publication in stead of event focused publications, such as either "The 100 Largest Losses (in the Hydrocarbon Industry) 1974-2013" published by Marsh Risk Consulting. This years edition is the 23rd, but the authors should have had access to the 19th. Another more appropriate reference could by appendices 1-8 of Frank P. Lees "Loss Prevention in the Process Industries" (2nd Edition) or the updated 3rd Edition edited by Sam Mannan. I also wonder what the authors mean by "the old concepts of accident prevention". The reference provided to "Chemical Process Safety: Fundamentals wit Applications" by DA Crowl and JF Louvar does not provide the answer.
Then the introduction continues with a list of techniques claimed to be the most common in the chemical processing industries. However, the list is equally common to other process industries and most of them also to other industries. Quite honestly I do not understand the purpose of this list, and the reference provided to the claim, that "no single technique can support all the aspects of safety/risk".
The next paragraph of the introduction starts "Several methodologies of risk analysis have been presented so far...".  It would have been more correct to say "mentioned" in stead of "presented". After this a list of books are referenced. Two are the textbooks by respectively Skelton (name misspelled in reference list) and by Wells. The other two are quideline books by the CCPS. I my view the authors should select to either reference textbooks or guideline books - but not both. The reference to Tixier et.al's survey of tools is most welcome.
The third paragraph of the introduction mention the lag of standards for safety/risk analysis methodology. But I would like to disagree with the statement that tools used are based on "...judgemental contribution of the analyst or the plant manager". I certainly hope this is not the case!
The final paragraph of the introduction introduce the idea of using dynamic simulation to simulate large variations of design/operational parameters. To me simulation of variations in design parameters makes little sense after the design completion, and I would also question whether dynamic simulations can actually cope with events such as loss of reflux pump or loss of cooling water to condenser. At the end of this paragraph the purpose of the paper is stated as "...a combination of conventional risk analysis techniques and process disturbance simulation...for safety/risk analysis and optimization". Nothing about generation of safety system alternatives or their evaluation! There seem to be major difference between the title of the paper and the stated purpose at the end of the introduction. The peer review process should have pointed this discrepancy out.
The second section of the paper by Ramzan et.al describe a four step methodology based on extended hazop - without defining what is meant with extended hazop.
The first step is the usual task in preparation of a hazop study. Unfortunately most academic authors don't provide access to documents and other information collected in this first step. Such documents would of course be of tremendous value to others wanting to learn the hazop methodology.
The description of the second step starts with a very true statement, that "the biggest source of error in hazard analysis is failure to identify the ways in which a hazard can occur". This is normally called the causes of the hazards. Indeed, I find that many - especially students - have great difficulty with distinguishing between hazards, their causes and their consequences.  The section then continues with a discussion of a traditional hazop study and the requirements for conducting this, leading to a statement of the purpose of the work "...to identify weak points arising from disturbances in operation, which may or may not be hazardous, to improve safety, operability, and/or profit at the same time". That is not exactly what I expected from the title of the paper, and to me it sounds like a rather unclear purpose ("weak points"?) with possibly conflicting objectives (safety versus profit). The authors goes on to state, that the analysis of disturbances are based "on shortcut or simplified hand calculations supported by dynamic simulation". Unfortunately the authors does explain what shortcut or simplified calculations are used in the distillation example of the paper, but they do have some very positive statements about Aspen Dynamics after briefly mentioning other dynamic simulators. The section then continue with a clear five point description of the differences between traditional hazop and hazop supported by dynamic simulation (Extended Hazop). These difference are 1) to identify consequences using dynamic simulation, 2) rank consequences in eight classes, 3) identify frequency of each possible consequence, 4) documenting the results in an extended hazop worksheet, and 5) ranking the hazop results. This to me sounds like a description of QRA using a risk matrix with 9 consequence classes and 10 frequency classes. Most of the industrial risk matrices I have seen are limited to between 4 to 5 consequence classes and between 3 to 5 frequency classes, since the uncertainty in the calculations and the assumptions they are based on do not justify a larger risk matrix.
The third step involve what is called either the risk potential matrix or the hazop decision matrix. Here the authors are using consequences and hazards interchangeably. Here the matrix of 9 by 10 risk categories is condensed to just four risk levels: 1) intolerable, 2) acceptable for a short time, 3) risk reduction optional, and 4) acceptable. According to the authors the risk potential matrix is used for the following: a) status  of plant safety, b) ranking of events, c) optimization proposals, d) improvements, and e) documentation. I can't help asking what happened to the safety system alternatives?
The fourth step is development and analysis of optimization proposals. Here the term "risk target" is introduced for the first time. This is a further indication that the article discus a QRA approach to risk assessment. However, there is no clear description of how optimization proposals are developed. but we are told they are evaluated using dynamic simulation, event tree analysis and/or fault tree analysis.

It appear to me, that the conclusions goes beyond what has been presented in the paper. For example I find the paper contain no analysis of operational failures. Neither does a contain any analysis of effects of design improvements. A critical review should also have pointed out, that the term "weak point" has not been defined. I agree with the authors statement, that dynamic simulation in combination with hazop is a powerful tool. However, I disagree with the authors claims about the usefulness of the methodology for safety concept definition, safety analysis and safety system design. Did the reviewer actually read the conclusion?

The second paper I have looked at is "Modelling of BP Texas City refinery accident using dynamic risk assessment approach" by M Kalantarnia et.al, and this paper was published in 2010 in Process Safety and Environmental Protection (Vol. 88, pp. 191-199). I wonder what "dynamic risk assessment" could be? Is it something accident responders use during an event? Or is it the realtime risk state of the facility?
One should always be very careful with stating conclusions already in the abstract, since there is no room for a supporting argument. In this article the authors should in my view have avoided telling us what the main reason for events such as the BP Texas CIty refinery explosion and fire is. Especially when they are dead wrong! Such tragic events are not caused by "lack of effective monitoring and modelling approaches that provide early warning". Neither do such event occur "In spite of endless end-of-pipe safety systems". To me such statements are indications, that the authors have not fully understood the event. Unfortunately, the internal preliminary accident report and the internal final accident report, both of which BP at the time made available on the internet appear no longer to be accessible. These reports clearly shows that the main cause of the event was a local management decision to start the unit before the turn-around was 100% complete (some instrumentation had not been re-commissioned) and before it was needed.
That being said, I do believe that accidents do give an opportunity to look at how process monitoring could be improved. And the article by Kalantarnia et.al is indeed a very good example for such kind of research. Although implemention of the proposed system would not have prevented the events of March 23rh 2005 in Texas City. The local management failed to react on warnings in hazop updates several years before 2005, so I think they would also have overlooked the warnings of a sophisticated probability based system.

In the first paragraph of the introduction the authors gets it partly right and partly wrong. It is absolutely correct, that a high level of safety and reliability requires "...the implementation of a strong safety culture within the facility". Unfortunately not having such a culture was one of major contributing factors to the event. However, to believe that safety and reliability is "maintained by strict regulatory procedures" is just wishful thinking, and in clear contrast to  the end of the paragraph, which reads "...a culture to respect safety throughout the plant both by the personnel and the management is critical". The authors then continue with a more technical introduction in the second paragraph, where a reference to the original QRA work in the nuclear industry is encouraging. However, hazop newer was and never will be part of the QRA toolset. A more critical review should have pointed this our to the authors. The list of references, which the authors choose to mention do not all appear to be relevant for the present work. Again a more critical review could have pointed this out. I find, that only the reference to the work of Khan and Amyotte (J.Loss Prev. Process Ind. (2002) pp. 467-575) is relevant for the contribution of the paper, while the others appear not to relate to the current work. In my view the highlight of the introduction is the review of work to use accident precursors in risk assessment. I really feel, that here things sparkle. But I am left without a definition of dynamic risk assessment.
However, towards the end of the introduction it become clear, that the main inspiration is the work of Meel and Seider (Chem.Eng.Sci., vol. 61, pp. 7036-7056). I heer get the idea, that dynamic risk assessment is actually probabilistic risk assessment updated using real time plant data and Bayes theorem - I guess!

The second part of the article is titled "BP Texas City Refinery", by it could more appropriately have been titled "Raffinate splitter at ISOM unit of BP Texas City Refinery", and then the authors could have avoided filling a scientific paper with irrelevant data from the CSB report on the event, i.e. section 2.1 titled "A brief history" - which it is not, and the first two paragraphs of section 2.1 titled "Process description". A critical review should have pointed this out. Also half of section 2.3 titled "Accident description" could be omitted without loosing information relevant for the present study. The final sentence of this paragraph should be admitted in a scientific publication: "The release of flammable let to an explosion and fire". It would be more correct to write "The released flammables found a source of ignition resulting in an explosion and fire", and one could add "which killed 15 person and more than 170 persons".

The third part of the article "Dynamic risk assessment" is the key contribution of the paper. Nice short five step description of the procedure before going into details about each step.
In the first step the authors define 18 failure scenarios. 17 of these are related to instrumentation failures, such as failure of level site glass or failure of safety relief valve. This simple approach to scenario definition indicate a rather limited understanding of failure mode. I think it would be more appropriate to talk about failure of safety valve to open on demand or leakage through 1.5 inch reflux bypass valve on reflux drum. How can a site glass be labelled as a safety barrier? A reviewer should also look at inappropriate use of terminology. The 18th (or rather 1st) scenario is an operational scenario excess feed loading. But what about others, that actually happened on March 23rd? Such as failure to start flow for heavy distillate from the tower? Or overheating of feed to raffinate splitter? Maybe these operational events are only described in BP's internal incident investigation reports, and hence the authors did not know about them.  It would also have been interesting with a discussion of the choice of parameter values (discrete value, parameter 'a', parameter 'b'). This would help other implement dynamic risk assessment. It good that the choice of distribution function is discussed.
The second step is prior function calculation. To perform this calculation an event tree is needed. Part of this is shown in figure 3 of the paper, and it is incorrectly labelled "Part of event tree of ISOM unit in BP Texas City refinery" even though the event tree is just for the raffinate section of the ISOM unit. A mislabeling is also found in the event tree, where the "Raffinate splitter" has been labelled the "ISOM Tower". Furthermore, the initiating event of the tree is not defined. I wonder if excess feed loading is considered the initiating event?
The third step is the formation of the likelihood function, which has a nice discussion of the choice of this function. However, unfortunately the paper don't discuss how the collected accident percursor data, such as those showed in table 2 of the paper is fitted to the distribution. But maybe that is trivial, and I just need to take a statistics course on-line.
The fourth step is the calculation of the posterior function from the prior function and the likelihood function. This apparently is where the event tree play a key role. So it appears to me, that essentially the paper shows a methodology for updating an event tree using likelihood functions for a single event - excess feed loading.
Essentially it is QRA for a single event, and this is a long way form a monitoring system to provide early warning.
The final and fifth step is consequence assessment. Unfortunately the authors limits themselves to the consequences: asset loos, human fatality, environmental loss, and image loss, and hence neglect other forms of human loss. They go on to state, that in the BP case the main focus is on asset loss and human fatality - hence neglecting the 180 persons injured in the event. They then match each group of end states, i.e. A (process upset) - severity class 1, B (process shutdown) - severity class 2 and C (release) - severity class 4.

The fourth part of the paper is title "Result and discussion", and this start with a claim, which don't hold. The authors claim "The BP Texas City refinery has been analyzed..", when in reality it is the "Raffinate section of the ISOM unit at the BP Texas City refinery" which have been analyzed for a single scenario. I guess that the claim could be based on the fact, that ASP data are for the whole ISOM unit, and not just the Raffinate splitter section.
Unfortunately the authors report the calculated risk in dollars, and to me it looks like cost of unplanned shutdowns goes from 104 $ to 2610 $ and the cost of a release goes from 2590 $ to 96600 $. But is remains unclear to me as an operator or manager of this facility what these numbers tells and what they tell med something about. That is significant increases, but I wonder how things would change a) if there were additional releases or shutdowns in ASP data for the last 2-3 years. How would that change the profile? The whole refinery actually had severel hazop reviews during the time frame. Could such events be included in the ASP data?
I fail to see how the authors can claim that the data in figure 4 signals absence of inspection and maintenance plans? I am certian that Texas City refinery has such plans. However, they may not have satisfied generally accepted standards.
Events such as leakages, shutdowns or process upset clearly are discrete events for which a Poisson distribution properly would be a better choice than the linear model.  The diagram indicate, that there is little difference in the discrete numbers, but some in the cumulative.

While I am critical on several points about this paper, my excitement about the approach remains very high. If one could develop warning systems based on alarms - which if ignored leads to negative consequences - and are more frequent than the ASP's considered in the present paper, then I see dynamic risk assessment as another tool in the control engineers toolset.

However, I remain very critical about the current state of the peer review process, and huge pressure to publish, which create papers far worse that the two one I have critiqued here. I also maintain that both the papers critiqued here contain significant scientific contributions.
If the peer review process is not improved and simple counting of number of publications and citations as a measure of scientific production is not changed, then I fear for those students of the future, who have to review litterature within a subject are in order to obtain a graduate degree. Or maybe litterature review will just disappear.
IBM and their Watson computer have made great progress in analysing unstructured information and helping doctors diagnose rare diseases, such as cancer. I wonder if we can teach Watson to help the human reviewer by e.g. rating the relevance of different citations and missing relevant citations? I think it would be worth a test, and hope one of the major scientific publishers will approach IBM about this idea.