## Abstract

The primary purpose of an environmental monitoring program is to provide oversight for microbiological cleanliness of manufacturing operation and document the state of control of the facility. Key to the success of the program is the establishment of alert and action limits. In practice, several statistical methods including normal, Poisson, and negative binomial modeling have been routinely used to set these limits. However, data collected from clean rooms or controlled locations often display excess of zeros and overdispersion, caused by sampling population heterogeneity. Such data render it inappropriate to use the traditional methods to set alert and action levels. In this paper, a method based on a zero-inflated negative binomial model is proposed for the above instances. The method provides an enhanced alternative for trending environmental data of classified rooms, and it is demonstrated to show a clear improvement in terms of model fitting and parameter estimation.

*LAY ABSTRACT:* The primary purpose of an environmental monitoring program is to provide oversight for microbiological cleanliness of manufacturing operation and document the state of control of the facility. Key to the success of the program is the establishment of alert and action limits. In practice, several statistical methods including normal, Poisson, and negative binomial modeling have been routinely used to set these limits. However, data collected from clean rooms or controlled locations often display excess of zeros and overdispersion, caused by sampling population heterogeneity. Such data render it inappropriate to use the traditional methods to set alert and action levels. In this paper, a method based on a zero-inflated negative binomial model is proposed for the above instances. The method provides an enhanced alternative for trending environmental data of classified rooms, and it is demonstrated to show a clear improvement in terms of model fitting and parameter estimation.

- Alert and action limits
- Environmental monitoring
- Zero-inflated model
- Negative binomial distribution
- Non-parametric methods
- Poisson distribution

## Introduction

Environmental monitoring (EM) in the manufacture of pharmaceuticals provides quality assurance that the products are free of microbial contamination. It is mandated both by regulatory guidance and law (1). The U.S. Food and Drug Administration (FDA) 2004 Guidance for Industry: Sterile Drug Products Produced by Aseptic Processing—Current Good Manufacturing Practice and the U.S. Code of Federal Regulations (CFR) governing the manufacture of pharmaceutical products (21 CFR 211) clearly state the need to establish and follow procedures to prevent microbiological contamination of sterile drug products. Samples at manufacturing sites are routinely collected, assayed, and analyzed. Cumulative environmental data are used to set up alert and action limits. Data from real-time monitoring are judged against the limits so as to detect any microbial excursion that might put the product at risk. Traditionally, alert and action limits are established, using a mean of +2 or +3 standard deviations (SDs) to render certain degrees of assurance that the classified environments are in control. This method hinges on the assumption that the data or data after a transformation such as logarithmic are normally distributed (2). In the event of non-normal data, nonparametric methods can be used. A simple alternative is to use the lower and upper percentiles of the historical data to set up the limits. Another method is the non-parametric tolerance described previously (3). However, the reliability of these non-parametric approaches depends heavily on the amount of data. In the absence of large data sets, non-parametric approaches run the risk of selecting as the control limits the outlying data points caused by out-of-control conditions, resulting in inflated estimates of alert and action limits. Therefore it is always preferable to use model-based methods.

A more frequently used model for count data of skewed empirical distribution is the Poisson model (4, 5). The Poisson model assumes that the data are uniformly dispersed. With the Poisson model, the mean of the distribution is equal to the variance. However, the real-life environmental data are collected in various locations and time points and are often over-dispersed. This is primarily caused by heterogeneity of sampling populations, as samples collected at different locations and times often come from different random processes. As a consequence, a single-parameter Poisson model is not adequate to describe environmental data consisting of several subpopulations and tends to underestimate the variability of the data, thus resulting in more frequent false alarms of microbial excursion. Hoffman (4) and Christensen *et al.* (5) suggest using the negative binomial distribution to correct the effect of overdispersion. The negative binomial is a generalization of the Poisson model. It models data at each sampling location through a Poisson distribution, but it allows the mean of the Poisson distribution to vary across locations according to a gamma distribution. Because the negative binomial has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean, thus accommodating overdispersion.

As noted by Caputo and Huffman (6), in Class 100 (ISO 5) clean rooms and isolation systems, most of time there is no observation of microbial recovery. These frequent zero test results make normal, Poisson, or negative binomial based models insufficient. They propose to monitor frequency of non-zero microbial recovery, as opposed to trend individual observations. However, by focusing the monitoring on the frequency of events, one might not fully utilize all information contained in the measurements. For example, the magnitude of intermittent non-zero results is not reflected in the frequency analysis. Even if there are excursions exceeding the alert or action limit, they may go undetected. In addition, the Caputo and Huffman's method does not account for heterogeneity of the subpopulations at various sampling locations.

In this paper, we develop a complementary approach to setting alert and action limits for classified systems. The method is based on a zero-inflated negative binomial model discussed by Erdman *et al.* (7, 8). Microbial recoveries of a clean room are considered as being generated from two stochastic processes. The first process, representing sterile sampled population, generates zero counts, whereas the second process, representing sampled population with some tendency to be nonsterile, produces count data from a negative binomial model. The chance for each process to produce the observed results is modelled through a Bernoulli distribution. The method provides a simple and flexible alternative to the approaches currently in use for setting alert and action limits, and it has been shown to provide better model fitting and more accurate estimates to the alert and action limits.

## Zero-Inflated Negative Binomial Model

Let *X = {X _{i}, i = 1, … , n}* denote the microbial counts of

*n*samples taken in a monitoring cycle.

*X*is model through the following zero-inflated negative binomial (ZINB) model: where

_{i}*W*is Bernoulli random variable with Pr[W = 1] = p,

*Y*is a degenerated random variable with Pr(Y = 0) = 1,

*Z*follows a negative binomial distribution NB(μ, κ), with density function given by where μ > 0 and

*k > 0.*The distribution of

*Z*has a mean μ and variance

*μ(1 + kμ).*It is evident that the variance is greater than mean, whereas the variance and mean are equal for the Poisson distribution. As a consequence, the negative binomial distribution is useful for modeling over-dispersed data. It is also assumed that

*W, Y,*and

*Z*are independent. The excess of zeros in the observed data is described by the random variable

*Y,*which has a point mass at zero. The density function of the observed microbial count

*X*is

_{i}Several observations can be made about the zero-inflated model specified through eqs 1–3. Firstly, when *p = 0, f(x _{i}|p, μ, k) = g(x_{i}|μ, k).* That is,

*X*follows a negative binomial distribution. Secondly, when

_{i}*p = 0, k → 0*and

*kμ → λ, f(x*This implies that

_{i}|p, μ, k) = g(x_{i}|μ, k) →*X*approximately follows a Poisson distribution with mean microbial count of λ. Lastly, under the above conditions, assuming that λ is large, because

_{i}*X*being a Poisson random variable of mean λ, can be expressed as sum of λ independent Poisson variables of mean 1, it is approximately normally distributed with both mean and variance of λ. As a result, the normal, Poisson, and negative binomial distribution can be viewed, either in an exact or approximately sense, as special cases of zero-inflated negative binomial models.

_{i},## Parameter Estimation

Let *n _{x} (0 < x ≤ m)* denote the number of observations out of

*n*samples

*X = {X*i = 1, … , n} that take value x. In particular,

_{i},*n*

_{0}is the number of zeros in the samples. The log-likelihood function of X is given by

The maximum likelihood estimators (MLEs) (*p^, μ^, k^)* of model parameters (*p, μ, k)* can be found by maximizing the log-likelihood function in eq 4. Both SAS, a commercially available statistical analysis package (7, 9), and an open access package R (10) can be used to obtain the MLEs. Alternatively, data may be fit to a zero-inflated negative binomial model using Bayesian methods and WinBUGS as described by Neelon *et al.* (11).

## Construction of Alert and Action Limits

Substituting the MLEs (*p^, μ^, k^)* into eq 3, we obtain an empirical estimate of the density function of the count data,

Let 1 − α_{1} and 1 − α_{2} be the confidence levels of alert and action limits, respectively, such that 0 < 1 − α_{1} < 1 − α_{2} < 1. In practice, α_{1} and α_{2} are conventionally chosen to be 0.05 and 0.01, respectively. Thus the alert limit (ALL) and action limit (ACL) can be determined as

The two limits are the numbers that enable the inequalities in eqs 6 and 7 to meet for the first time. They are the smallest integers that are bigger than or equal to the 100(1 − α_{1}) and 100(1 − α_{2}) percentiles of the distribution of *X _{i},* and they can be obtained through iterative calculation. However, the accuracy of the two limits depends how closely the MLEs (

*p^, μ^, k^)*approximate (

*p, μ, k).*In general, the larger the sample size

*n,*the more accurate the estimates are.

## Comparison with Other Methods

In this section, through a simulation study, we compare the performance of the zero-inflated negative binomial model with the traditional models for count data including normal, Poisson, and negative binomial. One thousand data sets indexed by *m (1 ≤ m ≤ 1000)* are simulated for sample size n = 100, 1000, and 10000, through the model specified in eqs 1–3 with (*p, μ, k) = (0.5, 2.8, 0.04), α _{1} = 0.05,* and α

_{2}= 0.01. For each data set, ZINB, normal, Poisson, and negative binomial models are fit to the data to obtain estimates θ^ of model parameters θ, which are (

*p, μ, k), (μ, o*and (

^{2}), λ,*μ, k)*for ZINB, normal, Poisson, and negative binomial models, respectively, with (

*μ, k)*being defined in eq 2,

*p*in eq 3, (

*μ, o*being the mean and variance of the normal distribution, and λ being the mean of the Poisson distribution. For the normal distribution and given (m, n), the one-sided 95% and 99% confidence limits are used as the estimates for ALL and ACL:

^{2})For the other three models, the estimates for ALL and ACL are determined to be the smallest integers greater than or equal to 95^{th} and 99^{th} percentile. Specifically, let *f^(x|θ^)* be the estimated probability density function, which, for instance, takes the functional form in eq 5 for the ZINB random variable. The estimates of ALL and ACL are given by

The mean ALL and ACL estimates and their SD are calculated as follows:

The true ALL and ACL are determined to be 5 and 7, respectively, based on the density function in eq 3 with model parameters (*p, μ, k)* being replaced by their true values (0.5, 2.8, 0.04). The results are summarized in Table I.

It is evident that the normal and Poisson models underestimate the dispersion of the data, resulting in narrower alert and action limits. The negative binomial, presumably appropriate for modeling overdispersion, overestimates the data dispersion to account for the excess of zeros, giving rise to wider limits. By contrast, the ZINB model provides the most accurate estimates of ALL and ACL. For a moderate sample size n = 100, the ALL and ACL are very close to their respective true limits. The estimated ALL and ACL converge to the true limits as the sample size increases. In addition, the precision of the ALL and ACL estimates, characterized by SD, improves as the sample size becomes larger. However, the sample size has little effect on the accuracy of the estimates based on the normal, negative binomial, and Poisson models. The results suggest that when data exhibit excess of zeros and overdispersion, ZINB demonstrates an excellent performance in estimating ALL and ACL and that the other three models result in increased risk of either false alarm of microbial excursions or under-detection of adverse events.

Figure 1 displays estimated density functions of the four models imposed on the histogram of the microbial data from the first round of simulation (m = 1 and n = 1). Because the data contain excess of zeros and exhibit overdispersion, none of the normal, negative, and Poisson models fits the data well, which becomes especially clear when comparing the actual with modeled counts of zero for the normal and Poisson model, and non-zero observed and modeled counts for the negative binomial. The superiority of the ZINB to the other models is visually apparent as it fits the empirical data far better than the other three models do.

## Discussion

Environmental monitoring of sterile drug production is a regulatory mandate. Microbial count data are collected at different time and various locations where significant activity or product exposure occurs during production. Aided with advanced technology in the identification and quantification of microbes, the EM data provide timely feedback on the cleanliness of the manufacturing facilities. Because the level of exposure to microbial contamination varies over time and from location to location, the quality of alert and action limits derived from the normal and Poisson models diminishes rapidly because these models assume a constant mean microbial count. Although the population heterogeneity can be modeled through a negative binomial, the model does not lend itself easily to account for excess of zero observations. Such inflated-zero phenomenon is very common in most Class 100 (ISO 5) clean rooms and isolation systems. Failing to account for variability due to location and a large number of zeros may lead to either overestimate or underestimate alert and action levels, and may compromise the soundness of the EM program.

The ZINB model, taking into account of dispersion due to sampling time and location, as well as excess of zeros, overcomes the shortcomings of the traditional models. It provides good fit to data, and results in accurate estimates of alert and action limits, clearly superior to the proposed models from the literature. In the event that data do not show excess zeros or overdispersion, ZINB is expected to give, at least, comparable performance as one of the traditional models which may be the real process by which the data are generated. This, by and large, is due to the fact that the normal, negative binomial, and Poisson models can be either special cases of ZINB or approximated by the ZINB as previously discussed. In addition, because EM data are usually abundant, accurate model parameter estimates can always be derived from fitting the ZINB model to the data. An alternative approach, perhaps more comprehensive, to setting alert and action limits begins with evaluation of data distribution, following the procedure depicted in Figure 2. Data are tested for normality using the Shapiro-Wilk test. A *P*-value less than 0.05 would suggest non-normal data. The Shapiro-Wilk test is a powerful test of normality that can show statistical significance in cases of non-essential deviations from normality (such as rounding or binning of data). As an alternative to the Shapiro-Wilk, a normality decision may be based on a visual inspection of the data using a normal probability plot. If the data do not contain zero values, they are tested for normality after log-transformation. Again a *P*-value less than 0.05 of the Shapiro-Wilk test would be indicative non-normal log-transformed data. A Poisson model is fit to the data. A goodness-of-fit test is performed to evaluate if the data follow Poisson distribution or not. The final outcome of this evaluation is the claim that either data are normally distributed, log-normal, following Poisson, negative binomial, zero-inflated negative binomial, or cannot be adequately described through a parametric model. For the first four scenarios, the corresponding parametric model is used to estimate alert and action limits. A non-parametric method is preferred for the last case.

Because the soundness of an EM program depends heavily on accurate determination of alert and action limits, blindingly applying traditional techniques such as normal distribution without understanding the data distribution may lead to inappropriate alert and action limits, causing high rate of false positive or negative outcomes. The above procedure helps find models suitable for the data. The effectiveness of the method can be evaluated through simulation. However, a full description of this is beyond the scope of this paper. With the help of statisticians, the method can be automated through R programming.

## Conclusions

We propose a zero-inflated, negative binomial model for environmental data collected from clean rooms or isolated systems in manufacture of pharmaceutical products. Through a simulation study, it is shown that in the presence of excess zeros and overdispersion, the model fits data well and provides accurate estimates of alert and action limits. It demonstrates enhanced performance when compared to traditional methods. The ZINB model-fitting can be easily carried out with both commercial and free software packages, making establishment of alert and action limits computationally and operationally less challenging. It is also worth reiterating that a more comprehensive solution to setting alert and action limits for an EM program should include an evaluation process that allows for testing data distribution and proper selection of models.

## Conflict of Interest Declaration

The authors declare that they have no competing interests.

## Acknowledgements

We would like to thank Dr. Lanju Zhang for useful discussion of the manuscript. We also would like to thank the referees for their helpful comments, which greatly helped improve the paper.

- © PDA, Inc. 2013