1010 USP39-NF34 ANALYTICAL DATA INTERPRETATION AND TREATMENT

更新时间:2024-06-29 17:27:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

<1010> ANALYTICAL DATA—INTERPRETATION AND TREATMENT

分析数据的解释和处理 INTRODUCTION

前言

This chapter provides information regarding acceptable practices for the analysis and consistent interpretation of data obtained from chemical and other analyses. Basic statistical approaches for evaluating data are described, and the treatment of outliers and comparison of analytical procedures are discussed in some detail.

对于分析化学分析和其他分析工作中获得的数据资料并给出相应解释的工作,本章提供了一些可接受的操作信息。针对评价数据资料的一些基础统计学方法、异常值的处理和分析方法的比较,本章都进行了较为详细的讨论。

[NOTE—It should not be inferred that the analysis tools mentioned in this chapter form an exhaustive list. Other, equally valid, statistical methods may be used at the discretion of the manufacturer and other users of this chapter.] [注:本章所列的并非是所有的分析工具。根据生产商和其他使用者的慎重判断,也可使用其他的一些等效统计方法。]

Assurance of the quality of pharmaceuticals is accomplished by combining a number of practices, including robust formulation design, validation, testing of starting materials, in-process testing, and final-product testing. Each of these practices is dependent on reliable test procedures. In the development process, test procedures are developed and validated to ensure that the manufactured products are thoroughly characterized. Final-product testing provides further assurance that the products are consistently safe, efficacious, and in compliance with their specifications.

药品的质量保证是由一系列实践活动联合完成的,这些活动包括了耐用性的处方设计、确认、起始物料的检测、过程监测和终产品检测等。所有这些活动都依赖于可靠的检测方法。在研发过程中,需要建立检测方法并对其进行确认,以便该方法能够确保所生产的产品可以被完全特征化。终产品的检测可以进一步确保产品可以始终安全、有效以及符合其质量标准。

Measurements are inherently variable. The variability of biological tests has long been recognized by the USP. For example, the need to consider this variability when analyzing biological test data is addressed in Analysis of Biological Assays <1034>. The chemical analysis measurements commonly used to analyze pharmaceuticals are also inherently variable, although less so than those of the biological tests. However, in many instances the acceptance criteria are proportionally tighter, and thus, this smaller allowable variability has to be considered when analyzing data generated using analytical procedures. If the variability of a measurement is not characterized and stated along with the result of the measurement, then the data can only be interpreted in the most limited sense. For example, stating that the difference between the averages from two laboratories when testing a common set of samples is 10% has limited interpretation, in terms of how important such a difference is, without knowledge of the intralaboratory variability.

任何测量本质上都是可变的。生物学检测的这种变异性很早就被USP认识到。比如,在其<1034> Analysis of Biological Assays中就规定了对生物检测数据进行分析时需要考虑其变异性。通常用于药品分析中的化学分析测量方法也同样本质上是可变的,尽管其变异性比生物检测实验小。但在很多情况下,其接受标准也相应地更严格;因此,当使用这些方法分析所获得的数据时,对这种可接受的较小变异也必须予以考虑。如果没有描述出一种测量的变异特性,仅以该测量的结果进行表述,那么就只能在非常有限的层面上对该资料的进行解释。比如,当两个实验室测量一组相同样本时,如果没有实验室间的变异性相关信息,仅通过表达其测量均值的差异为10%想要说明这种差异有多么显著,其解释意义是非常有限的。

This chapter provides direction for scientifically acceptable treatment and interpretation of data. Statistical tools that may be helpful in the interpretation of analytical data are described. Many descriptive statistics, such as the mean and standard deviation, are in common use. Other statistical tools, such as outlier tests, can be performed using several different, scientifically valid approaches, and examples of these tools and their applications are also included. The framework within which the results from a compendial test are interpreted is clearly outlined in General Notices and Requirements 7. Test Results. Selected references that might be helpful in obtaining additional information on the statistical tools discussed in this chapter are listed in Appendix G at the end of the chapter. USP does not endorse these citations, and they do not represent an exhaustive list. Further information about many of the methods cited in this chapter may also be found in most statistical textbooks.

本章提供了一个对实验数据进行科学适当处理及解释的指导。在这里,对一些有益于进行数据解释的统计工具进行了描述。其中许多描述性统计方法是通常使用的,如均值和标准差。可以通过几种不同的,经科学确认的方式使用其他一些统计工具,如异常值检验,本章还给出了这些方法的相关实例及其应用。有关对法定方法检测数据结果进行解释的框架都明确概述在General Notices and Requirements 7. Test Results一节当中。附录G罗列了一些有益于获取关于本章中讨论过的统计工具的更多信息的相关文献。USP并未核准这些引用文献,且这些文献也并非代表所讨论的统计方法全部内容。本章所引用统计方法的更多信息也可以在大部分统计教材中找到。

PREREQUISITE LABORATORY PRACTICES AND PRINCIPLES

实验室活动规范的先决条件和原则

The sound application of statistical principles to laboratory data requires the assumption that such data have been collected in a traceable (i.e., documented) and unbiased manner. To ensure this, the following practices are beneficial. 完全正确地应用统计原理于分析实验室的数据,需要具备下列,即这些数据以一种可以溯源(如记录并存档)并无偏倚的方式收集。遵守下列规范是非常有益于确保基本假定要求的。

Sound Record Keeping 保存完好无误的记录

Laboratory records are maintained with sufficient detail, so that other equally qualified analysts can reconstruct the experimental conditions and review the results obtained. When collecting data, the data should generally be obtained with more decimal places than the specification requires and rounded only after final calculations are completed as per the General Notices and Requirements.

实验室记录应包含有充分的细节,以便其他有同等能力的人员可以重建实验条件并评估所得实验结果。当采集数据资料时,应遵守“General Notices and Requirements”中的要求,通常以比质量标准要求保留的小数位数多几位的格式进行采集,并且只在所有计算都完成时才进行修约。

Sampling Considerations

抽样考虑

Effective sampling is an important step in the assessment of a quality attribute of a population. The purpose of sampling is to provide representative data (the sample) for estimating the properties of the population. How to attain such a sample depends entirely on the question that is to be answered by the sample data. In general, use of a random process is considered the most appropriate way of selecting a sample. Indeed, a random and independent sample is necessary to ensure that the resulting data produce valid estimates of the properties of the population. Generating a nonrandom or ―convenience‖ sample risks the possibility that the estimates will be biased. The most straightforward type of random sampling is called simple random sampling, a process in which every unit of the population has an equal chance of appearing in the sample. However, sometimes this method of selecting a random sample is not optimal because it cannot guarantee equal representation among factors (i.e., time, location, machine) that may influence the critical properties of the population. For example, if it requires 12 hours to manufacture all of the units in a lot and it is vital that the sample be representative of the entire production process, then taking a simple random sample after the production has been completed may not be appropriate because there can be no guarantee that such a sample will contain a similar number of units made from every time period within the 12-hour process. Instead, it is better to take a systematic random sample whereby a unit is randomly selected from the production process at systematically selected times or locations (e.g., sampling every 30 minutes from the units produced at that time) to ensure that units taken throughout the entire manufacturing process are included in the sample. Another type of random sampling procedure is needed if, for example, a product is filled into vials using four different filling machines. In this case it would be impor-tant to capture a random sample of vials from each of the filling machines. A stratified random sample, which randomly samples an equal number of vials from each of the four filling machines, would satisfy this requirement. Regardless of the reason for taking a sample (e.g., batch-release testing), a sampling plan should be established to provide details on how the sample is to be obtained to ensure that the sample is representative of the entirety of the population and that the resulting data have the required sensitivity. The optimal sampling strategy will depend on knowledge of the manufacturing and analytical measurement processes. Once the sampling scheme has been defined, it is likely that the sampling will include some element of random selection. Finally, there must be sufficient sample collected for the original analysis, subsequent verification analyses, and other analyses. Consulting a statistician to

identify the optimal sampling strategy is recommended.

要对一个总体的质量属性进行评估,有效的抽样方式是重要的一步。抽样的目的就是提供能正确描述总体特性的代表性样本资料。如何获得这样一个样本完全取决于样本所要解答的问题。一般而言,使用随机抽样是最合适的取样方式。实际上,为了确保所得样本数据能有效地评估总体的属性,一个随机、独立的样本是必须的。采用非随机或“便利”的样本会出现偏倚评估的风险。最直接的随机抽样的方式是“简单随机样本法(simple random sampling)”,在该过程中,总体中每一个单体都有相同的机会出现在样本中。但有时这种简单抽样方法也不是最优的,因为它不能保证平等地体现某些因素(如时间、地点和机器),而这些因素会对总体的一些重要属性产生影响。例如,如果要求12小时生产出一个批次的所有单元,那么所取样本能够代表整个生产过程是极为重要的,这时,如果在生产完成后,采用简单随机抽样的方法进行抽样将是不合适的,因为无法保证这样抽取的样本会均等或类似均等地包含12小时内每个时间段生产的单元。这时,最好采用系统随机样本法(systematic random sample)的方式,以便使每一个所抽单元来自于整个生产过程中的不同时间段和地点段(如在生产过程中,每隔30分钟抽取一个单元),从而确保了所抽样本均衡地来自于整个生产过程。当假设一个产品用4个不同的分装机器分装到药瓶时,若进行抽样,则需要另一种随机抽样方法,以确保所抽的随机样本中包含了来自于每一台机器的药瓶。“分层随机样本法(stratified random sample)”,即将4台分装机中均等数目的药瓶随机进行取样的方式,可以满足这样的要求。如果不考虑取样原因时(如批放行检测),则在抽样时应建立一个抽样方案来提供一些细节,这些细节是有关如何取样才能确保样本可以代表总体的所有属性,并且确保取得的样本有必需的灵敏性。最佳抽样(的选用)策略取决于对生产和分析测量过程的了解。一旦确定了抽样方案,取样很可能包含一些随机选择的基本要素。最后,必须采集足够量的样本以便进行初步分析、后续验证分析和其他分析等。建议咨询统计人员以便确定最优抽样策略。

Tests discussed in the remainder of this chapter assume that simple random sampling has been performed. 本章下面所讨论的检测实验都是针对假定采用了简单随机抽样得到的样本。

Use of Reference Standards

对照品的使用

Where USP or NF tests or assays call for the use of a USP Reference Standard, only those results obtained using the specified USP Reference Standard are conclusive for purposes of demonstrating conformance to such USP or NF standards. While USP standards apply at all times in the life of an article from production to expiration, USP does not specify when testing must be done, or any frequency of testing. Accordingly, users of USP and NF apply a range of strategies and practices to assure articles achieve and maintain conformance with compendial requirements, including when and if tested. Such strategies and practices can include the use of secondary standards traceable to the USP Reference Standard, to supplement or support any testing undertaken for the purpose of conclusively demonstrating conformance to applicable compendial standards. Because the assignment of a value to a standard is one of the most important factors that influences the accuracy of an analysis, it is critical that this be done correctly.

当进行USP和NF的检测或者含量测定时如果要求使用USP参考品时,只有使用了规定USP参照品的实验结果才可以给出符合相关USP和NF质量标准的结论。尽管USP参考品可以保证一个物品从生产到失效的生命周期内任何时候的供应,USP也没有明确必须进行测试的时间或者测试的频率。相应的,USP和NF的使用者会采用一系列策略和操作来确保这些物品按照法定的要求来取得或保管,包括什么时间及是否进行测试。这些策略和操作包括使用可以溯源至USP参考品的二级参考品来补充或支持任何测试的进行,哪怕这些测试是出于明确证明符合适当法定质量标准的目的。由于这时参考品的赋值是影响分析准确性的最重要因素,所以正确地给参考品赋值是至关重要的。

System Performance Verification

系统性能的验证

Verifying an acceptable level of performance for an analytical system in routine or continuous use can be a valuable practice. This may be accomplished by analyzing a control sample at appropriate intervals, or using other means, such as, variation among the standards, background signal-to-noise ratios, etc. Attention to the measured parameter, such as charting the results obtained by analysis of a control sample, can signal a change in performance that requires adjustment of the analytical system. An example of a controlled chart is provided in Appendix A.

对一个日常使用或需连续使用的分析系统,验证其性能是否处于一个可接受的水平是非常有价值的活动。这可以通过在适当间隔分析控制样本来完成,也可以使用其他方式,如标准品的变异性、背景信噪比等。通过关注被测量的参数(比如将一份控制样本的分析结果进行图示)可以显示出分析系统性能变化的信号,性能的变化可能需要对系统进行调整。附录A给出了一个控制图的实例。

Procedure Validation

方法确认

All analytical procedures are appropriately validated as specified in Validation of Compendial Procedures <1225>. Analytical procedures published in the USP–NF have been validated and meet the Current Good Manufacturing Practices regulatory requirement for validation as established in the Code of Federal Regulations. A validated procedure may be used to test a new formulation (such as a new product, dosage form, or process intermediate) only after confirming that the new formulation does not interfere with the accuracy, linearity, or precision of the method. It may not be assumed that a validated procedure could correctly measure the active ingredient in a formulation that is different from that used in establishing the original validity of the procedure. [NOTE ON TERMINOLOGY—The definition of accuracy in <1225> and in ICH Q2 corresponds to unbiasedness only. In the International Vocabulary of Metrology (VIM) and documents of the International Organization for Standardization (ISO), accuracy has a different meaning. In ISO, accuracy combines the concepts of unbiasedness (termed trueness) and precision. This chapter follows the definition in <1225>, which corresponds only to trueness.]

所有的方法都应根据<1225> Validation of Compendial Procedures要求进行充分地确认。USP-NF发布的方法都已经进行了确认,并符合cGMP法规当中如联邦法规所述的对于确认的要求。只有当确证该新药品处方不会干

扰其准确性、线性和精密度后,一个已确认过的方法才可用于检测一个新的处方(如新产品、新剂型或中间产物)的检测。我们不能假定,一个经过确认的方法就一定能准确检测在不同处方中的活性成分。[注意术语的使用,“准确性(accuracy)”在<1225>和ICH Q2中的定义是仅仅相当于无偏倚性,在国际计量学词汇(VIM)和ISO文件中,accuracy有不同的意思。ISO文件中的accuracy包含不偏倚性(使用真实性trueness的术语)和精密度。本章所采用的概念是根据<1225>的定义,即仅相当于真实性trueness。]

MEASUREMENT PRINCIPLES AND VARIATION

测量原则和变异性

All measurements are, at best, estimates of the actual (―true‖ or ―accepted‖) value for they contain random

variability (also referred to as random error) and may also contain systematic variation (bias). Thus, the measured value differs from the actual value because of variability inherent in the measurement. If an array of measurements consists of individual results that are representative of the whole, statistical methods can be used to estimate informative properties of the entirety, and statistical tests are available to investigate whether it is likely that these properties comply with given requirements. The resulting statistical analyses should address the variability associated with the measurement process as well as that of the entity being measured. Statistical measures used to assess the direction and magnitude of these errors include the mean, standard deviation, and expressions derived therefrom, such as the percent coefficient of variation (%CV; also called the percent relative standard deviation, %RSD). The estimated variability can be used to calculate confidence intervals for the mean, or measures of variability, and tolerance intervals capturing a specified proportion of the individual measurements.

所有的测量都只能最多说用于估计实际值(“真值”或“认可值”),因为它们都包含随机变异(也叫随机误差)和可能的系统变异(偏倚)。所以,被测值与实际值因这些测量固有的变异而存在差异。如果一系列包含单独检测结果的测量是整个总体的代表,那么就可以使用统计方法对总体的特征信息进行估计,并且可以使用统计检验的方法判断这些特性是否符合规定要求。所得到的统计分析结果应该显示出所有测量过程和测量总体相关的变异性。用于评估这些误差的方向和程度的统计量包括均值、标准差及由此衍生的一些表述,例如百分变异系数(%CV,也叫百分相对标准偏差,%RSD)。所评估的变异性可以用于计算均值的置信区间,或测量变异性,以及计算用于捕捉特定比例单次测量的容忍区间(tolerance intervals)。

The use of statistical measures must be tempered with good judgment, especially with regard to representative sampling. Data should be consistent with the statistical assumptions used for the analysis. If one or more of these assumptions appear to be violated, alternative methods may be required in the evaluation of the data. In particular, most of the statistical measures and tests cited in this chapter rely on the assumptions that the distribution of the entire population is represented by a normal distribution and that the analyzed sample is a representative subset of this population. The normal (or Gaussian) distribution is bell-shaped and symmetric about its center and has certain characteristics that are required for these tests to be valid. The data may not always be expected to be normally distributed and may require a transformation to better fit a normal distribution. For example, there exist variables that

have distributions with longer right tails than left. Such distributions can often be made approximately normal through a log transformation. An alternative approach would be to use ―distribution-free‖ or ―nonparametric‖ statistical procedures that do not require that the shape of the population be that of a normal distribution. When the objective is to construct a confidence interval for the mean or for the difference between two means, for example, then the normality assumption is not as important because of the central limit theorem. However, one must verify normality of data to construct valid confidence intervals for standard deviations and ratios of standard deviations, perform some outlier tests, and construct valid statistical tolerance limits. In the latter case, normality is a critical assumption. Simple graphical methods, such as dot plots, histograms, and normal probability plots, are useful aids for investigating this assumption. 使用统计量必须有良好的判断加以调节,特别是要考虑到抽样的代表性。数据必须与分析用到的统计假设相一致。如果有一个或多个假设出现不相符,则需要采用替代的方法进行数据评价。特别应指出的是,本章所引用的统计量和检验都是基于假设总体符合正态分布,并且假设所分析的样本是能代表总体的一个亚体(Subset)。正态分布(也叫高斯分布)是一个钟形且呈中心对称形状的分布,并且有一些用于检验的特征需要被验证。数据并非总是符合正态分布的,这时需要进行适当转换以便其更好地符合正态分布。例如,存在着一些变量具有长右尾分布。这样的分布经常通过对数转换将其变为符合近似正态分布。也可使用“不依赖于分布”或“非参数”的替代统计方法,该类方法不要求数据符合正态分布。当目标是计算均值或两均值差的置信区间时,那么,其正态性假设就因中心极限定理(central limit theorem)而不再那么重要。但是,人们必须首先验证数据的正态性,才能计算出正确有效的标准差的置信区间和标准偏差比、进行异常值检测并且计算出正确有效的统计容忍限等。对于后者,正态性是至关重要的假设。一些简单的作图法(如散点图、柱状图和正态概率图)对于分析正态性假设非常有用。

A single analytical measurement may be useful in quality assessment if the sample is from a whole that has been prepared using a well-validated, documented process and if the analytical errors are well known. The obtained analytical result may be qualified by including an estimate of the associated errors. There may be instances when one might consider the use of averaging because the variability associated with an average value is always reduced as compared to the variability in the individual measurements. The choice of whether to use individual measurements or averages will depend upon the use of the measure and its variability. For example, when multiple measurements are obtained on the same sample aliquot, such as from multiple injections of the sample in an HPLC method, it is generally advisable to average the resulting data for the reason discussed above.

除非样本来自于一个使用经充分确认过且经过证明的方法制备的总体,并且其分析误差已知,那么这样一个单次分析测量在质量评价中才会是有用的。在引入了相关误差的评估后该分析结果才能满足要求。有些情况可以考虑使用均值,因为与单一的各测量值比较,均值的变异总是很小。究竟使用单个测量值还是使用其均值的选择,主要依赖于所用的测量和其变异性。例如,当可以从样本的组分中获得多个测量值时,如在使用液相方法对同一样本进行多次检测时,根据前述原因一般建议使用结果的均值。

Variability is associated with the dispersion of observations around the center of a distribution. The most

commonly used statistic to measure the center is the sample mean (x):

变异性与围绕分布中心的观测离散性相关。最常见的用于计算中心位置的统计量就是样本均值:

Analytical procedure variability can be estimated in various ways. The most common and useful assessment of a procedure's variability is the determination of the standard deviation based on repeated independent1 measurements of a sample. The sample standard deviation, s, is calculated by the formula:

方法的变异性可以有各种方式进行评估。对于方法变异性最常见和有用的评估指标是针对样本重复性独立性1测量值的标准差。样本标准差的计算公式如下:

in which xi is the individual measurement in a set of n measurements; and x is the mean of all the measurements. The percent relative standard deviation (%RSD) is then calculated as:

在公式中,xi是一系列测量中某一个测量值,x为所有测量值的均值。百分相对标准偏差(%RSD)的计算如下:

and expressed as a percentage. If the data requires log transformation to achieve normality (e.g., for biological assays), then alternative methods are available2.

百分相对标准偏差用百分数表示。如果数据需要进行对数转换才能达到正态性(如一些生物检定实验),那么应使用另一种替代计算方法2。

1

Multiple measurements (or, equivalently, the experimental errors associated with the multiple measurements) are independent from one another when they can be assumed to represent a random sample from the population. In such a sample, the magnitude of one measurement is not influenced by, nor does it influence the magnitude of, any other measurement. Lack of independence implies the measurements are correlated over time or space. Consider the example of a 96-well microtiter plate. Suppose that whenever the

unknown causes that produce experimental error lead to a low result (negative error) when a sample is placed in the first column and these same causes would also lead to a low result for a sample placed in the second column, then the two resulting measurements would not be statistically independent. One way to avoid such possibilities would be to randomize the placement of the samples on the plate. 当可以假定他们是来自于同一整体的随机代表样本,多次测量是彼此独立的,同样的与多次测量相关的实验误差也是彼此独立的。在这样一个样本中,单次测量的量值不会被其他的测量所干扰,也不会干扰其他测量的量值。缺乏独立性意味着测量值与时间或空间相关。想象下96孔板的例子。假设当样品置于第一孔上时未知的实验误差因素在任何时候都导致一个偏低的结果(阴性结果),这些相同的因素也对第二孔上的样本导致了偏低结果,那么这两个测量结果就不能满足独立性的要求。避免这种可能性的一种方法是在板上随机放置样本。

2

When data have been log (base e) transformed to achieve normality, the %RSD is:

This can be reasonably approximated by: where s is the standard deviation of the log (base e) transformed data. 当数据进行了对数转换以获得正态性后,%RSD计算公式为 也可以合理的简化为 其中s是自然对数转换后数据的标准偏差。

A precision study should be conducted to provide a better estimate of procedure variability. The precision study may be designed to determine intermediate precision (which includes the components of both ―between run‖ and ―within-run‖ variability) and repeatability (―within-run‖ variability). The intermediate precision studies should allow for changes in the experimental conditions that might be expected, such as different analysts, different preparations of reagents, different days, and different instruments. To perform a precision study, the test is repeated several times. Each run must be completely independent of the others to provide accurate estimates of the various components of variability. In addition, within each run, replicates are made in order to estimate repeatability. See an example of a precision study in Appendix B.

应该进行方法的精密度研究以更好地提供方法变异性评估。可以通过测定算中间精密度(包括“组间between run”变异性和“组内within-run”变异性两个成分)和重复性(“组内within-run”变异性)设计精密度研究。中间精密度研究应允许一些可能的实验条件变化,如不同的分析人员、不同的试剂配置方法、不同的检测时间和不同的检测仪器等。进行精密度研究,实验检测需要进行多次重复。每次运行都必须是彼此完全独立的,以便提供各变异性成分的准确评价。此外,在每个实验组内,必须进行重复测定以便评估其重复性。具体可参见附录 B给出的精密度研究实例。

A confidence interval for the mean may be considered in the interpretation of data. Such intervals are calculated from several data points using the sample mean (x) and sample standard deviation(s) according to the formula:

均值的置信区间也可以用于解释数据资料。置信区间可以通过使用均值和标准偏差根据下面的公式进行计算:

in which ta/2, n? 1 is a statistical number dependent upon the sample size (n), the number of degrees of freedom (n? 1), and the desired confidence level (1 ? a). Its values are obtained from published tables of the Student t-distribution. The confidence interval provides an estimate of the range within which the ―true‖ population mean (μ) falls, and it also evaluates the reliability of the sample mean as an estimate of the true mean. If the same experimental set-up were to be replicated over and over and a 95% (for example) confidence interval for the true mean is calculated each time, then 95% of such intervals would be expected to contain the true mean, μ. One cannot say with certainty whether or not the confidence interval derived from a specific set of data actually collected contains μ. However, assuming the data represent mutually independent measurements randomly generated from a normally distributed population, the procedure used to construct the confidence interval guarantees that 95% of such confidence intervals containμ. Note that it is important to define the population appropriately so that all relevant sources of variation are captured. [NOTE ON TERMINOLOGY—In the documents of the International Organization for Standardization (ISO), different terminology is used for some of the concepts described here. The term s/

, which is commonly called the standard

is called the

error of the mean, is called the standard uncertainty in ISO documents. The term ta/2, n? 1 S/

expanded uncertainty, and ta/2, n? 1 is called the coverage factor, by ISO. If the standard deviation is found by

combining estimates of variability from multiple sources, it is called the combined standard uncertainty. Some of these sources could have nonstatistical estimates of uncertainty, called Type B uncertainties, such as uncertainty in calibration of a balance.] 在该公式中,t

/2,n–1是一个统计数据,其大小取决于样本量大小(n),自由度(n-1);及期望的置信水平(1 –

)。其值可以通过发表的学生t-分布表查到。置信区间评估了一个总体真实均值(μ)存在的区间,它同时也是评价样本均值代表真值可靠性的指标。如果同样的实验设计重复多次,并且假定每次都计算真实均值的95%置信区间,那么这些置信区间中有95%将会期望包含真值μ。但人们不能一定说从一个实际采集的数据衍生出来的置信区间是包含或不包含真值μ。但是,假设所用数据代表了正态分布总体中的相互独立的随机测量,那么用于计算置信区间的方法可以担保有95%的置信区间包含真值μ。应该注意的是,为了能采集到所用的相关变异因素,正确定义总体是非常重要的。[注意术语的使用,在ISO的文件当中对于上述一些概念使用了不同的术语。s/

通常被称为均值的标准误,而在ISO文件当中称为标准不确定度。在ISO文件中,t

/2,n–1S/

被称为扩展不

确定度,t

/2,n–1被称为包含因子。如果标准偏差合并了对于多种来源变异性的估计,那么其被称为合成标准不

确定度。这些变异性来源中的一些可能是对于不确定度的非统计性评估,他们被称为B类不确定度,比如天平校准的不确定度。]

OUTLYING RESULTS

异常结果

Occasionally, observed analytical results are very different from those expected. Aberrant, anomalous, contaminated, discordant, spurious, suspicious or wild observations; and flyers, rogues, and mavericks are properly called outlying results. Like all laboratory results, these outliers must be documented, interpreted, and managed. Such results may be accurate measurements of the entity being measured, but are very different from what is expected. Alternatively, due to an error in the analytical system, the results may not be typical, even though the entity being measured is typical. When an outlying result is obtained, systematic laboratory and, in certain cases, process investigations of the result are conducted to determine if an assignable cause for the result can be established. Factors to be considered when investigating an outlying result include—but are not limited to—human error, instrumentation error, calculation error, and product or component deficiency. If an assignable cause that is not related to a product or component deficiency can be identified, then retesting may be performed on the same sample, if possible, or on a new sample. The precision and accuracy of the procedure, the USP Reference Standard, process trends, and the specification limits should all be examined. Data may be invalidated, based on this documented investigation, and eliminated from subsequent calculations.

有时,我们观察到的结果和我们预期的有很大差距。异常的、反常的、被污染的、不一致的、虚假的、可疑的或者是不受控制的观察值;还有离群值,异常值和异端值都应该叫做异常结果。像所有实验结果一样,这些异常值也必须进行记录、解释说明和处理。这些结果有可能是被测物的正确测量值,只是和我们的预期有很大

的差距。相应地,即使被测总体符合典型特征,这些结果也有可能由于一个分析体系中的错误而变成非典型的。当得到一个异常值的时候,就要对该值进行系统的实验室调查,在某些情况下还要进行实验过程调查,以确定产生异常值是否有一个明确的原因(assignable cause)。产生异常值的明确原因通常有但不限于以下几点——人为错误,仪器错误,计算错误,产品或者组分缺陷。如果产生异常值的明确原因可以确定与产品或者组分缺陷有关,那么如果可能就对同一样本进行重复实验,或者用新的样本进行重复实验。对于方法的精密性和准确性,USP参考品,过程趋势和规定标准限值等都要进行审核。基于这些经过证明的调查,数据有可能被发现是无效的,这时需将它从后续的计算中删除。

If no documentable, assignable cause for the outlying laboratory result is found, the result may be tested, as part of the overall investigation, to determine whether it is an outlier.

如果没有发现实验室结果异常存在可证明的明确原因,那么它要作为整体研究的一部分进行检验,以确定它是否是个异常值。

However, careful consideration is warranted when using these tests. Two types of errors may occur with outlier tests: (a) labeling observations as outliers when they really are not; and (b) failing to identify outliers when they truly exist. Any judgment about the acceptability of data in which outliers are observed requires careful interpretation.

但是,当进行这些检验的时候一定要小心谨慎。在进行异常值检验的时候会犯两类错误。第一类是将不是异常值的值当做异常值;第二类是把异常值当做正常值。对于观测到异常值的数据可接受性,所做出的任何一种判断都要进行详细的解释。

―Outlier labeling‖ is informal recognition of suspicious laboratory values that should be further investigated with more formal methods. The selection of the correct outlier identification technique often depends on the initial recognition of the number and location of the values. Outlier labeling is most often done visually with graphical techniques. ―Outlier identification‖ is the use of statistical significance tests to confirm that the values are inconsistent with the known or assumed statistical model.

“异常值的标识(outlier labeling)”是对可疑实验数据的非正式识别,要用更正规的方法进一步调查。异常值正确识别方法的选择通常依赖于对数值的数目和位置的初步识别。标志异常值通常用绘图方法进行目视标识。“异常值识别(Outlier identification)”是使用统计学显著性方法来确定数值在已知的或假定的统计模型中是异常的。 When used appropriately, outlier tests are valuable tools for pharmaceutical laboratories. Several tests exist for detecting outliers. Examples illustrating three of these procedures, the Extreme Studentized Deviate (ESD) Test, Dixon's Test, and Hampel's Rule, are presented in Appendix C.

如果使用得当,异常值检验对于药品领域的实验室来说非常有用。有几个方法可以用于异常值检验。在附录C中有3个例子:极端学生化偏离检验(ESD检验)、狄克逊检验(Dixon检验)和Hampel规则

Choosing the appropriate outlier test will depend on the sample size and distributional assumptions. Many of these tests (e.g., the ESD Test) require the assumption that the data generated by the laboratory on the test results can be

thought of as a random sample from a population that is normally distributed, possibly after transformation. If a transformation is made to the data, the outlier test is applied to the transformed data. Common transformations include taking the logarithm or square root of the data. Other approaches to handling single and multiple outliers are available and can also be used. These include tests that use robust measures of central tendency and spread, such as the median and median absolute deviation and exploratory data analysis (EDA) methods. ―Outlier accommodation‖ is the use of robust techniques, such as tests based on the order or rank of each data value in the data set instead of the actual data value, to produce results that are not adversely influenced by the presence of outliers. The use of such methods reduces the risks associated with both types of error in the identification of outliers.

选择合适的异常值检验方法取决于样本量大小和分布假设。许多检验方法(如ESD检验)要求假设实验室结果数据是来自一个正态分布总体或者转换成正态分布总体的随机样本。如果对数据进行了转换,则异常值检验方法适用于转换后的数据。常见的转换方法包括取对数转换或平方根转换。其他处理单个或者多个异常值的方法也可以使用。这些方法包括集中趋势和离散趋势的稳健分析方法,比如中位数、中位数绝对偏差(median absolute deviation)和探索性数据分析(EDA)方法。“异常值的调适(Outlier accommodation)”是利用稳健方法使得出的结果不会因异常值存在而产生不利影响,比如可以使用每个数据值在整个数据集中的序或秩来替代原数据进行分析。使用这些方法会降低了异常值识别过程中出现上述两类错误的风险。

―Outlier rejection‖ is the actual removal of the identified outlier from the data set. However, an outlier test cannot be the sole means for removing an outlying result from the laboratory data. An outlier test may be useful as part of the evaluation of the significance of that result, along with other data. Outlier tests have no applicability in cases where the variability in the product is what is being assessed, such as content uniformity, dissolution, or release-rate determination. In these applications, a value determined to be an outlier may in fact be an accurate result of a nonuniform product. All data, especially outliers, should be kept for future review. Unusual data, when seen in the context of other historical data, are often not unusual after all but reflect the influences of additional sources of variation.

“异常值的剔除(Outlier rejection)”是将识别出的异常值从数据集中剔除。但是,异常值检验不是将异常值从实验数据中剔除的唯一方法。异常值检验连同其他数据一起,作为对结果显著性评估的一部分是很有用的。当实验目的就是评价产品的变异性时,如在含量均匀度、溶出性或释放速率实验中,异常值检验是无法使用的。在这种情况下,被确定为异常值的一个结果实际上就是确定产品不均一的一个准确结果。所有数据,尤其是异常值,要保留下来供今后进一步的评估。当放在其它历史数据中一起审视时,一些异常的数据很可能就不是异常的了,而是反映出其他变异性来源的影响。

In summary, the rejection or retention of an apparent outlier can be a serious source of bias. The nature of the testing as well as scientific understanding of the manufacturing process and analytical procedure have to be considered to determine the source of the apparent outlier. An outlier test can never take the place of a thorough laboratory investigation. Rather, it is performed only when the investigation is inconclusive and no deviations in the manufacture

or testing of the product were noted. Even if such statistical tests indicate that one or more values are outliers, they should still be retained in the record. Including or excluding outliers in calculations to assess conformance to acceptance criteria should be based on scientific judgment and the internal policies of the manufacturer. It is often useful to perform the calculations with and without the outliers to evaluate their impact.

总之,拒绝或者保留一个明显的异常值都会导致明显偏倚。(异常值)检验(方法)的特性以及对生产过程和分析方法的科学理解都必须在确定这个异常值的来源时予以考虑。一个异常值的检验永远不能代替全面的实验室调查分析。实际上,只有在调查分析中无法找出确切原因,也没有发现在产品生产和检测中存在偏离时才能使用异常值检验。即使这样的统计学检验显示有一个或者多个数据是异常值,也仍要将它们保留在原始记录中。在评估标准符合性的计算过程中,保留或排除这些异常值都应该基于科学判断和生产商内部政策。在时,使用包含异常值和不包含异常值分别计算的方法对于评价异常值的影响通常是有用的。

Outliers that are attributed to measurement process mistakes should be reported (i.e., footnoted), but not included in further statistical calculations. When assessing conformance to a particular acceptance criterion, it is important to define whether the reportable result (the result that is compared to the limits) is an average value, an individual measurement, or something else. If, for example, the acceptance criterion was derived for an average, then it would not be statistically appropriate to require individual measurements to also satisfy the criterion because the variability associated with the average of a series of measurements is smaller than that of any individual measurement.

对于那些测量过程错误导致的异常值都需要进行记录(如使用脚注),但是不用将其包含在接下来的计算中。当评价是否符合某一特定接受标准时,非常重要的一件事是确定需报告的结果(即与限值比较的结果)是均值、单次测量值,还是其他的值。比如,如果接受标准是来自于均值,那么要求单个测量值也满足这个标准在统计学意义上就是不适当的,因为一系列测量均值的变异性要小于任何一个单独测量值的变异性。

COMPARISON OF ANALYTICAL PROCEDURES

分析方法的比较

It is often necessary to compare two procedures to determine if their average results or their variabilities differ by an amount that is deemed important. The goal of a procedure comparison experiment is to generate adequate data to evaluate the equivalency of the two procedures over a range of values. Some of the considerations to be made when performing such comparisons are discussed in this section.

我们经常需要比较两种(分析)方法以确定它们的平均结果或变异性是否存在重要差异。方法比较实验的目的是获得足够的数据,以便评价在一定范围内两种方法的等效性。下面的内容给出了在进行这种比较时应该做出的考虑。

Precision 精密度

Precision is the degree of agreement among individual test results when the analytical procedure is applied repeatedly to a homogeneous sample. For an alternative procedure to be considered to have ―comparable‖ precision to

that of a current procedure, its precision (see Analytical Performance Characteristics in <1225>, Validation) must not be worse than that of the current procedure by an amount deemed important. A decrease in precision (or increase in variability) can lead to an increase in the number of results expected to fail required specifications. On the other hand, an alternative procedure providing improved precision is acceptable.

精密度是指使用分析方法对均质样本进行重复测定时,各实验结果一致的程度。因为一个替代方法应当被认为具有与现行方法“相似”的精密度,其精密度(参见<1225>中分析性能属性,确认)与现有方法相比必须不能存在明显的差异。精密度的下降(或者说变异的增大)可导致不符合规定质量标准的实验结果数量增加。另一方面,体现出更佳精密度的替代方法是可以接受的。

One way of comparing the precision of two procedures is by estimating the variance for each procedure (the sample variance, s, is the square of the sample standard deviation) and calculating a one-sided upper confidence interval for the ratio of (true) variances, where the ratio is defined as the variance of the alternative procedure to that of the current procedure. An example, with this assumption, is outlined in Appendix D. The one-sided upper confidence limit should be compared to an upper limit deemed acceptable, a priori, by the analytical laboratory. If the one-sided upper confidence limit is less than this upper acceptable limit, then the precision of the alternative procedure is considered acceptable in the sense that the use of the alternative procedure will not lead to an important loss in precision. Note that if the one-sided upper confidence limit is less than one, then the alternative procedure has been shown to have improved precision relative to the current procedure.

比较两种方法精密度的一种方式是通过评价每种方法的方差(样本方差s2即是样本标准偏差的平方),并计算替代方法与现用方法的(真)方差比值的单侧置信上限(one-sided upper confidence limit)。附录D给出了这种假设的一个具体实例。理所当然的,该单侧置信上限应该与分析实验室确定的可接受上限进行比较。如果所计算的单侧置信上限低于可接受上限,该替代方法的精密度就被认为可以接受,即认为使用该替代方法不会导致重要的精密度损失。应该注意的是,如果计算所得的单侧置信上限小于1,那么替代方法已经显示出比原使用方法的精密度高的结论。

The confidence interval method just described is preferred to applying the two-sample F-test to test the statistical significance of the ratio of variances. To perform the two-sample F-test, the calculated ratio of sample variances would be compared to a critical value based on tabulated values of the F distribution for the desired level of confidence and the number of degrees of freedom for each variance. Tables providing F-values are available in most standard statistical textbooks. If the calculated ratio exceeds this critical value, a statistically significant difference in precision is said to exist between the two procedures. However, if the calculated ratio is less than the critical value, this does not prove that the procedures have the same or equivalent level of precision; but rather that there was not enough evidence to prove that a statistically significant difference did, in fact, exist.

上述置信区间的方法特别适合用于两样本的F检验来判断方差比值的统计学显著性差异。要进行两样本的F检验,需要将样本方差比与临界值进行比较,临界值可以根据预期的置信度和每个方差的自由度在F分布表中查

2

出。大部分的统计书籍都提供这样的F值表。如果所计算的比值超过临界值,则认为两种方法的精密度在统计学上存在显著差异。但如果所计算的比值小于临界值,并非证明两种方法具有相同或等效水平的精密度,而只能认为没有足够的证据证明两者之间在统计学上有显著差异。

Accuracy 准确度

Comparison of the accuracy (see Analytical Performance Characteristics in <1225>, Validation) of procedures provides information useful in determining if the new procedure is equivalent, on the average, to the current procedure. A simple method for making this comparison is by calculating a confidence interval for the difference in true means, where the difference is estimated by the sample mean of the alternative procedure minus that of the current procedure. 一般认为,方法间准确度(参见<1225>中分析性能属性,确认)的比较,在确定新方法在平均水平上是否与现有方法等效方面可提供非常有用的信息。一个进行比较的简单方法就是计算真实均值之差异的置信区间,这里,该差异是通过替代方法测得结果的均值减去现用方法的结果均值进行评估的。

The confidence interval should be compared to a lower and upper range deemed acceptable, a priori, by the laboratory. If the confidence interval falls entirely within this acceptable range, then the two procedures can be considered equivalent, in the sense that the average difference between them is not of practical concern. The lower and upper limits of the confidence interval only show how large the true difference between the two procedures may be, not whether this difference is considered tolerable. Such an assessment can be made only within the appropriate scientific context. This approach is often referred to as TOST (two one-sided tests; see Appendix F)

理所当然的,计算所得的置信区间应该与实验室确定的置信上限和下限进行比较。如果置信区间完全落在其确定的可接受置信上下限内,那么可以认为两种方法是等效的;即认为两种方法的均值没有实际差异。该置信区间的上下限仅显示两种方法的真值差异有多大,而不是说明这种差异是否可以被容忍。对于是否可以容忍这种差异的评估只有在科学的背景下才能进行。这种方法一般被TOST(双单侧检验;参见附录F)。 The confidence interval method just described is preferred to the practice of applying a t-test to test the statistical significance of the difference in averages. One way to perform the t-test is to calculate the confidence interval and to examine whether or not it contains the value zero. The two procedures have a statistically significant difference in averages if the confidence interval excludes zero. A statistically significant difference may not be large enough to have practical importance to the laboratory because it may have arisen as a result of highly precise data or a larger sample size. On the other hand, it is possible that no statistically significant difference is found, which happens when the confidence interval includes zero, and yet an important practical difference cannot be ruled out. This might occur, for example, if the data are highly variable or the sample size is too small. Thus, while the outcome of the t-test indicates whether or not a statistically significant difference has been observed, it is not informative with regard to the presence or absence of a difference of practical importance.

上述这种置信区间的比较方法特别适合于使用t-检验去检测两均值差异的统计显著性问题。进行t检验的一种

方式是先计算其置信区间,然后检查其是否包含0值。当该置信区间不包括0值时,说明两方法的均值差有显著差异。但是,统计学上的显著差异对于实验室并不一定有多么重要的实际意义,因为差异的增大可能来自于高精密度的数据或者大样本量的数据。另一方面,当置信区间包括0值时,也会出现虽然结果显示两者无统计显著性差异,但也并不能排除存在具有重要实际意义的差异。比如,当数据具有较大变异性或者样本量太小时,这种情况常会发生。所以,不论t检验的结果是否显示有显著性差异,都不能充分证明是否存在有实际重要意义的差异。

Determination of Sample Size

样本量计算

Sample size determination is based on the comparison of the accuracy and precision of the two procedures3 and is similar to that for testing hypotheses about average differences in the former case and variance ratios in the latter case, but the meaning of some of the input is different. The first component to be specified is δ, the largest acceptable difference between the two procedures that, if achieved, still leads to the conclusion of equivalence. That is, if the two procedures differ by no more than δ, on the average, they are considered acceptably similar. The comparison can be two-sided as just expressed, considering a difference of δ in either direction, as would be used when comparing means. Alternatively, it can be one-sided as in the case of comparing variances where a decrease in variability is acceptable and equivalency is concluded if the ratio of the variances (new/current, as a proportion) is not more than 1.0 + δ. A researcher will need to state δ based on knowledge of the current procedure and/or its use, or it may be calculated. One consideration, when there are specifications to satisfy, is that the new procedure should not differ by so much from the current procedure as to risk generating out-of-specification results. One then chooses δ to have a low likelihood of this happening by, for example, comparing the distribution of data for the current procedure to the specification limits. This could be done graphically or by using a tolerance interval, an example of which is given in Appendix E. In general, the choice for δ must depend on the scientific requirements of the laboratory.

根据两种方法进行准确性和精密度比较的需要来确定样本量3,在准确性比较时样本量类似于均值差异检验假设所需,在精密度比较时样本量类似于方差差异检验假设所需,但是计算样本量时所需的一些输入参量的意义是不同的。第一个所需参量是δ,它代表两种方法最大可接受的差异,如果满足条件就可以给出等效性结论。如果两种方法的差异小于δ,一般认为两者等效。考虑到在两个方向上δ的差异,方法的比较可以选择均值比较时所使用的双侧检验。或者,如果可以接受变异性降低,在比较方差时也可以选择单侧比较,并且如果方差比值(新方法方差/现行方法方差的比值)不大于1.0 +δ,新方法和现行方法就被认为是等效的。研究人员需要根据现行方法和/或其应用等的相关知识来规定δ值,或者计算δ值。当合规性检测时,其中的一项考虑就是新方法不应与现行方法出现较大差异,以导致出现超标结果(OOS)的风险。这时人们应该通过选择δ值来降低发

3

In general, the sample size required to compare the precision of two procedures will be greater than that required to compare the accuracy of the procedures.

通常用来两种方法精密度比较所需的样本量应该大于准确度比较所需。

生这种情况的可能性,比如,可以通过比较质量标准限度中现行方法的分布数据来确定。这需要使用图形法或通过使用容忍区间来完成,附录E给出了一个相应的使用实例。总之,δ值的选择要根据实验室的科学需求。 The next two components relate to the probability of error. The data could lead to a conclusion of similarity when the procedures are unacceptably different (as defined by δ). This is called a false positive or Type I error. The error could also be in the other direction; that is, the procedures could be similar, but the data do not permit that conclusion. This is a false negative or Type II error. With statistical methods, it is not possible to completely eliminate the possibility of either error. However, by choosing the sample size appropriately, the probability of each of these errors can be made acceptably small. The acceptable maximum probability of a Type I error is commonly denoted as α and is commonly taken as 5%, but may be chosen differently. The desired maximum probability of a Type II error is commonly denoted byβ. Often, βis specified indirectly by choosing a desired level of 1 ? β , which is called the ―power‖ of the test. In the context of equivalency testing, power is the probability of correctly concluding that two procedures are equivalent. Power is commonly taken to be 80% or 90% (corresponding to aβof 20% or 10%), though other values may be chosen. The protocol for the experiment should specify δ, α, and power. The sample size will depend on all of these components. An example is given in Appendix E. Although Appendix E determines only a single value, it is often useful to determine a table of sample sizes corresponding to different choices of δ, a, and power. Such a table often allows for a more informed choice of sample size to better balance the competing priorities of resources and risks (false negative and false positive conclusions).

计算样本量的另外两个成分与误差概率相关。当两种方法存在不可接受的差异(如δ所定义)时,而数据却给出了具有相似性的结论。这被称为假阳性结果或I类错误。错误也可以来自另一个方向,方法是相似的但数据却不能支持该结论。这是假阴性结果或II类错误。使用统计方法,两种错误都是无法完全避免的。然而,通过选择合适的样本量,可以将发生这些错误的可能性有效地减小到一个可以接受的很小程度。I类错误的最大可接受概率一般用α表示,并且取值通常为5%(也可以取其他值)。II类错误的最大期望概率一般用β表示。一般用一个期望水平1-β,即称为检测“效能”来间接确定β。在等效性检测情形中,效能是判定两种方法等效的正确概率。尽管也可选择其他值,通常效能取值为80%或90%(及相应的β值为20%或10%)。实验方案中应规定δ,α和效能。样本量将与所有的这三个因素直接相关。附录E给出了相应的计算实例。尽管附录E仅确定了一个值,但通常根据不同的δ,α和效能来确定一个样本量表是很有用的,这样的样本量表提供了更多的选择,以便允许更好地平衡资源与风险(假阴性和假阳性)的问题。

APPENDIX A: CONTROL CHARTS

附录A:控制图

Figure 1 illustrates a control chart for individual values. There are several different methods for calculating the upper control limit (UCL) and lower control limit (LCL). One method involves the moving range, which is defined as the absolute difference between two consecutive measurements (xi–xi-1). These moving ranges are averaged (MR) and used in the following formulas:

图1显示了一个各单独数值的控制图。有几种不同的方法计算控制上限(UCL)和下限(LCL)。一个方法涉及到移动区间,这被定义为连续两次测量差值的绝对值(xi–xi-1)。这些移动区间被求平均值(MR)并被用在下列公式中:

where x is the sample mean, and d2 is a constant commonly used for this type of chart and is based on the number of observations associated with the moving range calculation. Where n = 2 (two consecutive measurements), as here, d2 = 1.128. For the example in Figure 1, the MR was 1.7:

其中x是样本的平均值,d2是通常用于这类图表中的一个常数,它也是基于与移动区间计算相关的观测。在这里n = 2 (两次连续测量),d2 = 1.128。图1中所示的例子中,MR值为1.7。

Other methods exist that are better able to detect small shifts in the process mean, such as the cumulative sum (also known as ―CUSUM‖) and exponentially weighted moving average (―EWMA‖).

有一些其他的方法更适于检测方法均值的微小波动,例如累加和(也被称为“CUSUM”)和指数加权的移动均值(“EWMA”)。

Figure 1. Individual X or individual measurements control chart for control samples.

In this particular example, the mean for all the samples (x) is 102.0, the UCL is 106.5, and the LCL is 97.5.

图1:控制样本的单一X值或单一测量控制图。

在本例中,所有样本(x)的均值为102.0,UCL为106.5,LCL为97.5。

APPENDIX B: PRECISION STUDY

附录B:精密度研究

Table 1 displays data collected from a precision study. This study consisted of five independent runs and, within each run, results from three replicates were collected.

表1显示了一个精密度研究中采集的数据。这项研究包括了5个独立的组,每组实验来自3次重复的结果。

Table 1. Data from a Precision Study

表1. 精密度研究数据

Replicate Number 1 2 3 Mean Standard deviation %RSDa 1 100.70 101.05 101.15 100.97 0.236 0.234% 2 99.46 99.37 99.59 99.47 0.111 0.111% Run Number 3 99.96 100.17 101.01 100.38 0.556 0.554% 4 101.80 102.16 102.44 102.13 0.321 0.314% 5 101.91 102.00 101.67 101.86 0.171 0.167% a

%RSD (percent relative standard deviation) = 100% × (standard deviation/mean)

a

%RSD (百分相对标准偏差) = 100% × (标准偏差/均值)

Table 1A. Analysis of Variance Table for Data Presented in Table 1

表1A 表1中数据的方差分析表

aSource of Variation Degrees of Freedom (df) Sum of Squares (SS) Mean Squares(MS) F = MSB/MSW Between runs 4 14.200 3.550 34.886 Within runs 10 1.018 0.102 Total 14 15.217 a

The Mean Squares Between (MSB) = SSBetween/dfBetween and the Mean Squares Within (MSW) = SSWithin/dfWithin a

组间均方 (MSB) = SSBetween/dfBetween及组内均方 (MSW) = SSWithin/dfWithin

Performing an analysis of variance (ANOVA) on the data in Table 1 leads to the ANOVA table (Table 1A). Because there were an equal number of replicates per run in the precision study, values for VarianceRun and VarianceRep can be derived from the ANOVA table in a straightforward manner. The equations below calculate the variability

associated with both the runs and the replicates where the MSwithin represents the ―error‖ or ―within-run‖ mean square, and MSbetween represents the ―between-run‖ mean square.

对表1中的数据进行方差分析(ANOVA)可以得到方差分析表(表1A)。因为在精密度研究中每组重复同样次数,可以用一种直接的方式从方差分析表中得出VarianceRun值和VarianceRep值。下列公式可以计算与实验组(runs)相关的变异性和与重复性(replicates)相关的变异性,其中MSwithin表示“错误”或“组内”均方差,MSbetween表示“组间”均方差。

VarianceRep = MSwithin = 0.102

[NOTE—It is common practice to use a value of 0 for VarianceRun when the calculated value is negative.] Estimates

can still be obtained with unequal replication, but the formulas are more complex. Many statistical software packages can easily handle unequal replication. Studying the relative magnitude of the two variance components is important when designing and interpreting a precision study. The insight gained can be used to focus any ongoing procedure improvement effort and, more important, it can be used to ensure that procedures are capable of supporting their intended uses. By carefully defining what constitutes a result (i.e., reportable value), one harnesses the power of averaging to achieve virtually any desired precision. That is, by basing the reportable value on an average across replicates and/or runs, rather than on any single result, one can reduce the %RSD, and reduce it in a predictable fashion. [注意-通常当计算值为负值时,将VarianceRun值实际设为0]。当重复次数不等时也可以得出估计值,但是公式会比较复杂。许多统计软件包可以很容易解决这种情况。在设计和解释精密度研究时,研究两个方差成分相对大小是很重要的。研究所获得的洞察力可以用于关注任何正在进行的优化方法的努力,更重要的是,也可以用于确认方法是可以满足其预期用途的。通过仔细定义结果的组成(比如,报告值),利用平均值的力量就可以事实上获得任何预期的精密度。这就是说,如果基于多次测量及/或多组测量的平均值生成报告值,而不是基于单次测量生成报告值,一个人可以降低%RSD,并且是以可预见的方式降低。

Table 2 shows the computed variance and %RSD of the mean (i.e., of the reportable value) for different combinations of number of runs and number of replicates per run using the following formulas: 对于实验组数和每组重复次数不同的组合时,表2使用下列公式计算出的方差及均值的%RSD:

For example, the Variance of the mean, Standard deviation of the mean, and %RSD of a test involving two runs and three replicates per each run are 0.592, 0.769, and 0.76% respectively, as shown below.

比如,当一项研究包括2个实验组,每组3次重复实验时,如下所示均值的方差,标准偏差及%RSD分别为0.592, 0.769和0.76%。

RSD = (0.769/100.96) × 100% = 0.76%

where 100.96 is the mean for all the data points in Table 1. As illustrated in Table 2, increasing the number of runs from one to two provides a more dramatic reduction in the variability of the reportable value than does increasing the number of replicates per run.

其中100.96是表1中所有数据的平均值。如表2所示,与增加每组实验中重复次数相比,将实验组数量从1增加到2可以将报告值的方差显著减少。

Median = MAD = Data 100.3 100.2 100.1 100 100 100 99.9 99.7 99.5 100 n = 9 Deviations from the Median Absolute Deviations Absolute Normalized 0.3 0.3 2.02 0.2 0.2 1.35 0.1 0.1 0.67 0 0 0 0 0 0 0 0 0 ?0.1 0.1 0.67 ?0.3 0.3 2.02 ?0.5 0.5 3.37 0.1 0.14

APPENDIX D: COMPARISON OF PROCEDURES—PRECISION

附录D:方法比较 - 精密度

The following example illustrates the calculation of a 90% confidence interval for the ratio of (true) variances for the purpose of comparing the precision of two procedures. It is assumed that the underlying distribution of the sample measurements are well-characterized by normal distributions. For this example, assume the laboratory will accept the alternative procedure if its precision (as measured by the variance) is no more than four-fold greater than that of the current procedure.

为了比较两种方法的精密度需要计算(真)方差比值的90%置信区间,下面的实例阐述了计算过程。需要假设样本测量值的分布本质上是良好的正态分布。对于本例,如果替代方法的精密度(以方差计算)不大于现行方法的4倍,实验室就可以接受替代方法。

To determine the appropriate sample size for precision, one possible method involves a trial and error approach using the following formula:

对于精密度实验需要确定适当的样本量,使用下列公式的试错法是一种可能的确定方法:

where n is the smallest sample size required to give the desired power, which is the likelihood of correctly claiming the alternative procedure has acceptable precision when in fact the two procedures have equal precision; α is the risk of wrongly claiming the alternative procedure has acceptable precision; and the 4 is the allowed upper limit for an increase in variance. F-values are found in commonly available tables of critical values of the F-distribution. Fα, n-1, n-1 is the upper a percentile of an F-distribution with n-1 numerator and n-1 denominator degrees of freedom; that is, the value exceeded with probabilityα. Suppose initially the laboratory guessed a sample size of 11 per procedure was necessary (10 numerator and denominator degrees of freedom); the power calculation would be as follows7:

其中n是获得预期效能的最小样本量,这样就有可能在两种方法实际上等精度时正确地证明替代方法具备适当的精度;α是做出等精度证明错误的概率;4是方差增长的允许下限。F值通常可以从F分布的临界值表中找到。Fα, n-1, n-1值是F分布的上四分位数,这个F分布具有以n-1为分子及分母的自由度;也就是说,这个值超出了概率α。假定实验室最初猜测每个方法11个样本是必需为样本量(分子及分母的自由度均为10),按下式计算效能7:

Pr [F>/4Fα, n-1, n-1] = Pr [F>/4F.05, 10, 10] = Pr [F> (2.978/4)] = 0.6751

In this case the power was only 68%; that is, even if the two procedures had exactly equal variances, with only 11

7

This could be calculated using a computer spreadsheet. For example, in Microsoft? Excel the formula would be:

FDIST((R/A)*FINV(alpha, n ? 1, n ? 1), n ? 1, n ? 1), where R is the ratio of variances at which to determine power (e.g., R = 1, which was the value chosen in the power calculations provided in Table 6) and A is the maximum ratio for acceptance (e.g., A = 4). Alpha is the significance level, typically 0.05.

可以使用计算表格进行这个运算。比如在Microsoft? Excel中公式应该是:FDIST((R/A)*FINV(alpha, n-1, n-1), n-1, n-1),其中R是用于计算效能的方差比值(如:R=1,这个值可以从效能计算表6中选择),A是最大可接受值(如:A=4)。Alpha是显著性水平,通常为0.05。

1

1

samples per procedure, there is only a 68% chance that the experiment will lead to data that permit a conclusion of no more than a fourfold increase in variance. Most commonly, sample size is chosen to have at least 80% power, with choices of 90% power or higher also used. To determine the appropriate sample size, various numbers can be tested until a probability is found that exceeds the acceptable limit (e.g., power >0.90). For example, the power determination for sample sizes of 12–20 are displayed in Table 6. In this case, the initial guess at a sample size of 11 was not adequate for comparing precision, but 15 samples per procedure would provide a large enough sample size if 80% power were desired, or 20 per procedure for 90% power.

在这个例子当中效能仅为68%,这就是说,即使两个方法实际上是等方差的,当每个方法拥有11个样本时,实验仅有68%的机会得出方法没有超过4倍的结论。更常见的,样本量的选择至少能满足80%的效能,也会选择90%的效能或者更高。为了决定适当的样本量,会测试不同的数值直到结果超过了可接受的限度(如,效能>0.90)。例如,样本量为12–20时的效能测定值列在表6当中。本例中,最初的猜测样本量为11,其不具备足够的比较精度,但是每个方法15个样本就可以在预期80%效能时提供足够的样本量,或者在预期90%效能时需要20个样本。

Table 6. Power Determinations for Various Sample Sizes (Specific to the Example in Appendix D) (Continued)

Typically the sample size for precision comparisons will be larger than for accuracy comparisons. If the sample size for precision is so large as to be impractical for the laboratory to conduct the study, there are some options. The first is to reconsider the choice of an allowable increase in variance. For larger allowable increases in variance, the required sample size for a fixed power will be smaller. Another alternative is to plan an interim analysis at a smaller sample size, with the possibility of proceeding to a larger sample size if needed. In this case, it is strongly advisable to seek professional help from a statistician.

精度度比较的典型样本量会比实际比较时的大一些。如果精密度的样本量过大对于实验室进行研究就不实际了,这时可以有一些选择。第一个是重新选择增加的允许值。如果对方差增加的允许值大一些,对于相同效能所需的样本量会小一些。另一个选择是计划使用小样本量进行一个中间分析,可以使用大样本量的概率。本例中,强烈建议向统计学家寻求帮助。

Now, suppose the laboratory opts for 90% power and obtains the results presented in Table 7 based on the data generated from 20 independent runs per procedure.

现在,假定基于每个方法20次独立测试所获得的数据,实验室选择了90%效能,表7显示所获得的结果。

Ratio = alternative procedure variance/current procedure variance = 45.0/25.0 = 1.8

Lower limit of confidence interval = ratio/F.05 = 1.8/2.168 = 0.83 Upper limit of confidence interval = ratio/F.95 = 1.8/0.461 = 3.90

Table 7. Example of Measures of Variance for Independent Runs (Specific to the Example in Appendix D) Procedure Alternative Current Variance (standard deviation) 45.0 (6.71) 25.0 (5.00) Sample Size 20 20 Degrees of Freedom 19 19

For this application, a 90% (two-sided) confidence interval is used when a 5% one-sided test is sought. The test is

one-sided, because only an increase in standard deviation of the alternative procedure is of concern. Some care must be exercised in using two-sided intervals in this way, as they must have the property of equal tails—most common intervals have this property. Because the one-side upper confidence limit, 3.90, is less than the allowed limit, 4.0, the study has demonstrated that the alternative procedure has acceptable precision. If the same results had been obtained from a study with a sample size of 15— as if 80% power had been chosen—the laboratory would not be able to conclude that the alternative procedure had acceptable precision (upper confidence limit of 4.47).

在这种情况下,当寻求单侧5%区间时,需要使用90%(双侧)置信区间。检验是单侧的,因为只有替代方法标准偏差的增加才是需要考虑的。此时使用双侧区间需要加以小心,因为他们必须具备等尾的特性-通常区间具备这一属性。因为单侧置信上限(3.90)小于允许限度(4.0),研究显示替代方法具有可接受的精密度。如果使用样本量15的研究获得了相同的结果,假设选择了80%效能,实验室不能做出替代方法具有可接受精密度的结论(此时置信上限为4.47)。

APPENDIX E: COMPARISON OF PROCEDURES—DETERMINING THE LARGEST ACCEPTABLE

DIFFERENCE, δ, BETWEEN TWO PROCEDURES

This Appendix describes one approach to determining the difference, δ, between two procedures

(alternative-current), a difference that, if achieved, still leads to the conclusion of equivalence between the two procedures. Without any other prior information to guide the laboratory in the choice of δ, it is a reasonable way to proceed. Sample size calculations under various scenarios are discussed in this Appendix.

这是一种改进过的ESD检验,它可以从一个正态分布的总体当中发现预先设定数量(r)的异常值。对于仅检测1个异常值的情况,极端学生化偏离检验也就是常说的Grubb's检验。不建议将Grubb's检验用于多个异常值的检验。设定r=2,而n=10。

Tolerance Interval Determination

Suppose the process mean and the standard deviation are both unknown, but a sample of size 50 produced a mean and standard deviation of 99.5 and 2.0, respectively. These values were calculated using the last 50 results generated by this specific procedure for a particular (control) sample. Given this information, the tolerance limits can be calculated by the following formula:

这是一种改进过的ESD检验,它可以从一个正态分布的总体当中发现预先设定数量(r)的异常值。对于仅检测1个异常值的情况,极端学生化偏离检验也就是常说的Grubb's检验。不建议将Grubb's检验用于多个异常值的检验。设定r=2,而n=10。

x ±KS

in which x is the mean; s is the standard deviation; and K is based on the level of confidence, the proportion of results to be captured in the interval, and the sample size, n. Tables providing K values are available. In this example, the value of K required to enclose 95% of the population with 95% confidence for 50 samples is 2.3828. The tolerance limits are calculated as follows:

这是一种改进过的ESD检验,它可以从一个正态分布的总体当中发现预先设定数量(r)的异常值。对于仅检测1个异常值的情况,极端学生化偏离检验也就是常说的Grubb's检验。不建议将Grubb's检验用于多个异常值的检验。设定r=2,而n=10。

99.5 ± 2.382 × 2.0

hence, the tolerance interval is (94.7, 104.3).

这是一种改进过的ESD检验,它可以从一个正态分布的总体当中发现预先设定数量(r)的异常值。对于仅检测1个异常值的情况,极端学生化偏离检验也就是常说的Grubb's检验。不建议将Grubb's检验用于多个异常值的检验。设定r=2,而n=10。

Comparison of the Tolerance Limits to the Specification Limits

8

There are existing tables of tolerance factors that give approximate values and thus differ slightly from the values reported here.

APPENDIX G: ADDITIONAL SOURCES OF INFORMATION

附录G:其他的信息来源

There may be a variety of statistical tests that can be used to evaluate any given set of data. This chapter presents several tests for interpreting and managing analytical data, but many other similar tests could also be employed. The chapter simply illustrates the analysis of data using statistically acceptable methods. As mentioned in the Introduction, specific tests are presented for illustrative purposes, and USP does not endorse any of these tests as the sole approach for handling analytical data.

可能有许多统计检验可以用于评估任何给定的数据组。本章展示了一些检验被用来解释和管理分析数据,但是许多其他的相似检验也可以使用。本章使用统计上可授受的方法简单阐述了数据分析。如在“介绍”所述,为了说明的目的,使用了特定的方法,同时USP并不将这些方法中的任何一种检验作为处理分析数据的唯一方法。 Additional information and alternative tests can be found in the references listed below or in many statistical textbooks. 其他的信息和替代的检验可以从下列参考文献或者许多统计教材中获得。

Control Charts:

th

1. Manual on Presentation of Data and Control Chart Analysis, 6 ed., American Society for Testing and Materials (ASTM), Philadelphia, 1996.

th

2. Grant, E.L., Leavenworth, R.S., Statistical Quality Control, 7 ed., McGraw-Hill, New York, 1996.

rd

3. Montgomery, D.C., Introduction to Statistical Quality Control, 3 ed., John Wiley and Sons, New York, 1997.

rd

4. Ott, E., Schilling, E., Neubauer, D., Process Quality Control: Troubleshooting and Interpretation of Data, 3 ed., McGraw-Hill, New York, 2000.

Detectable Differences and Sample Size Determination:

nd

1. CRC Handbook of Tables for Probability and Statistics, 2 ed., Beyer W.H., ed., CRC Press, Inc., Boca Raton, FL, 1985.

nd

2. Cohen, J., Statistical Power Analysis for the Behavioral Sciences, 2 ed., Lawrence Erlbaum Associates, Hillsdale, NJ, 1988.

3. Diletti, E., Hauschke, D., Steinijans, V.W., ―Sample size determination for bioequivalence assessment by means of confidence intervals,‖ International Journal of Clinical Pharmacology, Therapy and Toxicology, 1991; 29,1–8. 4. Fleiss, J.L., The Design and Analysis of Clinical Experiments, John Wiley and Sons, New York, 1986, pp. 369–375.

th

5. Juran, J.A., Godfrey, B., Juran's Quality Handbook, 5 ed., McGraw-Hill, 1999, Section 44, Basic Statistical Methods.

6. Lipsey, M.W., Design Sensitivity Statistical Power for Experimental Research, Sage Publications, Newbury Park, CA, 1990.

7. Montgomery, D.C., Design and Analysis of Experiments, John Wiley and Sons, New York, 1984.

8. Natrella, M.G., Experimental Statistics Handbook 91, National Institute of Standards and Technology, Gaithersburg, MD, 1991 (reprinting of original August 1963 text).

9. Kraemer, H.C., Thiemann, S., How Many Subjects<: Statistical Power Analysis in Research, Sage Publications, Newbury Park, CA, 1987.

10. van Belle G., Martin, D.C., ―Sample size as a function of coefficient of variation and ratio of means,‖ American Statistician 1993; 47(3):165–167.

11. Westlake, W.J., response to Kirkwood, T.B.L.: ―Bioequivalence testing—a need to rethink,‖ Biometrics 1981; 37:589–594.

General Statistics Applied to Pharmaceutical Data:

rd

1. Bolton, S., Pharmaceutical Statistics: Practical and Clinical Applications, 3 ed., Marcel Dekker, New York, 1997.

th

2. Bolton, S., ―Statistics,‖ Remington: The Science and Practice of Pharmacy, 20 ed., Gennaro, A.R., ed., Lippincott Williams and Wilkins, Baltimore, 2000, pp. 124–158.

3. Buncher, C.R., Tsay, J., Statistics in the Pharmaceutical Industry, Marcel Dekker, New York, 1981.

4. Natrella, M.G., Experimental Statistics Handbook 91, National Institute of Standards and Technology (NIST), Gaithersburg, MD, 1991 (reprinting of original August 1963 text).

nd

5. Zar, J., Biostatistical Analysis, 2 ed., Prentice Hall, Englewood Cliffs, NJ, 1984.

rd

6. De Muth, J.E., Basic Statistics and Pharmaceutical Statistical Applications, 3 ed., CRC Press, Boca Raton, FL, 2014.

General Statistics Applied to Analytical Laboratory Data:

1. Gardiner, W.P., Statistical Analysis Methods for Chemists, The Royal Society of Chemistry, London, England, 1997.

nd

2. Kateman, G., Buydens, L., Quality Control in Analytical Chemistry, 2 ed., John Wiley and Sons, New York, 1993.

3. Kenkel, J., A Primer on Quality in the Analytical Laboratory, Lewis Publishers, Boca Raton, FL, 2000. 4. Mandel, J., Evaluation and Control of Measurements, Marcell Dekker, New York, 1991.

5. Melveger, A.J., ―Statististics in the pharmaceutical analysis laboratory,‖ Analytical Chemistry in a GMP Environment, Miller J.M., Crowther J.B., eds., John Wiley and Sons, New York, 2000.

6. Taylor, J.K., Statistical Techniques for Data Analysis, Lewis Publishers, Boca Raton, FL, 1990. 7. Thode, H.C., Jr., Testing for Normality, Marcel Dekker, New York, NY, 2002.

8. Taylor, J.K., Quality Assurance of Chemical Measurements, Lewis Publishers, Boca Raton, FL, 1987.

9. Wernimont, G.T., Use of Statistics to Develop and Evaluate Analytical Methods, Association of Official Analytical Chemists (AOAC), Arlington, VA, 1985.

10. Youden, W.J., Steiner, E.H., Statistical Manual of the AOAC, AOAC, Arlington, VA, 1975.

Nonparametric Statistics:

rd

1. Conover, W.J., Practical Nonparametric Statistics, 3 ed., John Wiley and Sons, New York, 1999.

rd

2. Gibbons, J.D., Chakraborti, S., Nonparametric Statistical Inference, 3 ed., Marcel Dekker, New York, 1992.

nd

3. Hollander, M., Wolfe, D., Nonparametric Statistical Methods, 2 ed., John Wiley and Sons, NY, 1999.

Outlier Tests:

1. Barnett, V., Lewis, T., Outliers in Statistical Data, 3rd ed., John Wiley and Sons, New York, 1994. 2. B?hrer, A., ―One-sided and two-sided critical values for Dixon's Outlier Test for sample sizes up to n = 30,‖ Economic Quality Control, Vol. 23 (2008), No. 1, pp. 5–13.

3. Davies, L., Gather, U., ―The identification of multiple outliers,‖ Journal of the American Statistical Association (with comments), 1993; 88:782–801.

4. Dixon, W.J., ―Processing data for outliers,‖ Biometrics, 1953; 9(1):74–89.

5. Grubbs, F.E., ―Procedures for detecting outlying observations in samples,‖ Technometrics, 1969; 11:1–21. 6. Hampel, F.R., ―The breakdown points of the mean combined with some rejection rules,‖ Technometrics, 1985; 27:95–107.

7. Hoaglin, D.C., Mosteller, F., Tukey, J., eds., Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York, 1983.

8. Iglewicz B., Hoaglin, D.C., How to Detect and Handle Outliers, American Society for Quality Control Quality Press, Milwaukee, WI, 1993.

9. Rosner, B., ―Percentage points for a generalized ESD many-outlier procedure,‖ Technometrics, 1983; 25:165–172. 10. Standard E-178-94: Standard Practice for Dealing with Outlying Observations, American Society for Testing and Materials (ASTM), West Conshohoken, PA, September 1994.

11. Rorabacher, D.B., ―Statistical treatment for rejections of deviant values: critical values of Dixon's ―Q‖ parameter and related subrange ratios at the 95% confidence level,‖ Analytical Chemistry, 1991; 63(2):139–146.

Precision and Components of Variability:

th

1. Hicks, C.R., Turner, K.V., Fundamental Concepts in the Design of Experiments, 5 ed., Oxford University Press, 1999 (section on Repeatability and Reproducibility of a Measurement System).

2. Kirk, R.E., Experimental Design: Procedures for the Behavioral Sciences, Brooks/Cole, Belmont, CA, 1968, pp. 61–63.

3. Kirkwood, T.B.L., ―Geometric means and measures of dispersion,‖ Letter to the Editor, Biometrics, 1979; 35(4). 4. Milliken, G.A., Johnson, D.E., Analysis of Messy Data, Volume 1: Designed Experiments, Van Nostrand Reinhold Company, New York, NY, 1984, pp. 19–23.

5. Searle, S.R., Casella, G., McCulloch, C.E., Variance Components, John Wiley and Sons, New York, 1992. 6. Snedecor, G.W., Cochran, W.G., Statistical Methods, 8th ed., Iowa State University Press, Ames, IA, 1989.

7. Standard E-691-87: Practice for Conducting an Interlaboratory Study to Determine the Precision of a Test Method, ASTM, West Conshohoken, PA, 1994.

8. Hauck, W.W., Koch, W., Abernethy, D., Williams, R. ―Making sense of trueness, precision, accuracy, and uncertainty,‖ Pharmacopeial Forum, 2008; 34(3).

Tolerance Interval Determination:

1. Hahn, G.J., Meeker, W.Q., Statistical Intervals: A Guide for Practitioners, John Wiley and Sons, New York, 1991. 2. Odeh, R.E., ―Tables of two-sided tolerance factors for a normal distribution,‖ Communications in Statistics: Simulation and Computation, 1978; 7:183–201.

本文来源:https://www.bwwdw.com/article/l263.html

Top