Richard Charnigo's Home Page

Big Matters with Small Numbers: Narrow Populations

Presentation at the MCH Epi Conference on 15 December 2010 (PS PDF)

Pages 1 through 47 contain the core material and were discussed

at the presentation on 15 December 2010. Pages 48 through 54

contain the ancillary notes.

Spreadsheet implementing the methods (XLS)

Big Matters with Small Numbers: Rare Events

Presentation at the MCH Epi Conference on 11 December 2008 (PS PDF)

Presentation at the NBDPN Annual Meeting on 24 February 2009 (PS PDF)

Pages 1 through 41 contain the core material and were discussed at the

presentations on 11 December 2008 and 24 February 2009. Pages 42

through 52 contain the ancillary notes.

Down’s syndrome data set (XLS) {> 12 MB}

Description on page 2 of presentation file

Cleft lip/palate data set (XLS) {> 14 MB}

Description on page 43 of presentation file

Postneonatal mortality data set (XLS) {> 11 MB}

Description on page 43 of presentation file

Questions from presentation attendees and my answers:

1. What are the advantages or disadvantages of rolling aggregation (e.g., looking at overlapping time periods such as 1991-1995, 1992-1996, etc.)?

The main advantage of rolling aggregation (e.g., looking at 1991-1995, 1992-1996, …, 2001-2005) rather than aggregation based on nonoverlapping time periods (e.g., looking at 1991-1995, 1996-2000, 2001-2005) is that temporal trends may be perceived more easily. For instance, a plot of estimated risks based on rolling five-year aggregation from 1991 to 2005 would have 11 points, whereas a plot of estimated risks based on nonoverlapping five-year time periods would have only 3 points. A clear decreasing (or increasing) pattern formed from 11 points would be more persuasive of a temporal change than such a pattern formed from only 3 points.

The main disadvantage of rolling aggregation is that estimated risks based on overlapping time periods cannot be compared using procedures for two independent samples, even if the aggregation overcomes the “small numbers” issue. We cannot, for example, use an ordinary Z test or chi-square test to say whether the estimated risk over 1992-1996 is significantly different from the estimated risk over 1991-1995. This is because the data from 1992-1996 are not statistically independent of the data from 1991-1995.

2. How do Wilson’s score intervals work?

The familiar 95% confidence interval for a risk is given by

\hat p + 1.96 \sqrt{ \hat p (1 – \hat p) / n }

and is called a Wald interval. Thus, a number p₀ is included in the Wald interval if and only if

| \hat p – p₀ | < 1.96 \sqrt{ \hat p (1 – \hat p) / n }.

An alternative confidence interval is the score interval, sometimes called the Wilson score interval. A number p₀ is included in the score interval if and only if

| \hat p – p₀ | < 1.96 \sqrt{ p₀ (1 – p₀) / n }.

Some algebraic manipulations put this in a more convenient form for computational purposes,

[ 2 \hat p + 3.84 / n + \sqrt{ (2 \hat p + 3.84 / n)^2 – 4 \hat p^2 (1 + 3.84 / n) } ] / [ 2 (1 + 3.84 / n) ].

The score interval has the advantage that, unlike the Wald interval, it can never contain a negative number. Yet, the score interval is also based on normal theory, so it is not really suitable for “small numbers” data – even though it may not be as glaringly bad as the Wald interval.

3. Suppose we want to test the null hypothesis that p₁ = p₂. Can we do so by constructing a confidence interval for p₁, constructing a confidence interval for p₂, and rejecting the null hypothesis if and only if the two individual confidence intervals have no overlap?

This strategy, commonly advocated for “small numbers” data, seems reasonable intuitively but is unnecessarily conservative. If the confidence intervals have a little overlap, this strategy accepts the null hypothesis even though rejection may have been possible with a less conservative procedure.

As an example, suppose that there are 2 events in one stratum of size 5000 and 10 events in another stratum of size 5000. The binomial method on pages 18 to 26 of my MCH Epi presentation yields a confidence interval of 0.022 to 0.938 for the relative risk p₁/p₂, so the null hypothesis that p₁ = p₂ should be rejected. However, if we construct confidence intervals for p₁ and p₂ individually using the Poisson method on pages 11 to 17 of my MCH Epi presentation, we obtain 0.5 to 14.4 per 10000 and 9.6 to 36.7 per 10000. These confidence intervals overlap, so we would unnecessarily accept the null hypothesis if we made a decision based on the individual confidence intervals.

4. If rare events data arise from a survey with a complex design, how can we account for the survey design when constructing a confidence interval for a risk? In particular, is the Poisson approach from the workshop still applicable with some simple modifications?

The Poisson approach from the workshop is no longer applicable. However, there are a number of options for analyzing rare events data that arise from a survey with a complex design. Some of them are discussed and exemplified on pages 64-68 of Analysis of Health Surveys by Korn and Graubard (Wiley, 1999). The option that the authors favor, and that I agree appears most reasonable, begins with the calculation of an effective sample size using their formula 3.2-6 on page 65. This effective sample size depends on the survey design. Once the effective sample size has been calculated, formula 3.2-5 on page 65 can be applied to obtain the confidence interval. Formula 3.2-5 is based on a principle that statisticians refer to as “pivoting the cumulative distribution function”. This contrasts with the Poisson approach from the workshop, which is based on a principle that statisticians refer to as “inversion”.

5. How can we model temporal trends in rare event rates?

One might like to plot the rare event rates and then use a regression model to fit a line or a parabola through the points so plotted. Two difficulties are that: (i) rare event rates tend to be unstable from year to year, so such a plot may not elucidate the temporal trends clearly; and, (ii) the usual assumption of normally distributed errors for regression modeling will not be valid.

Both of these difficulties can be ameliorated by rolling aggregation (see question 1 above), but this introduces a third problem: (iii) rolling multi-year rare event rates are not statistically independent, which violates another one of the usual assumptions for regression modeling. Therefore, to model rolling multi-year rare event rates, we need a regression-like approach that does not require the independence assumption.

Three regression-like approaches that do not entail the independence assumption are: (1) generalized estimating equations; (2) mixed modeling, sometimes also called multilevel modeling or random coefficient analysis; and, (3) time series modeling. The first two approaches are described in Chapter 4 of Applied Longitudinal Data Analysis for Epidemiology by Twisk (Cambridge, 2003), while the third approach is the subject of Introduction to Time Series and Forecasting by Brockwell and Davis (Springer, 2002).

6. Does the inversion principle apply more generally than indicated in the workshop? For instance, if we have plentiful-event data and use normal-theory methods, is accepting the null hypothesis that p = p₀ at level 0.05 in a hypothesis test equivalent to including p₀ in the 95% confidence interval for p ?

Yes, the inversion principle does apply more generally. Accepting the null hypothesis that p = p₀ in a level 0.05 Wald hypothesis test is equivalent to including p₀ in the 95% Wald confidence interval for p, and accepting the null hypothesis that p = p₀ in a level 0.05 score hypothesis test is equivalent to including p₀ in the 95% score confidence interval for p.

However, the wrinkle here is that most people use the score method to carry out the hypothesis test while using the Wald method to construct the confidence interval!

My answer to question 2 above describes the Wald confidence interval and the score confidence interval. And, as one may surmise from those descriptions, the Wald hypothesis test entails rejecting the null hypothesis that p = p₀ if | \hat p – p₀ | > 1.96 \sqrt{ \hat p (1 – \hat p) / n }, whereas the score hypothesis test entails rejecting the null hypothesis if | \hat p – p₀ | > 1.96 \sqrt{ p₀ (1 – p₀) / n }.

7. Can methods described in the workshop be carried out using familiar software packages?

Yes, for the most part. Both OPENEPI and SABER allow you to obtain a confidence interval for a risk like the one presented in the workshop. (OPENEPI: Choose “Counts” and then “Proportion”. SABER: Choose “Estimation/Testing” and then “Confidence Limits for a Binomial Proportion”.) However, they do not allow you to obtain p-values for customized hypothesis tests involving a risk (Cf. p. 14 of the workshop presentation).

Likewise, both OPENEPI and SABER allow you to obtain a confidence interval for a relative risk like the one presented in the workshop and to obtain a p-value for testing the null hypothesis that the relative risk is one. (OPENEPI: Choose “Counts” and then “Two by Two Table”. SABER: Choose “Estimation/Testing” and then “Fisher’s Exact Test for 2 x 2 Tables”.) However, they do not allow you to obtain p-values for customized hypothesis tests involving a relative risk (Cf. p. 23 of the workshop presentation).