Big Matters with Small Numbers: Narrow Populations

Presentation at the MCH Epi Conference on 15 December 2010 (PS PDF)

Pages 1 through 47 contain the core material and were discussed

at the presentation on 15 December 2010. Pages 48 through 54

contain the ancillary notes.

Spreadsheet implementing the methods (XLS)

Big Matters with Small Numbers: Rare Events

Presentation at the MCH Epi Conference on 11 December 2008 (PS PDF)

Presentation at the NBDPN Annual Meeting on 24 February 2009 (PS PDF)

Pages 1 through 41 contain the core material and were discussed at the

presentations on 11 December 2008 and 24 February 2009. Pages 42

through 52 contain the ancillary notes.

Down’s syndrome data set (XLS) {> 12 MB}

Description on page 2 of presentation file

Cleft lip/palate data set (XLS) {> 14 MB}

Description on page 43 of presentation file

Postneonatal mortality data set (XLS) {> 11 MB}

Description on page 43 of presentation file

Questions from presentation attendees and my answers:

1. What are the advantages or disadvantages of rolling aggregation (e.g., looking at overlapping time periods such as 1991-1995, 1992-1996, etc.)?

The main advantage of rolling aggregation (e.g., looking at 1991-1995, 1992-1996, …, 2001-2005) rather than aggregation based on nonoverlapping time periods (e.g., looking at 1991-1995, 1996-2000, 2001-2005) is that temporal trends may be perceived more easily. For instance, a plot of estimated risks based on rolling five-year aggregation from 1991 to 2005 would have 11 points, whereas a plot of estimated risks based on nonoverlapping five-year time periods would have only 3 points. A clear decreasing (or increasing) pattern formed from 11 points would be more persuasive of a temporal change than such a pattern formed from only 3 points.

The main disadvantage of rolling aggregation is that estimated risks based on overlapping time periods cannot be compared using procedures for two independent samples, even if the aggregation overcomes the “small numbers” issue. We cannot, for example, use an ordinary Z test or chi-square test to say whether the estimated risk over 1992-1996 is significantly different from the estimated risk over 1991-1995. This is because the data from 1992-1996 are not statistically independent of the data from 1991-1995.

2. How do

The familiar 95% confidence interval for a risk is given by

\hat p __+__
1.96 \sqrt{
\hat p (1 – \hat p) / n }

and is called a Wald interval. Thus, a number *p*_{0} is included in the Wald interval if and only
if

| \hat p – *p*_{0} | __<__ 1.96 \sqrt{
\hat p (1 – \hat p) / n }.

An
alternative confidence interval is the score interval, sometimes called the *p*_{0} is included in the score interval if and only
if

| \hat p – *p*_{0} | __<__ 1.96 \sqrt{ *p*_{0} (1 – *p*_{0}) / n }.

Some algebraic manipulations put this in a more convenient form for computational purposes,

[ 2 \hat p + 3.84 / n __+__ \sqrt{
(2 \hat p + 3.84 / n)^2 – 4 \hat p^2 (1 + 3.84 / n) } ] / [ 2
(1 + 3.84 / n) ].

The score interval has the advantage that, unlike the Wald interval, it can never contain a negative number. Yet, the score interval is also based on normal theory, so it is not really suitable for “small numbers” data – even though it may not be as glaringly bad as the Wald interval.

3. Suppose
we want to test the null hypothesis that *p*_{1}
= *p*_{2}. Can we do so by constructing a confidence
interval for *p*_{1}, constructing a
confidence interval for *p*_{2}, and rejecting the null
hypothesis if and only if the two individual confidence intervals have no
overlap?

This strategy, commonly advocated for “small numbers” data, seems reasonable intuitively but is unnecessarily conservative. If the confidence intervals have a little overlap, this strategy accepts the null hypothesis even though rejection may have been possible with a less conservative procedure.

As an
example, suppose that there are 2 events in one stratum of size 5000 and 10
events in another stratum of size 5000.
The binomial method on pages 18 to 26 of my MCH Epi
presentation yields a confidence interval of 0.022 to 0.938 for the relative
risk *p*_{1}/*p*_{2}, so the null hypothesis
that *p*_{1}
= *p*_{2} should be rejected. However, if we construct confidence intervals
for *p*_{1} and *p*_{2} individually using the Poisson method on
pages 11 to 17 of my MCH Epi presentation, we obtain
0.5 to 14.4 per 10000 and 9.6 to 36.7 per 10000. These confidence intervals overlap, so we
would unnecessarily accept the null hypothesis if we made a decision based on
the individual confidence intervals.

4. If rare events data arise from a survey with a complex design, how can we account for the survey design when constructing a confidence interval for a risk? In particular, is the Poisson approach from the workshop still applicable with some simple modifications?

The Poisson
approach from the workshop is no longer applicable. However, there are a number of options for
analyzing rare events data that arise from a survey with a complex design. Some of them are discussed and exemplified on
pages 64-68 of *Analysis of Health Surveys*
by Korn and Graubard
(Wiley, 1999). The option that the
authors favor, and that I agree appears most reasonable, begins with the
calculation of an effective sample size using their formula 3.2-6 on page
65. This effective sample size depends
on the survey design. Once the effective
sample size has been calculated, formula 3.2-5 on page 65 can be applied to
obtain the confidence interval. Formula
3.2-5 is based on a principle that statisticians refer to as “pivoting the
cumulative distribution function”. This
contrasts with the Poisson approach from the workshop, which is based on a
principle that statisticians refer to as “inversion”.

5. How can we model temporal trends in rare event rates?

One might like to plot the rare event rates and then use a regression model to fit a line or a parabola through the points so plotted. Two difficulties are that: (i) rare event rates tend to be unstable from year to year, so such a plot may not elucidate the temporal trends clearly; and, (ii) the usual assumption of normally distributed errors for regression modeling will not be valid.

Both of these difficulties can be ameliorated by rolling aggregation (see question 1 above), but this introduces a third problem: (iii) rolling multi-year rare event rates are not statistically independent, which violates another one of the usual assumptions for regression modeling. Therefore, to model rolling multi-year rare event rates, we need a regression-like approach that does not require the independence assumption.

Three
regression-like approaches that do not entail the independence assumption are:
(1) generalized estimating equations; (2) mixed modeling, sometimes also called
multilevel modeling or random coefficient analysis; and, (3) time series
modeling. The first two approaches are
described in Chapter 4 of *Applied
Longitudinal Data Analysis for Epidemiology* by Twisk
(*Introduction to Time Series and Forecasting* by Brockwell
and Davis (Springer, 2002).

6. Does the
inversion principle apply more generally than indicated in the workshop? For instance, if we have plentiful-event data
and use normal-theory methods, is accepting the null hypothesis that *p* = *p*_{0} at level 0.05 in a hypothesis test equivalent
to including *p*_{0} in the 95%
confidence interval for *p *?

Yes, the
inversion principle does apply more generally.
Accepting the null hypothesis that
*p* = *p*_{0} in a level
0.05 Wald hypothesis test is equivalent to including *p*_{0} in the 95% Wald confidence interval for *p*,
and accepting the null hypothesis that *p* = *p*_{0} in a level 0.05 score hypothesis test is
equivalent to including *p*_{0} in the 95% score confidence interval for *p*.

However, the wrinkle here is that most people use the score method to carry out the hypothesis test while using the Wald method to construct the confidence interval!

My answer
to question 2 above describes the Wald confidence interval and the score
confidence interval. And, as one may
surmise from those descriptions, the Wald hypothesis test entails rejecting the
null hypothesis that *p* = *p*_{0} if |
\hat *p* – *p*_{0} | > 1.96 \sqrt{ \hat p
(1 – \hat p) / n }, whereas the score hypothesis test entails rejecting the
null hypothesis if | \hat *p* – *p*_{0}
| > 1.96 \sqrt{ *p*_{0} (1 – *p*_{0}) / n }.

7. Can methods described in the workshop be carried out using familiar software packages?

Yes, for the most part. Both OPENEPI and SABER allow you to obtain a confidence interval for a risk like the one presented in the workshop. (OPENEPI: Choose “Counts” and then “Proportion”. SABER: Choose “Estimation/Testing” and then “Confidence Limits for a Binomial Proportion”.) However, they do not allow you to obtain p-values for customized hypothesis tests involving a risk (Cf. p. 14 of the workshop presentation).

Likewise, both OPENEPI and SABER allow you to obtain a confidence interval for a relative risk like the one presented in the workshop and to obtain a p-value for testing the null hypothesis that the relative risk is one. (OPENEPI: Choose “Counts” and then “Two by Two Table”. SABER: Choose “Estimation/Testing” and then “Fisher’s Exact Test for 2 x 2 Tables”.) However, they do not allow you to obtain p-values for customized hypothesis tests involving a relative risk (Cf. p. 23 of the workshop presentation).