Multiple Logistic Regression:

Predicting Coronary Heart Disease

 

 

 

 

A thesis submitted to the Department of Mathematics of Southern Oregon University in partial fulfillment of the requirements for the degree of

 

 

BACHELOR OF SCIENCE

in

MATHEMATICS

 

 

 

 

 

 

 

 

 

 

 

Ashland, Oregon

2001


APPROVAL PAGE

 

 

 

To my heroes, the mathematicians and statisticians of the past and present,

for asking interesting questions for the mathematicians and statisticians of the future.


ACKNOWLEDGEMENTS

 

The author wishes to acknowledge Professors Daniel Kim and Lisa Ciasullo of the Southern Oregon University Mathematics Department for the assistance they provided. Thanks is also due to the entire Math 490 class, for their patience and countless suggestions.


ABSTRACT OF THESIS

 

APPLIED AND THEORETICAL MULTIPLE LOGISTIC REGRESSION

 

 

     This thesis addresses the topic of applied and theoretical multiple logistic regression. Specifically, this study traces the roots of the logistic regression model, presents a derivation of how the estimated regression coefficients are obtained, and applies this knowledge to analyze the Framingham Heart Study data.

     Research took three main routes: 1) a historical overview of the suspected origins of the logistic and logistic regression model; 2) the derivations of many important features of the logistic regression model, including how the estimated regression coefficients are obtained; 3) the application of this information to the Framingham Heart Study data, and the analysis of the results.

     The research shows that the risk of Coronary Heart Disease (CHD) increases with age, being of the male gender, having high blood pressure, high cholesterol, and increased smoking of cigarettes.


VITA

 

 

 

UNDERGRADUATE SCHOOLS ATTENDED:

     Southern Oregon University, Ashland, Oregon

 

DEGREES AWARDED:

Bachelor of Science in Mathematics, 2001, Southern Oregon University

 

AREAS OF SPECIAL INTEREST:

     Multiple Linear Regression

     Hypothesis Testing

     Economics

     Zheng Manqing style Taijiquan

 

AWARDS AND HONORS:

Southern Oregon University Calculus Scholar Award,

1999

Southern Oregon University Ida and Eugene Bowman

Award, 1999

 

PUBLICATIONS:

     "Problem # 688", The College Mathematics Journal, Vol.

31, No. 5, November, 2000.


TABLE OF CONTENTS

CHAPTER                                              PAGE

I.   INTRODUCTION  .    .    .    .    .    .    .    1

 

II.  HISTORICAL OVERVIEW AND PRELIMINARIES .    .    3

     Logistic Model .    .    .    .    .    .    .    4

     Regression    .    .    .    .    .    .    .    4

 

III. THE LOGISTIC REGRESSION MODEL     .    .    .    .    5

              CHD vs. AGE                           7

              Mean CHD vs. AGE                      8

          Prediction Model   .    .    .    .    .    9

              Conditional Expectation               9

              Link Function                         9

              Error Term                            10

                   Mean                              11

                   Variance                          11

     Final Model   .    .    .    .    .    .    .    12

 

IV.  ESTIMATING THE REGRESSION COEFFICIENTS .    .    13

          The Framingham Data .    .    .    .    .    13

          Method of Maximum Likelihood .    .    .    15

          Regression Matrices .    .    .    .    .    17

          Iteratively Reweighting Algorithm .    .    18

              Initial Estimates                     19

 

V.   MODEL DIAGNOSTICS  .    .    .    .    .    .    21

          Pearson Residual   .    .    .    .    .    22

          Interpretation .    .    .    .    .    .    23

          Conclusion    .    .    .    .    .    .    24

 

WORKS CITED   .    .    .    .    .    .    .    .    25
LIST
OF TABLES AND FIGURES

 

 

                                                     PAGE

Table 1.  AGE and CHD Status of 100 Subjects    .    .    6

 

Table 2.  Frequency Table    .    .    .    .    .    8

 

Table 3.  Descriptions of Variables    .    .    .    13

 

Table 4.  Correlation Matrix .    .    .    .    .    14

 

Table 5.  Regression Matrices .    .    .    .    .    17

 

Table 6.  Initial Estimates of Coefficients .    .    19

 

Table 7.  Estimated Coefficients  .    .    .    .    20

 

Table 8.  Odds of Having CHD .    .    .    .    .    23

 

Figure 1. CHD vs. AGE   .    .    .    .    .    .    7

 

Figure 2. Mean CHD vs. AGE   .    .    .    .    .    8

 

Figure 3. Prediction Model   .    .    .    .    .    10


Chapter 1:    INTRODUCTION

 

     Logistic regression is one of the most popular and effective ways to analyze data with a binary outcome. This paper uses logistic regression to develop a model to assess the probability of someone having coronary heart disease (CHD). A probability of having CHD is obtained based on the following biological factors: age, gender, systolic blood pressure, diastolic blood pressure, total serum cholesterol, and number of cigarettes smoked per day. The data studied is a random sample of 1,000 from the original Framingham Heart Study, which was started in 1948 in Framingham, Massachusetts (Bown A-28).

     The function of the coronary arteries is to supply blood to the heart muscle. Coronary diseases can reduce oxygen intake to the heart, which may lead to heart attacks or death. Plaque build-up in the inner lining of an artery is the most common form of coronary disease.

     Unfortunately, heart disease is the number one killer of both men and women in the United States, and has been for many years. Using logistic regression, one can estimate the odds of having CHD, given various biological factors. Once the causes of CHD are established, people can assess their odds of having CHD, and reassess their lifestyle choices, and hopefully avoid having heart disease.


Chapter II:   HISTORICAL OVERVIEW AND PRELIMINARIES

 

     There is no exact data when logistic regression was introduced. We can, however, logically trace its roots back to mathematicians in the 18th century who were studying differential equations. Using the results of differential equations, the economist and clergyman Thomas Malthus provided people with the alarming possibility that exponential human growth could threaten the global environment (de Steiguer 5). For example, any exponentially growing population sooner or later exceeds the physical and biological limits of its environment. Thus, the exponential model, which providing potentially good estimates of population trends, is not realistic in the long run.

     The exponential model overlooked the fact that every environment has finite space and resources. This inherent quality of an environment is called a carry capacity. A carrying capacity is the maximum number that a population can be. As the population increases towards the carrying capacity, its growth must slow. Therefore, the population's growth rate is proportional both to the population itself and to the difference between the carrying capacity and the population (Ostebee 645).

     In symbols, we can say that, whererepresents the population at time,represents the growth rate of the population at time,the long-term carrying capacity of the environment, andmeasures the population's reproduction rate.

 

 

     Separating variables yields:

.

Notice that:

We can leave out the absolute-value sign if we note that.

     Therefore, solving foris straightforward:

Letting

 

     To apply this function, we only need to choose appropriate values for the constants. These constants can be obtained from past data or estimated by various experiments.

     Over time, statisticians realized that they could use the logistic model with various kinds of data, not just populations of people and animals. Also, they adapted the regression model to enable it to handle multiple predictors, which is a more realistic approach. All of these changes made logistic regression a very powerful and useful regression technique.

     Why would someone want to learn regression in the first place? If we are studying some phenomenon and collecting data on it, we will eventually want to be able to say something intelligent about it. However, data collection can be expensive, time consuming, and in some cases, impossible to carry out completely. This is where regression analysis comes in. We would like to find some function that will estimate data that we do not have, by using the data that we do have. Regression analysis is employed in all types of sciences (Mendenhall 544) and is an extremely powerful tool.


Chapter III:  THE LOGISTIC REGRESSION MODEL

 

     The transition from the logistic model to the logistic regression model is probably best illustrated by a concrete example. However, logistic regression cannot be used with any type of data. The use logistic regression, the response variable must be polytomous. That is, the response variable can only take on a finite number of values. Most commonly, the response variable will only take on two values, in which case the response variable is called binary or dichotomous. The predictors, however, can either be continuous or discrete.

     Table 1 contains an example of data that is suitable to be used in logistic regression. The data illustrates a common situation in the medical field. Note that the response variable CHD can be regarded as a dichotomous variable by assigning appropriate codes to indicate the status of CHD. That is, we can assign 1 to CHD if the patient has the disease, and 0 to CHD otherwise.


Table 1 Age and CHD Status of 100 Subjects

ID

AGE

CHD

ID

AGE

CHD

ID

AGE

CHD

ID

AGE

CHD

1

20

0

26

35

0

51

44

1

76

55

1

2

23

0

27

35

0

52

44

1

77

56

1

3

24

0

28

36

0

53

45

0

78

56

1

4

25

0

29

36

1

54

45

1

79

56

1

5

25

1

30

36

0

55

46

0

80

57

0

6

26

0

31

37

0

56

46

1

81

57

0

7

26

0

32

37

1

57

47

0

82

57

1

8

28

0

33

37

0

58

47

0

83

57

1

9

28

0

34

38

0

59

47

1

84

57

1

10

29

0

35

38

0

60

48

0

85

57

1

11

30

0

36

39

0

61

48

1

86

58

0

12

30

0

37

39

1

62

48

1

87

58

1

13

30

0

38

40

0

63

49

0

88

58

1

14

30

0

39

40

1

64

49

0

89

59

1

15

30

0

40

41

0

65

49

1

90

59

1

16

30

1

41

41

0

66

50

0

91

60

0

17

32

0

42

42

0

67

50

1

92

60

1

18

32

0

43

42

0

68

51

0

93

61

1

19

33

0

44

42

0

69

52

0

94

62

1

20

33

0

45

42

1

70

52

1

95

62

1

21

34

0

46

43

0

71

53

1

96

63

1

22

34

0

47

43

0

72

53

1

97

64

0

23

34

1

48

43

1

73

54

1

98

64

1

24

34

0

49

44

0

74

55

0

99

65

1

25

34

0

50

44

0

75

55

1

100

69

1

(Hosmer 3)

 

As researches, the doctors would like to know what can be said about the relationship between the dependent variable CHD and the predictor variable AGE. The doctors start by constructing a scatterplot of CHD vs AGE.

     When analyzing data, especially in a regression setting, it is important to first create a scatterplot to roughly assess the relationship between the variables. However, in this case the dependent variable CHD is discrete, and as Figure 1 shows, a scatterplot is not very useful (Hosmer 2).

 

Figure 1 CHD vs AGE

 

Because we essentially have two horizontal bands, it is clear that it will be near impossible to find a useful function that will predict CHD given AGE. It does, however, seem that as AGE increases there are more cases of CHD = 1, but there are several exceptions. In order to find a more exact relationship between AGE and CHD, we will focus our attention on the probability of a subject having CHD. To accomplish this, consider the proportion of individuals who have CHD within a certain AGE interval. Table 2 shows the results.

Table 2 Frequency Table

Age Group

Midpoint

% CHD

20-29

24.5

0.1

30-34

32

0.13

35-39

37

0.25

40-44

42

0.33

45-49

47

0.46

50-54

52

0.63

55-59

57

0.76

60-69

64.5

0.8

 

 

 

 

 

 

 

 

 

 

 

    

Next, we plot the proportion of individuals with CHD versus the midpoint of each AGE interval. Doing this produces Figure 2.

 

Figure 2 %CHD vs Mean AGE

 

     This scatterplot provides much more insight into the relationship between CHD and AGE than Figure 1. However, our goal is to find a functional form to describe this relationship. In order to do this, we will focus our attention on the conditional expectation.

     Letdenote the column vector of allpredictors. The proportion of individuals with the characteristicis denoted as. Because these are proportions,.

     Keep in mind that there are an infinite number of functions that are between zero and 1. With hard theoretical work and empirical motivation, statisticians found a function that works effectively in many applications (Ryan 256), and has a similar structure to the logistic model from differential equations.

     Let, called the link function, denote a linear combination of thepredictors.

 

.

 

     It is an interesting question as to why a linear combination is used instead of some other relation. As it turns out, the link function has been found to work in many theoretical and applied settings (McCullagh 107), and is what works best in logistic regression.

     The main objective of logistic regression is to build a model to predict. The prediction model that has been found to work is:

 

.

 

     Graphically, this relationship is shown in Figure 3.

 

Figure 3 Prediction Model

 

     The random variableis called the error, and it expressed an observation's deviation from the conditional mean. In linear regression, for example, the common assumption is that. However, this is not the case with a binary response variable. In this situation, we may express the values of the response variable givenas:

 

 

     Here the quantitymay assume one of two possible values. When,with probability. When,with probability.

     Using the above information, we can calculate the mean and variance of the random variable.

     The definition of the expected value of a discrete random variable, with probability distribution function, is defined as:

 

 

     Therefore, the expected value ofis found to be:

 

.

 

     This says that even though some errors may be large or small, we can expect the mean of them to be zero, which is reassuring if we expect to have a good model.

     Similarly, the variance of a discrete random variable, with a probability distribution function, is defined as:

 

.

 

     Using this information, with the fact that, gives:

 

.

 

     Because we know the structure of our entire regression model, we are now able to present the regression model in full:

 

,

 

wherehas mean 0 and variance.

     We have the model now, but notice that it depends on the unknown coefficients. These coefficients cannot just be 'guessed'. The coefficients are parameters that have to be estimated from the existing data. In the next chapter, we will develop a reliable estimation process for accomplishing this important task.


Chapter IV:   ESTIMATING THE REGRESSION COEFFICIENTS

 

     The data that this paper analyzes is a random sample from the original Framingham Heart Study (Bown, A-28). There are many variables to consider when trying to determine what is responsible for CHD. The scientists involved with the Framingham study originally focused on the following variables: CHD, gender, age, systolic blood pressure, diastolic blood pressure, cholesterol level, and number of cigarettes smoked per day. These variables will be abbreviated as CHD, GEN, AGE, SBP, DBP, and CIG in subsequent graphs, tables, and equations. Table 3 lists descriptions of the variables, and their units of measurement.

 

Table 3 Descriptions of Variables

CHD: Coronary heart disease, 1: CHD present, 0: CHD absent

----

GEN: Gender, 1: male, 0: female

AGE: Age (years)

SBP: Systolic blood pressure – when heart is pumping (mm Hg) (Mercury)

DBP: Diastolic blood pressure – when heart is at rest (mm Hg)

CHL: Total serum cholesterol level (mg/dL)

CIG: Cigarettes smoked (#/day)

 

     To illustrate the relationships between the six predictors, a correlation matrix is shown in Table 4.

 

 

Table 4 Correlation Matrix

 

GEN

AGE

SBP

DBP

CHL

CIG

GEN

1

 

 

 

 

 

AGE

-.0310

1

 

 

 

 

SBP

.0108

.3532

1

 

 

 

DBP

.1279

.2298

.7930

1

 

 

CHL

-.0307

.2877

.2119

.1597

1

 

CIG

.3632

-.1525

-.0423

-.0281

.0436

1

 

     As Table 4 shows, it appears that there is high correlation only associated with DBP and SBP. There also seems to be moderate correlation between SBP and AGE, and CIG and GEN.

     Before we can use logistic regression to analyze this data, we must ask if these predictors are any good. That is, do our predictors do a good job of distinguishing who has CHD? If so, the we expect the population means of the predictors for each group CHD = 1 and CHD = 0 to be different. Using the technique of multivariate analysis of variance (MANOVA), we can shed some light on the answer to this question.

     The hypotheses we are concerned with are:

 

vs.

 

     The idea of MANOVA relies on the F-test. The actual form of the test statistics (Johnson 224) yields a p-value which is less than .001. We therefore rejectat the 5% level of significance, and conclude that the population means are, in fact, different.

     The next step is to actually estimate the regression coefficients. The most common technique statisticians are familiar with is the method of least squares. However, the method of least squares fails in the case of logistic regression because the necessary assumptions are violated. In logistic regression, the method of maximum likelihood is used to estimate the coefficients.

     The sample likelihood function is defined as the joint probability function of therandom variables, which constitute the sample. Specifically, for a sample sizethe corresponding random variables are:

 

 

     In most cases, it is reasonable to assume that the,are independent. By the multiplicative law for independent events, the joint probability function is then:

 

 

     In words, the likelihood function gives the probability of observing a sequence of 0's and 1's, which corresponding to people having CHD or not. It should be noted that the assumption of independent Bernoulli random variables might not always be plausible, however, for the most part we are safe with that assumption (Ryan 258).

     Maximum likelihood estimates are usually obtained by maximizing the logarithm of the likelihood function, especially when the likelihood function is complicated. This is acceptable to do because the likelihood function and the logarithm of the likelihood function both achieve their maximums at the same place, because the logarithm function is monotonic. This also has the added bonus of making the calculus involved more tractable.

     Taking the logarithm of each side of the likelihood equation produces:

 

 

.

 

     To find our maximum likelihood estimates of, we want to set all the partial derivates ofequal to zero, and solve the resulting non-linear system simultaneously for the. Because this is a non-linear system of equations, solving the system requires an approximation method. One of the most popular methods is the iteratively reweighting algorithm, which relies on Newton's Method.

     Introducing matrix notation will aid us in this process. Table 5 details the matrices involved with estimating the coefficients in logistic regression.

 

 

Table 5 Regression Matrices

 

     The likelihood equations that we obtain from differentiatingare:

 

, and

, for

 

     More concisely, we can write all of thelikelihood equations using matrix notation as:

 

.

 

     To use the iteratively reweighting algorithm, we must use the idea of Newton's Method which first requires determining. This is equivalent to computing, which equals.

     From its definition, we can computeand. Therefore, , so.

     Now that we have all the necessary parts that Newton's Method requires, we can write:

 

 

whereis the iteration number.

 

     Solving this foryields:

 

 

     This algorithm is dependent on. The method for initially estimating these coefficients is shown in Table 6 (Hosmer 35).

 

Table 6 Initial Estimates of Coefficients

 

Whereand, and

 

 

     After this algorithm is carried out, we have our fitted model. Using the statistics software program S+ 2000, the coefficients, shown in Table 7, were estimated.

 

Table 7 Estimated Coefficients

 

 

Therefore our fitted model is:

 

 

     It is tempting to want to substitute in an age, blood pressure, and so on, in order to estimate someone's probability of having CHD. However, we must know whether or not this fitted model is good before we can use it.

    
Chapter V:    MODEL DIAGNOSTICS

 

     The last chapter presented a method for estimating the coefficients involved in logistic regression. However, does going through that process guarantee that we have a good fitted mode? The answer is a surprising "No". Also, what does it mean to have a "good" model? The branch of regression that attempts to answer these questions is called model diagnostics.

     The main focus of model diagnostics is on the concept of goodness of fit. There are many ways to measure goodness of fit. Some of the ways are through analysis of the likelihood ratios, deviance residuals, various forms of, and through the Pearson residuals. This chapter focuses on the Pearson residuals because of their straightforward interpretations.

     In order to build these residuals, letdenote the number of subjects with the same covariate pattern. Note that.

     Letdenote the number of positive responses,, among thesubjects with. Note that, the total number of subjects with.

     We can now say that the expected number of positive response is:

 

 

     The Pearson residual is defined and denoted as:

 

,

 

and the summary statistic based on these results is:

 

,

 

wheredenotes the number of distinct values ofobserved.

 

     Once again we set up two hypotheses:

 

The model does fit the data

vs.

The model does not fit the data

 

     This goodness of fit test relies on adistribution withdegrees of freedom. For the Framingham data, the observed test statistic, . This is less than the theoretical. Therefore, we fail to rejectat the 5% level of significance, and conclude that there is evidence of our model fitting the data well.

     One way to interpret the model is through the estimated coefficients. In linear regression, the interpretation is straightforward. In logistic regression we are working with a non-linear function, and therefore our interpretation must change accordingly.

     In the Framingham study, for example, it is important to consider the odds of having CHD given certain biological characteristics. It is an interesting question to see how these odds change as the biological variables change. The interpretation in logistic regression is done by examining the odds of having a positive response.

     Specifically, we are interested in. One can work out, from the definition of odds, that, which is equivalent to:

 

=

 

     Using these results we can assess the odds of having CHD, ceteris paribus. This was done with all of the predictors, and the results are summarizes in Table 8.

 

Table 8 Odds of Having CHD

Predictor

Unit Change

Odds Change

GEN

1

2.75

AGE

10 years

2

SBP

15 mm

1.3

DBP

15 mm

.97

CHL

20 mg

1.15

CIG

5

1.06

 

 

     We can see that the unit changes in the majority of the predictors cause a multiplicative increase in the odds of having CHD. Particularly disturbing is the impact that being male has on the odds, and also the relationship that age has with having CHD.

     It is interesting to, and somewhat confusing at first, to see that increases in SBP increase the odds of having CHD, while increases in DBP lower the odds of having CHD. This apparent anomaly may simply be the result of analyzing a random sample. It could also be related with the fact that SBP and DBP have a very large degree of correlation.

     Logistic regression is a very powerful tool in the sciences. It has its roots in calculus and differential equations, and has been added to and modernized by statisticians to make it what it is today.

     By using logistic regression on a random sample of the original Framingham heart study data, it was found that being male, increase age, increased blood pressure, increased cholesterol, and increase number of cigarettes smoked per day, all increase the odds of having CHD.

     A more sophisticated study would possibly include some different variables and include some others. Also, attention needs to be paid to possible confounding factors, some of which could be exercise, diet, lifestyle, stress levels, and other biological factors, such as genetics.

     There are numerous applications of logistic regression, and most are contained in the sciences. Logistic regression has been successfully used for environmental modeling, remote sensing, and disease classification, just to name a few. With some work, logistic regression can be extended to handing multiple responses, which would allow for more realistic situations, such as a disease with multiple stages.


 

WORKS CITED

 

Bown, Fred, and Chase, Warren. General Statistics. New York: John Wiley and Sons, 2000.

 

De Steiguer, J.E. Age of Environmentalism. New York: McGraw-Hill, 1997.

 

Hosmer, David W., and Lemeshow, Stanley. Applied Logistic Regression. New York: John Wiley & Sons, 1989.

 

Johnson, Richard A., and Wichern, Dean W. Applied Multivariate Statistical Analysis. New Jersey: Prentice-Hall Inc., 1998.

 

McCullagh, P., and Nelder, J.A. Generalized Linear Models. Cambridge: University Press, 1989.

 

Mendenhall, William, and Schaeffer, Richard L., and Wackerly, Dennis D. Mathematical Statistics with Applications. New York: Duxbury Press, 1996.

 

Ostebee, Arnold, and Zorn, Paul. Calculus- From Graphical, Numerical, and Symbolic Points of View. New York: Harcourt Brace, 1997.

 

Ryan, Thomas P. Modern Regression Methods. New York: John Wiley & Sons, 1997.