Cornell Mba

Showing posts with label regression. Show all posts

Thursday, December 2, 2010

Logistic regression analysis - Introduction to the role and Logit odds ratio

Researchers are also studying the structure of a model on the relationship between the predictors (ie, independent variables) and response (ie dependent variable). linear regression is often used when the response variable is continuous. An assumption of linear models is that the residuals follow a normal distribution. This assumption fails if the response variable is categorical, then a normal linear model is not appropriate. This article presents a. regression model of a variable, the answer is often with two dichotomous categories of examples: if a plant lives or dies, if accepted by a respondent or contradicts a statement, or if a child graduates or leaves school risk.

In the town of linear regression is the response variable (Y) is a linear function of the coefficients (B0, B1, etc.), equivalent to the predictor variables (X1, X2, etc.). A typical model lookas follows:

Y = B0 + B1 * X1 + B2 * X2 + B3 * X3 + ... + E

For a dichotomous response variable, we could use a similar linear model to predict individual membership category, if the numerical values are used to represent the two categories. Any value of 1 and 0 chosen for mathematical convenience. The first example, we assign Y = 1 if a plant alive and Y = 0 when the plant dies.

The linear model is not working wellfor some reasons. First, the response values 0 and 1 are arbitrary, the modeling of the actual values of Y is not very interesting. Secondly, it really is the probability that each individual in the population with 0 or 1, we are interested in answer modeling. For example, we plant with a high degree of fungal infection (X1) fall into the category of "life plan" (Y) less frequently than plants with low levels of infection detected. Just as the level of infection is increasing,survival of a plant decreases.

This, we model P, the probability to take into account the response variable. Again, there are problems. Despite the general decrease in the probability of a generalized increase in the rate of infection is associated, we know that P, like all the odds may be within the limits of 0 and 1. Therefore, it is better to accept that the relationship between X1 and sigma P (S-shaped), rather than a scaleLine.

However, you can work in a linear relationship between X1 and function of P. Although a number of functions, one of the most useful finding is the logit function. It 's the natural logarithm of the probability that Y equals 1, which is simply the ratio between the probability that Y 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is shaped sigmoid. The resulting regression equationis:

ln [P / (1-P)] = B0 + B1 * X1 + B2 * X2 + ...

Although the left side of this equation looks intimidating, in this way, the probability that the right side of the equation is linear and look familiar to us. This helps us to understand the significance of regression coefficients. The coefficients can be changed slightly so that its interpretation makes sense.

The logistic regression equation can be extendedcase of dichotomous variables on the response of groups and categories polytymous ordered (more than two categories).

Wednesday, December 1, 2010

logistic regression analysis - understanding odds and probability

to measure the probability and the same probability: the probability of a given result. People use the terms interchangeably possibilities and the chances of a casual use, but this is regrettable. It only creates confusion, because they are not equivalent. They measure the same thing on different scales. Imagine how confusing it would be if people used interchangeably Celsius and Fahrenheit. "There will be 35 degrees today" might actually wear the wrong way.

Remember meback to your introductory course in statistics back to all these problems on the probability of drawing red balls and white balls from an urn. In these problems, the probability of drawing a red ball is measured by how many balls there were in total and how many were red.

In measuring the probability of a result, we need to know two things: how many times something happened and how often it could happen. The result of interest is a success, if it is a good result orno.

The other exit is a failure. Every time you encounter the results is a process called. Since each process in the success or failure, the number of successes and failures in the number of total order must be based on the total number of attempts.

Probability of success is the number the total number of attempts has occurred with respect.

Chances are the number of successes has been the number of errors occurred in the comparison.

For example, the likelihood of accidents in a forecastparticular intersection, every vehicle that is going through an intersection as an attempt. Each study is one of two results: pass or accident. If the result we are most interested in the modeling of an incident that happened (no matter how it sounds morbid) is.

Probability (success) = number of successes / total number of units attempted (success) = number of successes / number of failures

The odds are often written as:

Number of successes: 1 failures

Read 'Number of hits for all faults 1. But often, one will be deleted.

I see a lot of learning when the researchers blocked logistic regression because they are not accustomed to thinking about the risk of a bet are standard.

Equal opportunities are a first success for every failure 1. 01:01 equal probability .5. A success for all the 2 studies.

The odds are infinity to 0. Odds greater than 1 indicates success rather than failure. Rates of less than 1 indicates the failure is more likely toSuccess.

Probability can range from 0 to 1-area. probability greater than 0.5 indicates success rather than failure. less than 0.5 indicates an error probability is more likely to be a success.

Example: In the last month, shows data from a particular intersection, a .354 that the car drove by him, 72 it was an accident.

72, 1282 incident = = Errors Safe Passage (1354-1372) Total Error = - Pr happened (accident) = 72/1354 = 0.053 Pr (Safe Passage)= 0.947 = 1282/1354 Odds (accident) = 72/1282 = 0.056 Odds (Security) = 1282/72 = 17.87

Now you get the computer because you will see how these relate to each other.

Odds (accident) = Pr (accident) / Pr (security) Odds (accident) = (72/1354) / (1282/1354) = 0.056 (the denominator cancel) Odds (accident) = 1/Odds (Safe Passage) = 1/17.87

Friday, November 5, 2010

Multinomial logistic regression models and ordinal variables

The multinomial (aka polytomous) logistic regression is a simple extension of binomial logistic regression model. They are used when the dependent variable is greater than two (disordered) has rated categories.

Dummy coding of independent variables is quite common. In multinomial logistic regression the dependent variable is coded in several dummy 1 / 0 variables. This is a variable for all categories except one, so if there are categories M, M-1 dummyVariables. All except their own category dummy variable. Each category is dummy variable has a value of 1 in its category and a 0 for all others. A category, the category of reference, is not its own dummy variable, as is clearly indicated by all other variables equal to 0.

Logistic regression mulitnomial then estimated a separate binary logistic regression model for each of these dummy variables. The result is M-1 binary logistic regression models. Each tells theEffects of predictors on the probability of success in this category compared to the reference category. Each model has its own intercept and regression coefficients - the predictors may be of interest to each class may vary.

Why not just run a series of binary regression models? They could, and people were once, in multinomial regression models in the software away. You will probably get similar results. But they run together means that they estimated simultaneously,that the parameter estimates are more efficient - there are fewer errors unexplained.

Ordinal Logistic Regression: Proportional Odds Model

If the response categories are ordered, you could have a multinomial regression model. The disadvantage is that you throw away the information about the order. An ordinal logistic regression model retains this information, it is more complicated.

In the proportional odds model, the event modelnot with a score in a single category, as in the binary and multinomial models. Rather, the event is modeled with a score in a particular category or all of the previous category.

For example, for a response variable with three ordered categories, the possible events, defined as:

* In Group 1
* In Group 2, or 1
* In Group 3, 2 or 1

In the proportional odds model, each trap has its output, but the same regressionCoefficients. This means:

1. the overall rate of all cases, be different, but according to the effect of predictors on the probability of an event in any other category, for each category. This is a hypothesis of the model, you need to check. It is often violated.

The model is a bit 'different than usual, written in SPSS, with a hyphen between the intercept and all coefficients of regression. This is a convention to ensure that, for positive coefficients,Increases the values of X lead to an increased likelihood of a greater number of response categories. In SAS, the character is a plus, and elevations of a predictor for increased risk of lead lower numbered response categories. Make sure you understand how the model in your statistics package before interpreting the results.

Monday, November 1, 2010

The linear regression analysis - Interpreting the interactions between continuous and categorical predictors

In linear regression, the interaction terms add significantly to the understanding of the relationship between the variables of the model allows multiple hypotheses to be tested. The disadvantage is that it is more difficult to interpret the regression parameters. This article describes how to interpret the estimates of regression parameters when there is a pattern of interaction.

The example is a model of the height of a shrub (height) by the amount of bacteria in the soil(Bacteria) and that the shrub is in part or in full sun (sun). The height is measured in cm, is measured in thousands of bacteria per ml of the Earth and Sun = 0, if the plant part, Sun and Sun = 1 if the plant in full sun. The regression equation was estimated as follows:

Height = 42 + 11 + 2.3 * Bacteria * Sun

It would be useful to add an interaction term of the model, if we wanted to test the hypothesis that the relationship between the amount of bacteria in the soil on whichHeight of the spot was different from that in full sun to part so One possibility is that in full sun, plants with more bacteria in the soil are greater, while in partial sun, plants are shorter with more bacteria in the soil, tend. Another possibility is that plants are more bacteria in the soil is greater than in either total or partial tend Sun, but that the relationship is much more dramatic in the middle part so

The presence of a significant interaction shows that theEffect of a predictor of target size is different at different values of the other predictor variable. It will be multiplied by adding a term of the model in which the two predictor variables tested. The regression equation will look like this:

Height = B0 + B1 + B2 + B3 * Bacteria * Sun * Sun * bacteria

Adding an interaction term, a model radically changed the interpretation of all coefficients. If there were no interaction term could be interpreted as B1,the only effect of bacteria on the size. Since the interaction indicates that the effect of bacteria on the top to different values of the Sun, the only effect of bacteria on height is not limited to B1, but it also depends on the values of so B1 and B3 + B3 * Sun: L ' unique effect of all bacteria is represented by bacteria that multiply in the model. B1 can now only as the effect of bacteria on height only if Sun can be interpreted = 0.

In our example, if you addConcept of interaction provides for our model as follows:

Height = 35 + 9 + 4.2 * 3.2 * + * Sun Bacteria Bacteria Sun

Note that adding the interaction term changes the values of B1 and B2. The effect of the bacteria level is now 4.2 + 3.2 * in the case of plants in partial sun, Sun = 0, so that the effect of bacteria is 4.2 + 3.2 * 0 = 4.2. So for two plants in partial sun, a plant with more than 1000 bacteria / ml in the world would probably be 4.2 cm greater than a plant with less bacteria. For plantsfull sun, but the effect of bacteria 4.2 + 3.2 * 1 = 7.4. So for two plants in full sun, a plant with more than 1000 bacteria / ml in the world would probably be 7.4 cm greater than a plant with less bacteria.

Because of the interactions, the effect is more bacteria in the soil in a different way, when a plant has been said, in whole or in part, so different that the slope of the regression between height and number of bacteria are different for different categories ofSun B3 shows how different the tracks.

B2 interpretation is more difficult. B2 is the effect of the sun, when the bacteria = 0. Since the bacteria is a continuous variable, it is unlikely that 0 is the same, if nothing else, B2, virtually meaningless in itself. Instead, it is useful to understand the effects of the sun, but again, this can be difficult. The effect of the sun is B2 + B3 * Bacteria, which is different for each of the infinite values of bacteria. For this reason, oftenThe only way is to get an intuitive understanding of the effects of the sun, some levels of bacteria in the plug equation, such as altitude, the target size to see the changes.