class: center, middle, inverse, title-slide # Laboratorio Bio-demografico - Regressione logistica e multinomial logit ### Nicola Barban
Alma Mater Studiorum Università di Bologna
Dipartimento di Scienze Statistiche ### 11 Marzo 2021
--- # Regressione Logistica ## Dataset: Pozzi in Bangladesh * Molti pozzi di acqua potabile in Bangladesh e altri paesi del Sud-Asia sono contaminati con arsenico. Il problema riguarda oltre 100 milioni di persone * L'esposizione all'arsenico aumenta il rischio di tumori e altre malattie * Un gruppo di ricerca degli Stati Uniti e del Bangladesh ha analizzato tutti i pozzi, misurando le concentrazioni di arsenico e caratterizzandoli come **sicuri** o **non sicuri.** * Gli abitanti nei pressi di pozzi non sicuri sono stati incoraggiati a cambiare ed utilizzare un pozzo sicuro nelle vicinanze --- * Alcuni anni dopo, i ricercatori sono ritornati e hanno verificato chi ha cambiato pozzo e chi no. * Useremo una **logistic regression analysis** per capire i fattori che hanno influenzato il cambio per gli abitanti che usavano pozzi non sicuri. $$ Y_i = 1 \text{ if household i switched to a new well} $$ $$ Y_i = 0 \text{ if household i continued using its own well} $$ --- # Variabili indipendenti Siamo interessati ai seguenti fattori * Distanza al pozzo sicuro più vicino * livelli di arsenico del pozzo correntemente usato * Il convolgimento di almeno un membro della famiglia in una organizzazione di comunità * Il livello di istruzione del capo famiglia --- # Distribuzione binomiale Possiamo vedere `\(Y_i\)` acome una variabile casuale che prende valore 1 e 0 con probabilità `\(p_i\)` e `\(1-p_i\)`.La distribuzione di `\(Y_i\)` è chiamata `\(Bernoulli\)` e può essere scritta in questo modo: $$ Pr(Y_i=y_i)= {p_i}^{y_i} (1-p_i)^{1-y_i} $$ --- # Modellare dati binari Vorremmo che le probabilità `\(p_i\)` dipendano da un vettore di covariate osservate `\(X_i\)`. * Il modo più semplice è quello di modellare `\(p_i\)` che dipende da un vettore di covariate osservate `\(X_i\)` usando un modello lineare . Questo è anche chiamato: *linear probability model.* * Il problema è che le probabilità devobno essere tra 0 e 1, ma un linear predictor `\(X\beta\)` può prendere qualsiasi valore reale. * Una soluzione è quella di trasformare la probabilità `\(p_i\)` in \alert{odds:} $$ odds_i= \frac{p_i}{1-p_i} $$ * Odds possono prendere qualsiasi valore $$ logit(p_i)= log \frac{p_i}{1-p_i} $$ --- # The logit function <img src="images/logit.pdf" alt=""> --- # The logistic regression model Non avrebbe senso stimare un modello lineare continuo, `\(X\beta + \epsilon\)` su dati `\(y\)` che possono prendere solo i valori 0 e 1.Invece, modelliamo la probabilità che `\(y=1\)`, $$ Pr(y_i=1) = logit^{-1}(X_i \beta), $$ con l'assunzione che gli outocmes `\(y_i\)` siano indipendenti dato la loro probabilità. `\(X\beta\)` sono chiamati anche `$$linear predictor$$`. --- # La funzione logit La funzione `\(logit^{-1}(x)=\frac{e^x}{1+e^x}\)` transforma da valori continui all'intervallo `\((0,1)\)`. * `\(Pr(y_i=1)= p_i\)` * `\(logit(p_i) = X_i \beta\)` where `\(logit = log(x/(1-x))\)` is a function mapping the range `\((0,1)\)` to the range `\((-\infty, \infty).\)` --- # Odds Ratio Supponiamo di volter studiare l'effetto di essere attivo in una organizzazione di comunità nella probabilità di cambiare. La variabile `\(X_i\)` può prendere i valori 0 or 1. * `\(X_i\)` influenza la probabilità di cambio? Di quanto? * confrontiamo gli *odds*, calcolando il rapporto `\((odds_i|X_i=1)\)` e `\((odds_i|X_i=0)\)` $$ OR_X = \frac{odds_i|X_i=1}{odds_i|X_i=0} = \frac{logit^{-1}(\alpha +\beta \times 1)}{1-logit^{-1}(\alpha +\beta \times 1 )} \times \frac{1 -logit^{-1}(\alpha +\beta \times 0)}{logit^{-1}(\alpha +\beta \times 0 )} $$ $$ = \frac{logit^{-1}(\alpha +\beta )}{1-logit^{-1}(\alpha +\beta )} \times \frac{1 -logit^{-1}(\alpha )}{logit^{-1}(\alpha )} \\ = e^{\alpha + \beta} \times \frac{1}{e^\alpha}=e^{\beta} $$ --- # Importiamo i dati ```r library(haven) library(tidyverse) library(tidymodels) wells <- read_dta("data/all.dta") %>% filter( survey==1) #View(wells) ``` --- ```r mod0<- glm(formula=switch~assn, family="binomial", data=wells) tidy(mod0) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercep… -0.427 0.0418 -10.2 ## 2 assn -0.114 0.0562 -2.02 ## # … with 1 more variable: p.value <dbl> ``` ```r exp(-0.1136 ) ``` ``` ## [1] 0.8926149 ``` --- ```r tidy(mod0) %>% mutate(OR=exp(estimate), OR.std=exp(std.error)) %>% select(OR, OR.std, p.value) ``` ``` ## # A tibble: 2 x 3 ## OR OR.std p.value ## <dbl> <dbl> <dbl> ## 1 0.652 1.04 1.62e-24 ## 2 0.893 1.06 4.34e- 2 ``` * The probability to move to another well by people who are active in community organizations is 89\% lower than people who are not active * In other words, being active in community is **associated with a decrease of (1-0.89)=11% in the probability to move** --- ```r mod1<- glm(formula=switch~distcw, family="binomial", data=wells) summary(mod1) ``` ``` ## ## Call: ## glm(formula = switch ~ distcw, family = "binomial", data = wells) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.277 -1.016 -0.767 1.192 2.617 ## ## Coefficients: ## Estimate Std. Error ## (Intercept) 0.2417606 0.0561907 ## distcw -0.0021560 0.0001473 ## z value Pr(>|z|) ## (Intercept) 4.302 1.69e-05 *** ## distcw -14.641 < 2e-16 *** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' ## 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 5330.9 on 4007 degrees of freedom ## Residual deviance: 5053.8 on 4006 degrees of freedom ## AIC: 5057.8 ## ## Number of Fisher Scoring iterations: 4 ``` --- # Modello con interazioni ```r wells<- wells %>% mutate(dist100=distcw/100) mod2<- glm(formula=switch~dist100*arsenic+assn+ed4, family="binomial", data=wells) tidy(mod2) ``` ``` ## # A tibble: 6 x 5 ## term estimate std.error statistic ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercep… -1.93 0.156 -12.3 ## 2 dist100 -0.175 0.0346 -5.05 ## 3 arsenic 0.807 0.0643 12.5 ## 4 assn -0.104 0.0628 -1.66 ## 5 ed4 0.176 0.0400 4.41 ## 6 dist100:a… 0.0566 0.0196 2.88 ## # … with 1 more variable: p.value <dbl> ``` --- # Testing the parameters Possiamo testare l'ipotesi: $$ H_0: \beta_j=0 $$ calcolando la statistica `\(z_j\)` che in un campione ampio è distribuita secondo una `\(N(0,1)\)`. $$ z_j=\frac{ \beta_j }{\sqrt{\hat{var}(\hat{\beta}_j)}}. $$ I test può essere usato per calcolare un intervallo di confidenza al 95\% per `\(\beta_j\)`: $$ \hat{\beta} \pm 1.96 \sqrt{\hat{var}(\hat{\beta}_j)} $$ **GLi intervalli di confidenza degli Odds ratios non sono più simmetrici!** --- # Probit analysis Un'altra possibile **link function** è la funzione **probit** : $$ probit_i(p_i)=\Phi^{-1}(p_i), $$ dove `\(\Phi^{-1}\)` è la *cumulative distribution function* di una normale standard. <img src="images/probit.pdf" alt=""> --- # Latent-data model * Logit: `\(Pr(y_i=1)=logit^{-1}(X_i \beta)= \frac{e^x}{1+e^x}\)` $$ y_i= \begin{cases} 1 & \text{if } z_i>0\\ 0 & \text{if } z_i<0\\ \end{cases} $$ $$ z_i=X_i\beta + \epsilon_i, \qquad \epsilon_i \sim logistic \text{ probability distribution} $$ * **Probit:** same model, but `\(\epsilon_i \sim N(0, 1)\)` * Probit coefs are logit coef divided by aprox. 1.6 [Why1.6?](http://www.johndcook.com/blog/2010/05/18/normal-approximation-to-logistic/) --- ```r glm(formula=switch~dist100*arsenic+assn+ed4, family=binomial(link = "probit"), data=wells) %>% summary() ``` ``` ## ## Call: ## glm(formula = switch ~ dist100 * arsenic + assn + ed4, family = binomial(link = "probit"), ## data = wells) ## ## Deviance Residuals: ## Min 1Q Median 3Q ## -3.4340 -0.7698 -0.5888 0.9990 ## Max ## 2.6235 ## ## Coefficients: ## Estimate Std. Error ## (Intercept) -1.12111 0.08737 ## dist100 -0.10430 0.01870 ## arsenic 0.45576 0.03513 ## assn -0.06536 0.03675 ## ed4 0.10630 0.02360 ## dist100:arsenic 0.03386 0.01093 ## z value Pr(>|z|) ## (Intercept) -12.832 < 2e-16 *** ## dist100 -5.577 2.44e-08 *** ## arsenic 12.973 < 2e-16 *** ## assn -1.778 0.07536 . ## ed4 4.504 6.68e-06 *** ## dist100:arsenic 3.097 0.00195 ** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' ## 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 5328.9 on 4005 degrees of freedom ## Residual deviance: 4346.6 on 4000 degrees of freedom ## (2 observations deleted due to missingness) ## AIC: 4358.6 ## ## Number of Fisher Scoring iterations: 5 ``` **Coefficients in a probit regression are typically close to logistic regression coefficients divided by 1.6** --- # Effetti Marginali Marginal effects for continuous variables measure the *instantaneous rate of change* (defined shortly). They are popular in some disciplines because they often provide a good approximation to the amount of change in `\(Y\)` that will be produced by a 1-unit change in `\(X_k\)`. Nel modello di regressione lineare: `\(Y_i=\alpha + \beta X_i + \epsilon\)` e *marginal effect*= `\(\beta\)` --- <img src="images/scatter6.pdf" alt=""> --- <img src="images/scatter8.pdf" alt=""> --- # Predicted values ```r library(margins) mod1 <- glm(switch ~ arsenic, data = wells, family=binomial(link="logit")) augment(mod1, type.predict="response") %>% ggplot(aes(x=arsenic, y=.fitted)) + geom_point() ``` <img src="lezione12_files/figure-html/unnamed-chunk-7-1.png" width="60%" /> --- # Conditional predicted value (cplot) ```r cplot(mod1, x = "arsenic", se.type = "shade") ``` ``` ## xvals yvals upper ## 1 1.000000 0.1953373 0.2109323 ## 2 1.291667 0.2464163 0.2622056 ## 3 1.583333 0.3057764 0.3217845 ## 4 1.875000 0.3723702 0.3892481 ## 5 2.166667 0.4441875 0.4630144 ## 6 2.458333 0.5184150 0.5400169 ## 7 2.750000 0.5918386 0.6162705 ## 8 3.041667 0.6613798 0.6879197 ## 9 3.333333 0.7245864 0.7520424 ## 10 3.625000 0.7799208 0.8069980 ## 11 3.916667 0.8267952 0.8523781 ## 12 4.208333 0.8654088 0.8887081 ## 13 4.500000 0.8964918 0.9170714 ## 14 4.791667 0.9210512 0.9387777 ## 15 5.083333 0.9401723 0.9551328 ## 16 5.375000 0.9548891 0.9673086 ## 17 5.666667 0.9661163 0.9762898 ## 18 5.958333 0.9746235 0.9828677 ## 19 6.250000 0.9810367 0.9876590 ## 20 6.541667 0.9858527 0.9911339 ## lower ## 1 0.1797423 ## 2 0.2306270 ## 3 0.2897684 ## 4 0.3554922 ## 5 0.4253606 ## 6 0.4968132 ## 7 0.5674067 ## 8 0.6348399 ## 9 0.6971304 ## 10 0.7528435 ## 11 0.8012124 ## 12 0.8421096 ## 13 0.8759122 ## 14 0.9033247 ## 15 0.9252118 ## 16 0.9424696 ## 17 0.9559427 ## 18 0.9663793 ## 19 0.9744145 ## 20 0.9805715 ``` <img src="lezione12_files/figure-html/unnamed-chunk-8-1.png" width="60%" /> --- ```r wells_new<-wells %>% mutate(arsenic=runif(4008,min=0, max=10 )) augment(mod1, type.predict="response",newdata=wells_new) %>% ggplot(aes(x=arsenic, y=.fitted)) + geom_point() ``` <img src="lezione12_files/figure-html/unnamed-chunk-9-1.png" width="60%" /> --- # Marginal effects with interactions ```r cplot(mod2, x = "dist100", dx = "arsenic", what = "effect", se.type = "shade") ``` <img src="lezione12_files/figure-html/unnamed-chunk-10-1.png" width="60%" /> --- ```r library(haven) UKB<-read_dta("data/UKB_teaching.dta") ``` --- ```r names(UKB) ``` ``` ## [1] "sex" ## [2] "year_birth" ## [3] "neb" ## [4] "childless" ## [5] "smoking_pregnancy_mom" ## [6] "educ" ## [7] "aam" ## [8] "afs" ## [9] "afb" ## [10] "meno" ## [11] "birthweight" ## [12] "teen" ## [13] "bmi" ## [14] "num_siblings" ## [15] "country" ## [16] "gdp_spl" ## [17] "region_gdp" ## [18] "pill_lad" ## [19] "idn" ``` --- ```r UKlog1<-glm(childless~pill_lad+I(region_gdp), data=UKB, family=binomial(link="logit")) UKlog1 %>% tidy() %>% mutate(OR=exp(estimate), OR.std=exp(std.error)) %>% select(term, OR, OR.std, p.value) ``` ``` ## # A tibble: 10 x 4 ## term OR OR.std p.value ## <chr> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.144 1.03 0. ## 2 pill_lad 1.01 1.00 0. ## 3 I(region_gdp)Lo… 1.27 1.03 1.24e-13 ## 4 I(region_gdp)No… 0.971 1.03 3.08e- 1 ## 5 I(region_gdp)Re… 1.19 1.03 1.53e- 7 ## 6 I(region_gdp)Sc… 1.35 1.03 1.22e-19 ## 7 I(region_gdp)So… 0.975 1.04 4.98e- 1 ## 8 I(region_gdp)Wa… 0.967 1.04 4.15e- 1 ## 9 I(region_gdp)We… 1.09 1.03 1.55e- 2 ## 10 I(region_gdp)Yo… 0.935 1.03 3.66e- 2 ``` --- ```r #cplot(UKlog1, x = "pill_lad", se.type = "shade") ``` --- ```r #cplot(UKlog1, x = "pill_lad", se.type = "shade", what="effect") ``` --- # Multinomial Logit Models * An outcome is *nominal* when the categories are assumed to be unordered. * For example, marital status can be grouped nominally into the categories of divorced, never married, married or widowed. * Occupations might be organized as professional, white collar, blue collar, * The **multinomial logit model** (MNLM) is the most frequently used nominal regression model. This model essentially fit separate binary logits for each pair of outcome variables --- # Italian high school data Data come from ITAGEN2. ```r itagen<-read_dta("data/itagen2_peso.dta") %>% filter(GRADE<=4,) %>% mutate(hschool=as_factor(hschool), generation=as_factor(generation), SEX=as_factor(SEX), educParents=as_factor(educParents), GRADE=as_factor(GRADE)) ``` --- The table shows the choice of secondary school track for 1,294 students in Italy, according to their generation status,. ```r TAB<-table(itagen$hschool,itagen$generation) %>% prop.table( 2) %>% round(2) kable(TAB, format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Natives </th> <th style="text-align:right;"> Second Generation </th> <th style="text-align:right;"> 1.5 Generation </th> <th style="text-align:right;"> Immigrants </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Vocational course </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> Vocational School </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:right;"> 0.37 </td> <td style="text-align:right;"> 0.53 </td> </tr> <tr> <td style="text-align:left;"> Polytechnic </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.38 </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.30 </td> </tr> <tr> <td style="text-align:left;"> High school </td> <td style="text-align:right;"> 0.51 </td> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.29 </td> <td style="text-align:right;"> 0.17 </td> </tr> <tr> <td style="text-align:left;"> Not applicable </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> </tr> </tbody> </table> N. Barban and M. J. White (2011) Immigrants' Children's Transition to Secondary School in Italy - International Migration Review, 45(3):702--726 --- # Transition to Secondary School in Italy * Choosing a secondary school is an important choice in educational track systems such as Italy (or germany), since it influences educational achievement and labor force trajectory. * We analyze the effect of generational status and length of residence on the transition to secondary school among immigrants living in Italy. * Children of immigrants *are more likely to have inferior outcomes* on the middle school exam and to *enroll in vocational and polytechnic schools*. * After *controlling for the family's human capital* and other key background factors, immigrant students show greater propensity to choose a vocational path. * Differences remains when *previous scholastic results* are taken into account. --- # The Multinomial Distribution * Consider a random variable `\(Y_i\)` that may take one of several discrete values, which we index `\(1,2,...,J\)`. In the example the response is \alert{secondary school} and it takes the values Vocational, Polytechnic and High school, which we index 1, 2 and 3. * Let `\(\pi_{ij} = Pr(Y_i = j)\)` denote the probability that the i-th response falls in the j-th category. In the example `\(\pi_{i1}\)` is the probability that the i-th respondent is vocational. * Assuming that the response categories are mutually exclusive and exhaustive, we have `\(\sum_i^J \pi_{i1} = 1\)` for each `\(i\)`, i.e. the probabilities add up to one for each individual, and we have only `\(J-1\)` parameters. * The probability distribution of the counts `\(Y_{ij}\)` given the total `\(n_i\)` is given by the multinomial distribution --- # Multinomial logit model In the multinomial logit model we assume that the log-odds of each response follow a linear model `$$\eta_{ij}=log \frac{\pi_{ij}}{\pi_{iJ}} =\alpha_j+\sum_i^K \beta_{kj}$$` where `\(\alpha_j\)` is a constant and `\(\beta_{1j} \ldots \beta_{Kj}\)` is a vector of regression coefficients, for `\(j = 1, 2,\ldots J-1\)`. --- # Modeling the Probabilities * The multinomial logit model may also be written in terms of the original probabilities `\(\pi_{ij}\)` rather than the log-odds. * Starting from the previous equation and adopting the convention that `\(\eta_{iJ} = 0\)`, we can write: `\(\pi_{ij}=\frac{exp(\eta_{ij})}{\sum_{k=1}^{J} exp(\eta_{kj})}\)` --- ```r library("nnet") itagen$hschool <- relevel(itagen$hschool, ref = "High school") m <- multinom(hschool ~ generation , data=itagen) ``` ``` ## # weights: 15 (8 variable) ## initial value 1379.857035 ## iter 10 value 1281.307255 ## final value 1278.258614 ## converged ``` ```r tidy(m) ``` ``` ## # A tibble: 8 x 6 ## y.level term estimate std.error ## <chr> <chr> <dbl> <dbl> ## 1 Vocationa… (Interce… -1.22 0.103 ## 2 Vocationa… generati… 0.677 0.236 ## 3 Vocationa… generati… 1.47 0.224 ## 4 Vocationa… generati… 2.34 0.261 ## 5 Polytechn… (Interce… -0.408 0.0777 ## 6 Polytechn… generati… 0.357 0.201 ## 7 Polytechn… generati… 0.552 0.218 ## 8 Polytechn… generati… 0.986 0.272 ## # … with 2 more variables: ## # statistic <dbl>, p.value <dbl> ``` ```r exp(2.319877) ``` ``` ## [1] 10.17442 ``` --- ```r tidy(m) %>% mutate(expB=exp(estimate), expSE=exp(std.error)) %>% select( y.level,term, expB, expSE) ``` ``` ## # A tibble: 8 x 4 ## y.level term expB expSE ## <chr> <chr> <dbl> <dbl> ## 1 Vocational … (Intercept) 0.296 1.11 ## 2 Vocational … generationSe… 1.97 1.27 ## 3 Vocational … generation1.… 4.35 1.25 ## 4 Vocational … generationIm… 10.4 1.30 ## 5 Polytechnic (Intercept) 0.665 1.08 ## 6 Polytechnic generationSe… 1.43 1.22 ## 7 Polytechnic generation1.… 1.74 1.24 ## 8 Polytechnic generationIm… 2.68 1.31 ``` `\(e^{2.34}=10.38\)` tell you that immigrants are 10 times more likely **than natives** to end up in vocational training compared to high school. --- # Modello con interazioni ```r m2 <- multinom(hschool ~ GRADE*generation, data=itagen) ``` ``` ## # weights: 87 (56 variable) ## initial value 1379.857035 ## iter 10 value 1066.535013 ## iter 20 value 1046.522909 ## iter 30 value 1044.911273 ## iter 40 value 1044.708395 ## iter 50 value 1044.704889 ## final value 1044.704870 ## converged ``` ```r tidy(m2) ``` ``` ## # A tibble: 56 x 6 ## y.level term estimate std.error ## <chr> <chr> <dbl> <dbl> ## 1 Vocation… (Interc… 11.1 2.17e- 1 ## 2 Vocation… GRADEEx… -16.2 8.07e- 1 ## 3 Vocation… GRADEVe… -13.4 3.07e- 1 ## 4 Vocation… GRADEGo… -11.7 2.59e- 1 ## 5 Vocation… GRADESu… -10.2 2.84e- 1 ## 6 Vocation… GRADEI … 0 2.48e-15 ## 7 Vocation… GRADENo… 0 NaN ## 8 Vocation… generat… -2.07 2.42e- 1 ## 9 Vocation… generat… 6.00 2.53e- 1 ## 10 Vocation… generat… 3.74 4.06e- 1 ## # … with 46 more rows, and 2 more ## # variables: statistic <dbl>, ## # p.value <dbl> ``` --- ```r fitted(m2) %>% head() ``` ``` ## High school Vocational School ## 2 0.3372108 0.17829337 ## 3 0.1644749 0.40789088 ## 4 0.8407941 0.00497500 ## 5 0.6600994 0.06896491 ## 6 0.8407941 0.00497500 ## 7 0.8407941 0.00497500 ## Polytechnic ## 2 0.4844959 ## 3 0.4276342 ## 4 0.1542309 ## 5 0.2709357 ## 6 0.1542309 ## 7 0.1542309 ``` ```r newdata<-data.frame(generation = rep(c("Natives","Second Generation", "1.5 Generation" ,"Immigrants"),4), GRADE = c(rep("Sufficient",4),rep( "Good",4), rep( "Very Good",4), rep( "Excellent",4))) P<-predict(m2, newdata = newdata, "probs") %>% data.frame() ``` --- ```r predictions1<-cbind(newdata, P) %>% filter((GRADE=="Excellent" | GRADE=="Sufficient") & (generation=="Natives" | generation=="Immigrants")) %>% select(GRADE, generation, High.school) predictions2<-cbind(newdata, P) %>% filter((GRADE=="Excellent" | GRADE=="Sufficient") & (generation=="Natives" | generation=="Immigrants")) %>% select(GRADE, generation, Vocational.School) ``` --- ```r predictions1 ``` ``` ## GRADE generation High.school ## 1 Sufficient Natives 0.16447492 ## 4 Sufficient Immigrants 0.01563354 ## 13 Excellent Natives 0.84079410 ## 16 Excellent Immigrants 0.64705153 ``` * Natives who got Excellent in the previous test have 84\% probability to choose high school, while the same probability is 64\% for recent immigrants --- ```r predictions2 ``` ``` ## GRADE generation ## 1 Sufficient Natives ## 4 Sufficient Immigrants ## 13 Excellent Natives ## 16 Excellent Immigrants ## Vocational.School ## 1 4.078909e-01 ## 4 7.656167e-01 ## 13 4.975000e-03 ## 16 1.588798e-07 ``` * Natives who got Sufficient in the previous test have 41\% probability to choose Vocational school, while the same probability is 76\% for recent immigrants ---