分析學優勢：邏輯回歸

本文僅為作者學完資料分析課程後的筆記記錄，對資料分析感興趣可在評論區留言，建群討論。以framingham.csv資料集為例

讀取資料集：framingham = read.csv("framingham.csv")。

建立邏輯回歸模型，先安裝，再載入caTools庫：library(caTools)

劃分資料集為訓練集train和測試集test：

先設置隨機種子，比如set.seed(1000);set.seed() 與 sample.split()配套使用才會保證每次結果不變化。

split = sample.split(framingham$TenYearCHD, SplitRatio = 0.65)，其中TenYearCHD為預建立模型的因變數；65%的資料集作為訓練集， 35%的為測試集；

train = subset(framingham, split==TRUE)；

test = subset(framingham, split==FALSE)。

建立模型：使用glm()函數， “glm”意為“廣義線性模型”， train中的TenYearCHD ~ .表示TenYearCHD作為因變數，其餘變數為引數，需要添加family=binomial參數。

framinghamLog = glm(TenYearCHD ~ ., data = train, family=binomial)；

總結模型，查看各個變數前係數以及變數對模型重要性和模型AIC：summary(framinghamLog)。 AIC相當於線性回歸中的Adjusted R-squared，

更好的模型通常有更小的AIC。

在測試集上做預測：

predictTest = predict(framinghamLog, type="response", newdata=test)， predictTest為測試集上每個物件對應的概率；可使用summary(predictTest)查看變數統計學資訊；。 type="response"是使模型給出我們概率。

使用混淆矩陣評估模型：

混淆矩陣：table(test$TenYearCHD, predictTest > 0.5)；

FALSE TRUE

0 1069 6

1 187 11

準確率：(1069+11)/(1069+6+187+11)

基線模型準確率：(1069+6)/(1069+6+187+11)

計算Sensitivity和specificity：（高閾值t：lower sensitivity（真陽性率） and a higher specificity；低閾值： a higher sensitivity and a lowerspecificity.）

Sensitivity 和 specificity

計算測試集的AUC，

先安裝載入ROCR包：

library(ROCR)

ROCRpred = prediction(predictTest, test$TenYearCHD)

as.numeric(performance(ROCRpred, "auc")@y.values)

使用quality.csv資料集，讀取數據quality = read.csv("quality.csv")

建模：

set.seed(88)

split = sample.split(quality$PoorCare, SplitRatio = 0.75)

qualityTrain = subset(quality, split == TRUE)

qualityTest = subset(quality, split == FALSE)

QualityLog = glm(PoorCare ~ StartedOnCombination+ProviderCount, data=qualityTrain, family=binomial)

summary(QualityLog)

predictTrain = predict(QualityLog, type="response")

在訓練集上分析預測結果predictions：

summary(predictTrain)

tapply(predictTrain, qualityTrain$PoorCare, mean)：輸出結果：0：0.19；1：0.44。即實際為poorcare(1)的平均概率0.44；實際為goodcare(0)的平均概率0.19。

計算混淆矩陣：

table(qualityTrain$PoorCare, predictTrain > 0.5)， 0.5為設置的閾值t，閾值t的選擇基於error，兩類error之間不存在偏好時，則選擇t=0.5；選擇閾值的好辦法是在訓練集上計算然後畫ROC曲線，選擇合適的閾值。

在訓練集上計算然後畫ROC曲線，選擇合適的閾值（不一定是0.5）：

先載入ROCR庫， library(ROCR)。

ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)

使用performance()函數：

ROCRperf = performance(ROCRpred, "tpr", "fpr")

畫圖ROC圖， colorize=TRUE控制顏色， print.cutoffs.at=seq(0,1,by=0.1)增加閾值標籤。

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

測試集上做預測，計算AUC：

predictTest = predict(QualityLog, type="response", newdata=qualityTest)

ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)

auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)

訓練集上的ROC曲線

選擇閾值t，應該儘量使真陽性率更大，即sensitivity大，假陽性率更小。

處理缺失值問題：使用mice庫，將缺失值根據合理的資訊填充，即觀察物件數目沒變。 install.packages("mice")

library(mice)

那麼遇到缺失值如何處理？是填充還是移除？

移除後，數據會不會太少？移除數據後會不會對模型產生偏見？（可以subset()，看缺失值的觀察物件具體有多少，然後查看有缺失值的觀察的相關統計量與整個資料集是否有較大差別，如無較大差別，移除不會產生偏見。一般使用填充缺失值的方法。）見課程The Analytics Edge第三周的第三個Assignment。

以PollingData.csv為例，讀取數據polling = read.csv("PollingData.csv")

查看資料集統計學資訊、是否有缺失值summary(polling)

填充缺失值：

polling的Rasmussen, SurveyUSA兩個變數有缺失值，使用mice()函數，還有set.seed()

simple = polling[c("Rasmussen", "SurveyUSA")]

summary(simple)

set.seed(144)

imputed = complete(mice(simple))

summary(imputed)

polling$Rasmussen = imputed$Rasmussen

polling$SurveyUSA = imputed$SurveyUSA

summary(polling)

polling資料集已經沒有缺失值了！

寫成csv文件查看！

write.csv(polling,"Polling1.csv")

符號函數sign()：sign(20)=1；sign(-10)=-1；sign(0)=0

smart baseline model：table(sign(Train$Rasmussen))；table(Train$Republican, sign(Train$Rasmussen))

二分類中的Odds：（poor care：y=1;good care:y=0）

Odds=p(y=1)/p(y=0)

log(Odds)=beta0+beta1*x1+……+betan*xn， betai為正，增加Odds；log(Odds)稱為“Logits",Logits越大， p(y=1)越大。

以parole.csv資料集為例， parole = read.csv("parole.csv"),其中state和crime為parole資料集的至少3個level的兩個unorder factor varible。先使用as.factor()函數轉為factor varible。

state: a code for the parolee's state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.

crime: a code for the parolee's main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.

parole$state = as.factor(parole$state)

parole$crime = as.factor(parole$crime)

建立模型：

set.seed(144)

library(caTools)

split = sample.split(parole$violator, SplitRatio = 0.7)

train = subset(parole, split == TRUE)

test = subset(parole, split == FALSE)

mod = glm(violator~., data=train, family="binomial")

summary(mod)的輸出結果：

模型總結

模型解釋：

log(odds) = -4.2411574 + 0.3869904*male + 0.8867192*race - 0.0001756*age + 0.4433007*state2 + 0.8349797*state3 - 3.3967878*state4 - 0.1238867*time.served + 0.0802954*max.sentence + 1.6119919*multiple.offenses + 0.6837143*crime2 - 0.2781054*crime3 - 0.0117627*crime4.

if a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny

This parolee has male=1, race=1, age=50, state2=0, state3=0, state4=0, time.served=3, max.sentence=12, multiple.offenses=0, crime2=1, crime3=0, crime4=0. We conclude that log(odds) = -1.700629.

Therefore, the odds ratio is exp(-1.700629) = 0.183, and the predicted probability of violation is 1/(1+exp(1.700629)) = 0.154.

以loans.csv檔為例（完整代碼）：

loans=read.csv("loans.csv")

str(loans)

summary(loans)

table(loans$not.fully.paid)

str(missing)

table(missing$not.fully.paid)

library(mice)

set.seed(144)

#對所有引數做imputation

vars.for.imputation = setdiff(names(loans), "not.fully.paid")

imputed = complete(mice(loans[vars.for.imputation]))

loans[vars.for.imputation] = imputed

library(caTools)

set.seed(144)

spl = sample.split(loans$not.fully.paid, 0.7)

train = subset(loans, spl == TRUE)

test = subset(loans, spl == FALSE)

mod = glm(not.fully.paid~., data=train, family="binomial")

summary(mod)

test$predicted.risk = predict(mod, newdata=test, type="response")

table(test$not.fully.paid, test$predicted.risk > 0.5)

library(ROCR)

pred = prediction(test$predicted.risk, test$not.fully.paid)

as.numeric(performance(pred, "auc")@y.values)

#模型的auc與選擇的閾值t有關。 The AUC deals with differentiating between a randomly selected positive and #negative example. It is independent of the regression cutoff selected.