Datathon Cajamar PythonHack 2016

Equipo: WoodenSpoonNinjas

Comenzamos importando los paquetes y módulos que vamos a utilizar

In [1]:
from collections import Counter
import numpy as np
import pandas as pd
import seaborn

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, confusion_matrix, classification_report
from sklearn import metrics   # Más métricas

from sklearn.model_selection import GridSearchCV  # Validación cruzada en rejilla de parámetros

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
  
import itertools
C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Cargamos los datos de entrenamiento (previamente convertidos a csv) como un Pandas DataFrame y separamos las variables independientes X y la variable dependiente y. Al terminar, contamos el número de registros en cada clase.

In [62]:
filename = 'pm_train.csv'
D = pd.read_csv(filename)
columns = D.columns

X = D.drop(["TARGET"],axis=1).values
y = D['TARGET'].values
print(Counter(y))
Counter({0: 464360, 1: 7478})

Como se puede ver, el número de muestras en cada clase es muy dispar (sólo un 1.6% de ellas pertenecen a la clase 1). En este punto, tenemos 2 opciones:

  1. Ajustar el número de muestras en ambas clase para que sean iguales (utilizando técnicas de undersampling, oversampling o una combinación de ellas)
  2. No ajustar el número de muestras y hacer que nuestra técnica tenga en cuenta esta probabilidad en sus predicciones.

En los criterios de evaluación del reto, se indica que clasificar correctamente ambas clases será tenido en consideración (se valorará especificidad y sensibilidad). Consecuentemente, seguiremos la aproximación 1, para que nuestros resultados no estén sesgados hacia la clase 0 debido al desbalance. De esta manera, conseguimos también que los resultados sean robustos a futuros cambios en la probabilidad a priori de cliente pidiendo/no pidiendo préstamos.

In [17]:
method = "US"
if method == "US":
    from imblearn.under_sampling import RandomUnderSampler
    resamp = RandomUnderSampler(random_state=42)
    X_res, y_res = resamp.fit_sample(X, y)
elif  method == "OS":
    from imblearn.over_sampling import SMOTE
    resamp = SMOTE(random_state=42)
    X_res, y_res = resamp.fit_sample(X, y)
    sel = np.random.randint(0, y_res.shape[0], 20000)
    X_res = X_res[sel,:]
    y_res = y_res[sel]   
else:
    X_res, y_res = X, y

print(Counter(y_res))
Counter({0: 7478, 1: 7478})
In [18]:
X_sel = X_res  # Seleccionamos para el entrenamiento todas las características remuestreadas

Validación cruzada

Como modelo de clasificación utilizaremos un eXtreme Gradient Boosting model de la biblioteca xgboost. Para realizar la selección de los hiperparámetros del modelo utilizaremos un sistema de validación cruzada secuencial ya que, por el tiempo (y potencia de cálculo) limitada de que disponemos nos imposible realizar una búsqueda exhaustiva en toda la rejilla de hiperparámetros.

Selección del número de estimadores

En primer lugar fijamos todos los parámetros del modelo excepto el número de estimadores y realizamos una validación cruzada de este último parámetro. Para ello, primero implementamos una función auxiliar modelfit para realizar el barrido.

In [19]:
def modelfit(alg, predictors, target, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(predictors, label=target)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
        alg.set_params(n_estimators=cvresult.shape[0])

        
    # Fit the algorithm on the data
    alg.fit(predictors, target, eval_metric='auc')

    # Predict training set:
    dtrain_predictions = alg.predict(predictors)
    dtrain_predprob = alg.predict_proba(predictors)[:, 1]

    # Print model report:
    print("Model Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(target, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(target, dtrain_predprob))
    

    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    plt.figure(figsize=(20,10))
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    return alg

Ejecutamos el barrido y observamos que finaliza indicando que el número optimo de estimadores es 385. A partir de este punto, fijaremos este hiperparámetro para optimizar el resto

In [26]:
xgb1 = XGBClassifier(
 learning_rate=0.1,
 n_estimators= 1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=-1,
 scale_pos_weight=1,
 seed=27)
alg = modelfit(xgb1, X_sel, y_res)
[0]	train-auc:0.839808+0.00171059	test-auc:0.826817+0.0103184
[1]	train-auc:0.851201+0.00392178	test-auc:0.837704+0.00693683
[2]	train-auc:0.858353+0.00328648	test-auc:0.84396+0.0079885
[3]	train-auc:0.86212+0.00330253	test-auc:0.847497+0.00756405
[4]	train-auc:0.865265+0.00278012	test-auc:0.85011+0.00799946
[5]	train-auc:0.867921+0.00302978	test-auc:0.852881+0.00793075
[6]	train-auc:0.86987+0.00339854	test-auc:0.854506+0.00822721
[7]	train-auc:0.872393+0.00287162	test-auc:0.856482+0.00791109
[8]	train-auc:0.874293+0.00311432	test-auc:0.858005+0.00736747
[9]	train-auc:0.876129+0.00277235	test-auc:0.859624+0.007468
[10]	train-auc:0.877636+0.00269972	test-auc:0.860497+0.00767365
[11]	train-auc:0.879046+0.00246435	test-auc:0.861499+0.0079584
[12]	train-auc:0.880173+0.00245154	test-auc:0.862691+0.00771334
[13]	train-auc:0.881602+0.00205924	test-auc:0.86382+0.00788325
[14]	train-auc:0.882786+0.00191871	test-auc:0.864604+0.00805297
[15]	train-auc:0.884163+0.00179159	test-auc:0.865775+0.00806183
[16]	train-auc:0.885512+0.0017605	test-auc:0.866849+0.00814692
[17]	train-auc:0.886659+0.00186687	test-auc:0.867667+0.0079363
[18]	train-auc:0.887576+0.00194896	test-auc:0.868506+0.00804284
[19]	train-auc:0.888799+0.00182833	test-auc:0.869275+0.00826667
[20]	train-auc:0.889837+0.00170271	test-auc:0.869993+0.00814772
[21]	train-auc:0.891452+0.00148923	test-auc:0.871435+0.00832809
[22]	train-auc:0.892655+0.00129467	test-auc:0.872408+0.00839547
[23]	train-auc:0.893819+0.00118565	test-auc:0.873431+0.00861691
[24]	train-auc:0.894932+0.0015354	test-auc:0.874164+0.00817623
[25]	train-auc:0.896355+0.00163166	test-auc:0.875044+0.00776744
[26]	train-auc:0.897453+0.00158914	test-auc:0.875687+0.00798889
[27]	train-auc:0.898453+0.00157243	test-auc:0.876737+0.0079846
[28]	train-auc:0.899397+0.00149501	test-auc:0.87746+0.00806052
[29]	train-auc:0.900352+0.00143902	test-auc:0.878033+0.0078812
[30]	train-auc:0.90128+0.00147379	test-auc:0.878525+0.00798936
[31]	train-auc:0.902129+0.00134929	test-auc:0.879005+0.00811003
[32]	train-auc:0.902829+0.00138312	test-auc:0.879329+0.00815361
[33]	train-auc:0.903735+0.00155429	test-auc:0.879814+0.00788754
[34]	train-auc:0.904648+0.00159473	test-auc:0.880308+0.00787797
[35]	train-auc:0.905629+0.00163817	test-auc:0.880992+0.00776976
[36]	train-auc:0.906402+0.00170141	test-auc:0.881424+0.00781628
[37]	train-auc:0.90716+0.00162139	test-auc:0.881928+0.00787629
[38]	train-auc:0.907943+0.00155056	test-auc:0.882244+0.00779229
[39]	train-auc:0.90866+0.00152703	test-auc:0.882535+0.00788044
[40]	train-auc:0.909406+0.00157638	test-auc:0.882945+0.00796154
[41]	train-auc:0.910067+0.00172041	test-auc:0.883274+0.00780498
[42]	train-auc:0.91103+0.00159541	test-auc:0.883783+0.00769337
[43]	train-auc:0.911734+0.00140086	test-auc:0.884081+0.00783996
[44]	train-auc:0.912492+0.00140276	test-auc:0.884374+0.00802784
[45]	train-auc:0.913289+0.00133201	test-auc:0.884474+0.00802273
[46]	train-auc:0.913879+0.00142932	test-auc:0.884695+0.00795259
[47]	train-auc:0.914386+0.00150321	test-auc:0.884758+0.00808088
[48]	train-auc:0.91493+0.00140132	test-auc:0.884917+0.00808024
[49]	train-auc:0.915599+0.00143699	test-auc:0.885058+0.00814899
[50]	train-auc:0.91598+0.00149655	test-auc:0.885161+0.00815822
[51]	train-auc:0.916622+0.00138173	test-auc:0.88528+0.00814642
[52]	train-auc:0.91722+0.00160193	test-auc:0.885417+0.00822087
[53]	train-auc:0.917653+0.00149842	test-auc:0.885599+0.00832114
[54]	train-auc:0.918521+0.00146625	test-auc:0.885955+0.00825345
[55]	train-auc:0.919163+0.00142812	test-auc:0.886231+0.00824867
[56]	train-auc:0.919587+0.0014438	test-auc:0.886315+0.0082398
[57]	train-auc:0.920111+0.0014732	test-auc:0.88648+0.00814075
[58]	train-auc:0.920637+0.00148781	test-auc:0.886604+0.00809682
[59]	train-auc:0.921218+0.00138197	test-auc:0.886695+0.00810631
[60]	train-auc:0.921795+0.00133179	test-auc:0.886777+0.00812773
[61]	train-auc:0.922486+0.00136253	test-auc:0.887128+0.00804516
[62]	train-auc:0.92295+0.0015163	test-auc:0.887214+0.00800608
[63]	train-auc:0.923429+0.00144981	test-auc:0.887364+0.00805169
[64]	train-auc:0.923813+0.00134964	test-auc:0.887531+0.00807309
[65]	train-auc:0.924389+0.00149667	test-auc:0.887734+0.00799743
[66]	train-auc:0.925064+0.00154081	test-auc:0.887932+0.00792428
[67]	train-auc:0.925576+0.00152753	test-auc:0.887942+0.00788649
[68]	train-auc:0.926266+0.0014836	test-auc:0.888122+0.00794432
[69]	train-auc:0.926914+0.00145486	test-auc:0.888301+0.00790366
[70]	train-auc:0.92744+0.00133094	test-auc:0.888551+0.00798949
[71]	train-auc:0.927805+0.00125859	test-auc:0.888672+0.00803489
[72]	train-auc:0.928073+0.00128646	test-auc:0.888749+0.00803208
[73]	train-auc:0.928514+0.00122391	test-auc:0.888837+0.00801852
[74]	train-auc:0.929127+0.00126603	test-auc:0.888924+0.0079863
[75]	train-auc:0.929601+0.0014396	test-auc:0.889053+0.00780167
[76]	train-auc:0.930043+0.00136521	test-auc:0.889125+0.00786436
[77]	train-auc:0.930337+0.00139383	test-auc:0.889174+0.00789556
[78]	train-auc:0.93067+0.00137798	test-auc:0.889296+0.0078388
[79]	train-auc:0.931087+0.00130163	test-auc:0.889408+0.00791735
[80]	train-auc:0.931484+0.00121769	test-auc:0.889534+0.00797887
[81]	train-auc:0.931958+0.00137176	test-auc:0.889526+0.00792262
[82]	train-auc:0.932357+0.00146553	test-auc:0.889585+0.007838
[83]	train-auc:0.93248+0.00145219	test-auc:0.889633+0.00786695
[84]	train-auc:0.933023+0.00169685	test-auc:0.889761+0.0076771
[85]	train-auc:0.933342+0.00169777	test-auc:0.889835+0.00775004
[86]	train-auc:0.933727+0.00162007	test-auc:0.889807+0.00785038
[87]	train-auc:0.933988+0.00166163	test-auc:0.889856+0.00784914
[88]	train-auc:0.93442+0.00168123	test-auc:0.889878+0.00790147
[89]	train-auc:0.934753+0.00158494	test-auc:0.889931+0.00790539
[90]	train-auc:0.935113+0.00162298	test-auc:0.889949+0.00790664
[91]	train-auc:0.935476+0.00153185	test-auc:0.89015+0.00789432
[92]	train-auc:0.935859+0.00147743	test-auc:0.890144+0.00790411
[93]	train-auc:0.936157+0.00156178	test-auc:0.890267+0.00787377
[94]	train-auc:0.936452+0.00169788	test-auc:0.890325+0.00783706
[95]	train-auc:0.936883+0.00163079	test-auc:0.890412+0.00780692
[96]	train-auc:0.937252+0.00163381	test-auc:0.890428+0.00767434
[97]	train-auc:0.937587+0.00161101	test-auc:0.890601+0.00763028
[98]	train-auc:0.937971+0.0015287	test-auc:0.890675+0.00756303
[99]	train-auc:0.938328+0.0014919	test-auc:0.890627+0.0075496
[100]	train-auc:0.938596+0.00150155	test-auc:0.890664+0.00752659
[101]	train-auc:0.93892+0.00138564	test-auc:0.890646+0.00754067
[102]	train-auc:0.939307+0.00136734	test-auc:0.89065+0.00744262
[103]	train-auc:0.939596+0.0014258	test-auc:0.890754+0.00740473
[104]	train-auc:0.939898+0.00135405	test-auc:0.890817+0.00741607
[105]	train-auc:0.940332+0.00142098	test-auc:0.890928+0.00747616
[106]	train-auc:0.940723+0.00138305	test-auc:0.890907+0.00749774
[107]	train-auc:0.941038+0.00135597	test-auc:0.890871+0.00742629
[108]	train-auc:0.941328+0.00132047	test-auc:0.890945+0.00735449
[109]	train-auc:0.941599+0.00137816	test-auc:0.890968+0.00730955
[110]	train-auc:0.94193+0.00147774	test-auc:0.890945+0.00733953
[111]	train-auc:0.942242+0.00154171	test-auc:0.890905+0.00731596
[112]	train-auc:0.942511+0.00151938	test-auc:0.890916+0.00731485
[113]	train-auc:0.942797+0.00142057	test-auc:0.890955+0.00731142
[114]	train-auc:0.943141+0.0013883	test-auc:0.890885+0.00719454
[115]	train-auc:0.943439+0.00127918	test-auc:0.89085+0.00717085
[116]	train-auc:0.943794+0.00130832	test-auc:0.890848+0.00722965
[117]	train-auc:0.944128+0.00130446	test-auc:0.890863+0.0072349
[118]	train-auc:0.944345+0.00138975	test-auc:0.890904+0.00725959
[119]	train-auc:0.944655+0.00149278	test-auc:0.89089+0.00717782
[120]	train-auc:0.944954+0.00143313	test-auc:0.890981+0.00726935
[121]	train-auc:0.945228+0.0014605	test-auc:0.891109+0.00735707
[122]	train-auc:0.945561+0.00155553	test-auc:0.891147+0.00733941
[123]	train-auc:0.945819+0.00154362	test-auc:0.891155+0.00739481
[124]	train-auc:0.946168+0.00146946	test-auc:0.891254+0.0075166
[125]	train-auc:0.946498+0.00154163	test-auc:0.891372+0.00744421
[126]	train-auc:0.946821+0.00139312	test-auc:0.89146+0.00750559
[127]	train-auc:0.947133+0.0013764	test-auc:0.891564+0.00752445
[128]	train-auc:0.947412+0.00135602	test-auc:0.891536+0.00751082
[129]	train-auc:0.947624+0.00129916	test-auc:0.891635+0.00746061
[130]	train-auc:0.947962+0.00137498	test-auc:0.891737+0.00761118
[131]	train-auc:0.948173+0.00135039	test-auc:0.891812+0.00757783
[132]	train-auc:0.948483+0.00131977	test-auc:0.891844+0.00744337
[133]	train-auc:0.948772+0.00130914	test-auc:0.891867+0.00744883
[134]	train-auc:0.949043+0.00130312	test-auc:0.891859+0.00756676
[135]	train-auc:0.949314+0.00125486	test-auc:0.891909+0.00760991
[136]	train-auc:0.949589+0.00136811	test-auc:0.891932+0.00764438
[137]	train-auc:0.949734+0.00134653	test-auc:0.891868+0.00764772
[138]	train-auc:0.950072+0.00137056	test-auc:0.8919+0.00781653
[139]	train-auc:0.950321+0.00139727	test-auc:0.891884+0.0078568
[140]	train-auc:0.950536+0.00134886	test-auc:0.891951+0.00779773
[141]	train-auc:0.950738+0.00130258	test-auc:0.892009+0.00776325
[142]	train-auc:0.951028+0.00126636	test-auc:0.892062+0.00774706
[143]	train-auc:0.951241+0.00129043	test-auc:0.892072+0.00773679
[144]	train-auc:0.951476+0.00127922	test-auc:0.892165+0.00768395
[145]	train-auc:0.951764+0.00126498	test-auc:0.892122+0.00759501
[146]	train-auc:0.951996+0.00118851	test-auc:0.892192+0.00745657
[147]	train-auc:0.952211+0.00111042	test-auc:0.89223+0.00742677
[148]	train-auc:0.952483+0.000969332	test-auc:0.892283+0.00744594
[149]	train-auc:0.952765+0.000964397	test-auc:0.892304+0.00751167
[150]	train-auc:0.953157+0.00104862	test-auc:0.892299+0.00756297
[151]	train-auc:0.953447+0.00110265	test-auc:0.892297+0.00756652
[152]	train-auc:0.953726+0.00103682	test-auc:0.892294+0.00756972
[153]	train-auc:0.953952+0.000944316	test-auc:0.892316+0.00756884
[154]	train-auc:0.954122+0.000938449	test-auc:0.892336+0.0076407
[155]	train-auc:0.95439+0.000970227	test-auc:0.892377+0.00754935
[156]	train-auc:0.954589+0.00103038	test-auc:0.89236+0.00762426
[157]	train-auc:0.954868+0.000958644	test-auc:0.892446+0.00764348
[158]	train-auc:0.955084+0.000928824	test-auc:0.892536+0.00773142
[159]	train-auc:0.955349+0.000924746	test-auc:0.892525+0.00770877
[160]	train-auc:0.955545+0.000907382	test-auc:0.892448+0.0077939
[161]	train-auc:0.95572+0.000900311	test-auc:0.892482+0.00780491
[162]	train-auc:0.956016+0.00103221	test-auc:0.892513+0.00777771
[163]	train-auc:0.956241+0.0011118	test-auc:0.892538+0.00776957
[164]	train-auc:0.95647+0.0010137	test-auc:0.892515+0.00782419
[165]	train-auc:0.956692+0.0011412	test-auc:0.89247+0.00784827
[166]	train-auc:0.956886+0.00105306	test-auc:0.892551+0.00790477
[167]	train-auc:0.957053+0.00104328	test-auc:0.892548+0.00789113
[168]	train-auc:0.957207+0.00100365	test-auc:0.892562+0.00791809
[169]	train-auc:0.957423+0.00095785	test-auc:0.892591+0.00798499
[170]	train-auc:0.957625+0.000925259	test-auc:0.892588+0.00797537
[171]	train-auc:0.957804+0.00100174	test-auc:0.892644+0.00794921
[172]	train-auc:0.958046+0.0010544	test-auc:0.892646+0.00798134
[173]	train-auc:0.958317+0.00107945	test-auc:0.892668+0.00794809
[174]	train-auc:0.958537+0.00100461	test-auc:0.892665+0.00792561
[175]	train-auc:0.958898+0.00099376	test-auc:0.89258+0.0079785
[176]	train-auc:0.959109+0.000948883	test-auc:0.892579+0.0080536
[177]	train-auc:0.959346+0.00101093	test-auc:0.892566+0.00811446
[178]	train-auc:0.959543+0.000965447	test-auc:0.892508+0.00813308
[179]	train-auc:0.959738+0.000912229	test-auc:0.892557+0.00818612
[180]	train-auc:0.95998+0.000835913	test-auc:0.892611+0.00817993
[181]	train-auc:0.960294+0.000864174	test-auc:0.892601+0.00816707
[182]	train-auc:0.960525+0.000820458	test-auc:0.892607+0.00811211
[183]	train-auc:0.960796+0.000818428	test-auc:0.892623+0.0081187
[184]	train-auc:0.961035+0.00082342	test-auc:0.892601+0.00809693
[185]	train-auc:0.96118+0.000830082	test-auc:0.892586+0.0080894
[186]	train-auc:0.961481+0.000792472	test-auc:0.89259+0.00812179
[187]	train-auc:0.961712+0.000859902	test-auc:0.892542+0.00815413
[188]	train-auc:0.961898+0.000782129	test-auc:0.892556+0.00816704
[189]	train-auc:0.962082+0.000784246	test-auc:0.892662+0.00812324
[190]	train-auc:0.962253+0.000823369	test-auc:0.892658+0.00803654
[191]	train-auc:0.962494+0.000748088	test-auc:0.892664+0.00805484
[192]	train-auc:0.96268+0.000739652	test-auc:0.892724+0.0080689
[193]	train-auc:0.962823+0.000788884	test-auc:0.892676+0.0081296
[194]	train-auc:0.963081+0.000876163	test-auc:0.892667+0.0080821
[195]	train-auc:0.963302+0.000942668	test-auc:0.892725+0.00798224
[196]	train-auc:0.963594+0.000970716	test-auc:0.892757+0.00803451
[197]	train-auc:0.963805+0.000931765	test-auc:0.892762+0.00804723
[198]	train-auc:0.964053+0.000853161	test-auc:0.89283+0.00805416
[199]	train-auc:0.964247+0.000909006	test-auc:0.892828+0.00807433
[200]	train-auc:0.964396+0.000946844	test-auc:0.892817+0.00814038
[201]	train-auc:0.964519+0.000928561	test-auc:0.892793+0.00813787
[202]	train-auc:0.964782+0.000991937	test-auc:0.892722+0.00813407
[203]	train-auc:0.96496+0.00101755	test-auc:0.892715+0.00816163
[204]	train-auc:0.96518+0.000945682	test-auc:0.892668+0.00823581
[205]	train-auc:0.965357+0.000912264	test-auc:0.892646+0.00825223
[206]	train-auc:0.965553+0.000970866	test-auc:0.892657+0.00829376
[207]	train-auc:0.965718+0.000982086	test-auc:0.892603+0.00827964
[208]	train-auc:0.965945+0.00084441	test-auc:0.892572+0.00826787
[209]	train-auc:0.966141+0.000814813	test-auc:0.892598+0.00813844
[210]	train-auc:0.966294+0.000831657	test-auc:0.892618+0.0081623
[211]	train-auc:0.966468+0.00079087	test-auc:0.892609+0.00808418
[212]	train-auc:0.966692+0.000884318	test-auc:0.89253+0.00815186
[213]	train-auc:0.966902+0.000868741	test-auc:0.892517+0.0081919
[214]	train-auc:0.96706+0.000863936	test-auc:0.892525+0.00816293
[215]	train-auc:0.967242+0.000855471	test-auc:0.892629+0.00822494
[216]	train-auc:0.967448+0.00083466	test-auc:0.892618+0.0081786
[217]	train-auc:0.967606+0.000892583	test-auc:0.892655+0.00810419
[218]	train-auc:0.967752+0.00092212	test-auc:0.892682+0.00809369
[219]	train-auc:0.967993+0.000969974	test-auc:0.892624+0.00812013
[220]	train-auc:0.968197+0.000968315	test-auc:0.892679+0.00815448
[221]	train-auc:0.968466+0.000884835	test-auc:0.89263+0.00824214
[222]	train-auc:0.968626+0.000895346	test-auc:0.89265+0.00826365
[223]	train-auc:0.968753+0.000924859	test-auc:0.892676+0.0082901
[224]	train-auc:0.968897+0.000905584	test-auc:0.892683+0.00829321
[225]	train-auc:0.969077+0.000955964	test-auc:0.892618+0.00826414
[226]	train-auc:0.969247+0.000973507	test-auc:0.892704+0.00833391
[227]	train-auc:0.969395+0.000979427	test-auc:0.892712+0.00841924
[228]	train-auc:0.969567+0.000909364	test-auc:0.892717+0.00838647
[229]	train-auc:0.969708+0.000900707	test-auc:0.892785+0.00838381
[230]	train-auc:0.969876+0.000874282	test-auc:0.892806+0.00836928
[231]	train-auc:0.970041+0.000874155	test-auc:0.892785+0.00841202
[232]	train-auc:0.970146+0.000866204	test-auc:0.892826+0.0084143
[233]	train-auc:0.970261+0.000854212	test-auc:0.892837+0.00838475
[234]	train-auc:0.970485+0.000847455	test-auc:0.892819+0.00834369
[235]	train-auc:0.970662+0.000832753	test-auc:0.892838+0.00833353
[236]	train-auc:0.970852+0.000845171	test-auc:0.892889+0.00832432
[237]	train-auc:0.970989+0.000817857	test-auc:0.892935+0.00836841
[238]	train-auc:0.971124+0.000830063	test-auc:0.892914+0.00836071
[239]	train-auc:0.971286+0.000826974	test-auc:0.892923+0.00835543
[240]	train-auc:0.971432+0.000808239	test-auc:0.892957+0.00842647
[241]	train-auc:0.971576+0.00081742	test-auc:0.892976+0.00838875
[242]	train-auc:0.971701+0.000820922	test-auc:0.892927+0.00836717
[243]	train-auc:0.971858+0.000844415	test-auc:0.892937+0.00838817
[244]	train-auc:0.971996+0.000842006	test-auc:0.892915+0.00839736
[245]	train-auc:0.972084+0.000880706	test-auc:0.892919+0.00837073
[246]	train-auc:0.972213+0.000931898	test-auc:0.892889+0.00832425
[247]	train-auc:0.972293+0.000896904	test-auc:0.892869+0.00830585
[248]	train-auc:0.972377+0.000884336	test-auc:0.89286+0.00835436
[249]	train-auc:0.972498+0.000872528	test-auc:0.89283+0.00838855
[250]	train-auc:0.972647+0.000813344	test-auc:0.892788+0.00833792
[251]	train-auc:0.972753+0.000789313	test-auc:0.892788+0.00833449
[252]	train-auc:0.972954+0.00073121	test-auc:0.892799+0.00833111
[253]	train-auc:0.973086+0.00077297	test-auc:0.892824+0.0082933
[254]	train-auc:0.973196+0.00082005	test-auc:0.89284+0.00834559
[255]	train-auc:0.97335+0.000771615	test-auc:0.892823+0.00835421
[256]	train-auc:0.973514+0.000726582	test-auc:0.892807+0.00827661
[257]	train-auc:0.973606+0.000720471	test-auc:0.892786+0.00836738
[258]	train-auc:0.973707+0.000719945	test-auc:0.892788+0.00838894
[259]	train-auc:0.973813+0.000718357	test-auc:0.892793+0.00841035
[260]	train-auc:0.973881+0.000728191	test-auc:0.892801+0.00846073
[261]	train-auc:0.974043+0.000676733	test-auc:0.892843+0.00844469
[262]	train-auc:0.974173+0.000649674	test-auc:0.892782+0.00845797
[263]	train-auc:0.974341+0.000649604	test-auc:0.892693+0.00839131
[264]	train-auc:0.974459+0.000717722	test-auc:0.8927+0.0084097
[265]	train-auc:0.974626+0.000673987	test-auc:0.892693+0.00841141
[266]	train-auc:0.974767+0.000667471	test-auc:0.892666+0.00840575
[267]	train-auc:0.974914+0.000640906	test-auc:0.892699+0.00845917
[268]	train-auc:0.975048+0.000641332	test-auc:0.892604+0.0084603
[269]	train-auc:0.975195+0.000668933	test-auc:0.89263+0.00844446
[270]	train-auc:0.975294+0.00066686	test-auc:0.892616+0.00842676
[271]	train-auc:0.975436+0.000721942	test-auc:0.892657+0.00843918
[272]	train-auc:0.975575+0.000708615	test-auc:0.892622+0.00841796
[273]	train-auc:0.975717+0.000705867	test-auc:0.892664+0.00840106
[274]	train-auc:0.975901+0.000690303	test-auc:0.892689+0.00840826
[275]	train-auc:0.976079+0.000630773	test-auc:0.892591+0.00838097
[276]	train-auc:0.976275+0.00063996	test-auc:0.892615+0.00836929
[277]	train-auc:0.976425+0.000641166	test-auc:0.89264+0.00839487
[278]	train-auc:0.97658+0.000643692	test-auc:0.89264+0.00840323
[279]	train-auc:0.976699+0.00068269	test-auc:0.892687+0.0084274
[280]	train-auc:0.97681+0.000704768	test-auc:0.892662+0.00844418
[281]	train-auc:0.976963+0.00067372	test-auc:0.892656+0.00847922
[282]	train-auc:0.977063+0.000701606	test-auc:0.892608+0.00851185
[283]	train-auc:0.977195+0.00068829	test-auc:0.892607+0.00850083
[284]	train-auc:0.977276+0.000687733	test-auc:0.892607+0.00850641
[285]	train-auc:0.977436+0.000627868	test-auc:0.892633+0.00858339
[286]	train-auc:0.977528+0.000642833	test-auc:0.892598+0.00861011
[287]	train-auc:0.977666+0.000659727	test-auc:0.892587+0.00858336
[288]	train-auc:0.977771+0.000636126	test-auc:0.892583+0.00862277
[289]	train-auc:0.977929+0.000606055	test-auc:0.892559+0.00860138
[290]	train-auc:0.978052+0.000631095	test-auc:0.892476+0.0086017
Model Report
Accuracy : 0.8933
AUC Score (Train): 0.962860
In [22]:
# Guardamos el modelo entrenado
#from sklearn.externals import joblib
#joblib.dump(alg,'imbalanced_alg.pkl')
#joblib.dump(alg,'balanced_undersampled_alg.pkl')
#alg = joblib.load('imbalanced_alg.pkl')
#alg = joblib.load('balanced_undersampled_alg.pkl')
In [ ]:
### Selección del número de estimadores
param_test1 = {
 'max_depth': range(3,10,2),
 'min_child_weight': range(1,6,2)
}
gsearch1 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=385, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
 param_grid = param_test1, scoring='roc_auc', n_jobs=-1, iid=False, cv=5, verbose=10)

gsearch1.fit(X_sel, y_res)

Selección de los parámetros max_depth y min_child_weight

Esta es la selección óptima obtenida por el proceso de validación cruzada:

gsearch1.best_params_ = 
 {'max_depth': 5, 'min_child_weight': 1}
In [ ]:
param_test2 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch2 = GridSearchCV(estimator=XGBClassifier(learning_rate=0.1, n_estimators=385, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
 param_grid = param_test3, scoring='roc_auc', n_jobs=-1, iid=False, cv=5, verbose=10)
gsearch2.fit(X_sel, y_res)

Selección de parámetro gamma

Esta es la selección óptima obtenida por el proceso de validación cruzada:

gsearch2.best_params_ = 
 {'gamma': 0.0}

Selección del resto de parámetros

El resto de parámetros son seleccionados de la misma manera. Para limitar la extensión de este documento, no se muestran aquí los detalles pero sí los resultados finales.

Modelo final

Tras las validaciones cruzadas terminamos con el modelo final indicado a continuación:

In [59]:
alg = XGBClassifier(
 learning_rate=0.1,
 n_estimators= 385,
 max_depth=5,  # máxima profundidad de los árboles base (mucha profundidad -> sobrejuste)
 min_child_weight=1,  # mínima suma de los pesos para todas las observaciones en un hijo
 gamma=0,  # mejora mínima para un nuevo split
 subsample=0.8,  # fracción de las muestras para evitar sobreajuste
 colsample_bytree=0.8, # fraccion de las columnas
 objective= 'binary:logistic', # función de clasificación para clasificación binaria
 nthread=-1,  # utilizamos todos los cores disponibles
 scale_pos_weight=1,  # 1: alto desequilibrio entre classes
 seed=27)  # definimos una semilla para que los resultados sean reproducibles

Dividimos el conjunto de datos en test y train y hacemos una última validación cruzada antes de aplicar el modelo a los datos e test. Hacemos esto para asegurarnos de que no existe sobreajuste y nuestras predicciones no están sesgadas por el desbalance entre clases.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_sel, y_res, test_size=0.4, random_state=0, stratify=y_res)
alg = alg.fit(X_train, y_train)
y_pred = alg.predict(X_test)
pred_prob = alg.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, pred_prob)

plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='xgBoost')
plt.xlabel('False positive rate (`1 - specificity` or  `1-TNR`)')
plt.ylabel('True positive rate (sensitivity or recall)')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
#score = alg.score(X_test, y_test)
auc = roc_auc_score(y_test, pred_prob)
print("ROC-AUC = {}".format(auc))
ROC-AUC = 0.8902205725912139
In [28]:
print(confusion_matrix(y_test, y_pred))
[[2341  651]
 [ 509 2482]]
In [29]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
    
    print(cm)
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, "%.2f" % cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.colorbar()    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

class_names = ["0", "1"]    
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure(figsize=(10,5))
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure(figsize=(10,5))
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()
Confusion matrix, without normalization
[[2341  651]
 [ 509 2482]]
Normalized confusion matrix
[[ 0.78  0.22]
 [ 0.17  0.83]]
C:\Anaconda3\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
In [42]:
filename = 'pm_test.txt'  # leemos el fichero con los datos de test

D = pd.read_csv(filename, delimiter="|", decimal=",", index_col = 0)

# Sort columns by name
D = D.reindex_axis(sorted(D.columns), axis=1)
In [49]:
# Predecimos sobre el conjunto de test
X_test_final = D.values
y_pred_final = alg.predict(X_test_final)
Counter(y_pred_final)  # contamos el número de muestras en cada clase
Out[49]:
Counter({0: 154864, 1: 47653})

Generación de fichero de respuesta

In [55]:
D_resp = pd.Series(y_pred_final, index = D.index, name="Respuesta").to_frame()
In [60]:
D_resp.to_csv("submission.txt", sep="|")