(kaggle) titanic

AI,머신러닝/AI,ML 연습

(kaggle) titanic

깨비아빠0 2023. 7. 26. 16:09

728x90

kaggle 튜토리얼 competition 격인 titanic 분석

https://www.kaggle.com/code/sunghwankang/titanic

최고 점수: 0.79904 (Random Forest)
피쳐: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked
- Pclass, Sex, Embarked 피쳐 One-Hot 인코딩
- numerical 컬럼 결측치 median 사용

XGBoost로 변경 후에 오히려 0.74로 점수가 낮아졌는데, overfitting을 완화하는 방향으로 hyperparameter를 조정하니 0.76으로 약간 상승. 최적 파라미터를 찾기 위한 grid search는 해보지 않음.

EDA

correlation

hm_df = train_data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
hm_df['Sex'] = hm_df['Sex'].map({'male':0, 'female': 1}).astype(int)
sns.heatmap(hm_df.corr(), cmap=plt.cm.RdBu, annot=True, linewidths=1)

Survived는 Sex, Pclass, Fare와 높은 상관관계를 보임

컬럼별 Survived 비율

fix, ax = plt.subplots(nrows=1, ncols=5, figsize=(20, 5))
sns.barplot(data=train_data, ax=ax[0], y='Survived', x='Pclass')
sns.barplot(data=train_data, ax=ax[1], y='Survived', x='Sex')
sns.barplot(data=train_data, ax=ax[2], y='Survived', x='SibSp')
sns.barplot(data=train_data, ax=ax[3], y='Survived', x='Parch')
sns.barplot(data=train_data, ax=ax[4], y='Survived', x='Embarked')

Sex에 따라 Survived 비율이 뚜렷하게 구분

Survived 별 Age 분포

lbl_survived = 'survived'
lbl_not_survived = 'not survived'
fig, ax = plt.subplots(figsize=(5, 3))
ax = sns.distplot(train_data[train_data.Survived == 1].Age.dropna(), ax=ax, bins=20, label=lbl_survived)
ax = sns.distplot(train_data[train_data.Survived == 0].Age.dropna(), ax=ax, bins=20, label=lbl_not_survived)
ax.legend()
ax.set_ylabel('KDE')

females = train_data[train_data.Sex == 'female']
males = train_data[train_data.Sex == 'male']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
ax = sns.distplot(females[females.Survived == 1].Age.dropna(), ax=axes[0], bins=30, kde=False, label=lbl_survived)
ax = sns.distplot(females[females.Survived == 0].Age.dropna(), ax=axes[0], bins=30, kde=False, label=lbl_not_survived)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(males[males.Survived == 1].Age.dropna(), ax=axes[1], bins=30, kde=False, label=lbl_survived)
ax = sns.distplot(males[males.Survived == 0].Age.dropna(), ax=axes[1], bins=30, kde=False, label=lbl_not_survived)
ax.legend()
ax.set_title('Male')

Female은 모든 나이대에 걸쳐 생존률이 높음

Male은 어린 나이(약 2~3세)와 80세 부근에서만 생존률이 더 높음

모델 구성 (pipeline)

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

onehot_encoder = OneHotEncoder(sparse_output=False)
median_imputer = SimpleImputer(strategy="median")
column_transformer = ColumnTransformer(
    [
        ("categorical", onehot_encoder, ["Pclass", "Sex", "Embarked"]),
        ("numerical", median_imputer, ["Age", "SibSp", "Parch", "Fare"]),
    ],
    verbose_feature_names_out=False,
)
pipeline = Pipeline(
    [
        ("preprocessing", column_transformer),
        ("classifier", XGBClassifier(n_estimators=500, learning_rate=0.02, subsample=0.9, gamma=1)),
    ]
)

pipeline.fit(X_train, y_train)

Feature Importance (XGBoost)

Gain

# XGBoost feature importances (https://mljar.com/blog/feature-importance-xgboost/)

feature_names = pipeline[:-1].get_feature_names_out()
mdi_importances = pd.Series(pipeline[-1].feature_importances_, index=feature_names).sort_values(ascending=True)

ax = mdi_importances.plot.barh()
ax.set_title("XGBoost Feature Importances (gain?)")
ax.figure.tight_layout()

Permutation Importance

# permutation importance

from sklearn.inspection import permutation_importance

result = permutation_importance(pipeline, X_train, y_train, n_repeats=10, random_state=1, n_jobs=2)

sorted_importances_idx = result.importances_mean.argsort()
importances = pd.DataFrame(result.importances[sorted_importances_idx].T, columns=X_train.columns[sorted_importances_idx])
ax = importances.plot.box(vert=False, whis=10)
ax.set_title("Permutation Importances")
ax.axvline(x=0, color="k", linestyle="--")
ax.set_xlabel("Decrease in accuracy score")
ax.figure.tight_layout()

SHAP

import shap

shap.initjs()

explainer = shap.Explainer(pipeline[-1], feature_names=feature_names)
observations = pipeline[:-1].transform(X_train)
shap_values = explainer(observations)

# summarize the effects of all the features
print("""\
To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample.
The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output.
The color represents the feature value (red high, blue low).""")
shap.plots.beeswarm(shap_values)

# bar plot
print("We can also just take the mean absolute value of the SHAP values for each feature to get a standard bar plot")
shap.plots.bar(shap_values)

# create a dependence scatter plot to show the effect of a single feature across the whole dataset
print("""\
To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset.
Since SHAP values represent a feature's responsibility for a change in the model output, the plot below represents the change in predicted value as Age changes.
Vertical dispersion at a single value of Age represents interaction effects with other features. To help reveal these interactions we can color by another feature.
If we pass the whole explanation tensor to the color argument the scatter plot will pick the best feature to color by.""")
shap.plots.scatter(shap_values[:,"Age"], color=shap_values)

# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
print("""\
The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed)
to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.
Another way to visualize the same explanation is to use a force plot.""")

# visualize the first prediction's explanation with a force plot
shap.plots.force(shap_values[0], matplotlib=True)

# visualize all the training set predictions
print("If we take many force plot explanations such as the one shown above, rotate them 90 degrees, and then stack them horizontally, we can see explanations for an entire dataset")
shap.plots.force(shap_values)

Tree 확인 (XGBoost)

xgbclassifier 1번 tree

from xgboost import plot_tree

xgb = pipeline[-1]
xgb.get_booster().feature_names = list(feature_names)
fig, ax = plt.subplots(figsize=(30, 30))
plot_tree(xgb, num_trees=1, rankdir='LR', ax=ax)
plt.show()

dump_list = xgb.get_booster().get_dump()
print("number of trees:", len(dump_list))

'AI,머신러닝 > AI,ML 연습' 카테고리의 다른 글

(dacon) 농산물 가격 예측 (시계열 분석) (4)	2023.12.05
(dacon) 축구선수 유망 여부 예측 (0)	2023.11.29

현재글(kaggle) titanic

게임 개발(Engine/Client), 웹 개발(BE/FE), 데이터 엔지니어

Dacon, MySQL, NLP, next.js, decision tree, BigQuery, Gemini API, EDA, feature importance, Auth.js, Kysely, ML, planetscale, iframe, Regularization, Gemini Pro, react-syntax-highlighter, shadcn/ui, contentlayer, GEMINI,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

free islet

(kaggle) titanic

EDA

correlation

컬럼별 Survived 비율

Survived 별 Age 분포

모델 구성 (pipeline)

Feature Importance (XGBoost)

Gain

Permutation Importance

SHAP

Tree 확인 (XGBoost)

'AI,머신러닝 > AI,ML 연습' 카테고리의 다른 글

'AI,머신러닝/AI,ML 연습'의 다른글

티스토리툴바

(kaggle) titanic

EDA

correlation

컬럼별 Survived 비율

Survived 별 Age 분포

모델 구성 (pipeline)

Feature Importance (XGBoost)

Gain

Permutation Importance

SHAP

Tree 확인 (XGBoost)

'AI,머신러닝 > AI,ML 연습' 카테고리의 다른 글

'AI,머신러닝/AI,ML 연습'의 다른글

관련글

티스토리툴바