DecisionBoundaryDisplay에서 multiclass 확률 시각화

AI,머신러닝

DecisionBoundaryDisplay에서 multiclass 확률 시각화

깨비아빠0 2023. 11. 16. 14:46

728x90

멀티클래스 decision_function 및 predict_proba 결과를 시각화한 모습

모델 예측 확률을 시각화하기 위해 matplotlib의 pcolormesh 함수를 이용해서 (2차원) 데이터 변화에 따른 predict_proba(또는 decision_function) 결과값을 아래와 같이 그라데이션으로 표현할 수 있다. (소스코드)

SVM decision_function 결과 시각화 (https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html#visualization)

그리고, scikit-learn에는 이 작업을 손쉽게 해주는 DecisionBoundaryDisplay 클래스가 준비되어 있다.

하지만, DecisionBoundaryDisplay에 타겟 클래스가 3개 이상인 분류 모델을 사용하면, 아래 왼쪽 차트와 같이 predict 결과만을 시각화할 수 있고 부드러운 확률 변화는 확인할 수 없다. (오른쪽 차트와 같이 이진 분류 모델은 부드러운 확률 변화를 확인할 수 있다.)

멀티클래스 분류 모델은 predict만 사용 가능(왼쪽), 이진 분류 모델은 확률 시각화 가능(오른쪽)

이에 멀티클래스 분류 모델의 확률 변화도 시각화할 수 있도록 구현해보았다.

기존 DecisionBoundaryDisplay 사용법

DecisionBoundaryDisplay.from_estimator 함수를 통해 쉽게 모델의 예측 결과를 시각화할 수 있다.

classmethod from_estimator(estimator, X, *, grid_resolution=100, eps=1.0, plot_method='contourf', response_method='auto', xlabel=None, ylabel=None, ax=None, **kwargs)

그런데, estimator 인자에 멀티클래스 분류 모델을 사용하면 response_method가 predict로 고정된다. (소스코드)

즉, 시각화를 통해 예측된 클래스만 확인할 수 있고, 점진적인 확률 변화는 알 수 없다.

멀티클래스 확률 시각화 이슈 및 POC

3가지 이상의 색상 보간

멀티클래스 확률 시각화를 왜 지원하지 않는지 생각해보면, "3가지 이상의 색상 그라데이션을 서로 잘 구분되도록 혼합하는 일반적인 방법이 없다"는 이유가 클 것 같다. 예를 들어, 세 타겟 클래스를 각각 빨강/파랑/보라 색으로 시각화하려고 한다면, "빨강 0%, 파랑 0%, 보라 80%"를 혼합한 A 색상과 "빨강 80%, 파랑 80%, 보라 0%"를 혼합한 B 색상이 같을 수 있다.

타겟이 3개인 경우에는 각 클래스에 RGB 채널 하나씩을 사용하는 방법을 생각해볼 수 있다. 하지만, 타겟 개수가 4개 이상이면 사용하지 못하는 방법이므로 보다 범용적인 정책이 필요하다. 이 문제에 대한 근본적인 해결책은 찾기 어려울 것 같으므로, 다음과 같은 우회적인 방법을 사용하기로 했다.

예측 클래스(=argmax 클래스)의 색상 그라데이션만 표시
특정 타겟 클래스의 색상만 표시하는 옵션 추가

왼쪽부터 1. 예측 클래스의 색상만 표시, 2. 모든 클래스 확률 혼합, 3~5. 특정 클래스의 확률만 표시

여기서 중요한 문제는 과연 pcolormesh 함수에서 RGB 색상을 직접 지정할 수 있느냐이다. matplotlib은 보통 색상 샘플링에 colormap을 사용하는데, colormap은 1차원 데이터이므로 3색 이상의 그라데이션을 동시에 보여주기 어렵다. 다행히 matplotlib 3.7부터 pcolormesh에서 RGB(또는 RGBA) 색상을 직접 지정할 수 있게 되었다.

확률 계산

관측 데이터 X에 대한 예측 확률은 predict_proba 함수로 얻을 수 있다. 멀티클래스일 경우 클래스 각각의 확률이 모두 구해지므로 확률 시각화에 아무 지장이 없다.

그런데, 일부 모델은 predict_proba 함수를 사용할 수 없다. (예: SVC는 probability=True를 사용하지 않으면 predict_proba 사용 불가)

이럴 때에는 CalibratedClassifierCV 클래스를 사용하여 교차검증 방식으로 예측 확률을 얻을 수 있다.

clf = SVC(C=100, gamma="scale")
clf.fit(X, y)
# SVC에는 predict_proba 호출 불가 (probability=False 시)

clf_calibrated = CalibratedClassifierCV(clf, cv=5)
clf_calibrated.fit(X, y)
# CalibratedClassifierCV를 fit 시키면 predict_proba 사용 가능

classifier에서는 예측을 위해 decision_function이 사용되는데, decision_function의 결과값은 확률이 아닌 "신뢰할 수 있는 정도"의 의미를 갖는다. 사실 굳이 확률이 아니어도 예측에 대한 신뢰도 변화를 확인할 수 있으면 되므로 decision_function을 사용하는 것도 나쁘지 않으며, 예측 확률을 얻기 위해 CalibratedClassifierCV를 학습하는데 시간을 소모하지 않아도 되는 이점이 있다. 다만, decision_function 결과값은 [-∞, ∞]이므로, [0, 1] 사이로 정규화가 필요하다. [0, 1] 정규화에는 sigmoid 함수를 사용했으며, 보다 구분이 잘 되는 그라데이션을 위해 Platt scaling과 유사한 방식을 사용했다. (하지만 개선의 여지가 크다.)

멀티클래스 확률 시각화

구현

scikit learn의 DecisionBoundaryDisplay.from_estimator 함수를 바탕으로 멀티클래스 확률 시각화를 지원하는 decision_boundary_display_from_estimator 함수를 작성했으며, 주요 변경사항은 다음과 같다.

multiclass에 decision_function, predict_proba 적용 가능하도록 수정 (plot_method가 pcolormesh이면)
decision_normalize_method 인자 추가 (decision_function 결과 정규화 방식)
multiclass_mix_rest 인자 추가 (멀티클래스 예측 클래스 외 나머지 클래스의 색상 표시 여부)
multiclass_target_index 인자 추가 (멀티클래스 특정 타겟의 결과값만 표시)
plot_method 인자 default 값을 contourf에서 pcolormesh로 변경

클래스가 3개 이상인 경우 예외를 발생하던 것을, plot_method가 pcolormesh인 경우에는 허용하도록 수정하였다. (소스코드)

    if has_classes and len(estimator.classes_) > 2 and plot_method != "pcolormesh":
        if response_method not in {"auto", "predict"}:
            msg = (
                "Multiclass classifiers are only supported when response_method is"
                " 'predict' or 'auto'"
            )
            raise ValueError(msg)
        methods_list = ["predict"]

멀티클래스 모델에 response_method로 decision_function 사용시 response를 다음과 같이 [0, 1] 사이로 정규화한다.

# score를 [0,1] 정규화
# TODO: 적당한 sigmoid 계수 찾기 (CalibratedClassifierCV predict_proba처럼 값이 잘 구분되도록)
rmax = response.max()
mean = np.mean(response)
response = 1 / (1 + np.exp(-(response - mean) * 8 / (rmax - mean)))

확률 또는 decision_function 점수를 시각화하는 코드는 다음과 같다.

multiclass_mix_rest(모든 색상 혼합), multiclass_target_index(특정 타겟만 표시) 옵션에 따라 보간 방식이 달라진다.

cmap = mpl.cm.get_cmap(kwargs.get("cmap", None))
n_classes = len(estimator.classes_)
class_colors = [np.array(cmap(c / (n_classes - 1))) for c in range(n_classes)]

def get_color(proba):
    if multiclass_mix_rest:
        # proba 씩 만큼 class 색상 보간
        colors = np.array([class_colors[c] * p for c, p in enumerate(proba)])
        color = np.sum(colors, axis=0)
    elif multiclass_target_index is not None:
        # 특정 class 색상 보간
        c = multiclass_target_index
        w = proba[c]
        color = class_colors[c] * w + (1 - w)
    else:
        # predicted class 색상 보간
        min_p = 1 / n_classes
        c = proba.argmax()
        w = (proba[c] - min_p) / (1 - min_p)
        color = class_colors[c] * w + (1 - w)

    return np.clip(color, 0, 1)

response = np.apply_along_axis(get_color, -1, response)
response_shape += (4,)

위 함수를 사용하여 다음과 같이 멀티클래스 확률 시각화가 가능하다. (예제 코드는 마지막에 포함)

패키지 배포

위의 구현사항을 kebiml이란 패키지로 배포하였다. (배포 과정은 https://freeislet.tistory.com/31 참고)

kebiml

A simple private ml utilities

pypi.org

설치:

pip install --upgrade kebiml

사용예:

import matplotlib.pyplot as plt

from kebiml.visutil.decision_boundary import decision_boundary_display_from_estimator

...

decision_boundary_display_from_estimator(estimator, X, ax=ax, cmap=plt.cm.rainbow_r)

멀티클래스 확률 시각화 예제

시각화 테스트를 위한 전체 코드와 그 결과는 다음과 같다.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.inspection import DecisionBoundaryDisplay

from kebiml.visutil.decision_boundary import decision_boundary_display_from_estimator


# 차트 도움 함수
cmap = plt.cm.rainbow_r

def scatter(X, y, ax, cmap=cmap):
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap, edgecolors="k")
    
    labels = iris.target_names if y.max() > 1 else iris.target_names[-2:]
    ax.legend(scatter.legend_elements()[0], labels, loc="upper left")

def dbd(
    estimator,
    ax,
    title,
    plot_method="pcolormesh",
    response_method="auto",
    decision_normalize_method="sigmoid",
    multiclass_mix_rest=False,
    multiclass_target_index=None,
):
    ax.set_title(title)
    
    decision_boundary_display_from_estimator(
        estimator, X, ax=ax, cmap=cmap,
        plot_method=plot_method,
        response_method=response_method,
        decision_normalize_method=decision_normalize_method,
        multiclass_mix_rest=multiclass_mix_rest,
        multiclass_target_index=multiclass_target_index,
    )

    scatter(X, y, ax)


# import some data to play with
iris = load_iris()
X = iris.data[:, :2]  # we only take the first two features. ['sepal length (cm)', 'sepal width (cm)']
y = iris.target

# 2 class 만들기
X_2cls = X[y > 0]
y_2cls = y[y > 0]
y_2cls -= 1

# 모델 학습
clf_2cls = SVC(C=100, gamma="scale")
clf_2cls.fit(X_2cls, y_2cls)

clf = SVC(C=100, gamma="scale")
clf.fit(X, y)

clf_calibrated = CalibratedClassifierCV(clf, cv=5)
clf_calibrated.fit(X, y)

# decision boundary display 비교
fig, axs = plt.subplots(3, 5, figsize=(20, 12))

# - 기존 DecisionBoundaryDisplay
axs[0, 0].set_title("org DecisionBoundaryDisplay (predict only)")
DecisionBoundaryDisplay.from_estimator(clf, X, plot_method="pcolormesh", ax=axs[0, 0], cmap=cmap)
scatter(X, y, axs[0, 0])

# - 기존 DecisionBoundaryDisplay w/ 2 classes
axs[0, 1].set_title("org with 2-classes")
DecisionBoundaryDisplay.from_estimator(clf_2cls, X_2cls, plot_method="pcolormesh", ax=axs[0, 1], cmap=plt.cm.RdBu)
scatter(X_2cls, y_2cls, axs[0, 1], cmap=plt.cm.RdBu)

# - 멀티클래스 decision_function 점수
dbd(clf, axs[1, 0], "3-classes decision_function")
# dbd(clf, axs[1, 0], "3-classes decision_function", decision_normalize_method="minmax")
dbd(clf, axs[1, 1], "(mix all classes)", multiclass_mix_rest=True)  # 예측 클래스와 더불어 나머지 클래스 색상도 혼합
dbd(clf, axs[1, 2], "(class 0: setosa)", multiclass_target_index=0)  # 클래스 0 점수
dbd(clf, axs[1, 3], "(class 1: versicolor)", multiclass_target_index=1)  # 클래스 1 점수
dbd(clf, axs[1, 4], "(class 2: virginica)", multiclass_target_index=2)  # 클래스 2 점수

# - 멀티클래스 CalibratedClassifierCV.predict_proba 확률
dbd(clf_calibrated, axs[2, 0], "CalibratedClassifierCV.predict_proba", response_method="predict_proba")
dbd(clf_calibrated, axs[2, 1], "(mix all classes)", response_method="predict_proba", multiclass_mix_rest=True)
dbd(clf_calibrated, axs[2, 2], "(class 0: setosa)", response_method="predict_proba", multiclass_target_index=0)
dbd(clf_calibrated, axs[2, 3], "(class 1: versicolor)", response_method="predict_proba", multiclass_target_index=1)
dbd(clf_calibrated, axs[2, 4], "(class 2: virginica)", response_method="predict_proba", multiclass_target_index=2)

for ax in axs.flat:
    if not ax.collections:
        ax.set_axis_off()

plt.show()

'AI,머신러닝' 카테고리의 다른 글

다중공선성(multicollinearity) 이슈 (0)	2023.11.29
정규화 모델 - Ridge Regression, LASSO, Elastic Net (0)	2023.11.28
Gradient Boosting (XGBoost, LightGBM, CatBoost 비교) (0)	2023.09.19
SHAP (ML 모델 피쳐 중요도 측정) (0)	2023.09.04
의사결정 트리(Decision Tree) - 피쳐 중요도(Feature Importance) 측정 (0)	2023.09.01

현재글DecisionBoundaryDisplay에서 multiclass 확률 시각화

게임 개발(Engine/Client), 웹 개발(BE/FE), 데이터 엔지니어

iframe, GEMINI, NLP, decision tree, planetscale, Auth.js, BigQuery, shadcn/ui, Gemini Pro, Gemini API, contentlayer, feature importance, MySQL, Regularization, Kysely, react-syntax-highlighter, next.js, EDA, ML, Dacon,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

free islet