SMOTE 기법을 활용한 데이터 불균형 해소

데이터셋을 사용해서 머신러닝 모델을 학습시키기 전에 데이터 불균형으로 인한 과적합 문제를 해결하고 소수 클래스 예측 성능을 향상시키기 위해서 SMOTE 기법을 통해 데이터를 정제시키는 과정이 필요하다.

다음과 같은 데이터셋과 코드를 활용해 데이터셋을 불러오고 전처리 과정을 거쳐 클래스 불균형을 해소하고 모델을 학습/평가하는 과정을 진행했다.

사용 데이터셋: KDD99, NSL-KDD, UNSW-NB15

Python + Pandas + scikit-learn + imbalanced-learn 기반 코드

1. SMOTE 기법이란

SMOTE(Synthetic Minority Over-sampling Technique)는 데이터 불균형 문제를 해결하기 위한 대표적인 오버샘플링(oversampling) 기법이다.

기존 소수 클래스 데이터를 단순히 복제하는 것이 아니라, 소수 클래스 데이터들 사이에서 새로운 합성 샘플을 만들어낸다는 특징이 있다 .각 소수 클래스 샘플마다 k-최근접 이웃(k-NN)을 찾고, 이웃 중 하나를 무작위로 선택하는데 원본 샘플과 이웃 샘플을 연결하는 선분 위에서 임의의 위치(0~1 사이 랜덤값)를 선택해 새로운 데이터를 생성한다.

SMOTE는 학습 데이터에만 적용	테스트셋에 적용하면 데이터 누수 발생
다차원 데이터에 민감	고차원에서는 보간된 샘플이 실제 분포에서 벗어날 위험 있음
혼합 클래스 문제	SMOTE 보간이 클래스 경계 혼란을 일으킬 수 있어 SMOTE + TomekLinks 또는 SMOTEENN 등 혼합 기법 고려 가능

2. 데이터셋을 불러오고 → 전처리 → 클래스 불균형 해소(SMOTE)

클래스 비율 격차 = (다수 클래스 샘플 수 / 소수 클래스 샘플 수)
이 값이 1에 가까울수록 균형이 잘 잡힌 데이터
값이 클수록 불균형이 심함

>KDD99 데이터셋

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from collections import Counter

# ⚙ 클래스 비율 격차 계산 함수
def imbalance_ratio(counter):
    classes = counter.most_common()
    majority = classes[0][1]
    minority = classes[-1][1]
    return round(majority / minority, 2)

# 1. 데이터 불러오기
df = pd.read_csv("Class_Imbalance_Sampling/KDD99/kddcup99_csv.csv", header=0, low_memory=False)

# 2. 이진 분류용 라벨 처리
df['label'] = df['label'].apply(lambda x: 'normal' if x == 'normal' else 'attack')

# 3. 입력(X), 출력(y) 분리
X = df.drop(columns=['label'])
y = df['label']

# 4. 범주형 변수 → 원핫 인코딩
X = pd.get_dummies(X, columns=['protocol_type', 'service', 'flag'])

# 5. 스케일링
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 6. 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# 7. SMOTE 적용 전 분포 및 비율 출력
before_counts = Counter(y_train)
print("📊 SMOTE 적용 전 클래스 분포:", before_counts)
print(f"⚠ 비율 격차 (SMOTE 전): {imbalance_ratio(before_counts)} : 1")

# 8. SMOTE 적용
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 9. SMOTE 적용 후 분포 및 비율 출력
after_counts = Counter(y_resampled)
print("🚀 SMOTE 적용 후 클래스 분포:", after_counts)
print(f"✅ 비율 격차 (SMOTE 후): {imbalance_ratio(after_counts)} : 1")

>NSL-KDD 데이터셋

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from collections import Counter

# ⚙ 클래스 비율 격차 계산 함수
def imbalance_ratio(counter):
    classes = counter.most_common()
    majority = classes[0][1]
    minority = classes[-1][1]
    return round(majority / minority, 2)

# 1. NSL-KDD 데이터 불러오기 (헤더 없음)
df = pd.read_csv("Class_Imbalance_Sampling/NSL-KDD/KDDTrain+.txt", header=None)

# 2. 컬럼 이름 지정 (총 43개)
col_names = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in",
    "num_compromised", "root_shell", "su_attempted", "num_root", "num_file_creations",
    "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
    "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate",
    "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate",
    "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
    "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
    "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "label", "label_index"
]
df.columns = col_names

# 3. 라벨 이진화 (normal vs attack)
df['label'] = df['label'].apply(lambda x: 'normal' if x == 'normal' else 'attack')

# 4. 입력(X), 출력(y) 분리
X = df.drop(columns=['label'])
y = df['label']

# 5. 범주형 변수 원핫 인코딩
X = pd.get_dummies(X, columns=['protocol_type', 'service', 'flag'])

# 6. 스케일링
X_scaled = StandardScaler().fit_transform(X)

# 7. 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# 8. SMOTE 전 클래스 분포 및 비율 출력
before_counts = Counter(y_train)
print("📊 SMOTE 적용 전 클래스 분포:", before_counts)
print(f"⚠ 비율 격차 (SMOTE 전): {imbalance_ratio(before_counts)} : 1")

# 9. SMOTE 적용
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 10. SMOTE 후 클래스 분포 및 비율 출력
after_counts = Counter(y_resampled)
print("🚀 SMOTE 적용 후 클래스 분포:", after_counts)
print(f"✅ 비율 격차 (SMOTE 후): {imbalance_ratio(after_counts)} : 1")

>UNSW-NB15 데이터셋

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from collections import Counter
import numpy as np

# ⚙ 클래스 비율 격차 계산 함수
def imbalance_ratio(counter):
    classes = counter.most_common()
    majority = classes[0][1]
    minority = classes[-1][1]
    return round(majority / minority, 2)

# 1. parquet 파일 불러오기
df = pd.read_parquet("Class_Imbalance_Sampling/UNSW-NB15/UNSW_NB15_training-set.parquet")

# 2. 결측치 및 무한값 처리
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

# 3. 입력(X), 출력(y) 분리
X = df.drop(columns=['label', 'attack_cat'])  # attack_cat은 다중분류용이므로 제외
y = df['label']  # 0: normal, 1: attack

# 4. 범주형 변수 인코딩 (필요 시만)
X = pd.get_dummies(X)

# 5. 스케일링
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 6. 학습/테스트 분할
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)
# 7. SMOTE 전 클래스 분포 및 비율 출력
before_counts = Counter(y_train)
print("📊 SMOTE 적용 전 클래스 분포:", before_counts)
print(f"⚠ 비율 격차 (SMOTE 전): {imbalance_ratio(before_counts)} : 1")

# 8. SMOTE 적용
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 9. SMOTE 후 클래스 분포 및 비율 출력
after_counts = Counter(y_resampled)
print("🚀 SMOTE 적용 후 클래스 분포:", after_counts)
print(f"✅ 비율 격차 (SMOTE 후): {imbalance_ratio(after_counts)} : 1")

티스토리툴바