A A
[ML] DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN์€ ๋ฐ€๋„ ๊ธฐ๋ฐ˜์˜ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ๋ฐ€์ง‘๋œ ์˜์—ญ์„ ๊ตฐ์ง‘์œผ๋กœ ์‹๋ณ„ํ•˜๊ณ , ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ๋…ธ์ด์ฆˆ๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. DBSCAN์˜ ๋ชฉํ‘œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ฐ€์ง‘ ์˜์—ญ์„ ์ฐพ์•„๋‚ด์–ด, ๊ตฐ์ง‘์˜ ํฌ๊ธฐ๋‚˜ ํ˜•ํƒœ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ  ์œ ์—ฐํ•˜๊ฒŒ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5


DBSCAN์˜ ํŠน์ง•

  • ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ๊ตฐ์ง‘ํ™”: DBSCAN์€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ๋ฐ€๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ผ์ • ๋ฐ€๋„ ์ด์ƒ์˜ ์˜์—ญ์„ ํ•˜๋‚˜์˜ ๊ตฐ์ง‘์œผ๋กœ ๋ฌถ์Šต๋‹ˆ๋‹ค.
  • ๋…ธ์ด์ฆˆ ์ฒ˜๋ฆฌ: ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ๋…ธ์ด์ฆˆ๋กœ ๊ฐ„์ฃผ๋˜๋ฉฐ, ๊ตฐ์ง‘์—์„œ ์ œ์™ธ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ตฐ์ง‘ํ™” ๊ณผ์ •์—์„œ ๋…ธ์ด์ฆˆ๋‚˜ ์ด์ƒ์น˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ ์—ฐํ•œ ๊ตฐ์ง‘ ํ˜•์„ฑ: DBSCAN์€ ๊ตฐ์ง‘์˜ ํฌ๊ธฐ๋‚˜ ํ˜•ํƒœ์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ , ๋ฐ์ดํ„ฐ์˜ ๋ฐ€๋„์— ๋”ฐ๋ผ ์œ ์—ฐํ•˜๊ฒŒ ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„์ •ํ˜•์ ์ธ ํ˜•ํƒœ๋ฅผ ๊ฐ€์งˆ ๋•Œ ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๊ตฌํ˜„: DBSCAN ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋น„๊ต์  ๊ฐ„๋‹จํ•˜๊ณ , ๊ตฌํ˜„์ด ์šฉ์ดํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

DBSCAN์˜ ๊ธฐ๋ณธ ์›๋ฆฌ

๋ฐ€๋„ ๊ธฐ์ค€ (Density Criteria)

  • DBSCAN์€ ๋‘ ๊ฐœ์˜ ์ฃผ์š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
    • ฯต(epsilone): ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ์ตœ๋Œ€ ๊ฑฐ๋ฆฌ๋กœ, ์ด ๊ฑฐ๋ฆฌ ์ด๋‚ด์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์„ ํ•˜๋‚˜์˜ ๊ตฐ์ง‘์œผ๋กœ ๋ฌถ์Šต๋‹ˆ๋‹ค.
    • ์ตœ์†Œ ํฌ์ธํŠธ ์ˆ˜ (MinPts): ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด๋Š” ํ•ด๋‹น ์˜์—ญ์ด ๊ตฐ์ง‘์œผ๋กœ ๊ฐ„์ฃผ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ๋ฐ€๋„๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํฌ์ธํŠธ (Core Point)

  • ๋ฐ˜๊ฒฝ ฯต(epsilone) ์ด๋‚ด์— ์ตœ์†Œ MinPts ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํ•ต์‹ฌ ํฌ์ธํŠธ๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•ต์‹ฌ ํฌ์ธํŠธ๋Š” ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฝ๊ณ„ ํฌ์ธํŠธ (Border Point)

  • ๋ฐ˜๊ฒฝ ฯต(epsilone) ์ด๋‚ด์— MinPts ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ์—†์ง€๋งŒ, ํ•ต์‹ฌ ํฌ์ธํŠธ์˜ ๋ฐ˜๊ฒฝ ฯต(epsilone) ๋‚ด์— ์œ„์น˜ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๊ฒฝ๊ณ„ ํฌ์ธํŠธ๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฝ๊ณ„ ํฌ์ธํŠธ๋Š” ๊ตฐ์ง‘์— ์†ํ•˜์ง€๋งŒ, ๊ตฐ์ง‘์„ ํ™•์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

๋…ธ์ด์ฆˆ ํฌ์ธํŠธ (Noise Point)

  • ํ•ต์‹ฌ ํฌ์ธํŠธ์™€ ๊ฒฝ๊ณ„ ํฌ์ธํŠธ ๋ชจ๋‘์— ํ•ด๋‹นํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ๋…ธ์ด์ฆˆ ํฌ์ธํŠธ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ํฌ์ธํŠธ๋“ค์€ ๊ตฐ์ง‘์— ์†ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

DBSCAN์˜ ๊ตฐ์ง‘ ํ˜•์„ฑ๊ณผ์ •

์ดˆ๊ธฐํ™”

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ฐฉ๋ฌธํ•˜์ง€ ์•Š์€ ์ƒํƒœ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฐ์ง‘ ํ™•์žฅ

  • ๋ฐฉ๋ฌธํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์„ ํƒํ•˜์—ฌ, ํ•ด๋‹น ํฌ์ธํŠธ๊ฐ€ ํ•ต์‹ฌ ํฌ์ธํŠธ์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
  • ๋งŒ์•ฝ ํ•ต์‹ฌ ํฌ์ธํŠธ๋ผ๋ฉด, ๋ฐ˜๊ฒฝ ฯต(epsilone) ๋‚ด์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํฌํ•จํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฝ๊ณ„ ํฌ์ธํŠธ๋Š” ๊ธฐ์กด ๊ตฐ์ง‘์— ํฌํ•จ๋˜๋ฉฐ, ๋…ธ์ด์ฆˆ ํฌ์ธํŠธ๋Š” ๊ตฐ์ง‘์— ํฌํ•จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋ณต

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ๋ฐฉ๋ฌธ๋  ๋•Œ๊นŒ์ง€ ๊ตฐ์ง‘ ํ™•์žฅ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์ด ๋๋‚˜๋ฉด ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด์˜ ๊ตฐ์ง‘ํ™”๊ฐ€ ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค.

DBSCAN์˜ ์žฅ, ๋‹จ์ 

https://towardsdatascience.com/how-to-use-dbscan-effectively-ed212c02e62

DBSCAN์˜ ์žฅ์ 

  1. ๊ตฐ์ง‘ ์ˆ˜ ๋ฏธ๋ฆฌ ์ง€์ • ๋ถˆํ•„์š”: DBSCAN์€ ๊ตฐ์ง‘์˜ ์ˆ˜๋ฅผ ์‚ฌ์ „์— ์ง€์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ตฐ์ง‘์˜ ์ˆ˜๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ฐ€๋„์— ๋”ฐ๋ผ ์ž๋™์œผ๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.
  2. ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: DBSCAN์€ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๊ตฐ์ง‘์—์„œ ์ œ์™ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค๋ฅธ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋น„๊ตํ•ด ํฐ ์žฅ์ ์ž…๋‹ˆ๋‹ค.
  3. ๋‹ค์–‘ํ•œ ๊ตฐ์ง‘ ํ˜•ํƒœ: DBSCAN์€ ๋ฐ์ดํ„ฐ์˜ ๊ตฐ์ง‘ ํ˜•ํƒœ๊ฐ€ ๊ตฌํ˜•(spherical)์ด ์•„๋‹ˆ์–ด๋„ ์œ ์—ฐํ•˜๊ฒŒ ๊ตฐ์ง‘ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„์ •ํ˜•์ ์ธ ๊ตฐ์ง‘ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
  4. ํ™•์žฅ์„ฑ: DBSCAN์€ ๊ตฐ์ง‘์˜ ํฌ๊ธฐ๋‚˜ ๋ฐ€๋„์— ๊ด€๊ณ„์—†์ด ์œ ์—ฐํ•˜๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

DBSCAN์˜ ๋‹จ์ 

  1. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • ์–ด๋ ค์›€: ฯต(epsilone)๊ณผ MinPts์˜ ๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ’๋“ค์ด ์ž˜๋ชป ์„ค์ •๋˜๋ฉด ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๊ฐ€ ๋ถ€์ •ํ™•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ฐ€๋„ ๋ณ€ํ™”์— ๋ฏผ๊ฐ: ๋ฐ์ดํ„ฐ์˜ ๋ฐ€๋„๊ฐ€ ๊ท ์ผํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, DBSCAN์˜ ๊ตฐ์ง‘ํ™” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ€๋„๊ฐ€ ํฌ๊ฒŒ ๋ณ€ํ•˜๋Š” ์˜์—ญ์—์„œ๋Š” ๊ตฐ์ง‘์ด ์ œ๋Œ€๋กœ ํ˜•์„ฑ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์—์„œ์˜ ์„ฑ๋Šฅ ์ €ํ•˜: ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์—์„œ๋Š” ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์ด ์–ด๋ ค์›Œ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋งค์šฐ ๋น„์Šทํ•ด์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด, ๊ตฐ์ง‘์„ ์ž˜ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

DBSCAN Example Code

# ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Iris ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ
iris = load_iris()
X = iris.data
y = iris.target

# ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™”
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# DBSCAN ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
dbscan = DBSCAN(eps=0.5, min_samples=2)
labels = dbscan.fit_predict(X_scaled)
# ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ ์‹œ๊ฐํ™” (์ฒซ ๋ฒˆ์งธ ๋‘ ํŠน์ง• ๊ณต๊ฐ„ ์‚ฌ์šฉ)
plt.figure(figsize=(10, 7))
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        col = [0, 0, 0, 1]  # ๋…ธ์ด์ฆˆ๋Š” ๊ฒ€์ •์ƒ‰์œผ๋กœ ํ‘œ์‹œ

    class_member_mask = (labels == k)
    xy = X_scaled[class_member_mask]

    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=10)

plt.title('DBSCAN Clustering of Iris Dataset (eps=0.5, min_samples=2)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')
plt.show()