Clustering(1)

visualization /Information Visualization

Clustering(1)

뚜둔뚜둔 2024. 12. 16. 14:14

Clustering Analysis

Given a set of data point, partition them into a set of groups (i.e. cluster)
Learning philosophy
- Unsupervised learning with unknown cluster labels
- cf. classification is supervised learning with known class labels
Quality
- A good clustering method will produce high-quality clusters
  - High intra-class similarity -> Cohesive within clusters
  - Low inter-class similarity -> distinctive between clusters
- There are usually seoarate quality measures for the 'goodness' of a cluster for different applications
  - it is hard to define "similar enough" or "good ebough" -> The answer is typically highly subjective

Clustering Considerations

Partitioning criteria
- Sigle level VS hierarchical partitioning
- e.g. flat concepts (gender) vs multi-level concepts (county - city - contry - continent)
Cluster separation
- Exclusive 배타적 VS non-exclusive 비배타적
- e.g. a customer belongs to only one region VS a news article may belong to multiple topics 고객이 한 지역에만 속하는 것과 뉴스 기사가 여러 주제에 속할 수 있는 것
Similarity measure
- Distance-based VS connectivity-based
- e.g. Euclidean, road network, vector VS density or contiguity 인접성
Clustering space
- Full space VS subspaces
- e.g. Low dimensional VS high-dimensional

Clustering > Distance-based > Partitioning

Partitioning method
- Discovering the groupings in a dataset based on the distances of data objects 데이터 객체의 거리를 기반으로 데이터 세트에서 그룹화 찾기
Problem definition
- Given k, find a partition of k clusters that optimizes the chosen distance-based objective function
- e.g. minimize the sum of squared errors between objects and centers in clusters,클러스터의 객체와 중심 간의 제곱 오차 합을 최소화, σ𝐶𝑘∈𝐶 σ𝑥𝑖 ∈𝐶𝑘 𝑑(𝑥𝑖, 𝑐𝑘)2
Approaches
- Global optimal needs to exhaustively enumerate all possible partitions 전역 최적은 모든 가능한 분할을 철저히 열거해야함
- Heuristic methods (i.e. greedy algorithms): k-Means, k-Medoids, k-Medians, etc

K-means Clustering

각 군집을 중심점Centroid으로 표현
군집 내 데이터와 중심점 간의 제곱 오차합(SSE)을 최소화
절차:
1. k개의 초기 줌심점 선택
2. 각 데이터를 가장 가까운 중심점의 군집에 할당
3. 군집의 중심점 재계산
4. 수렴할 때까지 2-3번 반복
한계점:
- 연속적인 공간의 데이터에만 적용 가능
- 군집 수 k를 사전에 지정해야 함
- 초기값에 따라 지역 최적해에 빠질 수 있음
- 농즈와 이상치에 민감
- 볼록하지 않은 형태의 군집 발견이 어려움

K-means++ Clustering

k-means의 개선된 버전으로, 더 나은 초기 중심점 선택 방법 제시
기존 중심점들과 멀리 떨어진 데이터를 다음 중심점으로 선택할 확률이 높아지도록 설계

K-Medoids Clustering

실제 데이터 포인트 중 군집의 중앙에 위치한 데이터를 대표점으로 사용
k-Means보다 이상치에 덜 민감하나 계산 복잡도가 높음

Clustering > Distance-based > Hierarchical

Hierarchical Clustering

Generate a clustering hierarchy, visualized as a dendrogarm 덴드로그램으로 시각화된 클러스터링 계층을 생성
Not required to specify the number of clusters (k)
- the desired number of clusters can be obtained by cutting the dendrogram at the proper level
More deterministic than partitioning approaches 분할 방식 보다 더 결정적
- No iterative refinement required

Two Approaches

Agglomerative(i.e. bottom-up): 상향식 개별 데이터에서 시작하여 점진적으로 병합
Divisive(i.e. top-down):하향식 전체 데이터에서 시작하여 점진적으로 분할

Cluster distance measures for split/merge

Single link : 두 군집 간 가장 가까운 데이터 간 거리
Complete link : 두 군집 간 가장 먼 데이터 간 거리
Average link : 모든 데이터 쌍 간 거리의 평균
Ward link : 군집 병합시 SSE 증가량

BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies)

특징
- 효육적인 다단계 계층적 클러스터링 방법
- 클러스터링 특징 (CF) 개념 도입으로 군집 요약
- CF-tree를 통한 계층 구조 표현
구성 요소
- 클러스터링 특징(CF): (N, LS, SS)
  - N: 데이터 포인트 수
  - LS: 선형 합
  - SS: 제곱 합
- CF-tree: CF를 저장하는 높이 균형 트리
장점
- 데이터를 컴팩트하게 저장
- 다양한 거리 통계량 계산 가능
- 증분적 군집 병합 지원

저작자표시 비영리 동일조건 (새창열림)

'visualization > Information Visualization' 카테고리의 다른 글

Clustering(2) (0)	2024.12.16
Classificaiton(2) (0)	2024.12.16
Classification(1) (0)	2024.12.16

현재글Clustering(1)

데이터 처리중입니다. 잠시만 기다려 주세요... AI_Developer

#nlp #자연어처리 #ml #ai #MLOps

Mlflow, docker, K8S, 강화학습, 파이썬, 알고리즘, error, ml, 딥러닝, MLOps, 쿠버네티스, Airflow, AI, cicd, 머신러닝, reinforcement, conda, Python, Kubernetes, DevOps,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

데이터 처리중입니다. 잠시만 기다려 주세요... AI_Developer

Clustering(1)

Clustering Analysis

Clustering Considerations

Clustering > Distance-based > Partitioning

K-means Clustering

K-means++ Clustering

K-Medoids Clustering

Clustering > Distance-based > Hierarchical

Hierarchical Clustering

Cluster distance measures for split/merge

BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies)

'visualization > Information Visualization' 카테고리의 다른 글

'visualization /Information Visualization'의 다른글

티스토리툴바

Clustering(1)

Clustering Analysis

Clustering Considerations

Clustering > Distance-based > Partitioning

K-means Clustering

K-means++ Clustering

K-Medoids Clustering

Clustering > Distance-based > Hierarchical

Hierarchical Clustering

Cluster distance measures for split/merge

BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies)

'visualization > Information Visualization' 카테고리의 다른 글

'visualization /Information Visualization'의 다른글

관련글

티스토리툴바

Clustering > Distance-based > Hierarchical