6. 데이터 마이닝 - Association Analysis(2)

티스토리 뷰

SoftWare/데이터 마이닝

6. 데이터 마이닝 - Association Analysis(2)

White Whale 2017. 12. 16. 18:46

728x90

1. Rule Generation from frequent itemset

1. Frenquent itemset이 {A,B,C,D}일 떄 Candidate rules은 다음과 같다.

▪ Frenquent itemset이 K개 이면 총 생성할 수 있는 Candidate rules은 2^k-2개이다.

▪ 공집합과 전체 집합은 뺴기 때문에 -2이다.

2. 신뢰도(confidence)는 anti-monotone 성질을 가지지 않는다.

▪ Apriori 특성 사용이 어려움
- c(ABC->D) can be larger of smaller than c(AB->D)

▪ 동일한 항목집합에서 생성된 규칙에 대해서는 anti-monotone 성질이 성립

- c(ABC->D) >= c(AB->CD) >= c(A->BCD)

3. Candidate rule is generated by merging two rules

▪ 두 개의 Candidate Rule이 새로운 Rule을 만들어 내지만 생성된 Rule의 신뢰도는 기존 2개의 신뢰도 보다 작다.

2. Maximal Frequent Itemset

1. Itemset은 maximal frequent를 가진다.

2. Itemset은 maximal frequent 찾는 방법

▪ 먼저 Infrequent와 frequent itemset 사이의 itemset을 찾는다.

- d, bc, ad, adc

▪ 위에서 찾은 itemset의 immediate superset을 찾는다.

- d의 superset으로 ad, bd, cd가 있는데 ad는 frequent라서 d는 immediate superset이 아니다.

- bc는 adb와 bcd를 superset으로 갖는데, adc가 frequent

- ad와 abc의 superset은 모두 infrequent. 따라서 ad와 abc는 maximal frequnt

3. Closed Itemset

1. Closed Itemset : support값이 superset의 support 값보다 크면 close

2. Closed frequent itemset 찾는 방법

▪ 먼저 모든 frequent itemset을 찾는다. 위 예시에서는 최소 support값이 2이다. 파랑 테두리를 가지고 있다.

▪ superset의 support 값이 작은 close itemset을 찾는다. 두꺼운 테두리를 가지고 있다.

▪ 파랑 테두리(frequent itemset)중 close인 것들은 close frequent itemset이라고 한다.

4. Maximal vs Closed Itemsets

1. Example

2. Result

▪ 회색 : Closed frequent itemset

▪ 파랑 띠 : Maximal frequent itemset

▪ Frequent > Closed > Maximal 관계를 가진다.

5. 연관 규칙 평가(Pattern Evaluation)

1. 연관 규칙 생성 알고리즘은 너무 많은 연관 규칙을 생성한다.

▪ 생성된 모든 규칙이 유용하지는 않음

▪ {A, B, C} -> {D}와 {A, B} -> {D}가 동일한 지지도/신뢰도를 갖는다면, 이들 두 규칙은 중복임

▪ Interestingness measures(유용성 척도)는 유도된 규칙을 제거하거나 순위를 매기는데(prune or rank) 사용됨

▪ 지지도와 신뢰도(support & confidence)도 유용성 척도에 속함

2. 분할표(contingency table)를 사용하여 다양한 유용성 척도를 계산할 수 있다

3. 신뢰도의 단점.

▪ 위 신뢰도를 보면 차를 마시는 사람은 커피를 마는 경향이 있다고 추론할 수 있다.

▪ 그러나 원래 전체 100명 중 커피를 마시는 사람이 80퍼센트이다. 즉 차를 마신다는 정보에서 커피를 마시는 사람에 대한 정보를 아는 것은 큰 의미가 없다.

6. Statistical Independence

1. Statistical independence

▪ P(SnB) = P(S) ´ P(B) => Statistical independence
▪ P(SnB) > P(S) ´ P(B) => Positively correlated
▪ P(SnB) < P(S) ´ P(B) => Negatively correlated

2. Exampel

Population of 1000 students

- 600 students know how to swim (S)

- 700 students know how to bike (B)

- 420 students know how to swim and bike (S,B)

▪ P(SnB) = 420/1000 = 0.42

▪ P(S) ´ P(B) = 0.6 ´ 0.7 = 0.42

▪ P(SnB) = P(S) ´ P(B) => Statistical independence

3. Measures that take into account statistical dependence

4. 연관 규칙 평가 : Lift

▪ If Lift > 1, then X and Y appear more often together than expected
▪ If Lift < 1 then, X and Y appear less often together than expected
▪ If Lift = 1, then X and Y are independent.

5. Example

▪ Lift less than 1, so Negatively correlated.

저작자표시 비영리 변경금지 (새창열림)

'SoftWare > 데이터 마이닝' 카테고리의 다른 글

7. 데이터 마이닝 - Association Analysis(3) (0)	2017.12.16
6. 데이터 마이닝 - Association Analysis(1) (0)	2017.11.21
5. 데이터 마이닝 - Classification (2) (0)	2017.11.20
5. 데이터 마이닝 - Classification (1) (0)	2017.11.13
4. 데이터 마이닝 - Classification (0)	2017.10.24

공유하기 링크

페이스북
카카오스토리
트위터

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

흰고래의꿈

티스토리 뷰

6. 데이터 마이닝 - Association Analysis(2)

'SoftWare > 데이터 마이닝' 카테고리의 다른 글

티스토리툴바