COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week3chap6_basic_association_analysis
更新时间:2023-08-18 19:32:01 阅读量: 资格考试认证 文档下载
- COMP5318推荐度:
- 相关推荐
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Miningby Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactionsTID Items
Example of Association Rules{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Implication means co-occurrence, not causality!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}TID Items
– k-itemset
An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Support count ( )– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2
Support– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Definition: Association Rule
Association Rule– An implication expression of the form X Y, where X and Y are itemsets
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
Example:
{Milk, Diaper} Beers
– Confidence (c)
Measures how often items in Y appear in transactions that contain X
(Milk , Diaper, Beer )|T|
2 0.4 5
(Milk, Diaper, Beer ) 2 c 0.67 (Milk , Diaper ) 34/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having– support ≥ minsup threshold – confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association RulesTID Items
Example of Rules:{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5
)
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence
Thus, we may decouple the support and confidence requirements© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Mining Association Rules
Two-step approach:1. Frequent Itemset Generation–
Generate all itemsets whose support minsup
2. Rule Generation–
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generationnull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets4/18/2004 #
© Tan,Steinbach, Kumar
Introduction to Data Mining
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation
Brute-force approach:– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the databaseTransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
N
M
w
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Computational Complexity
Given d unique items:– Total number of itemsets = 2d – Total number of possible association rules:
d d k R k j 3 2 1d 1 k 1 d k j 1 d d 1
If d=6, R = 602 rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d – Use pruning techniques to reduce M
Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property of the support measure:
X ,Y : ( X Y ) s( X ) s(Y )– Support of an itemset never exceeds the support of its su
bsets – This is known as the anti-monotone property of support© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori Principlenull
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
Found to be InfrequentABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Pruned supersets© Tan,Steinbach, Kumar Introduction to Data Mining
ABCDE
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Illustrating Apriori PrincipleItem Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1
Items (1-itemsets)
Minimum Support = 3If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Count 3 2 3 2 3 3
Pairs (2-itemsets)(No need to generate candidates involving Coke or Eggs)
Triplets (3-itemsets)Itemset {Bread,Milk,Diaper} Count 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Apriori Algorithm
Method:– Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified Generate
length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Reducing Number of Comparisons
Candidate counting:– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structureInstead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
TransactionsTID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Hash Structure
N
k
Buckets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Introduction to Hash Functions
A Hash Function h is a mapping from a set X to a range of integers [0..k-1]. Thus each element of the set is mapped into one of k buckets. Each of the buckets will contain all the elements that are mapped by h into that bucket.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
A mod function is a good example of a hash function. For example suppose we use h(x) = xmod7. Then 0 to 6 gets mapped to 0 to 6 but 7 gets mapped to 0 and 8 to 1. Thus the range of mod7 is [0..6]. These are the buckets of mod7.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Example
Suppose X is the set of integers 1..1000 1 0,7,14,21…. 1,8,15,22….
23 4 5
6
6,13,20,27….
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Factors Affecting Complexity
Choice of mini
mum support threshold– – lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Dimensionality (number of items) of the data set– –
Size of database–
Average transaction width– transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
#
University of Sydney_COMP5318 Knowledge Discovery and Data Mining_2011
Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10 Number of frequent itemsets 3 k 10 k 1
Need a compact representationIntroduction to Data Mining 4/18/2004 #
© Tan,Steinbach, Kumar
正在阅读:
COMP5318 Knowledge Discovery and Data Mining_2011 Semester 1_week3chap6_basic_association_analysis08-18
邓小平理论 - 习题集(含答案)12-10
邓小平理论 - 习题集(含答案)06-30
水处理流程说明01-16
邓小平理论参考文献05-21
安全是指不受到威胁12-23
潜江市2014教育系统干部暑期集训培训指南 - 图文12-01
揠苗助长公开课教案10-09
六爻一个精彩绝伦卦例09-19
3001+邓小平理论概论04-14
- 1Received accepted Short title ANALYSIS OF MODEL DATA
- 2introduction - applied research methods- WEEK 1 (3)(3)
- 3Unit 6 Knowledge and Wisdom
- 4week6四级笔试模拟试卷3
- 5English writing (week 3)
- 6week3-HierarchicalStateMachines
- 7Arlequin(version 3.0) An integrated software package for population genetics data analysis
- 8English writing (week 3)
- 93rd Week
- 10week6练习 验证控件
- 梳理《史记》素材,为作文添彩
- 2012呼和浩特驾照模拟考试B2车型试题
- 关于全面推进施工现场标准化管理实施的通知(红头文件)
- 江西省房屋建筑和市政基础设施工程施工招标文件范本
- 律师与公证制度第2阶段练习题
- 2019-2020年最新人教版PEP初三英语九年级上册精编单元练习unit6训练测试卷内含听力文件及听力原文
- 小升初数学模拟试卷(十四) 北京版 Word版,含答案
- 认识创新思维特点 探讨创新教育方法-精选教育文档
- 00266 自考 社会心理学一(复习题大全)
- 多媒体在语文教学中的运用效果
- 派出所派出所教导员述职报告
- 低压电工作业考试B
- 18秋福建师范大学《管理心理学》在线作业一4
- 中国铝业公司职工违规违纪处分暂行规定
- 13建筑力学复习题(答案)
- 2008年新密市师德征文获奖名单 - 图文
- 保安员培训考试题库(附答案)
- 银川市贺兰一中一模试卷
- 2011—2017年新课标全国卷2文科数学试题分类汇编 - 1.集合
- 湖北省襄阳市第五中学届高三生物五月模拟考试试题一
- association
- Knowledge
- Discovery
- COMP5318
- Semester
- analysis
- Mining
- basic
- Data
- 2011
- week
- chap
- 集体备课教案
- 2009—2010学年度第二学期六年级第一次模拟测试语文科试卷
- 思科3750交换机系列堆叠配置实例
- 实验四 数据库试验-单表查询
- 超低失真正弦波发生器电路
- 实习报告定稿
- 2012年辽宁省实验高一数学期末考卷
- 2014年职称英语考试大纲词汇(C级)
- 5.5植物的开花和结果 导学案
- 抗骨桥蛋白抗体对沙鼠肝多房棘球蚴组织中的基质金属蛋白酶2和转化生长因子β1的影响
- MSTP技术的研究和民航中的应用
- Empirical project monitor A tool for mining multiple project data
- 必修五文言文理解性默写
- 人教版二年级语文上册生字描红字帖
- 西城区高二物理选修3-2模块测试
- 剖析课堂管理内涵,建构新型政治课堂论文
- PN结正向伏安特性曲线随温度的变化
- PORT AND TERMINAL MANAGEMENT
- 学习“铁人精神”心得体会
- 新闻价值比新闻道德更重要