香港科技大学:Clustering(PPT讲稿)

Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Yang @cs. ust. hk Thanks: J W. Han, I witten e frank
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: J.W. Han, I. Witten, E. Frank

Essentials erminology Objects rows= records Variables attributes= features a good clustering method high on intra-class similarity and low on inter-class similarity What is similarity? Based on computation of distance Between two numerical attributes Between two nominal attributes Mixed attributes
2 Essentials ◼ Terminology: ◼ Objects = rows = records ◼ Variables = attributes = features ◼ A good clustering method ◼ high on intra-class similarity and low on inter-class similarity ◼ What is similarity? ◼ Based on computation of distance ◼ Between two numerical attributes ◼ Between two nominal attributes ◼ Mixed attributes

The database XX Object i p
3 The database n n p i i p p x x x x x x 1 1 1 1 1 ... ... ... ... ... ... ... ... Object i

Numerical attributes Distances are normally used to measure the similarity or dissimilarity between two data objects Euclidean distance d(1)=,(x2-x,}+x2-x,P+,+|x,-x2) 12J2 Jp Where/=(Ⅻ…,)andj=(场灬加)le two p-dimensional records, Manhattan distance d(i)=x.-x,+|x.-x,|+.+1x2-x
4 Numerical Attributes ◼ Distances are normally used to measure the similarity or dissimilarity between two data objects ◼ Euclideandistance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional records, ◼ Manhattan distance | | ... | | ) 2 ( , ) (| | 2 2 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + − ( , ) | | | | ... | | 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + −

Binary variables([0, 1], or [true, false]) contingency table for binary data Row 10 sum 6 a+b Row i d c+d sum a+c b+d Simple matching coefficient btc +6+c+d Invariant of coding of binary variable: if you assign 1 to pass"and 0 to fail or the other way around, you'll get the same distance value 5
5 Binary Variables ({0, 1}, or {true, false}) ◼ A contingency table for binary data ◼ Simple matching coefficient ◼ Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. a b c d b c d i j + + + ( , )= + sum a c b d p c d c d a b a b sum + + + + 0 1 1 0 Row i Row j

Nominal attributes a generalization of the binary variable in that it can take more than 2 states, e.g. red yellow, blue green Method 1: Simple matching m:# of matches, p: total of variables d(i,j)=Prm Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
6 Nominal Attributes ◼ A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green ◼ Method 1: Simple matching ◼ m: # of matches, p: total # of variables ◼ Method 2: use a large number of binary variables ◼ creating a new binary variable for each of the M nominal states p p m d i j − ( , )=

Other measures of cluster distance Minimum distance d(Ci, ci)=min peCi,p,eg IP-PI Max distance maX p∈Ci,p∈C Mean distance (C1,C1)=m Avarage distance d(C2C)=∑∑|P 7
7 Other measures of cluster distance ◼ Minimum distance ◼ Max distance ◼ Mean distance ◼ Avarage distance ( , ) min | '| d Ci Cj = pCi, p'Cj P − P ( , ) max | '| d Ci Cj = pCi, p'Cj P − P d Ci Cj = mi − mj ( , ) = − p C i p C j P P n n d C C i j i j | '| 1 ( , )

Major clustering methods u Partition based(K-means) Produces sphere-like clusters Good when know number of clusters Small and med sized databases Hierarchical methods(agglomerative or divisive) Produces trees of clusters Fast Density based(dbscan Produces arbitrary shaped clusters Good when dealing with spatial clusters(maps) Grid-based Produces clusters based on grids Fast for large, multidimensional databases Model-based Based on statistical models Allow objects to belong to several clusters 8
8 Major clustering methods ◼ Partition based (K-means) ◼ Produces sphere-like clusters ◼ Good when ◼ know number of clusters, ◼ Small and med sized databases ◼ Hierarchical methods (Agglomerative or divisive) ◼ Produces trees of clusters ◼ Fast ◼ Density based (DBScan) ◼ Produces arbitrary shaped clusters ◼ Good when dealing with spatial clusters (maps) ◼ Grid-based ◼ Produces clusters based on grids ◼ Fast for large, multidimensional databases ◼ Model-based ◼ Based on statistical models ◼ Allow objects to belong to several clusters

The K-Means Clustering method: for numerical attributes Given k, the k-means algorithm is implemented in four steps Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition the centroid is the center, i. e. mean point, of the cluster) Assign each object to the cluster with the nearest seed point go back to Step 2, stop when no more new assignment
9 The K-Means Clustering Method: for numerical attributes ◼ Given k, the k-means algorithm is implemented in four steps: ◼ Partition objects into k non-empty subsets ◼ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ◼ Assign each object to the cluster with the nearest seed point ◼ Go back to Step 2, stop when no more new assignment

The mean point Y 2.5 2.75 The mean point can be a virtual point 10
10 The mean point 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 X X Y 1 2 2 4 3 3 4 2 2.5 2.75 The mean point can be a virtual point
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 电子科技大学:《计算机操作系统》课程教学资源(PPT课件讲稿)第三章 处理机的调度和死锁.ppt
- 《图像处理与计算机视觉 Image Processing and Computer Vision》课程教学资源(PPT课件讲稿)Chapter 11 Bundle adjustment Structure reconstruction SFM from N-frames.pptx
- 同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)关联规则 Association Rule.pptx
- 《程序设计基础》课程教学资源:实验教学大纲.pdf
- 白城师范学院:《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第二章 关系数据库(2.4 关系代数 2.5 关系演算 2.6 小结).ppt
- 安徽工贸职业技术学院:《计算机组装与维护》课程教学资源(PPT课件讲稿)项目五 微型计算机维护.ppt
- 曙光:并行程序设计简介(PPT讲座).ppt
- 《单片机原理与应用》课程教学资源(PPT课件讲稿)第7章 显示与开关/键盘输入及微型打印机接口设计.ppt
- 数据结构与算法(PPT课件讲稿)Data Structures and Algorithms.pptx
- 四川大学:《计算机操作系统 Operating System Principles》课程教学资源(PPT课件讲稿)第5章 死锁.ppt
- 四川大学:《Java面向对象编程》课程PPT教学课件(Object-Oriented Programming - Java)Unit 1.1 Java Applications 1.1.1 Applications in Java(熊运余).ppt
- 厦门大学:《大数据技术原理与应用》课程教学资源(PPT课件讲稿,2016)第8章 流计算.ppt
- Adaptive Dynamic Bipartite Graph Matching:A Reinforcement Learning Approach.pptx
- 中国科学技术大学:《网络安全协议》课程教学资源(PPT课件讲稿)第一章 网络安全综述 Network Security Protocols(薛开平).ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第二章 物理层.ppt
- Excel 2010高级使用技巧(PPT讲稿).ppt
- 《数据库原理》课程教学资源(PPT课件讲稿)第三章 关系数据库标准查询语言SQL.pps
- 河南中医药大学(河南中医学院):《计算机文化》课程教学资源(PPT课件讲稿)第一章 计算机网络概述(主讲:阮晓龙).pptx
- 《软件工程导论》课程教学资源(PPT课件讲稿)第9章 面向对象方法学.ppt
- 南京航空航天大学:《C++》课程电子教案(PPT课件讲稿)第3章 类的基础部分(主讲:陈哲).ppt
- 上海交通大学:TLS/SSL Security(PPT课件讲稿).pptx
- 山东大学计算机学院:《人机交互技术》课程教学资源(PPT课件讲稿)第7章 Web界面设计.ppt
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第三章 IAP15W4K58S4单片机的硬件结构.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)面向方面的编程 Aspect Oriented Programming.ppt
- 武昌首义学院:Word的基本操作与技巧(PPT讲稿,主讲:张旋子).pptx
- 《VB程序设计》课程教学资源(PPT课件讲稿)第八章 过程.pps
- 湖南生物机电职业技术学院:《电子商务概论》课程教学资源(PPT课件)第五章 网络信息搜索.ppt
- 《电子商务》课程教学资源(PPT课件讲稿)第十章 网络营销.pptx
- 广西外国语学院:《计算机网络》课程教学资源(PPT课件讲稿)第7章 传输层协议——TCP与UDP.ppt
- 九州大学(日本国立综合大学):烟花算法爆炸因子分析及改良(艺术工学府:余俊).pptx
- 图像视频编码与表达的理论与方法(PPT讲稿)图像压缩标准JPEG.ppt
- 中国科学技术大学:《计算机视觉》课程教学资源(PPT课件讲稿)第九章 单幅图像深度重建 Depthmap Reconstruction Based on Monocular cues.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第六章 应用层.ppt
- 《计算机导论》课程教学资源(PPT课件讲稿)第3章 计算机发展史和计算思维.pptx
- 武昌理工学院(武汉科技大学中南分校):Windows 2000/XP网络组建与系统管理(PPT课件讲稿,主讲:李燕).ppt
- 《网络编程实用教程(第三版)Network Application Programming》课程教学资源(PPT课件讲稿)第1章 概述.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第十章 下一代因特网.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)对象序列化和持久化 Object Serialization and Persistence.ppt
- B-树、散列技术、散列表的概念、散列函数的构造方法、处理冲突的方法、散列表上的运算.ppt
- 四川大学:《软件测试与维护基础教程》课程教学资源(PPT课件讲稿)软件测试工具 Software Testing Tool.ppt