《知识发现和数据挖掘 Knowledge Discovery and Data Mining》课程教学课件(PPT讲稿)Chapter 10. Cluster Analysis:Basic Concepts and Methods

COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei C2012 Han Kamber pei. All rights reserved
1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei ©2012 Han, Kamber & Pei. All rights reserved

Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical methods Density-Based Methods Grid-Based methods Evaluation of clustering Summary
2 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 2

What is Cluster Analysis? Cluster: a collection of data objects similar(or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis(or clustering, data segmentation,. Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e, learning by observations vs learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
3 What is Cluster Analysis? ◼ Cluster: A collection of data objects ◼ similar (or related) to one another within the same group ◼ dissimilar (or unrelated) to the objects in other groups ◼ Cluster analysis (or clustering, data segmentation, …) ◼ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ◼ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ◼ Typical applications ◼ As a stand-alone tool to get insight into data distribution ◼ As a preprocessing step for other algorithms

Clustering for Data Understanding and Applications Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: ldentification of areas of similar land use in an earth observation database Marketing Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch
4 Clustering for Data Understanding and Applications ◼ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ◼ Information retrieval: document clustering ◼ Land use: Identification of areas of similar land use in an earth observation database ◼ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ◼ City-planning: Identifying groups of houses according to their house type, value, and geographical location ◼ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults ◼ Climate: understanding earth climate, find patterns of atmospheric and ocean ◼ Economic Science: market resarch

Clustering as a Preprocessing Tool ( Utility) Summarization Preprocessing for regression, PCA, classification, and association analysis Compression Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those far away' from any cluster
5 Clustering as a Preprocessing Tool (Utility) ◼ Summarization: ◼ Preprocessing for regression, PCA, classification, and association analysis ◼ Compression: ◼ Image processing: vector quantization ◼ Finding K-nearest Neighbors ◼ Localizing search to one or a small number of clusters ◼ Outlier detection ◼ Outliers are often viewed as those “far away” from any cluster

Quality: What Is Good clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The guality of a clustering method depends on the similarity measure used by the method a its implementation, and Its ability to discover some or all of the hidden patterns 6
Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters ◼ high intra-class similarity: cohesive within clusters ◼ low inter-class similarity: distinctive between clusters ◼ The quality of a clustering method depends on ◼ the similarity measure used by the method ◼ its implementation, and ◼ Its ability to discover some or all of the hidden patterns 6

Measure the quality of clustering Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function typically metric: d(,D The definitions of distance functions are usually rather different for interval-scaled boolean categorical ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering There is usually a separate "quality function that measures the goodness" of a cluster It is hard to define similar enough"or "good enough The answer is typically highly subjective
Measure the Quality of Clustering ◼ Dissimilarity/Similarity metric ◼ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ◼ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ◼ Weights should be associated with different variables based on applications and data semantics ◼ Quality of clustering: ◼ There is usually a separate “quality” function that measures the “goodness” of a cluster. ◼ It is hard to define “similar enough” or “good enough” ◼ The answer is typically highly subjective 7

Considerations for Cluster Analysis Partitioning criteria Single level vs. hierarchical partitioning(often, multi-level hierarchical partitioning is desirable Separation of clusters EXclusive(e.g, one customer belongs to only one region)Vs non exclusive(e.g, one document may belong to more than one class Similarity measure Distance-based(e.g, Euclidian, road network, vector)VS connectivity-based(e.g, density or contiguity) Clustering space Full space(often when low dimensional)vs subspaces(often in high-dimensional clustering 8
Considerations for Cluster Analysis ◼ Partitioning criteria ◼ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) ◼ Separation of clusters ◼ Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class) ◼ Similarity measure ◼ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) ◼ Clustering space ◼ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 8

Requirements and challenges Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of the lese Constraint-based clustering User may give inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionalit
Requirements and Challenges ◼ Scalability ◼ Clustering all the data instead of only on samples ◼ Ability to deal with different types of attributes ◼ Numerical, binary, categorical, ordinal, linked, and mixture of these ◼ Constraint-based clustering ◼ User may give inputs on constraints ◼ Use domain knowledge to determine input parameters ◼ Interpretability and usability ◼ Others ◼ Discovery of clusters with arbitrary shape ◼ Ability to deal with noisy data ◼ Incremental clustering and insensitivity to input order ◼ High dimensionality 9

Major Clustering Approaches o Partitioning approach Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach Create a hierarchical decomposition of the set of data(or objects using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach based on a multiple- level granularity structure Typical methods: STING, WaveCluster, CLIQUE 10
Major Clustering Approaches (I) ◼ Partitioning approach: ◼ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ◼ Typical methods: k-means, k-medoids, CLARANS ◼ Hierarchical approach: ◼ Create a hierarchical decomposition of the set of data (or objects) using some criterion ◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON ◼ Density-based approach: ◼ Based on connectivity and density functions ◼ Typical methods: DBSACN, OPTICS, DenClue ◼ Grid-based approach: ◼ based on a multiple-level granularity structure ◼ Typical methods: STING, WaveCluster, CLIQUE 10
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《人工智能原理及应用》课程教学大纲 Artificial Intelligence Principles and Applications.doc
- 西安电子科技大学:《接入网技术及其应用》课程教学资源(PPT课件讲稿)第6章 接入网应用(徐展琦).ppt
- 《管理信息系统原理及开发》课程教学资源(PPT课件讲稿)第3、4讲 管理信息系统的系统设计.ppsx
- 西安电子科技大学:《现代密码学》课程教学资源(PPT课件讲稿)第四章 公钥密码(主讲:董庆宽).pptx
- 河南中医药大学(河南中医学院):《计算机文化》课程教学资源(PPT课件讲稿)第二章 计算机的前世今生(主讲:许成刚).ppt
- 《计算机软件及应用》课程教学资源(PPT课件讲稿)第2章 Photoshop CS入门基础.ppt
- 《大型机高级系统管理技术》课程教学资源(PPT课件讲稿)第4章 作业控制子系统.ppt
- 上海交通大学:《软件工程 Software Engineering》课程教学资源(PPT课件讲稿)软件开发过程 Software Development Processes.pptx
- 中国水利水电出版社:《计算机组装与维护实训教程》课程教学资源(PPT课件讲稿,共九章).ppt
- 《大学生计算机基础》课程教学资源(PPT讲稿)第三章 字处理软件(Word 2003).ppt
- 北京大学:《高级软件工程》课程教学资源(PPT课件讲稿)第六讲 网络环境中的软件质量.ppt
- 《计算机数据恢复技术》课程教学资源(PPT课件讲稿)第1章 数据恢复技术概述.ppt
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第2章 传统加密技术 Classical Encryption Techniques.ppt
- 《计算机系统安全》课程教学资源(PPT课件讲稿)第六章 访问控制 Access Control.ppt
- 陕西师范大学:Neural Networks and Fuzzy Systems(PPT讲稿)Chapter 3 NEURONAL DYNAMICS II:ACTIVATION MODELS.ppt
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第5章 Linux网络工程.ppt
- Parallel Algorithms Underlying MPI Implementations.ppt
- 《电子商务技术》课程教学资源(PPT课件讲稿)第五章 电子商务安全技术.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第六章 应用层(谢希仁).ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)并发对象 Concurrent Objects.ppt
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)小波分析 Wavelet Analysis(主讲:曹洋).pptx
- 《计算机网络 Computer Networking》课程教学资源(PPT课件讲稿)Chapter 6 无线和移动网络 Wireless and Mobile Networks.ppt
- 《UNIX操作系统基础》课程教学资源(PPT课件讲稿)第三章 UNIX的文件与目录.ppt
- 上海交通大学:并发理论(PPT课件诗篇)Concurrency Theory.ppt
- 南京大学:《Java语言程序设计》课程教学资源(PPT课件讲稿)第2章 Java语言语法基础.ppt
- 南京大学:使用失效数据来引导决定(PPT讲稿,计算机系:赵建华).ppt
- 南京航空航天大学:《C++》课程电子教案(PPT课件讲稿)第3章 类的基础部分(主讲:陈哲).ppt
- 《软件工程导论》课程教学资源(PPT课件讲稿)第9章 面向对象方法学.ppt
- 河南中医药大学(河南中医学院):《计算机文化》课程教学资源(PPT课件讲稿)第一章 计算机网络概述(主讲:阮晓龙).pptx
- 《数据库原理》课程教学资源(PPT课件讲稿)第三章 关系数据库标准查询语言SQL.pps
- Excel 2010高级使用技巧(PPT讲稿).ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第二章 物理层.ppt
- 中国科学技术大学:《网络安全协议》课程教学资源(PPT课件讲稿)第一章 网络安全综述 Network Security Protocols(薛开平).ppt
- Adaptive Dynamic Bipartite Graph Matching:A Reinforcement Learning Approach.pptx
- 厦门大学:《大数据技术原理与应用》课程教学资源(PPT课件讲稿,2016)第8章 流计算.ppt
- 四川大学:《Java面向对象编程》课程PPT教学课件(Object-Oriented Programming - Java)Unit 1.1 Java Applications 1.1.1 Applications in Java(熊运余).ppt
- 四川大学:《计算机操作系统 Operating System Principles》课程教学资源(PPT课件讲稿)第5章 死锁.ppt
- 数据结构与算法(PPT课件讲稿)Data Structures and Algorithms.pptx
- 《单片机原理与应用》课程教学资源(PPT课件讲稿)第7章 显示与开关/键盘输入及微型打印机接口设计.ppt
- 曙光:并行程序设计简介(PPT讲座).ppt