同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)K-means & EM

Partitioning Algorithms a Partitioning method: Construct a partition of n data into a set of k clusters a Given: a set of data and the number k Find: a partition of K clusters that optimizes the chosen partitioning criterion ◆ Globally optimal a Intractable for many objective functions a Ergo, exhaustively enumerate all partitions e Effective heuristic methods: K-means and K- medoids algorithms
Partitioning Algorithms ◼ Partitioning method: Construct a partition of n data into a set of K clusters ◼ Given: a set of data and the number K ◼ Find: a partition of K clusters that optimizes the chosen partitioning criterion ◆ Globally optimal Intractable for many objective functions Ergo, exhaustively enumerate all partitions ◆ Effective heuristic methods: K-means and Kmedoids algorithms

K-Means a Assumes data points are real- valued vectors a Clusters based on centroids(aka the center of gravity or mean) of points in a cluster, c u(c) x x∈C a Reassignment of instances to clusters is based on distance to the current cluster centroids o(Or one can equivalently phrase it in terms of similarities)
K-Means ◼ Assumes data points are real-valued vectors. ◼ Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: ◼ Reassignment of instances to clusters is based on distance to the current cluster centroids. ◆ (Or one can equivalently phrase it in terms of similarities) = x c x c | | 1 μ(c)

K-Means Algorithm Select K random data C1, C2,.CK] from data X1, 2… XN as seeds a Until clustering converges(or other stopping criterion) /partitioning For each point X Assign X; to the cluster c such that dist(x c)is minimal //NeXt, update the centroid of each cluster For each cluster c=(c)
K-Means Algorithm ◼ Select K random data {c1 , c2 ,… cK} from data {x1 , x2 ,… xN} as seeds. ◼ Until clustering converges (or other stopping criterion): //partitioning For each point xi : Assign xi to the cluster cj such that dist(xi , cj ) is minimal. //Next, update the centroid of each cluster For each cluster cj cj = (cj )

K Means Example(K-2) Pick seeds Reassign clusters Compute centroids Reassign clusters X Compute centroids Reassign clusters Converged
K Means Example (K=2) Pick seeds Reassign clusters Compute centroids x x Reassign clusters x x Compute centroids Reassign clusters Converged!

Example Assign Gate each the objects°23 cluster to most means similar reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the means 012345678910
Example 0123456789 10 0 1 2 3 4 5 6 7 8 9 10 0123456789 10 0 1 2 3 4 5 6 7 8 9 10 0123456789 10 0 1 2 3 4 5 6 7 8 9 10 0123456789 10 0 1 2 3 4 5 6 7 8 9 10 0123456789 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign

Termination conditions a Several possibilities, e. g .Afixed number of iterations e data partition unchanged o Centroid positions dont change Does this mean that the data in a cluster are unchanged?
Termination conditions ◼ Several possibilities, e.g., ◆ A fixed number of iterations. ◆ data partition unchanged. ◆ Centroid positions don’t change. Does this mean that the data in a cluster are unchanged?

Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters dont change K-means is a special case of a general procedure known as the Expectation Maximization(EM algorithm o EM is known to converge o Number of iterations could be large But in practice usually isnt
Convergence ◼ Why should the K-means algorithm ever reach a fixed point? ◆ A state in which clusters don’t change. ◼ K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. ◆ EM is known to converge. ◆ Number of iterations could be large. ➢ But in practice usually isn’t

Time Complexity a Computing distance between two data points IS O(D)Where d is the dimensionality of the vectors Reassigning clusters: O(KN) distance computations, or O(KND Computing centroids: Each point gets added once to some centroid: O(ND) a assume these two steps are each done once for /iterations: O(KND)
Time Complexity ◼ Computing distance between two data points is O(D) where D is the dimensionality of the vectors. ◼ Reassigning clusters: O(KN) distance computations, or O(KND). ◼ Computing centroids: Each point gets added once to some centroid: O(ND). ◼ Assume these two steps are each done once for I iterations: O(IKND)

Strengths of K-means clustering Relatively scalable in processing large data sets a Relatively efficient: O(tkn), where n is #f objects, k is clusters, and t is iterations. Normally, k, t<< n a Often terminates at a local optimum; the global optimum may be found using techniques such as genetic algorithms
Strengths of K-means clustering ◼ Relatively scalable in processing large data sets ◼ Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. ◼ Often terminates at a local optimum; the global optimum may be found using techniques such as genetic algorithms

Weaknesses of K-means clustering a Applicable only when the mean of objects is defined Need to specify k, the number of clusters in advance a Unable to handle noisy data and outliers a Not suitable to discover clusters with non-convex shapes, or clusters of very different size
Weaknesses of K-means clustering ◼ Applicable only when the mean of objects is defined ◼ Need to specify k, the number of clusters, in advance ◼ Unable to handle noisy data and outliers ◼ Not suitable to discover clusters with non-convex shapes, or clusters of very different size
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 中国医科大学计算机中心:《虚拟现实与增强现实技术概论》课程教学资源(PPT课件讲稿)第3章 虚拟现实系统的输出设备.pptx
- 香港中文大学:XML for Interoperable Digital Video Library.ppt
- 上海交通大学:《计算机图形学 Computer Graphics》课程教学资源(PPT讲稿)CHAPTER 4 THE VISUALIZATION PIPELINE.pptx
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 09 Evaluation.ppt
- 长春工业大学:《网页设计与制作》课程教学资源(PPT课件)第5章 Div+CSS布局技术.ppt
- 合肥工业大学:《计算机网络技术》课程教学资源(PPT课件讲稿)第4章 交换网的运行.ppt
- 山东大学软件学院:非线性规划(PPT讲稿)一维搜索方法.ppt
- 《并发控制技术》课程教学资源(PPT课件讲稿)第7章 事务管理 transaction management.ppt
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第1章 计算机常识(主讲:马秀麟).pptx
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)面向对象的分析与设计简介 OOA & OOD:An introduction.ppt
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)向量体系结构.pptx
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第二部分 公钥密码和散列函数 第8章 数论入门(苗付友).pptx
- 《计算机网络技术》课程教学资源(PPT课件讲稿)第5章 广域网.ppt
- 香港城市大学:Rank Aggregation in MetaSearch.ppt
- Vitebi 译码.ppt
- 图形处理及多媒体应用(PPT课件讲稿).pps
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第5章 Microsoft Excel 2010.pptx
- Distributed Systems and Networking Programmin(SOAP – Introduction).ppt
- Coded Caching under Arbitrary Popularity Distributions.pptx
- 东南大学:《泛型编程 Generic Programming》课程教学资源(PPT课件讲稿)Chapter 14 Templates.ppt
- 北京大学:文本挖掘技术(PPT讲稿)文本分类 Text Categorization.ppt
- 《网页设计与制作》课程教学资源(PPT课件讲稿)第一章 HTML基础.ppt
- 清华大学:《计算机导论》课程电子教案(PPT教学课件)第1章 计算机发展简史.ppt
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 06 Index Compression.ppt
- 嵌入式交叉开发环境的建立(PPT实验讲稿).ppt
- 西安交通大学:《微型计算机接口技术》课程教学资源(PPT课件讲稿)第五章 输入/输出控制接口.ppt
- 《TCP/IP协议及其应用》课程教学资源(PPT课件讲稿)第3章 IP寻址与地址解析.ppt
- 中国医科大学:《计算机网络实用教程》课程教学资源(PPT讲稿)高速局域网技术、交换式局域网技术、虚拟局域网技术、主要的城域网技术.ppt
- 《大学计算机基础》课程教学资源:作业习题.pdf
- 《计算机网络》课程教学资源(PPT课件讲稿)第一章 计算机网络概述.ppt
- 山西国际商务职业学院:《数据库应用程序设计》课程教学资源(PPT课件)第三章 数据与数据运算.pps
- 《C语言程序设计》课程电子教案(PPT课件讲稿)Chapter 02 用C语言编写程序.ppt
- 《数字图像处理》课程教学资源(PPT课件讲稿)第5章 图像复原.ppt
- 《数据结构 Data Structure》课程教学资源(PPT课件讲稿)06 非二叉树 Non-Binary Trees.ppt
- 《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第六讲 关系数据理论.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)并发对象 Concurrent Objects.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第六章 应用层(谢希仁).ppt
- 《电子商务技术》课程教学资源(PPT课件讲稿)第五章 电子商务安全技术.ppt
- Parallel Algorithms Underlying MPI Implementations.ppt
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第5章 Linux网络工程.ppt