Data Mining and Model Choice in Supervised Learning

Data Mining and Model choice in Supervised Learning Gilbert Saporta Chaire de statistique appliquee CEDRIC, CNAM 292 rue Saint Martin 6003 paris gilbert saporta@cnam. fr http://cedric.cnam.fr/usaporta
Data Mining and Model Choice in Supervised Learning Gilbert Saporta Chaire de Statistique Appliquée & CEDRIC, CNAM, 292 rue Saint Martin, F-75003 Paris gilbert.saporta@cnam.fr http://cedric.cnam.fr/~saporta

Outline 1. What is data mining 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. a scoring case study 6. Discussion Beijing, 2008 2
Beijing, 2008 2 Outline 1. What is data mining? 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. A scoring case study 6. Discussion

1. What is data mining Data mining is a new field at the frontiers of statistics and information technologies(database management, artificial intelligence, machine learning etc which aims at discovering structures and patterns in large data sets Beijing, 2008 3
Beijing, 2008 3 1. What is data mining? ◼ Data mining is a new field at the frontiers of statistics and information technologies (database management, artificial intelligence, machine learning, etc.) which aims at discovering structures and patterns in large data sets

1.1 Definitions U M Fayyad, G Piatetski-Shapiro :Data Mining is the nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns in data D.J. Hand shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets Beijing, 2008
Beijing, 2008 4 1.1 Definitions: ◼ U.M.Fayyad, G.Piatetski-Shapiro : “ Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ” ◼ D.J.Hand : “ I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets

The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases(Kardaun, T Alanko, 1998) Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs(Hand, 2000) Beijing, 2008 5
Beijing, 2008 5 ◼ The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. ◼ Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases (Kardaun, T.Alanko,1998) . ◼ Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand, 2000)

What is new? Is it a revolution The idea of discovering facts from data is as old as Statistics which"is the science of learning from data OKettenring former ASa president) In the 60s: Exploratory Data Analysis(tukey, Benzecri) Data analysis is a tool for extracting the diamond of truth from the mud of data,>> O P Benzecri 1973) Beijing, 2008 6
Beijing, 2008 6 ◼ The idea of discovering facts from data is as old as Statistics which “ is the science of learning from data ” (J.Kettenring, former ASA president). ◼ In the 60’s: Exploratory Data Analysis (Tukey, Benzecri..) « Data analysis is a tool for extracting the diamond of truth from the mud of data. » (J.P.Benzécri 1973) What is new? Is it a revolution ?

2 Data Mining started from an evolution of DBms towards Decision support Systems using a data Warehouse Storage of huge data sets: credit card transactions, phone calls, supermarket bills: giga and terabytes of data are collected automatically Marketing operations: CRM customer relationship management Research in artificial Intelligence, machine learning KDD for Knowledge Discovery in Data Bases Beijing, 2008 7
Beijing, 2008 7 1.2 Data Mining started from: ◼ an evolution of DBMS towards Decision Support Systems using a Data Warehouse. ◼ Storage of huge data sets: credit card transactions, phone calls, supermarket bills: giga and terabytes of data are collected automatically. ◼ Marketing operations: CRM (customer relationship management) ◼ Research in Artificial Intelligence, machine learning, KDD for Knowledge Discovery in Data Bases

1.3 Goals and tools Data Mining is a secondary analysis >> of data collected in an other purpose(management eg Data Mining aims at finding structures of two kinds: models and patterns Patterns a characteristic structure exhibited by a few number of points a small subgroup of customers with a high commercial value, or conversely highly risked Tools: cluster analysis visualisation by dimension reduction PCA, CA etc association rules Beijing, 2008 8
Beijing, 2008 8 1.3 Goals and tools ◼ Data Mining is a « secondary analysis » of data collected in an other purpose (management eg) ◼ Data Mining aims at finding structures of two kinds : models and patterns ◼ Patterns ◼ a characteristic structure exhibited by a few number of points : a small subgroup of customers with a high commercial value, or conversely highly risked. ◼ Tools: cluster analysis, visualisation by dimension reduction: PCA, CA etc. association rules

Models Building models is a major activity for statisticians econometricians and other scientists a model is a global summary of relationships between variables, which both helps to understand phenomenons and allows predictions dM is not concerned with estimation and tests off prespecified models but with discovering models through an algorithmic search process exploring linear and non linear models explicit or not: neural networks, decision trees, Support Vector Machines logistic regression, graphical models etc In DM Models do not come from a theory but from data exploration Beijing, 2008 9
Beijing, 2008 9 Models ◼ Building models is a major activity for statisticians econometricians, and other scientists. A model is a global summary of relationships between variables, which both helps to understand phenomenons and allows predictions. ◼ DM is not concerned with estimation and tests, of prespecified models, but with discovering models through an algorithmic search process exploring linear and non-linear models, explicit or not: neural networks, decision trees, Support Vector Machines, logistic regression, graphical models etc. ◼ In DM Models do not come from a theory, but from data exploration

process or tools? DM often appears as a collection of tools presented usually in one package, in such a way that several techniques may be compared on the same data-set But DM is a process not only tools Data Information Knowledge preprocessIng analysis Beijing, 2008 10
Beijing, 2008 10 process or tools? ◼ DM often appears as a collection of tools presented usually in one package, in such a way that several techniques may be compared on the same data-set. ◼ But DM is a process, not only tools: Data Information Knowledge preprocessing analysis
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 武昌理工学院:《操作系统原理》课程教学资源(PPT课件)第一章 操作系统概述(主讲:温静).pptx
- 《Computer Networking:A Top Down Approach》英文教材教学资源(PPT课件讲稿,6th edition)Chapter 8 网络安全 Network Security.ppt
- 西安电子科技大学:《现代密码学》课程教学资源(PPT课件讲稿)第六章 数字签名算法.pptx
- 华中师范大学:智能与分布计算(PPT课件讲稿)语义网与本体 Semantic Web & Ontology(Introduction).ppt
- 中国科学技术大学:《计算机科学导论》课程教学资源(PPT课件讲稿)第五讲 经典计算的计算模型(主讲:陈意云).pptx
- 《高级语言程序设计 Advanced Programming》课程教学资源(PPT课件讲稿)第5章 循环结构程序设计.ppt
- 香港科技大学:Introduction to Software Defined Network(SDN).pptx
- 《微机原理笔记》课程教学资源(PPT课件讲稿)第6章 输入输出和中断技术.ppt
- 厦门大学:《大数据技术原理与应用》课程教学资源(PPT课件讲稿)第九章 图计算.ppt
- 《大型机高级系统管理技术》课程教学资源(PPT课件讲稿)第3章 作业控制语言.ppt
- 贵州师范学院:《高级语言程序设计 Advanced Programming》课程教学资源(PPT课件讲稿)第9章 结构体.ppt
- A New Approach for Accurate Modelling of Medium Access Control(MAC)Protocols.ppt
- 西安电子科技大学:人工神经网络(PPT讲稿)Artificial Neural Networks(Introduction).ppt
- 《数据结构和编程设计》课程教学资源(PPT课件讲稿)Chapter 1 Programming Principles.ppt
- 《微机原理》课程教学资源(PPT课件讲稿)第三章 寻址方式与指令系统.ppt
- 《数据结构》课程教学资源(PPT课件讲稿)第九章 排序 Sort.ppt
- 中国科学技术大学:《数据结构》课程教学资源(PPT课件)第八章 查找表.pps
- 丽水职业技术学院:《电子商务实训》课程教学资源(PPT课件讲稿)电子商务交易模式之“B2C”.ppt
- 河南中医药大学(河南中医学院):《计算机文化》课程教学资源(PPT课件讲稿)第八章 数字多媒体.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第7章 运输层.ppt
- 上海交通大学:《软件工程导论》课程教学资源(PPT课件讲稿)第十三讲 软件项目中的人员管理.ppt
- 航空航天(PPT课件讲稿)Mechanics——Particle Motion.ppt
- 《网络编程实用教程》教学资源(PPT课件讲稿)第4章 MFC编程.ppt
- 东南大学:《数据结构》课程教学资源(PPT课件讲稿)贪心算法.pptx
- 《计算机算法基础》课程教学资源(PPT课件讲稿)分枝-限界法.ppt
- 《计算机系统和系统结构》课程教学资源(PPT课件讲稿)第四章 流水线技术.ppt
- 四川大学:《计算机操作系统 Operating System Principles》课程教学资源(PPT课件讲稿)第6章 存储器管理.ppt
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第二章 微型计算机基础知识.ppt
- 《The C++ Programming Language》课程教学资源(PPT课件讲稿)Lecture 05 Object-Oriented Programming.ppt
- 四川大学:《计算机操作系统 Operating System Principles》课程教学资源(PPT课件讲稿)第7章 虚拟存储器管理.ppt
- 《计算机软件技术基础》课程电子教案(PPT课件讲稿)第9章 存储管理.ppt
- 上海交通大学:传感器网络研究 Research On Sensor Nets(主讲:伍民友).ppt
- 南京航空航天大学:《数据结构》课程教学资源(PPT课件讲稿)第五章 数组和广义表.ppt
- 《大数据挖掘与应用技术》课程教学资源(PPT课件讲稿)第12章 Hibernate持久化技术.ppt
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)第7章 多处理器及线程级并行 7.3 分布式共享存储器体系结构 7.4 Models of Memory Consistency.pptx
- Acknowledged Broadcasting and Gossiping in ad hoc radio networks.ppt
- Apache Spark:Intro to Spark(Lightning-fast cluster computing).pptx
- 中国科学技术大学:《网络信息安全 NETWORK SECURITY》课程教学资源(PPT课件讲稿)第三章 局域网安全技术及应用.ppt
- 面向服务的业务流程管理(PPT讲稿)Business Process Analysis and Modeling.pptx
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第6章 Internet.ppt