北京航空航天大学:《数据挖掘——概念和技术(Data Mining - Concepts and Techniques)》课程教学资源(PPT课件讲稿)Chapter 03 Data Preprocessing

Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major I asks in Data Preprocessing Data Cleaning a Data Integration Data reduction Data transformation and data discretization I Summary
2 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary

Data Quality: Why Preprocess the Data? Measures for data quality A multidimensional view Accuracy: correct or wrong accurate or not Completeness: not recorded unavailable Consistency: some modified but some not, dangling Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood?
3 Data Quality: Why Preprocess the Data? ◼ Measures for data quality: A multidimensional view ◼ Accuracy: correct or wrong, accurate or not ◼ Completeness: not recorded, unavailable, … ◼ Consistency: some modified but some not, dangling, … ◼ Timeliness: timely update? ◼ Believability: how trustable the data are correct? ◼ Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization Normalization Concept hierarchy generation
4 Major Tasks in Data Preprocessing ◼ Data cleaning ◼ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies ◼ Data integration ◼ Integration of multiple databases, data cubes, or files ◼ Data reduction ◼ Dimensionality reduction ◼ Numerosity reduction ◼ Data compression ◼ Data transformation and data discretization ◼ Normalization ◼ Concept hierarchy generation

Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major I asks in Data Preprocessing Data Cleaning a Data Integration Data reduction Data transformation and data discretization I Summary
5 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary

Data Cleaning Data in the real World Is dirty: Lots of potentially incorrect data, e. g instrument faulty human or computer error transmission error incomplete lacking attribute values lacking certain attributes of interest, or containing only aggregate data e.g. Occupation="(missing data) noisy: containing noise, errors or outliers e.g. Salary--10"(an error) inconsistent: containing discrepancies in codes or names, e. g nA9e=42", Birthday=“03/07/2010″ Was rating"1, 2,3 now rating" A, B, c discrepancy between duplicate records Intentional(e.g. disguised missing data) Jan. 1 as everyone' s birthday
6 Data Cleaning ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◼ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◼ e.g., Occupation=“ ” (missing data) ◼ noisy: containing noise, errors, or outliers ◼ e.g., Salary=“−10” (an error) ◼ inconsistent: containing discrepancies in codes or names, e.g., ◼ Age=“42”, Birthday=“03/07/2010” ◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ discrepancy between duplicate records ◼ Intentional (e.g., disguised missing data) ◼ Jan. 1 as everyone’s birthday?

Incomplete Missing) Data Data is not always available E.g. many tuples have no recorded value for several attributes such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred
7 Incomplete (Missing) Data ◼ Data is not always available ◼ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ◼ Missing data may be due to ◼ equipment malfunction ◼ inconsistent with other recorded data and thus deleted ◼ data not entered due to misunderstanding ◼ certain data may not be considered important at the time of entry ◼ not register history or changes of the data ◼ Missing data may need to be inferred

How to Handle missing Data? ignore the tuple: usually done when class label is missing (when doing classification-not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious infeasible? Fill in it automatically with a global constant e.g. ,"unknown",a new class the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
8 How to Handle Missing Data? ◼ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ◼ Fill in the missing value manually: tedious + infeasible? ◼ Fill in it automatically with ◼ a global constant : e.g., “unknown”, a new class?! ◼ the attribute mean ◼ the attribute mean for all samples belonging to the same class: smarter ◼ the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which require data cleaning duplicate records incomplete data inconsistent data
9 Noisy Data ◼ Noise: random error or variance in a measured variable ◼ Incorrect attribute values may be due to ◼ faulty data collection instruments ◼ data entry problems ◼ data transmission problems ◼ technology limitation ◼ inconsistency in naming convention ◼ Other data problems which require data cleaning ◼ duplicate records ◼ incomplete data ◼ inconsistent data

How to Handle noisy Data? Binning first sort data and partition into equal-frequency bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human( e. g deal with possible outliers 10
10 How to Handle Noisy Data? ◼ Binning ◼ first sort data and partition into (equal-frequency) bins ◼ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. ◼ Regression ◼ smooth by fitting the data into regression functions ◼ Clustering ◼ detect and remove outliers ◼ Combined computer and human inspection ◼ detect suspicious values and check by human (e.g., deal with possible outliers)

Data Cleaning as a Process Data discrepancy detection Use metadata(e. g, domain, range, dependency distribution) Check field overloading Check uniqueness rule consecutive rule and null rule Use commercial tools Data scrubbing use simple domain knowledge( e.g. post code, spell-check to detect errors and make corrections Data auditing by analyzing data to discover rules and relationship to detect violators(e.g, correlation and clustering to find outliers) Data migration and integration Data migration tools: allow transformations to be specified ETL(EXtraction/Transformation/Loading)tools: allow users to specify transformations through a graphical user interface Integration of the two processes Iterative and interactive(e.g. Potter's Wheels) 11
11 Data Cleaning as a Process ◼ Data discrepancy detection ◼ Use metadata (e.g., domain, range, dependency, distribution) ◼ Check field overloading ◼ Check uniqueness rule, consecutive rule and null rule ◼ Use commercial tools ◼ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections ◼ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) ◼ Data migration and integration ◼ Data migration tools: allow transformations to be specified ◼ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface ◼ Integration of the two processes ◼ Iterative and interactive (e.g., Potter’s Wheels)
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《数字图象处理》课程教学资源(PPT课件讲稿)第七章 邻域运算.ppt
- 上海交通大学:《编译器构造》课程教学资源(PPT讲稿,马融)Compiler.pptx
- 《软件工程 Software Engineering》教学资源:课程教学大纲.pdf
- 沈阳理工大学:《单片机C语言应用程序设计》课程PPT教学课件(单片机C语言编程)04 C51编程设计(廉哲).pptx
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)傅里叶分析与卷积 Fourier Analysis and Convolution.pptx
- 北京科技大学:物联网知识体系和学科建设(PPT讲稿,王志良).ppt
- 香港理工大学:Discovering Classification Rules.ppt
- 《软件质量与测试》课程教学资源(PPT大纲课件,目录版).pptx
- 安徽理工大学:《汇编语言》课程教学资源(PPT课件讲稿)第七章 高级汇编语言技术(主讲:李敬兆).ppt
- 《Vb程序设计教程》课程教学资源(PPT课件讲稿)第三章 VB语言基础.pps
- 吉林大学:《C语言》课程教学资源(PPT课件讲稿)第6章 利用数组处理批量数据.ppt
- 《计算机组成原理》课程教学资源(PPT课件讲稿)第4章 处理器(CPU).ppt
- 北京大学:人工神经网络(PPT课件讲稿)Artificial Neural Networks,ANN.ppt
- 西安电子科技大学:《神经网络与模糊系统》课程教学资源(PPT课件讲稿)Chapter 6 结构和平衡 Architecture and Equilibria.ppt
- 清华大学:A Feature Weighting Method for Robust Speech Recognition(Speech Activities in CST).ppt
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第2章 计算机网络应用.ppsx
- 《Java网站开发》教学资源(PPT讲稿)第9章 过滤器和监听器技术.ppt
- 长春大学:《计算机应用基础》课程教学资源(PPT课件讲稿)第一章 计算机基础知识(崔天明).ppt
- 合肥工业大学:《网络安全概论》课程教学资源(PPT课件讲稿)第2讲 密码学简介(主讲:苏兆品).ppt
- 《计算机网络与因特网 Computer Networks and Internets》课程教学资源(PPT课件讲稿)Part II 物理层(信号、媒介、数据传输).ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第一章 概述(谢希仁).ppt
- 上海交通大学:Mining Massive Datasets(PPT讲稿).ppt
- 东南大学:《数据结构》课程教学资源(PPT课件讲稿)动态规划.pptx
- 《数据结构》课程教学资源:课程教学资源(PPT课件讲稿)第九章 查找表.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)抽象数据类型 Abstract Data Types.ppt
- 中国科学技术大学:《并行计算 Parallel Computing》课程教学资源(PPT课件讲稿)并行编译简介.ppt
- 《单片机原理及应用》课程教学资源(PPT课件讲稿)第6章 AT89S52单片机的串行口.ppt
- 上海交通大学:《程序设计》课程教学资源(PPT课件讲稿)第4章 循环控制.ppt
- 上海交通大学:《通信网络》课程PPT教学课件(Communication Networks)Introduction(主讲:叶通).pptx
- 北京师范大学:《多媒体技术基础》课程教学资源(PPT课件讲稿)第二章 数字图像(曾兰芳).ppt
- 利用EXCEL进行数据分析与图表处理(PPT讲稿).pptx
- 上海交通大学:《程序设计》课程教学资源(PPT课件讲稿)第9章 模块化开发.ppt
- 《计算科学基础研究》课程教学资源(PPT课件讲稿)类的定义.ppt
- 南京大学:《编译原理》课程教学资源(PPT课件讲稿)第九章 机器无关的优化(赵建华).ppt
- 《电子商务概论》课程教学资源(PPT课件讲稿)第一章 电子商务基础知识(主讲:贾朝辉).pptx
- 《操作系统》课程教学资源(PPT课件讲稿)内存管理 Memory Management.ppt
- 沈阳理工大学:《大学计算机基础》课程教学资源(PPT课件讲稿)第3章 编辑排版软件(Microsoft Word 2000).pps
- 《C语言程序设计》课程电子教案(PPT课件讲稿)第4章 算法控制结构.ppt
- 《数据结构》课程教学资源(PPT课件讲稿)第二章 线性表.ppt
- 上海交通大学:《数字图像处理 Digital Image Processing》课程教学资源(PPT课件讲稿,第三版)Chapter 12 Object Recognition.pptx