同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)Data Preprocessing

Big Data Analysis and Mining Lecture 2: Data Preprocessing Weixiong Rao饶卫雄 Tongji University同济大学软件学院 2015 Fall wxrao@tongji.edu.cn Some of the slides are from Dr Jure Leskovec's and Prof. Zachary G Ives 2021/28 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
2021/2/8 1 Big Data Analysis and Mining Lecture 2: Data Preprocessing Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall wxrao@tongji.edu.cn *Some of the slides are from Dr Jure Leskovec’s and Prof. Zachary G. Ives

Data Preprocessing a Data Preprocessing: An Overview ◆ Data Quality Major Tasks in Data Preprocessing a Data Cleaning u Data Integration ■ Data reduction Data Transformation and data discretization ■ Summary 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
2 2 Data Preprocessing ◼ Data Preprocessing: An Overview ◆ Data Quality ◆ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary

Data Quality: Why Preprocess the Data? a Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not ◆ Completeness: not recorded, unavailable,… Consistency: some modified but some not, dangling Timeliness: timely update? e Believability how trustable the data are correct? Interpretability how easily the data can be understood? 同济大学软件学院 3 ool of Software Engineering. Tongpi Unversity
3 Data Quality: Why Preprocess the Data? ◼ Measures for data quality: A multidimensional view ◆ Accuracy: correct or wrong, accurate or not ◆ Completeness: not recorded, unavailable, … ◆ Consistency: some modified but some not, dangling ◆ Timeliness: timely update? ◆ Believability: how trustable the data are correct? ◆ Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data reduction Dimensionality reduction ◆ Numerosity reduction ◆ Data compression Data transformation and data discretization ◆ Normalization Concept hierarchy generation 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
4 Major Tasks in Data Preprocessing ◼ Data cleaning ◆ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies ◼ Data integration ◆ Integration of multiple databases, data cubes, or files ◼ Data reduction ◆ Dimensionality reduction ◆ Numerosity reduction ◆ Data compression ◼ Data transformation and data discretization ◆ Normalization ◆ Concept hierarchy generation

Data Preprocessing a Data Preprocessing: An Overview ◆ Data Quality Major Tasks in Data Preprocessing a Data Cleaning u Data Integration ■ Data reduction Data Transformation and data discretization ■ Summary 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
5 5 Data Preprocessing ◼ Data Preprocessing: An Overview ◆ Data Quality ◆ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary

Data Cleaning a Data in the real World Is Dirty Lots of potentially incorrect data, e. g instrument faulty, human or computer error, transmission error incomplete: lacking attribute values lacking certain attributes of interest, or containing only aggregate data o e. g, Occupation=(missing data noisy: containing noise, errors, or outliers n e.g., Salary="-10(an error) inconsistent: containing discrepancies in codes or names, e.g 口Age=42, Birthday=“03/07/2010 n Was rating"1, 2, 3, now rating, B, C a discrepancy between duplicate records Intentional(e.g, disguised missing data) a Jan. 1 as everyone's birthday 6 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
6 Data Cleaning ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◆ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=“ ” (missing data) ◆ noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error) ◆ inconsistent: containing discrepancies in codes or names, e.g., Age=“42”, Birthday=“03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records ◆ Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to ◆ equipment malfunction e inconsistent with other recorded data and thus deleted o data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data a Missing data may need to be inferred 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
7 Incomplete (Missing) Data ◼ Data is not always available ◆ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ◼ Missing data may be due to ◆ equipment malfunction ◆ inconsistent with other recorded data and thus deleted ◆ data not entered due to misunderstanding ◆ certain data may not be considered important at the time of entry ◆ not register history or changes of the data ◼ Missing data may need to be inferred

How to Handle Missing Data? a Ignore the tuple: usually done when class label is missing(when doing classification-not effective when the of missing values per attribute varies considerably a Fill in the missing value manually: tedious+ infeasible? Fill in it automatically with a global constant: e.g., unknown, a new class? ◆ the attribute mean the attribute mean for all samples belonging to the same class smarter the most probable value: inference-based such as Bayesian formula or decision tree 8 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
8 How to Handle Missing Data? ◼ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ◼ Fill in the missing value manually: tedious + infeasible? ◼ Fill in it automatically with ◆ a global constant : e.g., “unknown”, a new class?! ◆ the attribute mean ◆ the attribute mean for all samples belonging to the same class: smarter ◆ the most probable value: inference-based such as Bayesian formula or decision tree

Noisy data Noise: random error or variance in a measured variable a Incorrect attribute values may be due to faulty data collection instruments ◆ data entry problems data transmission problems ◆ technology limitation inconsistency in naming convention a Other data problems which require data cleaning ◆ duplicate records ◆ incomplete data ◆ inconsistent data 同济大学软件学院 ool of Software Engineering. Tongpi Unversity
9 Noisy Data ◼ Noise: random error or variance in a measured variable ◼ Incorrect attribute values may be due to ◆ faulty data collection instruments ◆ data entry problems ◆ data transmission problems ◆ technology limitation ◆ inconsistency in naming convention ◼ Other data problems which require data cleaning ◆ duplicate records ◆ incomplete data ◆ inconsistent data

How to Handle noisy Data? Binning first sort data and partition into(equal-frequency bins o then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc Regression smooth by fitting the data into regression functions ■C| uttering detect and remove outliers Combined computer and human inspection o detect suspicious values and check by human(e.g deal with possible outliers) 同济大学软件学院 10 ool of Software Engineering. Tongpi Unversity
10 How to Handle Noisy Data? ◼ Binning ◆ first sort data and partition into (equal-frequency) bins ◆ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. ◼ Regression ◆ smooth by fitting the data into regression functions ◼ Clustering ◆ detect and remove outliers ◼ Combined computer and human inspection ◆ detect suspicious values and check by human (e.g., deal with possible outliers)
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《编译原理与技术》课程教学资源(PPT课件讲稿)运行环境.ppt
- 华南理工大学:神经计算的生理和动力学指标(PPT讲稿).ppt
- 中国科学技术大学:《嵌入式操作系统 Embedded Operating Systems》课程教学资源(PPT课件讲稿)第七讲 存储器管理.ppt
- 中国科学技术大学:《网络信息安全 NETWORK SECURITY》课程教学资源(PPT课件讲稿)Windows 操作系统.ppt
- 《Java面向对象程序设计》课程教学资源(PPT课件讲稿)第四章 Java图形用户界面设计 4.3 事件处理.pptx
- 南京航空航天大学:《C++》课程电子教案(PPT课件讲稿)第2章 文件操作.pptx
- MSC Software Corporation:Dynamic System Modeling, Simulation, and Analysis Using MSC.EASY5(Introductory Class).ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)构件化软件 Component Software.ppt
- 新乡学院:《PHP动态网站开发》课程教学资源(教学大纲).pdf
- 《Android 程序设计基础》课程教学资源(PPT课件讲稿)第8章 数据存储和访问.ppt
- 《高级软件工程》课程教学大纲 Advanced Software Engineering.doc
- 南京大学:《计算机图形学》课程教学资源(PPT课件讲稿)第6讲 图形观察与几何变换.pptx
- 《数据结构》课程教学资源(PPT课件讲稿)第六章 树与二叉树.ppt
- 烟台大学:《C语言程序设计》课程电子教案(PPT课件讲稿)第五章 数组、字符串、指针(主讲:荆蕾).ppt
- 《模式识别》课程教学资源(PPT讲稿)Learning with information of features.ppt
- 合肥工业大学:使用大数据进行计算建模(PPT讲稿)Computing/Modeling with Big Data(主讲:吴信东).pptx
- 人工神经网络(ANN)方法简介(PPT课件讲稿).ppt
- 清华大学:《数据中心网络 Data Center Networking》课程教学资源(PPT课件讲稿).pptx
- 上饶师范学院:《数据库系统原理 An Introduction to Database System》课程教学资源(PPT课件讲稿,共九章).ppt
- 北京大学:计算智能实验室(PPT讲稿)烟花算法算子分析.pptx
- 中国科学技术大学:《算法基础》课程教学资源(PPT课件讲稿)第五讲 概率分析与随机算法.pptx
- Robust Networking Architecture and Secure Communication Scheme for Heterogeneous Wireless Sensor Networks.pptx
- 《数据结构》课程教学资源(PPT讲稿)二叉树和二叉搜索树 Trees, Binary Trees, and Binary Search Trees.ppt
- 《网页设计与制作》课程PPT教学课件(Fireworks Mx 2004)第九章 Firework图像处理.ppt
- 西安交通大学:《微机原理与接口技术》课程教学资源(PPT课件讲稿)第4章 存储器系统接口.ppt
- 《计算机网络基础》课程PPT教学课件(讲稿)第4章 IP协议.ppt
- 大连理工大学:《计算机网络》课程教学资源(PPT课件讲稿)Chapter 1 Introduction(roadmap,主讲:孙伟峰).ppt
- 《数据库系统概论》课程教学资源(PPT课件讲稿)数据结构实用教程(共十章).ppt
- 上海交通大学:《程序设计》课程教学资源(PPT课件讲稿)第7章 间接访问——指针.ppt
- 编译程序构造 COMPILER CONSTRUCTION(PPT讲稿)原理与实践 Principles and Practice.ppt
- 《3ds Max 9》教学资源(PPT课件)第8章 灯光、摄影机、渲染输出.ppt
- 《运筹学与最优化方法》课程教学资源(PPT课件讲稿)第十章 智能优化计算简介.ppt
- 东北大学:《可信计算基础》课程教学资源(PPT课件讲稿)第五讲 分布式系统的安全(主讲:周福才).ppt
- 《电脑组装与维护实例教程》教学资源(PPT课件讲稿)第14章 系统的维护.ppt
- 《PHP程序设计》教学资源(PPT课件讲稿)项目七 Ajax商品发布.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第六章 传输层.ppt
- 《计算机系统安全》课程PPT教学课件(信息安全与管理)第九章 防火墙.ppt
- 同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)Getting to Know Your Data.ppt
- 香港浸会大学:Computer Security(PPT课件讲稿)Cryptography Chapter 1 Symmetric Ciphers.ppt
- 《计算机文化基础》课程教学资源(PPT课件讲稿)第九章 多媒体技术基础.ppt