重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 3 Data Preprocessing

Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data transformation and data discretization Summary
1 Chapter 3: Data Preprocessing ◼ Data Preprocessing: An Overview ◼ Data Quality ◼ Major Tasks in Data Preprocessing ◼ Data Cleaning ◼ Data Integration ◼ Data Reduction ◼ Data Transformation and Data Discretization ◼ Summary

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data a e.g. occupation= mm a noisy: containing errors or outliers aeg, salary=-10″ inconsistent: containing discrepancies in codes or names e.g. Age=42 Birthday=03/07/1997 e.g Was rating 1,213 now rating A, B,c e.g. discrepancy between duplicate records
2 Why Data Preprocessing? ◼ Data in the real world is dirty ◼ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◼ e.g., occupation=“ ” ◼ noisy: containing errors or outliers ◼ e.g., Salary=“-10” ◼ inconsistent: containing discrepancies in codes or names ◼ e.g., Age=“42” Birthday=“03/07/1997” ◼ e.g., Was rating “1,2,3”, now rating “A, B, C” ◼ e.g., discrepancy between duplicate records

Why ls Data Dirty? incomplete data may come from Not applicable"data value when collected Different considerations between the time when the data was collected and when it is analyzed Human/hardware/software problems Noisy data(incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation(e. g. modify some linked data) Duplicate records also need data cleaning
3 Why Is Data Dirty? ◼ Incomplete data may come from ◼ “Not applicable” data value when collected ◼ Different considerations between the time when the data was collected and when it is analyzed. ◼ Human/hardware/software problems ◼ Noisy data (incorrect values) may come from ◼ Faulty data collection instruments ◼ Human or computer error at data entry ◼ Errors in data transmission ◼ Inconsistent data may come from ◼ Different data sources ◼ Functional dependency violation (e.g., modify some linked data) ◼ Duplicate records also need data cleaning

Data Quality: Why Preprocess the data? Measures for data quality: A multidimensional view Accuracy: correct or wrong accurate or not Completeness: not recorded, unavailable, Consistency: some modified but some not, dangling Timeliness: timely update? Believability: how trustable the data are correct? Interpretability how easily the data can be understood?
4 Data Quality: Why Preprocess the Data? ◼ Measures for data quality: A multidimensional view ◼ Accuracy: correct or wrong, accurate or not ◼ Completeness: not recorded, unavailable, … ◼ Consistency: some modified but some not, dangling, … ◼ Timeliness: timely update? ◼ Believability: how trustable the data are correct? ◼ Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers and resolve inconsistencies Data integration Integration of multiple databases data cubes or files Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization normalization Concept hierarchy generation
5 Major Tasks in Data Preprocessing ◼ Data cleaning ◼ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies ◼ Data integration ◼ Integration of multiple databases, data cubes, or files ◼ Data reduction ◼ Dimensionality reduction ◼ Numerosity reduction ◼ Data compression ◼ Data transformation and data discretization ◼ Normalization ◼ Concept hierarchy generation

Data Cleaning Data in the real world is dirty: Lots of potentially incorrect data, e.g instrument faulty human or computer error transmission error incomplete: lacking attribute values lacking certain attributes of interest, or containing only aggregate data e.g., Occupation="(missing data) noisy: containing noise errors or outliers e.g., Salary--10"(an error) inconsistent: containing discrepancies in codes or names, e. g nAge=42",Bita="03/07/2010″ Was rating 1,2,3 now rating", B, c discrepancy between duplicate records Intentional(e. g disguised missing data) Jan. 1 as everyone' s birthday?
6 Data Cleaning ◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ◼ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ◼ e.g., Occupation=“ ” (missing data) ◼ noisy: containing noise, errors, or outliers ◼ e.g., Salary=“−10” (an error) ◼ inconsistent: containing discrepancies in codes or names, e.g., ◼ Age=“42”, Birthday=“03/07/2010” ◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ discrepancy between duplicate records ◼ Intentional (e.g., disguised missing data) ◼ Jan. 1 as everyone’s birthday?

Incomplete Missing) Data Data is not always available E.g., many tuples have no recorded value for several attributes such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred
7 Incomplete (Missing) Data ◼ Data is not always available ◼ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ◼ Missing data may be due to ◼ equipment malfunction ◼ inconsistent with other recorded data and thus deleted ◼ data not entered due to misunderstanding ◼ certain data may not be considered important at the time of entry ◼ not register history or changes of the data ◼ Missing data may need to be inferred

How to Handle missing Data? Ignore the tuple: usually done when class label is missing (when doing classification-not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious+ infeasible? Fill in it automatically with a global constant: e. g ,"unknown"a new class? the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
8 How to Handle Missing Data? ◼ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ◼ Fill in the missing value manually: tedious + infeasible? ◼ Fill in it automatically with ◼ a global constant : e.g., “unknown”, a new class?! ◼ the attribute mean ◼ the attribute mean for all samples belonging to the same class: smarter ◼ the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments data entry problems data transmission problems a technology limitation inconsistency in naming convention Other data problems which require data cleaning duplicate records incomplete data inconsistent data
9 Noisy Data ◼ Noise: random error or variance in a measured variable ◼ Incorrect attribute values may be due to ◼ faulty data collection instruments ◼ data entry problems ◼ data transmission problems ◼ technology limitation ◼ inconsistency in naming convention ◼ Other data problems which require data cleaning ◼ duplicate records ◼ incomplete data ◼ inconsistent data

How to Handle noisy Data? Binning first sort data and partition into(equal-frequency bins then one can smooth by bin means smooth by bin median, smooth by bin boundaries etc Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human(e. g deal with possible outliers)
10 How to Handle Noisy Data? ◼ Binning ◼ first sort data and partition into (equal-frequency) bins ◼ then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. ◼ Regression ◼ smooth by fitting the data into regression functions ◼ Clustering ◼ detect and remove outliers ◼ Combined computer and human inspection ◼ detect suspicious values and check by human (e.g., deal with possible outliers)
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 2 about data - Getting to Know Your Data.ppt
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 1 introduction.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第7章 机器人规划.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第6章 机器学习.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第5章 搜索策略.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第4章 智能计算(计算智能).ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第3章 推理技术.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_第2章 知识表示.ppt
- 重庆师范大学:《人工智能 AI》精品课程PPT教学课件_绪论、第1章 人工智能概述.ppt
- 重庆师范大学:《人工智能》精品课程PPT教学课件_VR虚拟现实和AR增强现实技术.ppt
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)09 Spark内存计算.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)08 流计算 Stream Computing.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)07 图计算 Graph Computing.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)06 HBase.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)05 HDFS.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)04 MapReduce.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)03 Hadoop.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)02 大数据关键技术与挑战.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)01 大数据概述.pdf
- 重庆大学:《大数据技术基础》课程教学资源(课件讲稿)13 大数据技术应用(大数据商业应用).pdf
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 4 OLAP - Data Warehousing and On-line Analytical Processing.ppt
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 5 Mining Frequent Patterns, Association and Correlations:Basic Concepts and Methods.ppt
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 6 Advanced Frequent Pattern Mining.ppt
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 7 Classification:Basic Concepts.ppt
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 8 Cluster Analysis:Basic Concepts and Methods.pptx
- 重庆大学:《数据仓库与数据挖掘 Data Warehouse and Data mining》课程PPT教学课件(英文版)Chapter 9 Outlier Analysis.ppt
- 延安大学:《网页制作基础教程》课程教学资源_教学大纲.pdf
- 延安大学:《网页制作基础教程》学术论文_基于AJAX技术的Web模型在网站互动平台的应用研究.pdf
- 延安大学:《网页制作基础教程》学术论文_基于RIA技术的实验演示系统的设计与实现.pdf
- 延安大学:《网页制作基础教程》学术论文_服务器推技术在实验演示系统中的应用.pdf
- 延安大学:《网页制作基础教程》学术论文_用户行为驱动的网页布局自动调整的研究.pdf
- 《网页制作基础教程》参考书籍(PDF):JavaScript 权威指南(第四版).pdf
- 《网页制作基础教程》参考书籍(PDF):Python学习手册(第3版,涵盖Pathon 2.5).pdf
- 《网页制作基础教程》参考书籍:CSS Mastery 精通CSS书籍——高级WEB标准解决方案(人民邮电出版社).pdf
- 延安大学:《网页制作基础教程》课程PPT教学课件_第一章 网页结构(牛永洁).ppt
- 延安大学:《网页制作基础教程》课程PPT教学课件_第二章 网页头部.ppt
- 延安大学:《网页制作基础教程》课程PPT教学课件_第三章 格式化.ppt
- 延安大学:《网页制作基础教程》课程PPT教学课件_第四章 列表的应用.ppt
- 延安大学:《网页制作基础教程》课程PPT教学课件_第五章 使用图像与多媒体.ppt
- 延安大学:《网页制作基础教程》课程PPT教学课件_第六章 使用超级链接.ppt