北京大学:文本挖掘技术(PPT讲稿)文本分类 Text Categorization

Nc&IS Text Categorization Peng bo 0/31/2010
Text Categorization PengBo 10/31/2010

本次课大纲 ■ Text Categorization Problem definition Build a classifier Naive Bayes Classifier K-Nearest Neighbor Classifier Evaluation
本次课大纲 ◼ Text Categorization ◼ Problem definition ◼ Build a Classifier ◼ Naïve Bayes Classifier ◼ K-Nearest Neighbor Classifier ◼ Evaluation

Definition Given n实例 instance,x∈X, where X is the instance language or instance space Issue: how to represent text documents 固定的类别集合 categories: C={c1,C2…,Cn} a Determine. The category of X:c(ⅩX)∈C, Where o(x) is a categorization function分类函数 We want to know how to build categorization functions( classifiers分类器”)
Definition ◼ Given: ◼ 实例 instance, xX, where X is the instance language or instance space. ◼ Issue: how to represent text documents. ◼ 固定的类别集合 categories: ◼ C = {c1 , c2 ,…, cn} ◼ Determine: ◼ The category of x : c(x)C, where c(x) is a categorization function 分类函数 ◼ We want to know how to build categorization functions (“classifiers 分类器” )

Text Categorization Examples ssign labels to each document or web-page Labels are most often topics such as Yahoo-categories g,"finance,""sports,""news> world>asia>business abels may be genres e.g, editorials""movie-reviews""news abels may be opinion e.g "llke r "hate "neutral Labels may be domain-specific binary SPAM e.g., "interesting-to-me", not-interesting-to-me' a e.g. ,spam. "not-spam' e.g, contains adult language,, doesnt PRN
Text Categorization Examples Assign labels to each document or web-page: ◼ Labels are most often topics such as Yahoo-categories ◼ e.g., "finance," "sports," "news>world>asia>business" ◼ Labels may be genres ◼ e.g., "editorials" "movie-reviews" "news“ ◼ Labels may be opinion ◼ e.g., “like” , “hate” , “neutral” ◼ Labels may be domain-specific binary ◼ e.g., "interesting-to-me" : "not-interesting-to-me” ◼ e.g., “spam” : “not-spam” ◼ e.g., “contains adult language” :“doesn’t

Classification methods ■人工分类 Manual classification Used by yahoo! Looksmart about com odP, Medline Accurate but expensive to scale ■自动文本分类 Automatic document classification ■基于规则:Hand- coded rule-based systems Spam mail filter, n有监督的学习: Supervised learning of a document- label assignment function No free lunch: requires人工标注的训练集hand- classified training data Note that many commercial systems use a mixture of methods
Classification Methods ◼ 人工分类 Manual classification ◼ Used by Yahoo!, Looksmart, about.com, ODP, Medline ◼ Accurate but expensive to scale ◼ 自动文本分类 Automatic document classification ◼ 基于规则:Hand-coded rule-based systems ◼ Spam mail filter,… ◼ 有监督的学习:Supervised learning of a documentlabel assignment function ◼ No free lunch: requires 人工标注的训练集 handclassified training data ◼ Note that many commercial systems use a mixture of methods

Think about it a How to represent text documents and categories a Vectors regions String Language(models) a How to build categorization functions? Closeness similarity to regions Probability to generate the string/language model
Think about it… ◼ How to represent text documents and categories? ◼ Vectors & Regions ◼ String & Language (models) ◼ How to build categorization functions? ◼ Closeness/Similarity to regions ◼ Probability to generate the string/language model

Nc&IS K-Nearest Neighbors
K-Nearest Neighbors

Classes in a Vector Space ● Government Science Arts
Classes in a Vector Space Government Science Arts

Classification Using Vector Spaces Each training doc a point(vector)labeled by its topic(= class Hypothesis: docs of the same class form a contiguous region of space We define surfaces to delineate classes in space
Classification Using Vector Spaces ◼ Each training doc a point (vector) labeled by its topic (= class) ◼ Hypothesis: docs of the same class form a contiguous region of space ◼ We define surfaces to delineate classes in space

Test document overnment Similarity hypothesis true in general? ● Government Science Arts
Test Document = Government Government Science Arts Similarity hypothesis true in general?
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)K-means & EM.pptx
- 中国医科大学计算机中心:《虚拟现实与增强现实技术概论》课程教学资源(PPT课件讲稿)第3章 虚拟现实系统的输出设备.pptx
- 香港中文大学:XML for Interoperable Digital Video Library.ppt
- 上海交通大学:《计算机图形学 Computer Graphics》课程教学资源(PPT讲稿)CHAPTER 4 THE VISUALIZATION PIPELINE.pptx
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 09 Evaluation.ppt
- 长春工业大学:《网页设计与制作》课程教学资源(PPT课件)第5章 Div+CSS布局技术.ppt
- 合肥工业大学:《计算机网络技术》课程教学资源(PPT课件讲稿)第4章 交换网的运行.ppt
- 山东大学软件学院:非线性规划(PPT讲稿)一维搜索方法.ppt
- 《并发控制技术》课程教学资源(PPT课件讲稿)第7章 事务管理 transaction management.ppt
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第1章 计算机常识(主讲:马秀麟).pptx
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)面向对象的分析与设计简介 OOA & OOD:An introduction.ppt
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)向量体系结构.pptx
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第二部分 公钥密码和散列函数 第8章 数论入门(苗付友).pptx
- 《计算机网络技术》课程教学资源(PPT课件讲稿)第5章 广域网.ppt
- 香港城市大学:Rank Aggregation in MetaSearch.ppt
- Vitebi 译码.ppt
- 图形处理及多媒体应用(PPT课件讲稿).pps
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第5章 Microsoft Excel 2010.pptx
- Distributed Systems and Networking Programmin(SOAP – Introduction).ppt
- Coded Caching under Arbitrary Popularity Distributions.pptx
- 《网页设计与制作》课程教学资源(PPT课件讲稿)第一章 HTML基础.ppt
- 清华大学:《计算机导论》课程电子教案(PPT教学课件)第1章 计算机发展简史.ppt
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 06 Index Compression.ppt
- 嵌入式交叉开发环境的建立(PPT实验讲稿).ppt
- 西安交通大学:《微型计算机接口技术》课程教学资源(PPT课件讲稿)第五章 输入/输出控制接口.ppt
- 《TCP/IP协议及其应用》课程教学资源(PPT课件讲稿)第3章 IP寻址与地址解析.ppt
- 中国医科大学:《计算机网络实用教程》课程教学资源(PPT讲稿)高速局域网技术、交换式局域网技术、虚拟局域网技术、主要的城域网技术.ppt
- 《大学计算机基础》课程教学资源:作业习题.pdf
- 《计算机网络》课程教学资源(PPT课件讲稿)第一章 计算机网络概述.ppt
- 山西国际商务职业学院:《数据库应用程序设计》课程教学资源(PPT课件)第三章 数据与数据运算.pps
- 《C语言程序设计》课程电子教案(PPT课件讲稿)Chapter 02 用C语言编写程序.ppt
- 《数字图像处理》课程教学资源(PPT课件讲稿)第5章 图像复原.ppt
- 《数据结构 Data Structure》课程教学资源(PPT课件讲稿)06 非二叉树 Non-Binary Trees.ppt
- 《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第六讲 关系数据理论.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)并发对象 Concurrent Objects.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第六章 应用层(谢希仁).ppt
- 《电子商务技术》课程教学资源(PPT课件讲稿)第五章 电子商务安全技术.ppt
- Parallel Algorithms Underlying MPI Implementations.ppt
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第5章 Linux网络工程.ppt
- 陕西师范大学:Neural Networks and Fuzzy Systems(PPT讲稿)Chapter 3 NEURONAL DYNAMICS II:ACTIVATION MODELS.ppt