香港科技大学:基于文档的知识管理(PPT课件讲稿)Knowledge Management with Documents
Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
1 Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
Keyword Extraction Goal given n documents each consisting of words, extract the most significant subset of words> keywords Example [All the students are taking exams]-->[student, take, exam Keyword Extraction Process remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words-build key word index extract key phrases- build key phrase index
2 Keyword Extraction ◼ Goal: ◼ given N documents, each consisting of words, ◼ extract the most significant subset of words → keywords ◼ Example ◼ [All the students are taking exams] -- >[student, take, exam] ◼ Keyword Extraction Process ◼ remove stop words ◼ stem remaining terms ◼ collapse terms using thesaurus ◼ build inverted index ◼ extract key words - build key word index ◼ extract key phrases - build key phrase index i t
Stop Words and Stemming From a given Stop Word list a, about, again, are, the to, of, Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf's law: Frequency * rank x constant most frequent words tend to be short most frequent 20% of words account for 60% of usage
3 Stop Words and Stemming ◼ From a given Stop Word List ◼ [a, about, again, are, the, to, of, …] ◼ Remove them from the documents ◼ Or, determine stop words ◼ Given a large enough corpus of common English ◼ Sort the list of words in decreasing order of their occurrence frequency in the corpus ◼ Zipf’s law: Frequency * rank constant ◼ most frequent words tend to be short ◼ most frequent 20% of words account for 60% of usage
Zipf's Law--An illustration Rank(R) Term Frequency(F)R*F(10**6) the 69971 0.070 of 36411 0.073 23456789 and 28.852 0.086 to 26.149 0.104 a 232370.116 In 21.341 0.128 that 10.5950074 10009 0.081 was 9816 0.088 10 he 9543 0.095
4 Zipf’s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095
Resolving power of word Non-significant Non-significant high-frequency low-frequency terms terms Presumed resolving power of significant words Words in decreasing frequency order
5 Resolving Power of Word Words in decreasing frequency order Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words
Stemming a The next task is stemming transforming words to root form Computing, Computer, Computation >comput Suffix based methods Remove ability"from"computability L+ness, +ive, >remove Suffix list context rules
6 Stemming ◼ The next task is stemming: transforming words to root form ◼ Computing, Computer, Computation →comput ◼ Suffix based methods ◼ Remove “ability” from “computability” ◼ “…”+ness, “…”+ive, → remove ◼ Suffix list + context rules
Thesaurus rules a thesaurus aims at classification of words in a language for a word it gives related terms which are broader than narrower than same as (synonyms)and opposed to(antonyms)of the given word (other kinds of relationships may exist, e.g, composed of) Static Thesaurus tables anneal, strain], [antenna, receiver] Roget's thesaurus WordNet at princeton 7
7 Thesaurus Rules ◼ A thesaurus aims at ◼ classification of words in a language ◼ for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) ◼ Static Thesaurus Tables ◼ [anneal, strain], [antenna, receiver], … ◼ Roget’s thesaurus ◼ WordNet at Preinceton
Thesaurus rules can also be learned From a search engine query log After typing queries browse If query 1 and query2 leads to the same document Then, Similar(query l, query2) If queryl leads to Document with title keyword K Then, Similar(query1, k) Then transitivity Microsoft research china s work in WWW10 (Wen et al )on Encarta online
8 Thesaurus Rules can also be Learned ◼ From a search engine query log ◼ After typing queries, browse… ◼ If query1 and query2 leads to the same document ◼ Then, Similar(query1, query2) ◼ If query1 leads to Document with title keyword K, ◼ Then, Similar(query1, K) ◼ Then, transitivity… ◼ Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online
The vector-Space Model Distinct terms are available call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document nor T1=architecture 2=bu T3==computer T4=database T5=xm computer science collection index terms or vocabulary of the colelction
9 The Vector-Space Model ◼ T distinct terms are available; call them index terms or the vocabulary ◼ The index terms represent important terms for an application → a vector to represent the document ◼ or T1=architecture T2=bus T3=computer T4=database T5=xml computer science collection index terms or vocabulary of the colelction
The vector-Space Model Assumptions: words are uncorrelated Given 1.n documents and a query TT 2. Query considered a document 1 012 too DD 2. Each represented by t terms 202122 3. Each term in document i has weight 4. We will deal with how to compute the weights later
10 The Vector-Space Model ◼ Assumptions: words are uncorrelated T1 T2 …. Tt D1 d11 d12 … d1t D2 d21 d22 … d2t : : : : : : : : Dn dn1 dn2 … dnt Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight 4. We will deal with how to compute the weights later ij d Q q q qt ... 1 2
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 健康促进与健康教育(PPT讲稿).ppt
- 华南农业大学图书馆:SCI论文检索与ESI学科基础知识.ppt
- 广州铁路职业技术学院:2017年度毕业生就业质量年度报告.pdf
- 普通话水平测试培训(PPT讲稿).ppt
- 北京林业大学:建设研究型大学的探索目标与措施(PPT讲稿).ppt
- 大作业成果(PPT讲稿)睡眠时间的影响因素.ppt
- 晋中职业技术学院:在主动适应新常态中进一步提升高校教师的科研能力.ppt
- 基于规则的知识(专家)系统(PPT课件讲稿)Rule-based Knowledge(Expert)Systems.ppt
- THOMSON图片馆:医学数据库检索与利用培训系列讲座之二.ppt
- 中国矿业大学化工学院:2017级研究生新同学(PPT).pptx
- 大连工业大学:习近平关于人才工作论述及北大讲话精神的学习.ppt
- 某高校图书馆《中图法》知识讲座(PPT讲稿).ppt
- 《文献检索》课程教学资源(PPT课件讲稿)第二章 文献检索基本知识.ppt
- 北京航空航天大学:学术信息获取的自由——信息素养及能力.ppt
- 四川农业大学:大学生创新创业活动解析(PPT讲稿).pptx
- 上海杉达学院:人文学院新生入学介绍(2013).ppt
- 合肥工业大学:借助专业认证之力推动专业建设和发展(胡学钢).ppt
- 中原大学:专业英文写作(PPT课件讲稿)On Professional Writings Using English.ppt
- Looking forward and outside:the value of Academic Libraries.pptx
- 香港大學教育學院:中學會考地理科專科語體分析.ppt
- 吉林大学:当好大学老师(PPT讲稿).pptx
- 北京理工大学:教育技术一级培训(PPT讲稿)理论部分.ppt
- 教育部职业教育与成人教育司:就业导向下的职业教育教学改革(PPT讲稿).ppt
- 成都大学:归档文件整理规则(PPT讲稿).pptx
- 华南师范大学:大学生心理卫生常识和心理咨询基础知识(PPT讲稿).ppt
- 江西师范大学:利用事实型数据与分析支撑机构的科研绩效分析和评估(基于Web of Science/InCites/DDA).pptx
- Doing researching and experimentation.pptx
- 北京师范大学:《教育研究方法》课程教学资源(PPT讲稿)Lecture 6 Approach to Comparative-Historical Method(3)Constructionism in Historical Perspective.ppt
- 沈阳药科大学:获取原文文献的方法与技巧(PPT讲稿).ppt
- 东华大学:本科硕士博士专业介绍.pdf
- 全方位驾驭世界顶级工程索引数据库平台 Enigineering Village.ppt
- 全面深化改革加强战略谋划在新的历史起点上推进高校科技工作持续前进.ppt
- 北京大学研究生院:北京大学研究生教育与学科发展.pptx
- 沈阳师范大学:论文写作与投稿指南.pptx
- 博士磨难 The Ph.D. Grind:大道至简 The greatest truths are the simplest.pptx
- 集美大学:毕业论文资料查找技能辅导讲座——毕业论文的格式、写作、选题与开题(张新).ppt
- 国家自然科学基金2010年度资助工作概况及2011年度申请注意事项.ppt
- 华东师范大学:当前世界职教课程改革基本趋势及其对我国的启示(石伟平).ppt
- 郑州大学河南医学院:标引词表及标引工具书(PPT讲稿).ppt
- 武昌首义学院:关于“教学研究与教学成果”(吴昌林).ppt