香港科技大学：基于文档的知识管理（PPT课件讲稿）Knowledge Management with Documents

点击下载完整版文档（PPT）

文档信息

资源类别：文库
文档格式：PPT
文档页数：70
文件大小：378KB
团购合买：点击进入团购

内容简介

香港科技大学：基于文档的知识管理（PPT课件讲稿）Knowledge Management with Documents

刷新页面文档预览

Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST

1 Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST

Keyword Extraction Goal given n documents each consisting of words, extract the most significant subset of words> keywords Example [All the students are taking exams]-->[student, take, exam Keyword Extraction Process remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words-build key word index extract key phrases- build key phrase index

2 Keyword Extraction ◼ Goal: ◼ given N documents, each consisting of words, ◼ extract the most significant subset of words → keywords ◼ Example ◼ [All the students are taking exams] -- >[student, take, exam] ◼ Keyword Extraction Process ◼ remove stop words ◼ stem remaining terms ◼ collapse terms using thesaurus ◼ build inverted index ◼ extract key words - build key word index ◼ extract key phrases - build key phrase index i t

Stop Words and Stemming From a given Stop Word list a, about, again, are, the to, of, Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf's law: Frequency * rank x constant most frequent words tend to be short most frequent 20% of words account for 60% of usage

3 Stop Words and Stemming ◼ From a given Stop Word List ◼ [a, about, again, are, the, to, of, …] ◼ Remove them from the documents ◼ Or, determine stop words ◼ Given a large enough corpus of common English ◼ Sort the list of words in decreasing order of their occurrence frequency in the corpus ◼ Zipf’s law: Frequency * rank  constant ◼ most frequent words tend to be short ◼ most frequent 20% of words account for 60% of usage

Zipf's Law--An illustration Rank(R) Term Frequency(F)R*F(10**6) the 69971 0.070 of 36411 0.073 23456789 and 28.852 0.086 to 26.149 0.104 a 232370.116 In 21.341 0.128 that 10.5950074 10009 0.081 was 9816 0.088 10 he 9543 0.095

4 Zipf’s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095

Resolving power of word Non-significant Non-significant high-frequency low-frequency terms terms Presumed resolving power of significant words Words in decreasing frequency order

5 Resolving Power of Word Words in decreasing frequency order Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words

Stemming a The next task is stemming transforming words to root form Computing, Computer, Computation >comput Suffix based methods Remove ability"from"computability L+ness, +ive, >remove Suffix list context rules

6 Stemming ◼ The next task is stemming: transforming words to root form ◼ Computing, Computer, Computation →comput ◼ Suffix based methods ◼ Remove “ability” from “computability” ◼ “…”+ness, “…”+ive, → remove ◼ Suffix list + context rules

Thesaurus rules a thesaurus aims at classification of words in a language for a word it gives related terms which are broader than narrower than same as (synonyms)and opposed to(antonyms)of the given word (other kinds of relationships may exist, e.g, composed of) Static Thesaurus tables anneal, strain], [antenna, receiver] Roget's thesaurus WordNet at princeton 7

7 Thesaurus Rules ◼ A thesaurus aims at ◼ classification of words in a language ◼ for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) ◼ Static Thesaurus Tables ◼ [anneal, strain], [antenna, receiver], … ◼ Roget’s thesaurus ◼ WordNet at Preinceton

Thesaurus rules can also be learned From a search engine query log After typing queries browse If query 1 and query2 leads to the same document Then, Similar(query l, query2) If queryl leads to Document with title keyword K Then, Similar(query1, k) Then transitivity Microsoft research china s work in WWW10 (Wen et al )on Encarta online

8 Thesaurus Rules can also be Learned ◼ From a search engine query log ◼ After typing queries, browse… ◼ If query1 and query2 leads to the same document ◼ Then, Similar(query1, query2) ◼ If query1 leads to Document with title keyword K, ◼ Then, Similar(query1, K) ◼ Then, transitivity… ◼ Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online

The vector-Space Model Distinct terms are available call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document nor T1=architecture 2=bu T3==computer T4=database T5=xm computer science collection index terms or vocabulary of the colelction

9 The Vector-Space Model ◼ T distinct terms are available; call them index terms or the vocabulary ◼ The index terms represent important terms for an application → a vector to represent the document ◼ or T1=architecture T2=bus T3=computer T4=database T5=xml computer science collection index terms or vocabulary of the colelction

The vector-Space Model Assumptions: words are uncorrelated Given 1.n documents and a query TT 2. Query considered a document 1 012 too DD 2. Each represented by t terms 202122 3. Each term in document i has weight 4. We will deal with how to compute the weights later

10 The Vector-Space Model ◼ Assumptions: words are uncorrelated T1 T2 …. Tt D1 d11 d12 … d1t D2 d21 d22 … d2t : : : : : : : : Dn dn1 dn2 … dnt Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight 4. We will deal with how to compute the weights later ij d Q q q qt ... 1 2

共70页，可试读20页，点击继续阅读 ↓

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；

点击下载完整版文档（PPT）