《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 03 The term vocabulary and postings lists
data:image/s3,"s3://crabby-images/c3ab4/c3ab4ade541796126e61072e40bab3704cc42740" alt=""
Term Vocabulary and Postings Lists Web Search and Mining Lecture 3: The term vocabulary and postings lists
Term Vocabulary and Postings Lists 1 Lecture 3: The term vocabulary and postings lists Web Search and Mining
data:image/s3,"s3://crabby-images/6c7ae/6c7ae90a6b7fc4796d1c937e3d3bfcd28af2b9c2" alt=""
Term Vocabulary and Postings Lists Recap of the previous lecture Basic inverted indexes Structure: Dictionary and Postings BRUTUS 124113145173174 CAeSAR 24561657132 calpurnIA→[23154101 Key step in construction Sorting Boolean query processing Intersection by linear time"merging Simple optimizations
Term Vocabulary and Postings Lists 2 Recap of the previous lecture ▪ Basic inverted indexes: ▪ Structure: Dictionary and Postings ▪ Key step in construction: Sorting ▪ Boolean query processing ▪ Intersection by linear time “merging” ▪ Simple optimizations
data:image/s3,"s3://crabby-images/ade1e/ade1ef9d850be7c48d3ccf6109448cd8658dfd95" alt=""
Term Vocabulary and Postings Lists Plan for this lecture Elaborate basic indexing Preprocessing to form the term vocabulary Documents Tokenization What terms do we put in the index? Postings Faster merges: skip lists Positional postings and phrase queries
Term Vocabulary and Postings Lists 3 Plan for this lecture Elaborate basic indexing ▪ Preprocessing to form the term vocabulary ▪ Documents ▪ Tokenization ▪ What terms do we put in the index? ▪ Postings ▪ Faster merges: skip lists ▪ Positional postings and phrase queries
data:image/s3,"s3://crabby-images/16b02/16b020e3e393c50c8c407561b7740341c404c1ca" alt=""
Term Vocabulary and Postings Lists Recall the basic indexing pipeline Documents to Ga?b Friends. Romans, countrymen. be indexed Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 24 averted index roman countryman put 1316
Term Vocabulary and Postings Lists 4 Recall the basic indexing pipeline Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen
data:image/s3,"s3://crabby-images/b23c1/b23c1dd5afb0d626adc341da57eb3daeb0c3e046" alt=""
Term Vocabulary and Postings Lists Document Delineation Parsing a document What format is it in? pdf/word/ excel/html? What language is it in? What character set is in use? Each of these is a classification problem which we will study later in the course But these tasks are often done heuristically
Term Vocabulary and Postings Lists 5 Parsing a document ▪ What format is it in? ▪ pdf/word/excel/html? ▪ What language is it in? ▪ What character set is in use? Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … Document Delineation
data:image/s3,"s3://crabby-images/1398f/1398f87747880f3954676e4fb084af4773e27e2a" alt=""
Term Vocabulary and Postings Lists Document Delineation Complications: Format/language Documents being indexed can include docs from many different languages a single index may have to contain terms of several languages a Sometimes a document or its components can contain multiple languages/formats French email with a German pdf attachment What is a unit document? A file? An email?( Perhaps one of many in an mbox An email with 5 attachments? A group of files(ppt or la teX as html pages)
Term Vocabulary and Postings Lists 6 Complications: Format/language ▪ Documents being indexed can include docs from many different languages ▪ A single index may have to contain terms of several languages. ▪ Sometimes a document or its components can contain multiple languages/formats ▪ French email with a German pdf attachment. ▪ What is a unit document? ▪ A file? ▪ An email? (Perhaps one of many in an mbox.) ▪ An email with 5 attachments? ▪ A group of files (PPT or LaTeX as HTML pages) Document Delineation
data:image/s3,"s3://crabby-images/8771e/8771e17fdeb5011b1590110978102d3f2cbe9110" alt=""
Term Vocabulary and Postings Lists Vocabulary of Terms TOKENS AND TERMS
Term Vocabulary and Postings Lists 7 TOKENS AND TERMS Vocabulary of Terms
data:image/s3,"s3://crabby-images/98d19/98d19be7e9218242f9febac05f5993cc48eb80a5" alt=""
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Input: "Friends, Romans and Countrymen Output: Tokens friends Romans Countrymen a token is an instance of a sequence of characters Each such token is now a candidate for an index entry after further processing Described below But what are valid tokens to emit?
Term Vocabulary and Postings Lists 8 Tokenization ▪ Input: “Friends, Romans and Countrymen” ▪ Output: Tokens ▪ Friends ▪ Romans ▪ Countrymen ▪ A token is an instance of a sequence of characters ▪ Each such token is now a candidate for an index entry, after further processing ▪ Described below ▪ But what are valid tokens to emit? Vocabulary of Terms
data:image/s3,"s3://crabby-images/dd42d/dd42dfd921cf8e0d9a9ec8aaedee0d9d0f01ae62" alt=""
Term Vocabulary and Postings Lists Vocabulary of Terms Tokenization Issues in tokenization Finland's capita/→ Finland? Fin/ands? Finland's? Hewlett-Packard -> Hewlett and packard as two tokens? state-of-the-art: break up hyphenated sequence CO-education lowercase, ower-case, lower case It can be effective to get the user to put in possible hyphens San francisco one token or two? How do you decide it is one token?
Term Vocabulary and Postings Lists 9 Tokenization ▪ Issues in tokenization: ▪ Finland’s capital → Finland? Finlands? Finland’s? ▪ Hewlett-Packard → Hewlett and Packard as two tokens? ▪ state-of-the-art: break up hyphenated sequence. ▪ co-education ▪ lowercase, lower-case, lower case ? ▪ It can be effective to get the user to put in possible hyphens ▪ San Francisco: one token or two? ▪ How do you decide it is one token? Vocabulary of Terms
data:image/s3,"s3://crabby-images/3e537/3e537a584bda2071dcfeaaea751650e77d6b9c64" alt=""
Term Vocabulary and Postings Lists Vocabulary of Terms Numbers 3/20/91 Mar12,1991 20/391 55B.C B-52 My PG key is 324a3df234cb23e 800)234-2333 Often have embedded spaces Older iR systems may not index numbers But often very useful: think about things like looking up error codes/stacktraces on the web (One answer is using n-grams Lecture 2.2) Will often index"meta-data" separately Creation date, format etc
Term Vocabulary and Postings Lists 10 Numbers ▪ 3/20/91 Mar. 12, 1991 20/3/91 ▪ 55 B.C. ▪ B-52 ▪ My PGP key is 324a3df234cb23e ▪ (800) 234-2333 ▪ Often have embedded spaces ▪ Older IR systems may not index numbers ▪ But often very useful: think about things like looking up error codes/stacktraces on the web ▪ (One answer is using n-grams: Lecture 2.2) ▪ Will often index “meta-data” separately ▪ Creation date, format, etc. Vocabulary of Terms
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- A Unified Approach to Route Planning for Shared Mobility.pptx
- 同济大学:《软件测试》课程教学资源(PPT课件讲稿)第6章 功能测试(朱少民).ppt
- 香港理工大学:Introduction to Matlab(PPT讲稿)Image Processing with MATLAB.pptx
- 同济大学:《机器学习》课程教学资源(PPT讲稿)决策树 Decision Tree.pptx
- 河南中医药大学:《网络技术实训》课程教学资源(PPT课件讲稿)网络建设中的关键技术(主讲:路景鑫).pptx
- 微信公众平台开发与应用(PPT讲座,谭海兵).pptx
- 《计算机常用工具软件》教学资源(PPT讲稿)第8章 音频工具.ppt
- 应用层网络(PPT课件讲稿)Application-layer Overlay Networks.ppt
- 中国科学技术大学:《信息论与编码技术》课程教学资源(PPT课件讲稿)第6章 有噪信道编码定理.pptx
- 《单片机原理与应用》课程教学资源(PPT课件讲稿)第2章 MCS-51单片机结构及原理.pptx
- 深圳大学:《编译原理》课程教学资源(PPT课件讲稿,共四章,尹剑飞).ppt
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第十章 人机交互接口(主讲:刘忠国).ppt
- 谈模式识别方法在林业管理问题中的应用(PPT讲稿).pptx
- 视觉系统(PPT课件讲稿)The Visual System.ppt
- 北京大学信息学院:《高级软件工程》课程教学资源(PPT课件讲稿)第五讲 新运行平台——云计算平台.pptx
- 《数字图像处理 Digital Image Processing》课程教学资源(PPT课件讲稿)第10章数字图像处理的应用.ppt
- 南京航空航天大学:《数据结构》课程教学资源(PPT课件讲稿)第九章 查找.ppt
- 香港科技大学:Information-Agnostic Flow Scheduling for Commodity Data Centers.pptx
- 同济大学:《软件测试》课程教学资源(PPT课件讲稿)第5章 单元测试(朱少民).ppt
- 《计算机网络安全》课程教学资源(PPT课件讲稿)第三章 网络防病毒.ppt
- 河南中医药大学(河南中医学院):《计算机网络》课程教学资源(PPT课件讲稿)第二章 物理层.ppt
- 香港浸会大学:Programming Interest Group(PPT讲稿)Combinatorics & Number Theory.ppt
- 南京航空航天大学:《数据结构》课程教学资源(PPT课件讲稿)第七章 图(微软精品课程建设).ppt
- 河南中医药大学(河南中医学院):《计算机文化》课程教学资源(PPT课件讲稿)第五章 运输层.pptx
- C++ Basics(PPT讲稿).ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第五章 运输层.ppt
- 《计算机组成原理》课程电子教案(PPT课件讲稿)第4章 指令系统.ppt
- 演化计算(PPT讲稿)Evolutionary Computation(EC).ppt
- 上海交通大学:自然语言处理(PPT课件讲稿)Natural Language Processing.ppt
- 厦门大学:《大数据技术原理与应用》课程教学资源(PPT课件讲稿,2017)第4章 分布式数据库HBase.ppt
- 《软件工程》课程教学资源(PPT讲稿)软件测试——系统测试.pptx
- 香港浸会大学:《Data Communications and Networking》课程教学资源(PPT讲稿)Chapter 9 High Speed LANs and Wireless LANs.ppt
- Software Reliability & Testing(PPT讲稿)Overview of Software Reliability Engineering.ppt
- 《Java程序开发》课程教学资源(PPT课件讲稿)第11章 Struts2框架技术.ppt
- 北京航空航天大学:《数据挖掘——概念和技术(Data Mining - Concepts and Techniques)》课程教学资源(PPT课件讲稿)Chapter 02 Getting to Know Your Data.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第三章 数据链路层.ppt
- 《信息系统与数据库技术》课程教学资源(PPT课件讲稿)第4章 T-SQL与可编程对象.ppt
- 香港理工大学:数据仓库和数据挖掘(PPT讲稿)Data Warehousing & Data Mining.ppt
- 山西农业大学:大数据技术原理与应用(PPT讲稿)Development and application of bigdata technology.ppt
- Peer-to-Peer Networks:Distributed Algorithms for P2P Distributed Hash Tables.ppt