《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 06 Index Compression

Index Compression Web Search and Mining Lecture 6: Index Compression
Index Compression 1 Lecture 6: Index Compression Web Search and Mining

Index Compression Last lecture-index construction Sort-based indexing a Naive in-memory inversion Blocked Sort-Based Indexing Merge sort is effective for disk-based sorting( avoid seeks Single-Pass In-Memory Indexing No global dictionary Generate separate dictionary for each block Dont sort postings Accumulate postings in postings lists as they occur Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
Index Compression 2 Last lecture – index construction ▪ Sort-based indexing ▪ Naïve in-memory inversion ▪ Blocked Sort-Based Indexing ▪ Merge sort is effective for disk-based sorting (avoid seeks!) ▪ Single-Pass In-Memory Indexing ▪ No global dictionary ▪ Generate separate dictionary for each block ▪ Don’t sort postings ▪ Accumulate postings in postings lists as they occur ▪ Distributed indexing using MapReduce ▪ Dynamic indexing: Multiple indices, logarithmic merge

Index Compression This lecture BRUTUS 1241131451731174 CAeSAR 2|4561657132 CALPURNIA→23154101 Collection statistics in more detail (with RCv1) How big will the dictionary and postings be? a Dictionary compression Postings compression
Index Compression 3 This lecture ▪ Collection statistics in more detail (with RCV1) ▪ How big will the dictionary and postings be? ▪ Dictionary compression ▪ Postings compression

Index Compression Why compression in general) Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data decompress] is faster than [read uncompressed datal Premise: Decompression algorithms are fast True of the decompression algorithms we use
Index Compression 4 Why compression (in general)? ▪ Use less disk space ▪ Saves a little money ▪ Keep more stuff in memory ▪ Increases speed ▪ Increase speed of data transfer from disk to memory ▪ [read compressed data | decompress] is faster than [read uncompressed data] ▪ Premise: Decompression algorithms are fast ▪ True of the decompression algorithms we use

Index Compression Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Posting gs file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory We will devise various IR-specific compression schemes
Index Compression 5 Why compression for inverted indexes? ▪ Dictionary ▪ Make it small enough to keep in main memory ▪ Make it so small that you can keep some postings lists in main memory too ▪ Postings file(s) ▪ Reduce disk space needed ▪ Decrease time needed to read postings lists from disk ▪ Large search engines keep a significant part of the postings in memory. ▪ Compression lets you keep more in memory ▪ We will devise various IR-specific compression schemes

Index Compression Collection Statistics Reca‖ Reuters rcv1 symbol statistic value documents 800.000 avg tokens per doc 200 terms(=word types)" 400,000 avg. bytes per token 6 (incl spaces/punct. avg. bytes per token 4.5 (without spaces/punct avg. bytes per term 7.5 non-positional postings 100,000,000
Index Compression 6 Recall Reuters RCV1 ▪ symbol statistic value ▪ N documents 800,000 ▪ L avg. # tokens per doc 200 ▪ M terms (= word types) ~400,000 ▪ avg. # bytes per token 6 (incl. spaces/punct.) ▪ avg. # bytes per token 4.5 (without spaces/punct.) ▪ avg. # bytes per term 7.5 ▪ non-positional postings 100,000,000 Collection Statistics

Index Compression Collection Statistics Index parameters vs. what we index (details //R Table 5.1, p 80) size of word types(terms) non-positional positional postings postings dictionary non-positional index positional index size△% cumul size(K)△ cumu size(K)△ cumul % %% Unfiltered 484 109971 197.879 No numbers 4742 -2100,6808 8179,1589 9 Case folding 392-17 1996969-3 12179,1580 9 30 stopwords 391-0 -1983,390-14-24121,858-31 -38 150 stopwords39101967002303994,51747 52 stemming 322-17 -3363.812-4 4294.5170 52 Exercise: give intuitions for all the o' entries Why do some zero entries correspond to big deltas in other columns
Index Compression 7 Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional postings positional postings dictionary non-positional index positional index Size (K) ∆% cumul % Size (K) ∆ % cumul % Size (K) ∆ % cumul % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9 30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? Collection Statistics

Index Compression Collection Statistics Lossless vs lossy compression Lossless compression: All information is preserved What we mostly do in IR Lossy compression Discard some information Several of the preprocessing steps can be viewed as lossy compression: case folding stop words stemming number elimination Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query Almost no loss quality for top k list 8
Index Compression 8 Lossless vs. lossy compression ▪ Lossless compression: All information is preserved. ▪ What we mostly do in IR. ▪ Lossy compression: Discard some information ▪ Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination. ▪ Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query. ▪ Almost no loss quality for top k list. Collection Statistics

Index Compression Collection Statistics Vocabulary vs collection size How big is the term vocabulary? That is how many distinct words are there? Can we assume an upper bound? Not really: At least 7020 =1037 different words of length 20 In practice, the vocabulary will keep growing with the collection size Especially with Unicode
Index Compression 9 Vocabulary vs. collection size ▪ How big is the term vocabulary? ▪ That is, how many distinct words are there? ▪ Can we assume an upper bound? ▪ Not really: At least 7020 = 1037 different words of length 20 ▪ In practice, the vocabulary will keep growing with the collection size ▪ Especially with Unicode ☺ Collection Statistics

Index Compression Collection Statistics Vocabulary vs collection size Heaps’law:M=k7 M is the size of the vocabulary t is the number of tokens in the collection Typical values:30≤k≤100andb≈0.5 In a log-log plot of vocabulary size M vs. T, Heaps law predicts a line with slope about 2 It is the simplest possible relationship between the two in log- log space An empirical finding empirical law")
Index Compression 10 Vocabulary vs. collection size ▪ Heaps’ law: M = kTb ▪ M is the size of the vocabulary, T is the number of tokens in the collection ▪ Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 ▪ In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ ▪ It is the simplest possible relationship between the two in log-log space ▪ An empirical finding (“empirical law”) Collection Statistics
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 清华大学:《计算机导论》课程电子教案(PPT教学课件)第1章 计算机发展简史.ppt
- 《网页设计与制作》课程教学资源(PPT课件讲稿)第一章 HTML基础.ppt
- 北京大学:文本挖掘技术(PPT讲稿)文本分类 Text Categorization.ppt
- 同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)K-means & EM.pptx
- 中国医科大学计算机中心:《虚拟现实与增强现实技术概论》课程教学资源(PPT课件讲稿)第3章 虚拟现实系统的输出设备.pptx
- 香港中文大学:XML for Interoperable Digital Video Library.ppt
- 上海交通大学:《计算机图形学 Computer Graphics》课程教学资源(PPT讲稿)CHAPTER 4 THE VISUALIZATION PIPELINE.pptx
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 09 Evaluation.ppt
- 长春工业大学:《网页设计与制作》课程教学资源(PPT课件)第5章 Div+CSS布局技术.ppt
- 合肥工业大学:《计算机网络技术》课程教学资源(PPT课件讲稿)第4章 交换网的运行.ppt
- 山东大学软件学院:非线性规划(PPT讲稿)一维搜索方法.ppt
- 《并发控制技术》课程教学资源(PPT课件讲稿)第7章 事务管理 transaction management.ppt
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第1章 计算机常识(主讲:马秀麟).pptx
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)面向对象的分析与设计简介 OOA & OOD:An introduction.ppt
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)向量体系结构.pptx
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第二部分 公钥密码和散列函数 第8章 数论入门(苗付友).pptx
- 《计算机网络技术》课程教学资源(PPT课件讲稿)第5章 广域网.ppt
- 香港城市大学:Rank Aggregation in MetaSearch.ppt
- Vitebi 译码.ppt
- 图形处理及多媒体应用(PPT课件讲稿).pps
- 嵌入式交叉开发环境的建立(PPT实验讲稿).ppt
- 西安交通大学:《微型计算机接口技术》课程教学资源(PPT课件讲稿)第五章 输入/输出控制接口.ppt
- 《TCP/IP协议及其应用》课程教学资源(PPT课件讲稿)第3章 IP寻址与地址解析.ppt
- 中国医科大学:《计算机网络实用教程》课程教学资源(PPT讲稿)高速局域网技术、交换式局域网技术、虚拟局域网技术、主要的城域网技术.ppt
- 《大学计算机基础》课程教学资源:作业习题.pdf
- 《计算机网络》课程教学资源(PPT课件讲稿)第一章 计算机网络概述.ppt
- 山西国际商务职业学院:《数据库应用程序设计》课程教学资源(PPT课件)第三章 数据与数据运算.pps
- 《C语言程序设计》课程电子教案(PPT课件讲稿)Chapter 02 用C语言编写程序.ppt
- 《数字图像处理》课程教学资源(PPT课件讲稿)第5章 图像复原.ppt
- 《数据结构 Data Structure》课程教学资源(PPT课件讲稿)06 非二叉树 Non-Binary Trees.ppt
- 《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第六讲 关系数据理论.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)并发对象 Concurrent Objects.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第六章 应用层(谢希仁).ppt
- 《电子商务技术》课程教学资源(PPT课件讲稿)第五章 电子商务安全技术.ppt
- Parallel Algorithms Underlying MPI Implementations.ppt
- 中国铁道出版社:《局域网技术与组网工程》课程教学资源(PPT课件讲稿)第5章 Linux网络工程.ppt
- 陕西师范大学:Neural Networks and Fuzzy Systems(PPT讲稿)Chapter 3 NEURONAL DYNAMICS II:ACTIVATION MODELS.ppt
- 《计算机系统安全》课程教学资源(PPT课件讲稿)第六章 访问控制 Access Control.ppt
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第2章 传统加密技术 Classical Encryption Techniques.ppt
- 《计算机数据恢复技术》课程教学资源(PPT课件讲稿)第1章 数据恢复技术概述.ppt