中国高校课件下载中心 》 教学资源 》 大学文库

《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 06 Index Compression

文档信息
资源类别:文库
文档格式:PPT
文档页数:46
文件大小:746KB
团购合买:点击进入团购
内容简介
《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 06 Index Compression
刷新页面文档预览

Index Compression Web Search and Mining Lecture 6: Index Compression

Index Compression 1 Lecture 6: Index Compression Web Search and Mining

Index Compression Last lecture-index construction Sort-based indexing a Naive in-memory inversion Blocked Sort-Based Indexing Merge sort is effective for disk-based sorting( avoid seeks Single-Pass In-Memory Indexing No global dictionary Generate separate dictionary for each block Dont sort postings Accumulate postings in postings lists as they occur Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

Index Compression 2 Last lecture – index construction ▪ Sort-based indexing ▪ Naïve in-memory inversion ▪ Blocked Sort-Based Indexing ▪ Merge sort is effective for disk-based sorting (avoid seeks!) ▪ Single-Pass In-Memory Indexing ▪ No global dictionary ▪ Generate separate dictionary for each block ▪ Don’t sort postings ▪ Accumulate postings in postings lists as they occur ▪ Distributed indexing using MapReduce ▪ Dynamic indexing: Multiple indices, logarithmic merge

Index Compression This lecture BRUTUS 1241131451731174 CAeSAR 2|4561657132 CALPURNIA→23154101 Collection statistics in more detail (with RCv1) How big will the dictionary and postings be? a Dictionary compression Postings compression

Index Compression 3 This lecture ▪ Collection statistics in more detail (with RCV1) ▪ How big will the dictionary and postings be? ▪ Dictionary compression ▪ Postings compression

Index Compression Why compression in general) Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data decompress] is faster than [read uncompressed datal Premise: Decompression algorithms are fast True of the decompression algorithms we use

Index Compression 4 Why compression (in general)? ▪ Use less disk space ▪ Saves a little money ▪ Keep more stuff in memory ▪ Increases speed ▪ Increase speed of data transfer from disk to memory ▪ [read compressed data | decompress] is faster than [read uncompressed data] ▪ Premise: Decompression algorithms are fast ▪ True of the decompression algorithms we use

Index Compression Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Posting gs file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory We will devise various IR-specific compression schemes

Index Compression 5 Why compression for inverted indexes? ▪ Dictionary ▪ Make it small enough to keep in main memory ▪ Make it so small that you can keep some postings lists in main memory too ▪ Postings file(s) ▪ Reduce disk space needed ▪ Decrease time needed to read postings lists from disk ▪ Large search engines keep a significant part of the postings in memory. ▪ Compression lets you keep more in memory ▪ We will devise various IR-specific compression schemes

Index Compression Collection Statistics Reca‖ Reuters rcv1 symbol statistic value documents 800.000 avg tokens per doc 200 terms(=word types)" 400,000 avg. bytes per token 6 (incl spaces/punct. avg. bytes per token 4.5 (without spaces/punct avg. bytes per term 7.5 non-positional postings 100,000,000

Index Compression 6 Recall Reuters RCV1 ▪ symbol statistic value ▪ N documents 800,000 ▪ L avg. # tokens per doc 200 ▪ M terms (= word types) ~400,000 ▪ avg. # bytes per token 6 (incl. spaces/punct.) ▪ avg. # bytes per token 4.5 (without spaces/punct.) ▪ avg. # bytes per term 7.5 ▪ non-positional postings 100,000,000 Collection Statistics

Index Compression Collection Statistics Index parameters vs. what we index (details //R Table 5.1, p 80) size of word types(terms) non-positional positional postings postings dictionary non-positional index positional index size△% cumul size(K)△ cumu size(K)△ cumul % %% Unfiltered 484 109971 197.879 No numbers 4742 -2100,6808 8179,1589 9 Case folding 392-17 1996969-3 12179,1580 9 30 stopwords 391-0 -1983,390-14-24121,858-31 -38 150 stopwords39101967002303994,51747 52 stemming 322-17 -3363.812-4 4294.5170 52 Exercise: give intuitions for all the o' entries Why do some zero entries correspond to big deltas in other columns

Index Compression 7 Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional postings positional postings dictionary non-positional index positional index Size (K) ∆% cumul % Size (K) ∆ % cumul % Size (K) ∆ % cumul % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9 30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? Collection Statistics

Index Compression Collection Statistics Lossless vs lossy compression Lossless compression: All information is preserved What we mostly do in IR Lossy compression Discard some information Several of the preprocessing steps can be viewed as lossy compression: case folding stop words stemming number elimination Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query Almost no loss quality for top k list 8

Index Compression 8 Lossless vs. lossy compression ▪ Lossless compression: All information is preserved. ▪ What we mostly do in IR. ▪ Lossy compression: Discard some information ▪ Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination. ▪ Chap 7: Prune postings entries that are unlikely to turn up in the top k list for any query. ▪ Almost no loss quality for top k list. Collection Statistics

Index Compression Collection Statistics Vocabulary vs collection size How big is the term vocabulary? That is how many distinct words are there? Can we assume an upper bound? Not really: At least 7020 =1037 different words of length 20 In practice, the vocabulary will keep growing with the collection size Especially with Unicode

Index Compression 9 Vocabulary vs. collection size ▪ How big is the term vocabulary? ▪ That is, how many distinct words are there? ▪ Can we assume an upper bound? ▪ Not really: At least 7020 = 1037 different words of length 20 ▪ In practice, the vocabulary will keep growing with the collection size ▪ Especially with Unicode ☺ Collection Statistics

Index Compression Collection Statistics Vocabulary vs collection size Heaps’law:M=k7 M is the size of the vocabulary t is the number of tokens in the collection Typical values:30≤k≤100andb≈0.5 In a log-log plot of vocabulary size M vs. T, Heaps law predicts a line with slope about 2 It is the simplest possible relationship between the two in log- log space An empirical finding empirical law")

Index Compression 10 Vocabulary vs. collection size ▪ Heaps’ law: M = kTb ▪ M is the size of the vocabulary, T is the number of tokens in the collection ▪ Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 ▪ In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ ▪ It is the simplest possible relationship between the two in log-log space ▪ An empirical finding (“empirical law”) Collection Statistics

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档