中国高校课件下载中心 》 教学资源 》 大学文库

清华大学:A Pivotal Prefix Based Filtering Algorithm for String Similarity Search(PPT讲稿)

文档信息
资源类别:文库
文档格式:PPTX
文档页数:54
文件大小:1.41MB
团购合买:点击进入团购
内容简介
清华大学:A Pivotal Prefix Based Filtering Algorithm for String Similarity Search(PPT讲稿)
刷新页面文档预览

snowbird UT. USA. 2014 A Pivotal Prefix Based Filtering Algorithm for String Similarity search Dong deng, guoliang Li, Jianhua Feng Database Group, Tsinghua University 小 1911 Present by dong deng

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Search is Important Google Searches per Year 1,600,000,000,000 ■ Search queries 1,200,000,000,000 800000,000,000 40,000,000,000 19992000200120022003200420052006200720082009201020112012 Google searches per Year Source:http://www.internetlivestats.com/google-search-statistics/

Search is Important Source: http://www.internetlivestats.com/google-search-statistics/ Google Searches per Year

Speed matters But when questions aren' t answered quickly, people ask less GOOGLE FOUND THAT SLOWING SEARCH RESULTS BY JUST 4/10THS OF A SECOND would reduce the number of searches by 8,000,000 a d Source

Speed Matters Source:

Data is dirty DBLP Complete Search ypos Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. cAo00:1011 argyris Zymnis relaxed E Yai I, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 related

Data is Dirty • Typos • Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search

Similarity Search Query All the strings similar to the query String Dataset

Similarity Search Query String Dataset All the strings similar to the query

Edit distance ED( s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s For example: ED(sigcom, sigmod )=2 sitcom substitute c with m sIemon substitute m with d sigmod

• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. • For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d

Problem definition DEFINITION 1 (STRING SIMILARITY SEARCH). Given a string dataset R, a query string s, and an edit-distance thresh old T, the string similarity search with edit-distance con- straints finds all strings r ER such that ed(r,S)<T id string Query string lmyouteca s= yotubecom"andτ=2 r2 ubuntucom r3 utubbecou r4 youtbecom r5 yoytubeca ed(s, r4<=2 output ra as a result string dataset R

Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r4 ) <= 2 output r4 as a result

Application Spell checking Copy detection Entity Linking ·Bⅰ informatic

Application • Spell Checking • Copy Detection • Entity Linking • Bioinformatic …

Challenge Naive method Time complexity: for each query O(RIs t

Challenge Naïve Method Time complexity: for each query 𝑂 |𝑅| |𝑠| τ

Filter-and-Verification framework Query string s Filter: Index No es Dataset r Signature()∩ Verify: Results Signature (r)=o? ED(rs)≤τ? Threshold T Pruning power Filtering Cost depends on depends on Sig(r)Ix(sig(s)) and Index: (sig(r)I(Probe set size) min((sig(r)(Sig(s) No Index: (sig(r)+(sig(s)I

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ? Verify: ED(r,s) ≤ τ? Yes Index

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档