中国高校课件下载中心 》 教学资源 》 大学文库

清华大学:Pass-Join(A Partition based Method for Similarity Joins)

文档信息
资源类别:文库
文档格式:PPTX
文档页数:53
文件大小:1.07MB
团购合买:点击进入团购
内容简介
 Motivation & Problem Formulation  Partition-based Framework  Improving Substring Selection  Improving the Verification  Experiment  Conclusion
刷新页面文档预览

Pass-Join:A Partition based Method for Similarity Joins Guoliang Li (Tsinghua,China) Dong Deng (Tsinghua,China) Jiannan Wang (Tsinghua,China) h心 -1911- Jianhua Feng (Tsinghua,China)

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China) Jiannan Wang (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Real-world Data is Rather Dirty! a DBLP Complete search ● Typo in“aut Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. □Awvw:051011 Argyris Zymnis Typo in“ title relaxe Yair Bartal, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 1/29/2021 related PassJoin a VLDB2012

 Typo in “author”  Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 1/29/2021 Real-world Data is Rather Dirty! PassJoin @ VLDB2012 2

Similarity Join Equal Join Datasets Dataset s Conference Conference VLDB CIDR SIGMOD SIGMOD ICDE PVLDB 20/2021 PassJoin a VLDB2012

Similarity Join 1/29/2021 PassJoin @ VLDB2012 3 Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB Dataset R Dataset S Equal Join

Similarity Join Similarity join Datasets Dataset s Conference Conference VLDB CIDR SIGMOD SIGMOD ICDE PVLDB 20/2021 PassJoin a VLDB2012

Similarity Join 1/29/2021 PassJoin @ VLDB2012 4 Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB Dataset R Dataset S Similarity Join

Applications o Data Cleaning and Integration Near duplicate object detection o Collaborative Filtering 20/2021 PassJoin a VLDB2012

Applications Data Cleaning and Integration Near Duplicate Object Detection Collaborative Filtering  …….. 1/29/2021 PassJoin @ VLDB2012 5

Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution to transform r to s Forexample: ED(hilton, huston)=2 hilton substitute i with u hulton substitute l with u huston Property:ED(r, s)2IIrI-Isl 20/2021 PassJoin a VLDB2012

 ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s.  For example: ED(hilton, huston) = 2  Property: ED(r, s) ≥ ||r|-|s|| Edit Distance 1/29/2021 PassJoin @ VLDB2012 6 hilton hulton huston substitute i with u substitute l with u

Problem formulation Give threshold t=3 [ID Strings $1 vankatesh S2 avataresha s3 kaushic chaduri S4 kaushik chakra S5 kaushuk chadhui s6=5ED=13ED=12 ED=14ED=12ED=12ED=12 ED=14ED=4ED=8 ED=4ED=8 1/29/2021 PassJoin a VLDB2012

Problem Formulation 1/29/2021 PassJoin @ VLDB2012 7 Give threshold τ=3 ED =5 ED =13 ED =12 ED =12 ED =14 ED =12 ED =12 ED =12 ED =14 ED =5 ED =4 ED =8 ED =4 ED =3 ED =8

Filter-and-refine methods Basic idea e Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs 20/2021 PassJoin a VLDB2012

Filter-and-refine Methods Basic idea  Filter a large number of dissimilar string pairs  Verify the remaining potentially similar pairs 1/29/2021 PassJoin @ VLDB2012 8

Filter-and-refine methods Give threshold t=3 DI Strings Length s1 vankatesh 9 Pruning Condition. s2/ avataresha s3 kaushic chaduri 15 S 3 s4 kaushik chakrab15 ss kaushuk chadhui|15 s6 caushik chakrabar|17 ED≤s1,2>=5113B1412E“112 E1当-44E23212E=5ED=4ED=8 ED=4ED=3ED=8 1/29/2021 PassJoin a VLDB2012

Filter-and-refine Methods 1/29/2021 PassJoin @ VLDB2012 9 Give threshold τ=3 ED =5 ED =13 ED =12 ED =12 ED =14 ED =12 ED =12 ED =12 ED =14 ED =5 ED =4 ED =8 ED =4 ED =3 ED =8 Pruning Condition: ||si | - |sj || > 3

Filter-and-refine methods ● Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks o Need to tune parameters Bad for short strings 20/2021 PassJoin a VLDB2012

Filter-and-refine Methods Basic idea  Filter a large number of dissimilar string pairs  Verify the remaining potentially similar pairs Drawbacks  Need to tune parameters  Bad for short strings 1/29/2021 PassJoin @ VLDB2012 10

刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档