清华大学:An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Dong Deng (Tsinghua,China) Guoliang Li (Tsinghua,China) Jianhua Feng (Tsinghua,China) h -1911-
Dong Deng (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Outline ● Motivation ● Problem formulation Trie-based framework Trie-based algorithms o Optimizing partition Scheme ° Experiment ● Conclusion 1/28/2021 aste@ ICDE2012
Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion 1/28/2021 Taste @ ICDE2012 2/42

Named Entity recognition o Dictionary-based Ner Dictionary of Entities Documents Isaac newtone 1 Str- Isaac Newton was an english physicist. Sigmund Freud mathematician, astronomer, natural philosopher. alchemist, and theologian and one of the n e most English influential men in human history. His philosophiae Austrian Naturalis Principia Mathematica, published in 1687, physicist by itself considered to be among the most influential mathematician books in the history of science laying the groundwork astronomer for most of classical mechanics philosopher alchemist 2 Sigmund freud was an austrian psychiatrist who theologian founded the psychoanalytic school of psychology. Freud psychiatrist is best known for his theories of the unconscious mind economist and the defense mechanism of repression and for historian creating the clinical practice of psychoanalysis for sociologist curing psychopathology through dialogue between a patient and a psychoanalyst 1/28/2021 Taste@ ICDE2012 /42
Dictionary-based NER Named Entity Recognition 1/28/2021 Taste @ ICDE2012 Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist ... 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst. Dictionary of Entities Documents 3/42

Automatically add the links ● Wikipedia Q2 KIPEDIA Levenshtein distance the Levenshtein distance fs a measuring the amount of difference between two sequences( ie an edit distance).The o refer specifically to Levenshtein distance The Levenshtein distance betweer(wo strings) defined as the mimimum number of edits needed to transform one string into the other, with the allowable edit operations bei Toolbox insertion deletion, or substitution of a singe character. It is named after Vladimir Levenshtein, who considered this distance in 1965. [11 b Print export For example, the Levenshtein distance between " kitten'andsittingis 3, since the following three edits change one into the other, and there is no way to do it with fewer than v languages three edits. the objective is to find matches for short strings, for instance, strings from a dictionary, in many longer texts, in situations where a sn Here. one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for insta ce, spell checke correcton syst optical character rece and software to assist natural language translation The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, Levenshtein distance is not the only popular notion of ed distance. Variations can be obtained by changing the set of allowable edit operations: for instance, is available, and each operation is assigned a cost(possibly infinite) further generalized by ence alignment algorithms such h make an operation s cost depend on where it is appled. Computing the Levenshtein distance is based on the observation that if we reserve a matri to hold the Levenshtein distances between all prefixes of the first string and all prefixes of the second, then we can compute the values in the matrix by flood fling the matrix, and thus find the distance between the two full strings as the last value http://en.wikipedia.org/wiki/levenshtein_disTance 1/28/2021 Taste@ ICDE2012 4
Wikipedia http://en.wikipedia.org/wiki/Levenshtein_distance Automatically add the links 1/28/2021 Taste @ ICDE2012 4/42

Real-world Data is Rather Dirty! a DBLP Complete search ● Typo in“aut Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. □Awvw:051011 Argyris Zymnis Typo in“ title relaxe Yair Bartal, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 1/28/2021 related Taste@ ICDE2012
Typo in “author” Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 1/28/2021 Real-world Data is Rather Dirty! Taste @ ICDE2012 5/42

Approximate Entity Extraction Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities ● For example: Dictionary of entities D ocuments Isaac newton Sigmund freund was an austrian psychiatrist who <Sigmund Freud founded the psychoanalytic school of psychology physicist phys Freud is best known for his theories of the astronomer unconscious mind and the defense mechanism of alchemist theologian repression and for creating the clinical practice of economist psychoanalysis for curing psychopathology sociologist Irou Th dialogue between a patient and a g sychoanalayst 1/28/2021 Taste@ ICDE2012 6/42
Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities. For example: Approximate Entity Extraction 1/28/2021 Taste @ ICDE2012 Isaac Newton Sigmund Freud physicist astronomer alchemist theologian economist sociologist ... Sigmund Freund was an Austrian psychiatrest who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalayst. Dictionary of Entities Documents 6/42

Outline ● Motivation ● Problem formulation Trie-based framework Trie-based algorithms o Optimizing partition Scheme ° Experiment ● Conclusion 1/28/2021 aste@ ICDE2012
Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion 1/28/2021 Taste @ ICDE2012 7/42

Problem formulation o Approximate Entity Extraction: Given a dictionary of entities E=ley e,,.,en, a document D, and a predefined edit distance threshold t, approximate entity extraction finds all"similar"pairs such that ED(S, ej s t, where s is a substring of d and eie e (a) dictionary E (b) Document D ID Entities Len Document vancouver an efficient filter for approximates 2 vanateshe membership ChecKing kaushit 3 Ksurajit chaudrD tetretratttrti, >surajit chaudhuri, 4 caushit chaudui 15 vankatesh ganti dong rIn. 5 caushit chakra 15 vancouver, canada. sigmod 2008 1/28/2021 Taste@ ICDE2012
Problem Formulation 1/28/2021 Taste @ ICDE2012 Approximate Entity Extraction: Given a dictionary of entities E = {e1 , e2 , . . . , en }, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all “similar” pairs such that ED(s, ei) ≤ τ, where s is a substring of D and ei∈ E. 8/42

Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution )to transform r to s o For example: ED(marios, maras)=2 8012345 m101 34 Dij=min (Di l, j +1,Di,jI+1,a 2 1 substitutea-p 321 Ⅰ sert o 0 if a,=b i4321 where t lifa.≠b 5432 654332 1/28/2021 Taste@ ICDE2012 9/42
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(marios, maras) = 2 Di,0 = i, D0,j = j, Di,j = min{Di-1, j + 1, Di, j-1 + 1, Di-1, j-1 + t i, j}, 0 if ai = bj 1 if ai bj . Edit Distance 1/28/2021 Taste @ ICDE2012 9/42 where t i, j = Substitute a->i Insert o

State-of-the-art methods NGPP Faerie Inverted Index Prefix of 1-variant family g-grams Filtering a substring matches with Overlap must be larger Condition the prefix of 1-variant of than a threshold partition Shortage ● Large Index size Need to tune parameters Not efficient for large threshold 1/28/2021 Taste@ ICDE2012
State-of-the-art Methods Shortage: Large Index Size Need to Tune Parameters Not efficient for large threshold 1/28/2021 Taste @ ICDE2012 10/42 NGPP Faerie Inverted Index Prefix of 1-variant family q-grams Filtering Condition a substring matches with the prefix of 1-variant of partition Overlap must be larger than a threshold
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 四川大学:《操作系统 Operating System》课程教学资源(PPT课件讲稿)Chapter 5 互斥与同步(Mutual Exclusion and Synchronization)5.4 Monitors 5.5 Message Passing 5.6 Readers/Writers Problem.ppt
- 上海交通大学:《程序设计》课程教学资源(PPT课件讲稿)第6章 过程封装——函数.ppt
- 《3ds Max》教学资源(PPT课件)第4章 基本三维模型的创建.ppt
- 南京大学:复杂系统学习(PPT课件讲稿)佩特里网 Petri Nets.pptx
- 香港科技大学:《软件开发》教学资源(PPT课件讲稿)Functions.ppt
- 《计算机文化基础》课程教学资源(PPT课件讲稿)第二章 Windows XP操作系统.ppt
- 电子科技大学:《计算机操作系统》课程教学资源(PPT课件讲稿)第五章 设备管理.ppt
- 山东大学:语音识别技术(PPT课件讲稿)自动语音识别 Automatic Speech Recognition.pptx
- 数据集成 Data Integration(PPT讲稿)成就与展望 Achievements and Perspectives.ppt
- 北京师范大学:拓扑序及其量子相变(PPT课件讲稿)Topological Order and its Quantum Phase Transition.ppt
- 计算机系教学资源(PPT课件讲稿)信息安全与保密技术.ppt
- 汤姆森 Thomson:利用Web of Knowledge对课题进行检索、分析、跟踪、管理.ppt
- 西安电子科技大学:《微机原理与接口技术》课程教学资源(PPT课件讲稿)第九章 定时/计数器8253.pptx
- 同济大学:聚类分析(PPT课件讲稿)Cluster Analysis.pptx
- 《数字图像处理学》课程教学资源(PPT课件讲稿)第2章 图像、图像系统与视觉系统.pptx
- 四川大学:《软件测试与维护基础教程》课程教学资源(PPT课件讲稿)软件测试工具 Software Testing Tool.ppt
- B-树、散列技术、散列表的概念、散列函数的构造方法、处理冲突的方法、散列表上的运算.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)对象序列化和持久化 Object Serialization and Persistence.ppt
- 电子工业出版社:《计算机网络》课程教学资源(第五版,PPT课件讲稿)第十章 下一代因特网.ppt
- 《网络编程实用教程(第三版)Network Application Programming》课程教学资源(PPT课件讲稿)第1章 概述.ppt
- 东南大学:《数据结构》课程教学资源(PPT课件讲稿)第三章 栈与队列.ppt
- 《计算机网络与因特网 Computer Networks and Internets》课程教学资源(PPT课件讲稿)Part II 物理层(信号、媒介、数据传输).ppt
- 合肥工业大学:《网络安全概论》课程教学资源(PPT课件讲稿)第2讲 密码学简介(主讲:苏兆品).ppt
- 长春大学:《计算机应用基础》课程教学资源(PPT课件讲稿)第一章 计算机基础知识(崔天明).ppt
- 《Java网站开发》教学资源(PPT讲稿)第9章 过滤器和监听器技术.ppt
- 北京师范大学现代远程教育:《计算机应用基础》课程教学资源(PPT课件讲稿)第2章 计算机网络应用.ppsx
- 清华大学:A Feature Weighting Method for Robust Speech Recognition(Speech Activities in CST).ppt
- 西安电子科技大学:《神经网络与模糊系统》课程教学资源(PPT课件讲稿)Chapter 6 结构和平衡 Architecture and Equilibria.ppt
- 北京大学:人工神经网络(PPT课件讲稿)Artificial Neural Networks,ANN.ppt
- 《计算机组成原理》课程教学资源(PPT课件讲稿)第4章 处理器(CPU).ppt
- 吉林大学:《C语言》课程教学资源(PPT课件讲稿)第6章 利用数组处理批量数据.ppt
- 《Vb程序设计教程》课程教学资源(PPT课件讲稿)第三章 VB语言基础.pps
- 安徽理工大学:《汇编语言》课程教学资源(PPT课件讲稿)第七章 高级汇编语言技术(主讲:李敬兆).ppt
- 《软件质量与测试》课程教学资源(PPT大纲课件,目录版).pptx
- 香港理工大学:Discovering Classification Rules.ppt
- 北京科技大学:物联网知识体系和学科建设(PPT讲稿,王志良).ppt
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)傅里叶分析与卷积 Fourier Analysis and Convolution.pptx
- 沈阳理工大学:《单片机C语言应用程序设计》课程PPT教学课件(单片机C语言编程)04 C51编程设计(廉哲).pptx
- 《软件工程 Software Engineering》教学资源:课程教学大纲.pdf
- 上海交通大学:《编译器构造》课程教学资源(PPT讲稿,马融)Compiler.pptx