中国高校课件下载中心 》 教学资源 》 大学文库

《高性能计算机网络》课程教学课件(讲义)第十章 大数据之Web典型应用 第54讲 Web信息检索简介

文档信息
资源类别:文库
文档格式:PDF
文档页数:18
文件大小:2.09MB
团购合买:点击进入团购
内容简介
《高性能计算机网络》课程教学课件(讲义)第十章 大数据之Web典型应用 第54讲 Web信息检索简介
刷新页面文档预览

Mining of Massive Web Data第54讲Web信息检索简介更多资料:http://web.stanford.edu/class/cs276/武汉理工大学计算机科学与技术学院

Mining of Massive Web Data 更多资料:h1p://web.stanford.edu/class/cs276/ ᦇᓒ๢ᑀ਍Өದ๞਍ᴺ ᒫ54ᦖ Webמ௳༄ᔱᓌՕ

计贸机科学与技术学院第14讲Web信息检索简介IntroductionInformationRetrievalWeb SearchIRHistory武铺理工大学

ᒫ14ᦖ Web מ௳༄ᔱᓌՕ Introduc@on Web Search Informa@on Retrieval IR History

计算机科学与技术学院InformationRetrieval (IR).The indexing and retrieval of textual documents.? Searching for pages on the World Wide Web is the most recent“killer app."? Concerned firstly with retrieving relevant documents to aquery.? Concerned secondly with retrieving from large sets ofdocumentsefficiently武铺理工大学

Information Retrieval (IR) • The indexing and retrieval of textual documents. • Searching for pages on the World Wide Web is the most recent “killer app.” • Concerned firstly with retrieving relevant documents to a query. • Concerned secondly with retrieving from large sets of documents efficiently

计等机科学与技术学院Typical IR TaskGiven:A corpus of textual natural-language documentsA user query in the form of a textual stringFind:A ranked set of documents that are relevant to thequery.武铺理工大学

Typical IR Task • Given: - A corpus of textual natural-language documents. - A user query in the form of a textual string. • Find: - A ranked set of documents that are relevant to the query

计算机科学与技术学院IRSystemDocumentcorpusQueryIRStringSystem1. Docl2. Doc2Ranked3. Doc3Documents武铺理工大学

IR System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

计算机科学与技术学院Relevance· Relevance is a subjective judgment and mayinclude:-Being on the proper subject.-Being timely (recent information)- Being authoritative (from a trusted source)- Satisfying the goals of the user and his/her intended useoftheinformation(informationneed)武铺理工大学

Relevance • Relevance is a subjective judgment and may include: - Being on the proper subject. - Being timely (recent information). - Being authoritative (from a trusted source). - Satisfying the goals of the user and his/her intended use of the information (information need)

计穿机科学与技术学院Keyword Search? Simplest notion of relevance is that the query stringappears verbatim in the document.: Slightly less strict notion is that the words in thequery appear frequently in the document, in anyorder (bag of words)武铺理工大学

Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words)

计算机科学与技术学院ProblemswithKeywords? May not retrieve relevant documents that include synonymousterms.“restaurant”vs.“"cafe"“PRC”vs.“China”? May retrieve irrelevant documents that include ambiguousterms.“bat"(baseballvs.mammal)“Apple"(companyvs.fruit)“bit"(unit ofdatavs.actof eating)武铺理工大学

Problems with Keywords • May not retrieve relevant documents that include synonymous terms. - “restaurant” vs. “café” - “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. - “bat” (baseball vs. mammal) - “Apple” (company vs. fruit) - “bit” (unit of data vs. act of eating)

计尊机科学与技术学院WebSearch·Application of IR to HTML documents on the World WideWeb..Differences:-Mustassembledocumentcorpusbyspideringtheweb-Can exploit the structural layout information in HTML (XML)-Documents change uncontrollably-Canexploitthelinkstructureoftheweb武铺理工大学

Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: - Must assemble document corpus by spidering the web. - Can exploit the structural layout information in HTML (XML). - Documents change uncontrollably. - Can exploit the link structure of the web

计导机科学与技术学院Web SearchSystemWebDocumentSpidercorpusQueryStringIRSystem1. Pagel2.Page2Ranked3.Page3Documents武铺理工大学

Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider

共18页,试读已结束,阅读完整版请下载
刷新页面下载完整文档
VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
相关文档