电子科技大学：《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源（课件讲稿）Lecture 3 Hashing.pdf_大学文库

文档信息

资源类别：文库
文档格式：PDF
文档页数：61
文件大小：2.11MB
团购合买：点击进入团购

内容简介

Locality-Sensitive Hashing Find Similar Items CASE STUDY Finding Similar Documents Min-Hashing Locality-Sensitive Hashing Learn to Hash

刷新页面文档预览

Lecture 3 HASHING

Why we need HASHING? Wal-Mart:267 million items/day;4PB data warehouse Sloan Digital Sky Survey:New Mexico telescope captures 200 GB image data/day T吉 nature Science The FOURTH PARADIGM DATA-TENS SCIENCEIN THE PETABYTE ERA data Challenge in big data applications: 1.Curse of dimensionality 2.Storage cost 3.Query speed

Challenge in big data applications: 1. Curse of dimensionality 2. Storage cost 3. Query speed • Wal-Mart: 267 million items/day; 4PB data warehouse • Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Why we need HASHING?

Example 1.Information Retrieval h(Statue of Liberty)= h (Napoleon)= h (Napoleon)= 10001010 01100001 011001Q1 flipped bit Should be very different Should be similar

Example 1. Information Retrieval

Example 2.Storage Cost Gist vector Binary reduction 10 million images 20 GB 160MB 512values 128bits

Example 2. Storage Cost

Example 3.Fast Nearest Neighbor Search Given a query point g(high dimensional),return the points closest(similar)to g in the database. ● 98 0 KD-TREE KD-tree cannot handle high-dimensional data

Example 3. Fast Nearest Neighbor Search Given a query point q (high dimensional), return the points closest (similar) to q in the database. KD-TREE KD-tree cannot handle high-dimensional data

WHAT WILL WE TALK? 1.Locality-Sensitive Hashing (Shingling+MinHash) 2.Learning to Hash 7

7 1. Locality-Sensitive Hashing (Shingling+ MinHash) 2. Learning to Hash WHAT WILL WE TALK?

Locality-Sensitive Hashing Find Similar Items

Introduction Many Web-mining problems can be expressed as finding "similar"sets: 1.Pages with similar words,e.g.,for classification by topic. 2.NetFlix users with similar tastes in movies,for recommendation systems. 3.Movies with similar sets of fans. 4.Images of related things. 9

9 Many Web-mining problems can be expressed as finding “similar” sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Movies with similar sets of fans. 4. Images of related things. Introduction Introduction

CASE STUDY Finding Similar Documents

Given a body of documents,e.g.,the Web,find pairs of documents with a lot of text in common, e.g.: -Mirror sites,or approximate mirrors. Application:Don't want to show both in a search -Plagiarism,including large quotations. -Similar news articles at many news sites. Application:Cluster articles by "same story." 11

11 • Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: – Mirror sites, or approximate mirrors. • Application: Don’t want to show both in a search. – Plagiarism, including large quotations. – Similar news articles at many news sites. • Application: Cluster articles by “same story.” Introduction

共61页，可试读20页，点击继续阅读 ↓

刷新页面下载完整文档

VIP每日下载上限内不扣除下载券和下载次数；
按次数下载不扣除下载券；
注册用户24小时内重复下载只扣除一次；
顺序：VIP每日次数-->可用次数-->下载券；