电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 3 Hashing

Lecture 3 HASHING
Lecture 3 HASHING

Why we need HASHING? Wal-Mart:267 million items/day;4PB data warehouse Sloan Digital Sky Survey:New Mexico telescope captures 200 GB image data/day T吉 nature Science The FOURTH PARADIGM DATA-TENS SCIENCEIN THE PETABYTE ERA data Challenge in big data applications: 1.Curse of dimensionality 2.Storage cost 3.Query speed
Challenge in big data applications: 1. Curse of dimensionality 2. Storage cost 3. Query speed • Wal-Mart: 267 million items/day; 4PB data warehouse • Sloan Digital Sky Survey: New Mexico telescope captures 200 GB image data/day Why we need HASHING?

Example 1.Information Retrieval h(Statue of Liberty)= h (Napoleon)= h (Napoleon)= 10001010 01100001 011001Q1 flipped bit Should be very different Should be similar
Example 1. Information Retrieval

Example 2.Storage Cost Gist vector Binary reduction 10 million images 20 GB 160MB 512values 128bits
Example 2. Storage Cost

Example 3.Fast Nearest Neighbor Search Given a query point g(high dimensional),return the points closest(similar)to g in the database. ● 98 0 KD-TREE KD-tree cannot handle high-dimensional data
Example 3. Fast Nearest Neighbor Search Given a query point q (high dimensional), return the points closest (similar) to q in the database. KD-TREE KD-tree cannot handle high-dimensional data

WHAT WILL WE TALK? 1.Locality-Sensitive Hashing (Shingling+MinHash) 2.Learning to Hash 7
7 1. Locality-Sensitive Hashing (Shingling+ MinHash) 2. Learning to Hash WHAT WILL WE TALK?

Locality-Sensitive Hashing Find Similar Items
Locality-Sensitive Hashing Find Similar Items

Introduction Many Web-mining problems can be expressed as finding "similar"sets: 1.Pages with similar words,e.g.,for classification by topic. 2.NetFlix users with similar tastes in movies,for recommendation systems. 3.Movies with similar sets of fans. 4.Images of related things. 9
9 Many Web-mining problems can be expressed as finding “similar” sets: 1. Pages with similar words, e.g., for classification by topic. 2. NetFlix users with similar tastes in movies, for recommendation systems. 3. Movies with similar sets of fans. 4. Images of related things. Introduction Introduction

CASE STUDY Finding Similar Documents
CASE STUDY Finding Similar Documents

Given a body of documents,e.g.,the Web,find pairs of documents with a lot of text in common, e.g.: -Mirror sites,or approximate mirrors. Application:Don't want to show both in a search -Plagiarism,including large quotations. -Similar news articles at many news sites. Application:Cluster articles by "same story." 11
11 • Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, e.g.: – Mirror sites, or approximate mirrors. • Application: Don’t want to show both in a search. – Plagiarism, including large quotations. – Similar news articles at many news sites. • Application: Cluster articles by “same story.” Introduction
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 2 BasicConcepts(Foundations of Data Mining).pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 1 Intro(主讲:邵俊明).pdf
- 计算机科学与技术(PPT讲稿)Unlock with Your Heart - Heartbeat-based Authentication on Commercial Mobile Phones.pptx
- 计算机科学与技术(参考文献)VECTOR - Velocity Based Temperature-field Monitoring with Distributed Acoustic Devices.pdf
- 计算机科学与技术(参考文献)VSkin - Sensing Touch Gestures on Surfaces of Mobile Devices Using Acoustic Signals.pdf
- 计算机科学与技术(参考文献)RespTracker - Multi-user Room-scale Respiration Tracking with Commercial Acoustic Devices.pdf
- 计算机科学与技术(参考文献)Dynamic Speed Warping - Similarity-Based One-shot Learning for Device-free Gesture Signals.pdf
- 计算机科学与技术(参考文献)SpiderMon - Towards Using Cell Towers as Illuminating Sources for Keystroke Monitoring.pdf
- 计算机科学与技术(参考文献)Unlock with Your Heart:Heartbeat-based Authentication on Commercial Mobile Phones.pdf
- 计算机科学与技术(参考文献)QGesture - Quantifying Gesture Distance and Direction with WiFi Signals.pdf
- 计算机科学与技术(PPT讲稿)QGesture - Quantifying Gesture Distance and Direction with WiFi Signals.pptx
- 计算机科学与技术(参考文献)Gait Recognition Using WiFi Signals.pdf
- 计算机科学与技术(参考文献)Gait Recognition Using WiFi Signals.pdf
- 计算机科学与技术(参考文献)Depth Aware Finger Tapping on Virtual Displays.pdf
- 计算机科学与技术(参考文献)Device-Free Gesture Tracking Using Acoustic Signals.pdf
- 计算机科学与技术(参考文献)Device-Free Gesture Tracking Using Acoustic Signals.pdf
- 计算机科学与技术(参考文献)Depth Aware Finger Tapping on Virtual Display.pdf
- 计算机科学与技术(参考文献)Keystroke Recognition Using WiFi Signals.pdf
- 计算机科学与技术(参考文献)Understanding and Modeling of WiFi Signal Based Human Activity Recognition.pdf
- 计算机科学与技术(参考文献)Understanding and Modeling of WiFi Signal Based Human Activity Recognition.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 4 Sampling for Big Data.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 5 Data Stream Mining.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 6 Graph Mining.pdf
- 电子科技大学:《大数据分析与挖掘 Big Data Analysis and Mining》课程教学资源(课件讲稿)Lecture 7 Hadoop-Spark.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Introduction(冯钢).pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 1 Overview - A big Picture on Traffic Control and QoS in IP networks.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 2 Call-level Models and Admission Control.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 3 Traffic Policing and Shaping.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 4 TCP Traffic Control.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 5 Buffer Management.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 6 Packet Scheduling.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 7 IntServ/RSVP and DiffServ.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 8 Traffic Management and Modeling.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 9 Network Traffic Engineering.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 10 Network Coding and Traffic Balancing.pdf
- 电子科技大学:《先进计算机网络技术》课程教学资源(课件讲稿)Unit 11 AI Enabled Wireless Access Control and Handoff.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)华为Atlas人工智能计算解决方案产品彩页.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)Xshell远程登陆开发板方法(华为atlas800 - 910).pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写体识别实验.pdf
- 《机器学习 Machine Learning》课程教学资源(实践资料)MNIST手写数字识别的Atlas 200DK推理应用.pdf