北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)MapReduce系统设计与实现 Web Search on MapReduce

Outline Last Course Review Reading Check Document Retrieval &Inverted Index ■PageRank Algorithm
Outline ◼ Last Course Review & Reading Check ◼ Document Retrieval &Inverted Index ◼ PageRank Algorithm

NC&IS Last Course Review
Last Course Review

ki vi k2 v2 ka va ka va ks Vs ke Ve map map map map a 1b 2 c3c6 a 5c 2 b 7 c 8 Shuffle and Sort:aggregate values by keys a 15 b27 c2368 reduce reduce reduce 12 S2 T3 S3
map map map map Shuffle and Sort: aggregate values by keys reduce reduce reduce k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 a 1 5 b 2 7 c 2 3 6 8 r1 s1 r2 s2 r3 s3

MapReduce "Runtime" a Handles scheduling Assigns workers to map and reduce tasks Handles "data distribution" Moves processes to data Handles synchronization Gathers,sorts,and shuffles intermediate data Handles errors and faults Detects worker failures and restarts -Everything happens on top of a distributed FS (later)
MapReduce “Runtime” ◼ Handles scheduling ◼ Assigns workers to map and reduce tasks ◼ Handles “data distribution” ◼ Moves processes to data ◼ Handles synchronization ◼ Gathers, sorts, and shuffles intermediate data ◼ Handles errors and faults ◼ Detects worker failures and restarts ◼ Everything happens on top of a distributed FS (later)

Ki vi k2 v2 ka ve ka va ks vs ke ve map map map map 1b2 a c 3 c 6 a 5c 2 b 7 c 8 combine combine combine combine 1b2 a c 9 a 5c 2 partition partition partition partition Shuffle and Sort:aggregate values by keys a15 b27 c2968 reduce reduce reduce S1 2S2 3s3
combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition map map map map k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys reduce reduce reduce a 1 5 b 2 7 c 2 9 8 r1 s1 r2 s2 r3 s3 c 2 3 6 8

Word Count:Version 1 1:class MAPPER 2: method MAP(docid a,doc d) 3: H-new ASSOCIATIVEARRAY 4: for all term t∈doc d do 5: H{t}←-H{t}+1 Tally counts for entire document 6: for all term t∈Hdo 7: EMIT(term t,count H{t) Are combiners still needed?
Word Count: Version 1 Are combiners still needed?

Computing the Mean:Version 2 1:class MAPPER 2 method MAP(string t,integer r) 3: EMIT(string t,integer r) 1:class COMBINER 2: method CoMBINE(string t,integers [r1,r2,...) 3: sum←-0 4: cmt←-0 5: for all integer r E integers [r1,r2,...]do 6: sum←-sum+T 7: cnt←-cmt+1 8: EMIT(string t,pair (sum,cnt)) Separate sum and count 1:class REDUCER 2: method REDUCE(string t,pairs [(s1,c1),(s2,c2)...]) 3: sum←-0 4: cmt←-0 5: for all pair (s,c)E pairs [(s1,c1),(s2,c2)...]do 6: sum←-sum+s 7: cnt←-cnt+c 8: Tavg←-sum/cnt 9: EMIT(string t,integer ravg) Why doesn't this work?
Computing the Mean: Version 2 Why doesn’t this work?

Another Try:“Stripes” Idea:group together pairs into an associative array (a,b)→1 (a,c→2 (a,d)→5 a→{b:1,c:2,d:5,e:3,f:2} (a,e)3 (a,f)→2 Each mapper takes a sentence: Generate all co-occurring term pairs For each term,emit a{b:countp,c:countc,d:countd... Reducers perform element-wise sum of associative arrays a→{b:1,d:5,e:3} Key:cleverly-constructed data structure a→{b:1,c:2,d:2,f:2} brings together partial results a→{b:2,c:2,d:7,e:3,f:2}
◼ Idea: group together pairs into an associative array ◼ Each mapper takes a sentence: ◼ Generate all co-occurring term pairs ◼ For each term, emit a → { b: countb , c: countc , d: countd … } ◼ Reducers perform element-wise sum of associative arrays Another Try: “Stripes” (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +

f(BlA):“Pairs” (a,*)→32 Reducer holds this value in memory (a,b)→3 (a,b)→3/32 (a,b2)→12 (a,b2)→12/32 (a,b3)→7 (a,b3)→7/32 (a,b4)→1 (a,b4)→1/32 For this to work: Must emit extra(a,*for every b in mapper Must make sure all a's get sent to same reducer (use partitioner) Must make sure (a,*comes first(define sort order) Must hold state in reducer across different key-value pairs
f(B|A): “Pairs” ◼ For this to work: ◼ Must emit extra (a, *) for every bn in mapper ◼ Must make sure all a’s get sent to same reducer (use partitioner) ◼ Must make sure (a, *) comes first (define sort order) ◼ Must hold state in reducer across different key-value pairs (a, b1 ) → 3 (a, b2 ) → 12 (a, b3 ) → 7 (a, b4 ) → 1 … (a, *) → 32 (a, b1 ) → 3 / 32 (a, b2 ) → 12 / 32 (a, b3 ) → 7 / 32 (a, b4 ) → 1 / 32 … Reducer holds this value in memory

NC&IS Homework Reading
Homework Reading
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)MapReduce算法设计 Basic MapReduce Algorithm Design.ppt
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)MapReduce原理 MapReduce Theory and Practice.ppt
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)课程介绍 Introduction to Cloud Computing(主讲:彭波).ppt
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)Data-Intensive Text Processing(MapReduce book 20100307).pdf
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)MapReduce——Simplified Data Processing on Large Clusters.pdf
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)The Google File System(GFS).pdf
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)k-means++——The Advantages of Careful Seeding.pdf
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching.pdf
- 《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(阅读材料)The Anatomy of a Large-Scale Hypertextual Web Search Engine.pdf
- 上海中医药大学:课程教学大纲汇编合集——教学大纲(计算机中心、图书信息中心).pdf
- 北京中医药大学:《计算机基础》课程教学资源(PPT课件)第8章 模块.ppt
- 北京中医药大学:《计算机基础》课程PPT教学课件(Access 数据库程序设计)包装应用系统.ppt
- 北京中医药大学:《计算机基础》课程教学资源(PPT课件)第7章 宏.ppt
- 北京中医药大学:《计算机基础》课程教学资源(PPT课件)第5章 报表.ppt
- 北京中医药大学:《计算机基础》课程教学资源(教学大纲,Ⅱ).doc
- 北京中医药大学:《计算机基础》课程教学资源(电子教材)《Access 数据库程序设计》第5章 报表.doc
- 北京中医药大学:《计算机基础》课程教学资源(电子教材)《Access 数据库程序设计》第4章 窗体.doc
- 北京中医药大学:《计算机基础》课程教学资源(试卷习题)2009年9月全国计算机等级考试二级笔试试卷——Access 数据库程序设计(含答案).docx
- 北京中医药大学:《计算机基础》课程教学资源(试卷习题)2008年9月计算机等级考试二级(ACCESS真题试卷及答案).docx
- 北京中医药大学:《计算机基础》课程教学资源(试卷习题)全国计算机等级考试二级笔试试卷——Access 数据库程序设计.docx
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)Clustering问题 Clustering.ppt
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)并行与分布式系统基础 Introduction to Distributed Systems.ppt
- 北京大学:《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源(PPT课件)分布式文件系统 Distributed File systems.ppt
- 北京大学:《移动计算与无线网络》课程教学资源(学生PPT)课程实验——WLAN性能实证(802.11 Wlan无线通讯实验).ppt
- 北京大学:《移动计算与无线网络》课程教学资源(学生PPT)揭秘WLAN无线链路的丢包规律.ppt
- 北京大学:《移动计算与无线网络》课程教学资源(学生PPT)无线实验——距离障碍物等因素之影响.ppt
- 西安电子科技大学:《信息系统安全》课程教学资源(PPT课件讲稿)第一章 绪论(主讲教师:董庆宽).ppt
- 西安电子科技大学:《现代密码学》课程教学资源(PPT课件讲稿)第三章 分组密码.pptx
- 西安电子科技大学:《现代密码学》课程教学资源(PPT课件讲稿)第五章 消息认证算法.pptx
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第1章 概述.pdf
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第2章 物理层.pdf
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第3章 数据链路层.pdf
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第4章 网络层.pdf
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第5章 运输层.pdf
- 郑州大学:《计算机网络》课程电子教案(课件讲稿)第6章 应用层.pdf
- 唐山广播电视大学:Premiere Pro CC视频编辑——期末复习题及答案.doc
- 四川开放大学:《跨境电商》课程教学资源(试卷习题)期末考试试题一(试题).doc
- 四川开放大学:《跨境电商》课程教学资源(试卷习题)期末考试试题一(答案).doc
- 四川开放大学:《跨境电商》课程教学资源(试卷习题)期末考试试题三(试题).doc
- 四川开放大学:《跨境电商》课程教学资源(试卷习题)期末考试试题三(答案).doc