《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-3-MR-model-and-systems

Building Big Data Processing Systemsbased on Scale-Out Computing Models
1 Building Big Data Processing Systems based on Scale-Out Computing Models

Small Data: Locality of ReferencesPrinciple of Locality- A small set of data that are frequently accessed temporally and spatially- Keeping it close to the processing unit is critical for performanceOneoflimitedprinciples/lawsincomputerscienceWhere can weget locality?- Everywhereincomputing:architecture,softwaresystems,applicationsFoundations of exploiting locality-Locality-awarearchitecture-Locality-awaresystemsLocalitypredictionfromaccesspatterns2
Small Data: Locality of References • Principle of Locality – A small set of data that are frequently accessed temporally and spatially – Keeping it close to the processing unit is critical for performance – One of limited principles/laws in computer science • Where can we get locality? – Everywhere in computing: architecture, software systems, applications • Foundations of exploiting locality – Locality-aware architecture – Locality-aware systems – Locality prediction from access patterns 2

Conventional Databases: Move data to compute. Centralized control to achieve ACID- Atomicity: if one part of the transaction fails, the entire onefails- Consistency: from one valid state to another valid state- lsolation: Resource sharing is not allowed- Durability: once a transaction is committed, the resultsneed to be permanently stored.: A centralized approach (or a scale up)- Scale-out: throughput increases as the # nodes increases. A vender controlled technical/business model- Expensive (designed for banks and high profit orgs)- ACID may not be required for massive data processing3
Conventional Databases: Move data to compute • Centralized control to achieve ACID – Atomicity: if one part of the transaction fails, the entire one fails – Consistency: from one valid state to another valid state – Isolation: Resource sharing is not allowed – Durability: once a transaction is committed, the results need to be permanently stored. • A centralized approach (or a scale up) – Scale-out: throughput increases as the # nodes increases • A vender controlled technical/business model – Expensive (designed for banks and high profit orgs) – ACID may not be required for massive data processing 3

How to handle increasingly large volume data? Anew paradigm (from Ivy League to Land Grant model)- 150+ years ago, Europe ended the industrial revolution- But US was a backwardagriculture country- Higher education is the foundation to become a strongindustrialcountry. Extending thelvyLeaguesto massively accept students? Impossible!.Anew higher education model? Must be! Land grant university model: at a low cost and be scalable- Lincoln singed the“Land Grant UniversityBill"in 1862Togivefederal landtomanyStatestobuildpublicuniversities- The missionis to build low costuniversities and open to massesThe success of land grant universities-Althoughthemodelislowcostandlessselectiveinadmissions,theexcellenceofeducationremains- Manyworld class universities wereborn and established bythismodel:Cornel,MiT,OhioState,Purdue,UCBerkeley,UluC,Wisconsin..4
How to handle increasingly large volume data? • A new paradigm (from Ivy League to Land Grant model) – 150+ years ago, Europe ended the industrial revolution – But US was a backward agriculture country – Higher education is the foundation to become a strong industrial country • Extending the Ivy Leagues to massively accept students? Impossible! • A new higher education model? Must be! • Land grant university model: at a low cost and be scalable – Lincoln singed the “Land Grant University Bill” in 1862 – To give federal land to many States to build public universities – The mission is to build low cost universities and open to masses • The success of land grant universities – Although the model is low cost and less selective in admissions, the excellence of education remains – Many world class universities were born and established by this model: Cornel, MIT, Ohio State, Purdue, UC Berkeley, UIUC, Wisconsin . 4

5MajorDifferencesinNewInfrastructure:Shared with conventional databases- SQLcontinues- Enterprise data warehouse (EDW)frameworkcontinues- Other commonly used API and standards (e.g. JDBC, ODBC).Majordifferences-A scale-out computing model (e.g. MapReduce)-Commodity computing and storage systems based-Scale up software efforts and advanced hardwareacceleration are additional efforts-Affordability is a requirement-Community driven open source software
Major Differences in New Infrastructure • Shared with conventional databases – SQL continues – Enterprise data warehouse (EDW) framework continues – Other commonly used API and standards (e.g. JDBC, ODBC) • Major differences – A scale-out computing model (e.g. MapReduce) – Commodity computing and storage systems based – Scale up software efforts and advanced hardware acceleration are additional efforts – Affordability is a requirement – Community driven open source software 5

6MajorIssuesofBigDataSystems·Access patterns are unpredictable- data analytics can be in different ways: Locality may not be a major concern-Every data may be important (e.g. a key word search).Majorconcerns-Scale out: throughput + as the number of nodes +- Fault tolerant (nodes are frequently dead)- Low cost processing for increasingly large volumes: These are not largely considered in existing systems
Major Issues of Big Data Systems • Access patterns are unpredictable – data analytics can be in different ways • Locality may not be a major concern – Every data may be important (e.g. a key word search) • Major concerns – Scale out: throughput + as the number of nodes + – Fault tolerant (nodes are frequently dead) – Low cost processing for increasingly large volumes • These are not largely considered in existing systems 6

MapReduce Data Processing EngineA simple but effective programming model designed toprocess huge volumes of data concurrentlyTwo unique properties- Minimum dependency among tasks (almost sharing nothing)- Simpletaskoperationsineachnode(lowcostmachinesaresufficient). Two strong merits for big data anaytics- Scalability (Amadal's Law): increase throughput byincreasing # of nodes-Fault-tolerance (quick and low cost recovery of thefailuresof tasks)·HadoopisawidelyusedsoftwareofMapReduce- in thousands of society-dependent corporations andorganizations for big data analytics: AOL, Baidu, EBayFacebook,IBM,NY Times,Yahoo!7
MapReduce Data Processing Engine • A simple but effective programming model designed to process huge volumes of data concurrently • Two unique properties – Minimum dependency among tasks (almost sharing nothing) – Simple task operations in each node (low cost machines are sufficient) • Two strong merits for big data anaytics – Scalability (Amadal’s Law): increase throughput by increasing # of nodes – Fault-tolerance (quick and low cost recovery of the failures of tasks) • Hadoop is a widely used software of MapReduce – in thousands of society-dependent corporations and organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! . 7

MapReduce Operations on HadoopGet average salary of each of 2 organizations in a huge file(name: (org., salary))(org.: avg. salary)KeyValueKeyValueOriginal key/value pairs:Result key/value pairs: twoall the person namesentries showing the org nameassociated with each organd corresponding averagesalaryname and their salariesName(dept.,salary)dept.avg.salaryAlice(Org-1, 3000)Org-1Bob(Org-2, 3500)Org-28
MapReduce Operations on Hadoop • Get average salary of each of 2 organizations in a huge file. {name: (org., salary)} {org.: avg. salary} 8 Key Value Key Value Original key/value pairs: all the person names associated with each org name and their salaries Result key/value pairs: two entries showing the org name and corresponding average salary Name (dept. ,salary) Alice (Org-1, 3000) Bob (Org-2, 3500) . . dept. avg. salary Org-1 . Org-2

MapReduce Operations on Hadoop Calculate the average salary of every organization(name: (org., salary))(org.: avg. salary)HDFSA HDFS blockHadoop Distributed File System (HDFS)
HDFS MapReduce Operations on Hadoop • Calculate the average salary of every organization {name: (org., salary)} {org.: avg. salary} 9 A HDFS block Hadoop Distributed File System (HDFS)

MapReduce Operations on Hadoop: Calculate the average salary of every department(name: (org., salary))(org.: avg. salary]TTDHDFSMapMapMap业业力Each map task takes 4 HbFS blocks as its inputand extract (org. salary? as new key/value pairs,(Alice: (org-1, 3000)o (org-1: 3000)e.g.3 Map tasks concurrently process input dataRecords of "org-1"Records of "org-2"10
HDFS MapReduce Operations on Hadoop • Calculate the average salary of every department Map Map Map {name: (org., salary)} {org.: avg. salary} 10 Each map task takes 4 HDFS blocks as its input and extract {org.: salary} as new key/value pairs, e.g. {Alice: (org-1, 3000)} to {org-1: 3000} 3 Map tasks concurrently process input data Records of “org-1” Records of “org-2
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-2-access-patterns-in-big-data.pptx
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-1-balanced-systems-updated.pptx
- 《系统软件与软件安全》课程教学资源(文献资料)系统软件与软件安全文献合集.pdf
- 济南大学:研究生院《人工智能》专业课程教学大纲汇编.pdf
- 济南大学:研究生院《计算机技术》专业课程教学大纲汇编.pdf
- 济南大学:研究生院《计算机科学与技术》专业课程教学大纲汇编.pdf
- 北京信息科技大学:研究生院计算机学院课程教学大纲汇编.pdf
- 湖南工业大学:计算机与人工智能学院人工智能专业课程教学大纲汇编(2023版人才培养方案).pdf
- 湖南工业大学:计算机与人工智能学院智能科学与技术专业课程教学大纲汇编(2023版人才培养方案).pdf
- 湖南工业大学:计算机与人工智能学院物联网工程专业课程教学大纲汇编(2023版人才培养方案).pdf
- 湖南工业大学:计算机与人工智能学院网络工程专业课程教学大纲汇编(2023版人才培养方案).pdf
- 湖南工业大学:计算机与人工智能学院通信工程专业课程教学大纲汇编(2023版人才培养方案).pdf
- 湖南工业大学:计算机与人工智能学院软件工程专业课程教学大纲汇编(2023版人才培养方案).pdf
- 华中科技大学:计算机科学与技术学院《机器学习》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《计算机图形学》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《计算理论》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《计算思维》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《接口技术》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《命令式计算原理》课程教学大纲(2021版).pdf
- 华中科技大学:计算机科学与技术学院《人工智能导论》课程教学大纲(2021版).pdf
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-4-LSbM-tree.pptx
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-7-big-volume-data-accesses.pptx
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-6-locks-and-CC.pptx
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-7-SSD-sys.pptx
- 《系统软件与软件安全》课程教学课件(PPT讲稿,英文)Lecture-8-SDS-vision.pptx
- 江苏科技大学:《计算机组成原理》课程教学资源(PPT课件,完整讲稿,共十章).pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter1_1计算机基础知识.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter1_2计算机中数的表示和编码.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter2_1 8086-8088微处理器结构.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter2_2 8086-8088的寻址方式.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter2_3 8086-8088的指令系统.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter2_4逻辑指令-控制转移指令.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter2_5处理机控制-串处理指令.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter3_1汇编语言及其程序结构.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter3_2汇编语言程序举例.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter3_3 BIOS和DOS中断功能调用.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter3_4 汇编语言程序设计.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter3_5 汇编语言程序设计小结.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter4_1 PC机的总线结构和时序.pptx
- 江苏科技大学:《微机原理与接口技术》课程教学资源(PPT课件)Chapter4_2 总线与时序.pptx
