《Managing XML and Semistructured Data》教学资源(PPT课件讲稿)Part 04 Compressing XML Data
data:image/s3,"s3://crabby-images/f7ece/f7ece878fb064b1dbdd8f1b4c75625ffe6919e23" alt=""
Managing XML and Semistructured data Part 4: Compressing XMl data
1 Part 4: Compressing XML Data Managing XML and Semistructured Data
data:image/s3,"s3://crabby-images/311e7/311e7b1f46ce81af9b95764474aa501db8c68ba7" alt=""
In this section XML Compression Motivation The State-of-the-Art Queriable compressors a Non-queriable compressors Resources XMILL: An Efficient Compressor for XML Data by liefke and Suciu in Sigmod20ol Others: XGrind, XPress, XQuec, XMLzip ■ⅩCQ: From my publications XOZip: From my publications MOX: From my publications
2 In this section ▪ XML Compression • Motivation • The State-of-the-Art ▪ Queriable compressors ▪ Non-queriable compressors Resources ▪ XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001 ▪ Others: XGrind, XPress, XQuec, XMLzip, … ▪ XCQ: From my publications ▪ XQZip: From my publications ▪ MQX : From my publications
data:image/s3,"s3://crabby-images/457b3/457b3bfd652493649dfab37c8593c2af9371de8a" alt=""
Introduction a More and more xml data is created Duplicate structures(tags, paths.) Data inflation: data in XML is much larger than raw data Compression: storage and data transfer General-purpose compressor( e.g. gzip) Characteristics of Xml data not utilized Ungueriable
3 Introduction ▪ More and more XML data is created • Duplicate structures (tags, paths …) • Data inflation: data in XML is much larger than raw data • Compression: storage and data transfer ▪ General-purpose compressor (e.g. gzip) • Characteristics of XML data not utilized • Unqueriable
data:image/s3,"s3://crabby-images/8753d/8753dfe75e7fa527dc5a1b5c419fd625ac10c086" alt=""
Compression: The Problem XML for exchange(space or time But XML is verbose and inflated due to Duplicated tags and paths Users prefer application specific formats Eg Web Server Logs Is Xml doomed to fail Solution XML-specific compressor Non-queriable: XMill Queriable XQzip
4 Compression: The Problem ▪ XML for exchange (space or time) ▪ But XML is verbose and inflated due to • Duplicated tags and paths ▪ Users prefer application specific formats: • Eg. Web Server Logs ▪ Is XML doomed to fail ? ▪ Solution: XML-specific compressor • Non-queriable: XMill • Queriable: XQzip
data:image/s3,"s3://crabby-images/75e7d/75e7d39b766491c4d46e3fdaea5aedc57c56f1ee" alt=""
XML-Specific Compressors Unqueriable Compression( e.g. XMill) Full-chunked data commonalities eliminated Very good compression ratio Queriable Compression(e.g XGrind, XPRESS Fine-grained: data commonalities ignored Inadequate compression ratio and time Support simple path queries with atomic predicate
5 XML-Specific Compressors ▪ Unqueriable Compression (e.g. XMill): • Full-chunked: data commonalities eliminated • Very good compression ratio ▪ Queriable Compression (e.g. XGrind, XPRESS): • Fine-grained: data commonalities ignored • Inadequate compression ratio and time • Support simple path queries with atomic predicate
data:image/s3,"s3://crabby-images/cb92d/cb92d72a05d0af08590e7330bf446767e609f2ac" alt=""
Issues in XML Compression Compression ratios Compression time Query coverage. memory Usage ...(see my survey paper in wwwJ) Technologies Compression Compression Memory Usage Time Compression Used (compared(compared( for compression Scheme with Gzip) with Gzip) Used Consistently Constant Not Support SAX Better Slower 8 MB (default) Querying Compress (UNIX) Much At least two Roughly Huffman Exact-match, SAX times longer Constant Coding Prefix-match Xpath Axes Child Attribute XPRESS At least two Roughly uffman Coding, Exact-match, SAX Constant Approximated Prefix-match Xpath Axes Arithmetic Child and Encoding Descendant Attribute prohibitively Constant onge XMLZip Comparable Much Proportional Not Support DOM Querying Input Data Size porti tructure Better Longer Io Compression Querying Slightl Much Proportional Differential Not Support DOM DDT Better DTD Tree Querying Input Data Size Compression, Comparison of existing technologies
6 Issues in XML Compression ▪ Compression ratios, Compression time, Query Coverage, Memory Usage…(see my survey paper in WWWJ) Comparison of existing technologies
data:image/s3,"s3://crabby-images/e8def/e8def41f082cde50f6f3eaa2cf62593eab6d82ea" alt=""
An Example: Web Server logs ASCll File 15.9 Mb (gzipped 1.6MB) 202.239.238.16get/http:/1.otext/html2001997/10/01-00:00:021-14478-i-http://www.netjp/moziLla/3.1lJaj( XML-ized apache web log inflates to 24.2 Mb gzipped 2. 1MB) 202. 239.238.16 Apacherequestline>get/http:/1.0 text/html 200 1997/10/01-00: 00: 02 4478 Kapachereferer>http://www.net.ip/ Mozilla/3. 1S[SjaS]S()
7 An Example:Web Server Logs 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) 202.239.238.16 GET / HTTP/1.0 text/html 200 1997/10/01-00:00:02 4478 http://www.net.jp/ Mozilla/3.1$[$ja$]$(I) ASCII File 15.9 MB (gzipped 1.6MB): XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):
data:image/s3,"s3://crabby-images/b2fd3/b2fd393c087c58089a1aa90fcc61072173a4a374" alt=""
XMill First specialized compressor for XML data SAX parser for parsing XML data Still using gzip as its underlying compressor Clever grouping of data into containers for compression Compress Xml via three basic techniques Compress the structure separately from the data Group the data values according to their types Apply semantic(specialized) compressors Downloadable www.cs.washington.edu/homes/suciu/xmill
8 XMill ▪ First specialized compressor for XML data • SAX parser for parsing XML data • Still using gzip as its underlying compressor • Clever grouping of data into containers for compression ▪ Compress XML via three basic techniques • Compress the structure separately from the data • Group the data values according to their types • Apply semantic (specialized) compressors: ▪ Downloadable: • www.cs.washington.edu/homes/suciu/XMILL
data:image/s3,"s3://crabby-images/885dc/885dca97d25b323dbbd34659c3b5ff190b2379a2" alt=""
XMill Architecture nput file: XML Command line: Container Expressions P//apache: host=>IP apache: host>203.237.165. 15 pache: request 11 e>GET /images/logo.gif -P// apache: requeatliae=>set("GET "t) P!/ apache: useragent>mozilla/ 4.0 SAX-Parser :203:172.222351 GET ,diat/testzi1 Path Processor Sem Compressor 1 Sem Compressor 2... Sem Compressor k Main memory Structure container Data container 1 Data container 2 Data container k CB ED 12C1#3c2 A5 0E Mo=i11a/4,0[ea] CB AC 16 02 dit/te計t,〓iP Output file: compressed XMl Figure 4: Architecture of the Compressor
9 XMill Architecture:
data:image/s3,"s3://crabby-images/9f02a/9f02a144e56bb99acd418af7d583cd896b41d6d5" alt=""
How Xmill Works. Three ideas Compress the structure separately from the data gzip structure gzip Data 202.23923816 Get/htTp/1.0 text/html =1.75MB 200
10 How Xmill Works: Three Ideas . . . 202.239.238.16 GET / HTTP/1.0 text/html 200 … gzip Structure gzip Data + =1.75MB Compress the structure separately from the data:
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《JAVA面向对象入门技术》教程教学资源(PPT课件讲稿)第二章 Java语言基础.ppt
- 北京大学:《项目成本管理》课程教学资源(PPT课件讲稿)项目范围计划(主讲:周立新).ppt
- 山东大学:《网站设计与建设》课程教学资源(PPT课件讲稿)第三部分 网站设计技术 第20章 MySQL数据库.ppt
- 程序设计工具(PPT课件讲稿)Software Program Tool.ppt
- 《Java Web应用开发技术与案例教程》教学资源(PPT讲稿)第7章 Java Web常用开发模式与案例.ppt
- 《面向对象程序设计》课程教学大纲(适用专业:信息与计算科学).pdf
- 《编译技术》课程教学资源(PPT课件讲稿)第六章 运行时存储空间的组织和管理.ppt
- 沈阳理工大学:《计算机网络》课程教学资源(PPT课件讲稿)第2章 IP技术.ppt
- 香港科技大学:Record Linkage for Big Data.pptx
- 中国科技大学计算机系:《黑客反向工程》课程教学资源(PPT课件讲稿)黑客反向工程导论(陈凯明).ppt
- 《单片机应用技术》课程PPT教学课件(C语言版)第10章 单片机测控接口.ppt
- 《计算机操作系统》课程教学资源(PPT课件讲稿)第四章 存储器管理.ppt
- 《计算机网络与因特网 Computer Networks and Internets》课程教学资源(PPT课件讲稿)第二讲 互联网应用软件.ppt
- 《C语言程序设计》课程电子教案(PPT课件讲稿)第七章 数组.ppt
- Analysis of Algorithms(PPT讲稿)Data Structures and Data Management.ppt
- 《计算机组成原理》课程教学资源(PPT课件讲稿)第3章 计算机的算术运算.pptx
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)图像压缩编码 Image Compression.pptx
- 中国科学技术大学:《信号与图像处理基础 Signal and Image Processing》课程教学资源(PPT课件讲稿)数字图像处理基础 Basics of Digital Image Processing.pptx
- 中国科学技术大学:云计算及安全(PPT讲稿)Cloud Computing & Cloud Security.pptx
- 《C语言程序设计》课程电子教案(PPT课件讲稿)第7章 用函数实现模块化程序设计.pptx
- Introduction to Text Mining 文本挖掘.pptx
- 北京大学:烟花算法的变异算子(PPT讲稿)Mutation Operators of Fireworks Algorithm.pptx
- 中国科学技术大学:《计算机体系结构》课程教学资源(PPT课件讲稿)绪论、第1章 量化设计与分析基础(主讲:周学海).ppt
- 清华大学出版社:《计算机应用基础实例教程》课程教学资源(PPT课件讲稿,第二版,共七章,主编:吴霞,制作:李晓新).ppt
- 《计算机算法设计与分析》课程教学资源(PPT课件)第8章回溯法.ppt
- 白城师范学院:《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第二章 关系数据库(2.1-2.3).ppt
- 《操作系统》课程教学资源(PPT课件讲稿)实时调度 Real-Time Scheduling.ppt
- 四川大学:《操作系统 Operating System》课程教学资源(PPT课件讲稿)Chapter 6 Concurrency - Deadlock(死锁)and Starvation(饥饿).ppt
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 12 Language Models.ppt
- Progress of Concurrent Objects with Partial Methods.pptx
- 《编译原理与技术》课程教学资源(PPT课件讲稿)代码优化.ppt
- 《单片机应用技术》课程PPT教学课件(C语言版)第3章 MCS-51指令系统及汇编程序设计.ppt
- 《数据结构》课程教学资源(PPT课件讲稿)第八章 图.ppt
- 同济大学:《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源(PPT课件讲稿)Platforms for Big Data Mining(主讲:饶卫雄).ppt
- 《计算机网络》课程教学资源(PPT讲稿)网络安全(访问控制、加密、防火墙).ppt
- 水平集方法与图像分割 Level set method and image segmentation.pptx
- 北京师范大学:《计算机文化基础》课程教学资源(PPT课件讲稿)08 网页制作基础知识(赵国庆).ppt
- 《C语言程序设计》课程教学资源(PPT讲稿)第1章 程序设计和C语言.pptx
- 《计算机组装与维护》课程教学资源(PPT课件讲稿)第十一章 计算机数据恢复技术.ppt
- 贵州大学:计算机应用基础(PPT课件讲稿)计算机基础知识.pdf