Wrapper Generation and HTML Reduction(PPT讲稿)
data:image/s3,"s3://crabby-images/299fb/299fb8f85027b75493f287ff0176266acf49a505" alt=""
Wrapper generation and HTML Reduction
1 Wrapper Generation and HTML Reduction Yu Li
data:image/s3,"s3://crabby-images/4384e/4384e3c029ba605eb9601753f60c815b5a062451" alt=""
Outline ●网页抽取问题 ● SGWrap System ●HTML的问题 ●HTML约简 ○基本想法 O问题的定义和目标 ○页面模型 ○算法设计 ● Future work
2 Outline ⚫ 网页抽取问题 ⚫ SGWrap System ⚫ HTML的问题 ⚫ HTML约简 基本想法 问题的定义和目标 页面模型 算法设计 ⚫ Future work
data:image/s3,"s3://crabby-images/d17e9/d17e9daea0e1c47e5bb63c34fc78fa28d34be0e2" alt=""
页面抽取的问题 ●Wveb上存在大量的数据,以半结构化的 HTML页面形式存在 ●Wveb数据集成需要将半结构化的数据转换 成为结构化的数据 ●页面抽取的任务:按照用户要求,将半结 构化的Web数据转换成为结构化数据 ●完成页面抽取任务的程序通常叫做 wrapper
3 页面抽取的问题 ⚫Web上存在大量的数据,以半结构化的 HTML页面形式存在 ⚫Web数据集成需要将半结构化的数据转换 成为结构化的数据 ⚫页面抽取的任务:按照用户要求,将半结 构化的Web数据转换成为结构化数据 ⚫完成页面抽取任务的程序通常叫做wrapper
data:image/s3,"s3://crabby-images/7b0f0/7b0f022c5c451977af3b83fba5bcb7364511f09a" alt=""
页面抽取问题 Alternatively you can view Contact, or see the Overview Name Detail Platform: java Purpose: indexing Availability: source Platform: UNIX Ahoy!the Homepage Fil Purpose: maintenance Availability: none -《> robot P !i mapping <> Platform? wrapper
4 页面抽取问题 mapping wrapper
data:image/s3,"s3://crabby-images/dfb0a/dfb0a3dce464612a436ca8ba4815f803f59bad07" alt=""
页面抽取问题 ●页面抽取的工作可以通过 ○手工编写 Wrapper:使用传统语言,将 mapping"硬”编码在 Wrapper程序中 ○借助工具生成 Wrapper:通过计算机辅助生成 wrapper程序 ●抽取规则、交互方式、维护 O完全自动进行 ●页面结构的划分、 Annotation
5 页面抽取问题 ⚫页面抽取的工作可以通过 手工编写wrapper:使用传统语言,将 mapping“硬”编码在wrapper程序中 借助工具生成wrapper:通过计算机辅助生成 wrapper程序 ⚫抽取规则、交互方式、维护 完全自动进行 ⚫页面结构的划分、Annotation
data:image/s3,"s3://crabby-images/f9f51/f9f51d7d49b2e4ea8ba4d98bbc1ec4f1f5dd9f82" alt=""
SGWrap system o SGWrap= Schema Guided Wrapper Generation SGWrap system interact generate Wrapper Program run HTML page data
6 SGWrap System ⚫ SGWrap=Schema Guided Wrapper Generation SGWrap System interact Wrapper Program generate HTML page run data
data:image/s3,"s3://crabby-images/26c58/26c585f76b53f442b661db03503db99f35b66928" alt=""
SGWrap System o SGWrap mainly consists of three parts O SGWrap Runtime(Runtime, for short), which provides service to access our algorithms for web page content extraction It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or ntegrate the runtime itself O SGWrap Compiler(Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Javas compiler javac. exe O Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules
7 SGWrap System ⚫ SGWrap mainly consists of three parts. SGWrap Runtime (Runtime, for short), which provides service to access our algorithms for web page content extraction. It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or integrate the Runtime itself. SGWrap Compiler (Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form. It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Java's compiler javac.exe. Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules
data:image/s3,"s3://crabby-images/6733c/6733c9880436c52ba3a848a519ed516c24b139a7" alt=""
SGWrap System -basic usage 口×」 x e 2 Address: D: \Robots. htm Alternatively you can view Contact, or see the Overvie Detail Platform: java Purpose: indexing Availability: source Plat form. UNIX Ahoy The Homepage Finder Purpose: maintenance Availability: none Schema Rule Open DTD Add Mapping Remove Mapping Generate Rule Save Rules 日-<> Web robots DataItem i der DataPath /HTML/BODY/TABLE/TBODY/TR[1]/TD [O]/A MetaData[None a Functi on><I none
8 SGWrap System – basic usage
data:image/s3,"s3://crabby-images/8f673/8f673cbde0f17d776cb457fe30d7b7c267ad0794" alt=""
SGWrap system basic usage o3 Steps O Design Rule by Using Visual SGWrap O Compile rule into Program by Using SGWrapC OTest and Apply Wrapper by Using SGWrap (Runtime) o There is a tutorial at http://idke.ruc.educn/sawrap/doc/a-10 Minutes-Tutorial. html(also in documentation of each installation)
9 SGWrap System – basic usage ⚫3 Steps Design Rule by Using Visual SGWrap Compile Rule into Program by Using SGWrapC Test and Apply Wrapper by Using SGWrap (Runtime) ⚫There is a tutorial at http://idke.ruc.edu.cn/sgwrap/doc/A-10- Minutes-Tutorial.html (also in documentation of each installation)
data:image/s3,"s3://crabby-images/c164a/c164a9badfca403fa936b152add7f47ccdbc5de0" alt=""
Welcome to http://idkeruc.educn/sgwrap OHomepage of SGTrap System-lozilla Firefor 回 文件)编辑)查看转到G)书签0)工具T)帮助0 ERSI SGWrap(schema Guided Wrapper Generation) System Homepage Introduction News Updates Download Document I Background History Publications Developer ContactAcknowledgement What is SGWrap System Schema Gui ded Wrapper Generation System(SGWrap, for short) is a toolkit for web page nformation extraction. It can semi-automatically generate programs called wrapper built from extraction rules through user interactions. A wrapper for a set of web pages is a program used to extract contents from the pages and output strutured data for further processing. A wrapper, materialized as a java program by sgWrap system, for some certain pages can be easily generated using the visual sgWrap tool of the system and can be reused or integrated in many information systems
10 Welcome to http://idke.ruc.edu.cn/sgwrap
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《微机原理》课程教学资源(PPT课件讲稿)第九章 可编程接口芯片及其与CPU的接口.ppt
- 面向服务的业务流程管理(PPT讲稿)Business Process Modeling Notation(BPMN), Business Process Executive Language(BPEL), and XML Process Definition Language(XPDL).pptx
- 上海交通大学:《微机原理与接口技术》课程教学资源(教学大纲)信息与计算科学专业.pdf
- 《计算机组装与维护》课程教学资源(PPT课件讲稿)第七章 计算机硬件故障处理.ppt
- 《Photoshop_CS入门教程》教学资源(PPT讲稿)第1章 浏览Photoshop CS.ppt
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第七章 定时计数器与可编程计数器阵列.ppt
- 《计算机网络》课程教学资源(PPT课件讲稿)第三章 数据链路层.ppt
- 中国人民大学:《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第一章 绪论.ppt
- 《PHP程序设计》课程教学资源(教学大纲).doc
- 软件测试(PPT课件讲稿)黑盒测试.pptx
- 河南中医药大学(河南中医学院):《计算机网络》课程教学资源(PPT课件讲稿)第一章 计算机网络概述(2015版).ppt
- 西安交通大学:《程序设计语言》课程电子教案(PPT教学课件)第二章 Fortran程序设计基础.ppt
- 西安电子科技大学:《微机原理与接口技术》课程教学资源(PPT课件讲稿)第七章 常用接口芯片技术.pptx
- 香港科技大学:Cross-Selling with Collaborative Filtering(PPT讲稿).ppt
- 中国科学技术大学:《密码学导论》课程教学资源(PPT课件讲稿)第4章 数论基础(主讲:李卫海).pptx
- 《高级语言程序设计》课程教学资源(试卷习题)试题一(无答案).doc
- 《C语言程序设计》课程教学资源(PPT课件讲稿)第6章 函数.ppt
- 东南大学:《操作系统概念 Operating System Concepts》课程教学资源(PPT课件讲稿)13 文件系统 I/O Systems.ppt
- 沈阳理工大学:《网站建设与维护》课程教学资源(PPT课件讲稿)第四章 动态网页基础.ppt
- 《计算机网络技术》课程教学资源(PPT课件讲稿)Chapter 03 物理层.ppt
- 西安交通大学:《微机原理与接口技术》课程教学资源(PPT课件讲稿)第7章 模拟量输入输出接口.ppt
- 《C语言程序设计》课程电子教案(PPT教学课件)第四章 选择结构程序设计.ppt
- 《JAVA与面向对象编程》课程教学资源(PPT课件讲稿)第二章 Java语法基础.ppt
- 华北科技学院:图像的采集与处理(PPT课件讲稿)Photoshop CS.ppt
- 《数据结构》课程PPT教学课件(讲稿)第一章 数据结构基础.ppsx
- 《计算机维修》课程教学资源(PPT课件讲稿)第3章 磁盘工具.ppt
- 《物联网导论》课程教学资源(PPT课件讲稿)第2章 自动识别技术与RFID.ppt
- Introduction to Computing Using Java(PPT讲稿)Java Language Basics.ppt
- 《编译原理》课程教学资源(PPT课件讲稿)从正则表达式到有限自动机.pptx
- 沈阳工程学院:《面向对象程序设计》课程教学大纲(适用专业:计算机科学与技术专业).pdf
- 《计算机辅助设计》课程介绍.pdf
- 《数据库系统概论 An Introduction to Database System》课程教学资源(PPT课件讲稿)第二讲 关系数据库.ppt
- 南京大学:《面向对象技术 OOT》课程教学资源(PPT课件讲稿)模式&框架 Pattern & Framework.ppt
- 《C语言程序设计》课程电子教案(PPT课件讲稿)第二章 基本数据类型及运算.ppt
- Performance Evaluation of Long Range Dependent Queues(PPT讲稿).pptx
- 上海海事大学:《数字图像处理》课程教学资源(PPT课件讲稿)Unit 7 Introduction to Digital Image Processing.ppt
- 《网络搜索和挖掘关键技术 Web Search and Mining》课程教学资源(PPT讲稿)Lecture 08 Scoring and results assembly.ppt
- 《数据库基础》课程教学资源(PPT课件讲稿)第四章 数据查询.ppt
- 北京大学:C++模板与STL库介绍(PPT讲稿).ppt
- Computer Graphics(PPT讲稿)INFORMATION VISUALIZATION.pptx