香港科技大学:Web-log Mining:from Pages to Relations

Web-log Mining: from Pages to relations Qiang Yang HKUST Hong Kong, China 2021/1/26
2021/1/26 1 Web-log Mining: from Pages to Relations Qiang Yang HKUST, Hong Kong, China

A Spectrum of Web-Log Miners Knowledge Knowledge rich Have logical relations over page contents Database generated pages Can make accurate predictions even without data Knowledge middle Some hierarchical representations of ontology and content Most cases Interesting! Can predict based on similarity Knowledge poor Have only page-level logs No relational knowledge Can predict observed pages only when data is plenty 2021/1/26
2021/1/26 2 A Spectrum of Web-Log Miners ◼ Knowledge Rich: ◼ Have logical relations over page contents ◼ Database generated pages ◼ Can make accurate predictions even without data! ◼ Knowledge Middle: ◼ Some hierarchical representations of ontology and content ◼ Most cases! ◼ Can predict based on similarity ◼ Knowledge Poor: ◼ Have only page-level logs ◼ No relational knowledge ◼ Can predict observed pages only when data is plenty Interesting! Knowledge

1. Knowledge-poor --Web-log mining Association rule based predictive model ABC D Left Hand Side Right Hand A B CF Side ABE AB C B. C B. CD.G B CD G C D 2 Current observations Window of prediction A B C Size=m 1 Extract rules 2 Select rules 2021/1/26
2021/1/26 3 1. Knowledge-poor -- Web-log mining Current Observations Window of Prediction A B C Size=m A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D Left Hand Side Right Hand Side A,B C B,C,D G Association Rule based Predictive Model ? 1 2 1 Extract rules 2 Select rules

Rule-Representation Methods (min sup=2) Subset 2 {A,C}→>C PI Substring A C C BC→C B A C Latest Substring A C C C→C Subsequence Latest Subsequence 2021/1/26
2021/1/26 5 Rule-Representation Methods (min sup=2) ◼ Subset {A, C}→C ◼ Substring “BC”→C ◼ Latest Substring “C”→C ◼ Subsequence ◼ Latest Subsequence W1 W2 A1 A2 A3 P1 P2 A B C A C B C A C D C A C D C

Rule- selection criteria Among the rules whose LHs matches W1 Longest-Match Selection Select a rule whose left hand side is the longest to apply Corresponds to using the strongest signature to predict Most Confident Select the rule with highest confidence to apply Pessimistic selection UCEN is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error UCFE, N cO N 2021/1/26
2021/1/26 6 Rule-Selection Criteria ◼ Among the rules whose LHS matches W1, ◼ Longest-Match Selection ◼ Select a rule whose left hand side is the longest to apply ◼ Corresponds to using the strongest signature to predict ◼ Most Confident ◼ Select the rule with highest confidence to apply ◼ Pessimistic Selection ◼ UCF (E,N) is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error N U E N conf CF p ( , ) = 1−

Comparison matrix Rule Longest Match Most Confident Pessimistic Rep/Methods Selection Selection Selection Subset rules Subsequence rules ?? subsequence rules Substring rules atest-substring Best Com pared with c4. 5 as well Comparison Criteria: Precision Model size 2021/1/26 7
2021/1/26 7 • Compared with C4.5 as well. • Comparison Criteria: Precision & Model Size Rule Rep/Methods Longest Match Selection Most Confident Selection Pessimistic Selection Subset rules ? ? ? Subsequence rules ? ? ? Latestsubsequence rules ? ? ? Substring rules ? ? ? Latest-substring rules ? ? Best! Comparison Matrix

Integrating with Caching Cache replacement algorithm a key value k(p) is assigned to each object p ■[ Arlltt et a.USENⅨ1998,Cao& irani,97] K(P)=L+ F(P*C(p)/S(p) C(p: Cost of loading a page(e.g. amount of time S(p): Size of a page F(p): Frequency Count of a page L: An Inflation factor to reflect cache aging 2021/1/26
2021/1/26 8 Integrating with Caching ◼ Cache replacement algorithm ◼ A key value K(p) is assigned to each object p ◼ [Arlltt et al. USENIX 1998, Cao & Irani, 97] ◼ K(p) = L + F(p) * C(p) / S(p) ◼ C(p): Cost of loading a page (e.g., amount of time) ◼ S(p): Size of a page ◼ F(p): Frequency Count of a page ◼ L: An Inflation factor to reflect cache aging

Predicting future frequency O1:0.70 Session 1 o2:0.90 O2:0.30 Predicted Frequency W1=0.70+0.60+0.70=2.00 O1:0.60 W2=0.90+0.70+0.90=2.50 ession O2:0.70 W2=0.30+0.20 0.50 o3:0.20 W4=0.11+0.30 0.41 Os:0.42 W=0.42+0.33 =0.7 O1:0.70 O2:0.90 Session 3 O4:0.30 O:0.33 K=L+(W+Fi*C/s Wi: Future frequency Fi Past frequen 2021/1/26
2021/1/26 9 Predicting future frequency Session 1 O1: 0.70 O2: 0.90 O3: 0.30 O4: 0.11 Session 2 O1: 0.60 O2: 0.70 O3: 0.20 O5: 0.42 Session 3 O1: 0.70 O2: 0.90 O4: 0.30 O5: 0.33 W1 = 0.70+0.60+0.70 = 2.00 W2 = 0.90+0.70+0.90 = 2.50 W3 = 0.30+0.20 = 0.50 W4 = 0.11+0.30 = 0.41 W5 = 0.42+0.33 = 0.75 Predicted Frequency Ki = L + ( Wi+Fi ) * Ci / Si Wi : Future frequency Fi : Past frequency

Byte-hit Rate measures bandwidth reduction Byte Hit Rate=/ Bytes answered by cache I I Total bytes Byte Hit Rate vs Cache Size on NASA web log NGRAM GDSF GDSize LRU 0 0.002 0.004 0006 0008 0.01 Cache Size(%) 2021/1/26
2021/1/26 10 Byte-hit Rate measures bandwidth reduction Byte Hit Rate vs Cache Size on NASA web log 0 1 0 2 0 3 0 4 0 5 0 0 0.002 0.004 0.006 0.008 0.01 Cache Size (%) Byte Hit Rate % NGRAM GDSF GDSize LFUDA LRU Byte Hit Rate = | Bytes answered by cache | | Total bytes |

Prefetch VS No-prefetch Relative Network traffic NASA Fractional Latency (NASA) 85 NASA Relative Network Traffic 75 Fractionallatency (NASA) 65 55 E+UmL 45 000m004000 35 002004000001 00002000400060008001 -o-Ngram +-Ngramt+Prefetch Cache size(%g) +导的门 2021/1/26
2021/1/26 11 Relative Network Traffic (NASA) NASA Relative Network Traffic 0 10 20 30 40 50 60 70 80 90 100 0 0.002 0.004 0.006 0.008 0.01 Cache size(%) Relative Network Traffic (%) Ngram Ngram+Prefetch NASA Fractional Latency 25 35 45 55 65 75 85 0 0.002 0.004 0.006 0.008 0.01 Cache size(%) Fractional latency (%) Ngram Ngram+Prefetch Fractional latency (NASA) Prefetch vs. No-prefetch
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 《PowerPoint》课程PPT教学课件:第六章 使用PowerPoint创建演示文稿.ppt
- 南京大学:《嵌入式网络物理系统》课程教学资源(PPT讲稿)时光自动机 Timed Automata.ppt
- 《C程序设计》课程PPT电子教案:第一章 概述.ppt
- 《算法设计与分析 Design and Analysis of Algorithms》课程PPT课件:Tutorial 10.pptx
- 中国科学技术大学:《现代密码学理论与实践》课程教学资源(PPT课件讲稿)第1章 引言(主讲:苗付友).pptx
- 东南大学:《数据结构》课程教学资源(PPT课件讲稿)随机算法(主讲:方效林).pptx
- 动态内存分配器的实现(实验PPT讲稿).pptx
- Java面向对象程序设计:Java的接口(PPT讲稿).pptx
- 赣南师范大学:《计算机网络技术》课程教学资源(PPT课件讲稿)第十章 Internet概述.ppt
- 《编译原理》课程教学资源(PPT课件讲稿)第四章 语法分析——自上而下分析.ppt
- 《网络搜索和挖掘技术》课程教学资源(PPT讲稿)Lecture 1:Web Search Overview & Web Crawling.ppt
- 《程序设计语言》课程PPT教学课件(章节大纲).ppt
- 长春大学旅游学院:《计算机网络与网络安全》课程教学资源(PPT课件)第6章 计算机网络与网络安全.ppt
- JavaScript编程基础(JavaScript语法规则).ppt
- 《面向对象程序设计》课程PPT教学课件:第1章 Visual Basic概述(主讲:高慧).ppt
- 西安电子科技大学:Operating-System Structures(PPT讲稿).pptx
- 电子科技大学计算机学院:《现代密码学》课程PPT教学课件(密码学基础)第一章 引言.ppt
- 山东大学:《微机原理及单片机接口技术》课程教学资源(PPT课件讲稿)第九章 模数转换器与数模转换器.ppt
- 香港浸会大学:《Data Communications and Networking》课程教学资源(PPT讲稿)Chapter 10 Circuit Switching and Packet Switching.ppt
- 杭州电子科技大学:《计算机、互联网和万维网简介》教学资源(PPT课件)Chapter 01 C++ Programming Basics.ppt
- 中国科学技术大学计算机学院:《高级操作系统 Advanced Operating System》课程教学资源(PPT课件)第四章 分布式进程和处理机管理(分布式处理机分配算法).ppt
- 清华大学:ICCV 2015 RIDE:Reversal Invariant Descriptor Enhancement.pptx
- 中国人民大学:Similarity Measures in Deep Web Data Integration.ppt
- 《数据结构》课程教学资源:课程PPT教学课件:绪论(数据结构讨论的范畴、基本概念、算法和算法的量度).ppt
- 《计算机组装与维修》课程教学资源(PPT课件讲稿)第二章 计算机系统维护维修工具使用.ppt
- 东南大学计算机学院:《操作系统概念 OPERATING SYSTEM CONCEPTS》课程教学资源(PPT课件)Operating-System Structures.ppt
- 《数字图像处理 Digital Image Processing》课程教学资源(PPT课件讲稿)第2章 图像分析.ppt
- 《EDA技术》实用教程(PPT讲稿)第5章 QuartusII 应用向导.ppt
- 香港浸会大学:《Data Communications and Networking》课程教学资源(PPT讲稿)Chapter 4 Transmission Media.ppt
- 北京大学:《搜索引擎 Search Engines》课程教学资源(PPT讲稿)Evaluating Search Engines(Search Engines Information Retrieval in Practice).ppt
- 西安电子科技大学:《8086CPU 指令系统》课程教学资源(PPT课件讲稿,共五部分,王晓甜).pptx
- 北京师范大学网络教育:《计算机应用基础》课程教学资源(PPT讲稿)第8章 计算机安全、第9章 多媒体技术.pptx
- 沈阳理工大学:《Java程序设计基础》课程教学资源(PPT课件讲稿)第1章 创建Java开发环境.ppt
- 成都信息工程大学(成都信息工程学院):分层分流培养个性发展的计算机卓越工程师——专业课分层教学探索与实践.ppt
- 厦门大学计算机科学系:《大数据技术原理与应用》课程教学资源(PPT课件)第十章 数据可视化.ppt
- SIGCOMM 2002:New Directions in Traffic Measurement and Accounting.ppt
- 计算机问题求解(PPT讲稿)图论中的其它专题.pptx
- 西安电子科技大学:《操作系统 Operating Systems》课程教学资源(PPT课件讲稿)Chapter 08 多处理器系统 Multiple Processor Systems.ppt
- 国家十一五规划教材:《电子商务案例分析》课程教学资源(PPT课件)第11章 网络社区模式案例分析.ppt
- 南京大学:《计算机图形学》课程教学资源(PPT课件讲稿)计算机图形学引言(主讲:路通).ppt