首页
 
第1页
了解哪些网页是最
网络搜索引擎的用户有用的是一个关键任务
网络信息检索(ir)的。 大多数与过去的作品
使用超链接分析的算法来解决这个概率,
莱姆。 然而,很少研究都集中在查询
独立的web数据网络红外清洗。 本文
我们首先提供重新之间的差异分析
trieval目标网页和普通的基于多
超过30万网站无论从文本获得页
检索会议(trec)和广泛使用中文
搜索引擎搜狗(www.sogou.com)。 我们进一步
提出了一种学习的数据清洗重新算法
ducing的网页不太可能对用户有用
请求。 我们发现存在着相当大的比重
低质量的网页中英两种网页
中文网页主体,目标页面和检索
可以确定使用查询无关的功能
和清洁的算法。 实验结果
表明我们的算法是有效减少大
网页的一部分在检索与小损失的目标
页。 它可以使web信息检索工具以满足
很大一部分用户的需要只有一小部分
在网络上的网页。 这些结果可以帮助网络搜索
发动机更充分地利用有限的存储
计算资源以提高搜索性能。
导言
在网络上数据的爆炸性增长使得informa公司,
重刑管理和知识发现越来越多昼夜温差,
ficult。 在web文档集合的规模已
其中大多数网站的主要障碍为基础的信息
管理技术,如网络信息重新
trieval(ir)和web数据挖掘。 在网页的数量
dexed通过网络信息检索工具(或搜索引擎)
一直在高增长速度。 例如,谷歌
(http://www.google.com/)索引超过80亿页
2004年12月:约20倍之多什么索引
在2000年(沙利文,2005年)。 然而,即使这样的
大量的网页仍无法涵盖整个规模
网络,其中已经包含超过200亿
表面网页和1300.00亿网页中深
2003年2月,根据项目的多少信息
(莱曼和瓦里安,2003)。
2005年9月,谷歌下降至其主页
在其指数,这表明著名的网页数
结束之间的商业搜索中文索引大小战争
希内斯(hedger已,2005年)。 这一事件还表明,web信息检索
工具目前更注重数据的数据质量比
数量。 这是众所周知网络是充满了嘈杂,
不可靠的,低质量,有时甚至相互矛盾的数据。
因此,数据清理的过程是必要的,然后再重新
trieval的执行。 此外,清洗的过程
应查询独立的因为清洗数据集
应该满足web各种搜索请求。 这里
出现该问题:质量为目标的网络估计
网页是密切相关的用户信息需求,这
体现在他们的查询,但过程的质量
估计已成为users'queries独立。 这就是为什么
网页质量评估被认为是最伟大的
挑战由henzinger,,和搜索引擎
西尔弗斯坦(2003年谷歌)。
虽然这是一项艰巨的任务进行web数据清洗,
荷兰查询独立,一些以前的作品可能有帮助。
学报信息科学与技术,58(12协会):001 - 015,2007
数据清洗的web信息检索
独立使用查询功能
刘依群,闵张和荣威岑
国家重点实验室智能技术与系统,清华大学,北京,
中华人民共和国中国。 电子邮箱:liuyiqun03@mails.tsinghua.edu.cn
韫茹
研发中心,搜狐公司,北京,中华人民共和国中国。 电子邮箱:ruliyun@sohu-rd.com
马少平
国家重点实验室智能技术与系统,清华大学,北京,
中华人民共和国中国。 电子邮箱:msp@tsinghua.edu.cn
接受2007年1月4日
© 2007威利期刊公司
在网上公布00 xxxxxx号2007
威利跨学科(www.interscience.wiley.com)。 日期:10.1002/asi.20633
asi5812_0150_20633.qxd 07-7-12下午8点13页1

第2页
例如,成功的pagerank(布林和佩奇,1998)
和其他超链接分析算法如点击数
(克莱因伯格,1999年)证明是可能的评价
网页的重要性没有查询的信息。 操作方法
以往,这些超链接分析算法估计质量
一个在其网页被访问的可能性而不是
它的用处对于搜索引擎用户。
为了净化网络数据按照其重要性
网络搜索的用户,我们提出一种新的数据清洗
方法:首先,我们试图找到检索之间的差异
目标网页(网页可用于某些web搜索答案
查询,看到目标页面检索功能)和普通
页。 分布的差异通过更多的分析
超过3700万个网页从bothanenglish语料库(。政府网站
在trec通过体)和汉语语料库(收集
sogou.com)。 据统计比较,有几个
查询独立的功能被发现能够告诉一个差别
3-2320/3-5973页面之间的这两种。 然后学习
算法基于这些功能的目的是净化
使用web数据检索的目标页面分类。
我们工作的主要贡献如下:
1。 阿查询的研究独立的网页功能
提请进行了明确的区别
是一吐温检索目标网页和普通网页。
2。 alearning为基础的方法来定位提出高质量的
网页会自动根据的机会
成为检索目标网页而不是probabili,
被访的关系。
3。 实现更好的检索性能的可能性
与清洗页面设置进行了讨论。
该文件的剩余部分安排如下:
相关工作给出了在web数据有关工作的简要回顾
清洗和网页分类。 特征检索
目标网页检索比较焦油之间的差异
获得普通网页和查询网页与独立功能
分析(既常用的功能和新亲
构成的)。 学习基于web数据清洗算法
介绍了清洁算法,包括数据的详情
查询无关的特点和学习方法。 那个
实验结果给出实验结果
和讨论以评估我们的算法的性能。
第6条赋予的文件和一些可能的结论
今后研究的问题。
相关工作
数据清洗的web信息检索
搜索引擎设计师已经认识到的重要性
数据清洗或网页质量评价相当长时间
时间。 2003年,henzinger和联营公司(2003年谷歌)
建议将是非常有用的搜索引擎
能够识别网页的质量独立
一个给定用户的要求。 然而,由于缺乏
大型网页令,一些研究人员,除了
那些在非常重视企业的有关问题。
根据义,刘,李(2003年),大部分工作
在ir研究为导向的网页数据清洗
分为两类:本地规模的数据清洗和
全球规模的数据清洗。 地方规模用来清洗
消除个别网页内的噪音,而全球规模
清洁与低质量的数据处理规模不小于
以上的网页。
目前已在当地几个大型网络数据工程
清洗以满足web数据挖掘研究人员的需要。
即使在万维网(www)的出现,有
已存在半年清理数据的几项研究,
如bibtex结构化文件格式。 目前,该
清洁类经常被描述为两个阶段进行:
分割成块的网页,然后定位
有用块或丢弃无用。 对于网页
分区,许多研究人员认为使用标签
信息以及划分的基础上的标签式网页
(buyukkokten,加西亚,莫利纳,&佩普基,2001;卡西宁,
阿尔托宁,科拉里,梅拉科斯基,与拉科,2000年;黄及富,
2000)。 一些技术(巴特勒,刘和浦,2001;恩布利,
江及伍,1999;黄和林,2004)是根据
分析双方的布局和实际内容的网站
页。 蔡,余,文,马(2003年)所使用的布局struc -
雷斯以建立一个网页视觉结构和实现
分割任务的视觉结构上。 此外
使用网页内的资料,研究人员试图寻找
在网站内噪声数据的共同风格(称为网站
样式由易树,刘,与李[2003])。 有了这种方法他们
在一个划分成主要内容的网站的网页
块和噪音块。 之后网页被划分成
几个街区,算法的学习机制
(宋,刘,温,与马,2004年)或超struc基于
ture(蔡,他,温家宝,与马,2004年)可以进行定位
重要的大厦或清洗不重要的。
大部分全球规模的数据清理工作的基础
对网页超链接结构。 最近,很多
研究人员一直试图改善现有的超链接分析
如pagerank算法(布林和佩奇,1998年)和点击数
(克莱因伯格,1999年)。 这种分析是基于两个
基本假设,根据建议的克拉斯韦尔和
同事(克拉斯韦尔,霍金,与罗伯逊,2001年):认可控制人的,
ommendation假设和主题地区的假设。
假设如果两个网页连接的超链接,
网页链接的建议由它的网页链接
建议 )和两个网页有着相似的话题
地方 )。 链接分析算法所使用的许多
商业(如网页的工作搜索引擎,
布林,,与格拉德,1998年),通过了许多
研究人员(如库马尔,拉加,拉贾戈帕兰,&
汤姆金斯,2006)。 这些算法依赖于这两条,
sumptions,只为一个理想的网络环境下举行。
然而,目前的万维网充满垃圾链接和
广告链接等的假设以及算法
他们依据的诺特运作良好。 教唆全球规模
数据清洗算法应该使用的其他信息
从内页和跨页。 然而,我们的
知识,出现了基于这种算法的研究很少。
2
学报信息科学与技术协会,2007年10月
日期:10.1002/asi
asi5812_0150_20633.qxd 07-7-12下午8点13分第2页

第3页
我们的数据清洗算法可以被视为全球
规模之一。 它使查询中使用的独立页面功能
无论从超链接分析和页面布局的分析。
这些功能是独立的用户请求以使
清洗过程可以进行估算的网页
质量脱机方式。
高品质网页分类
网页分类在我们的数据清洗
算法不同的潜在目标页面的检索
使用查询功能普通的独立。 重新
trieval目标页面分类问题有类似
困难与网页分类问题解
由于抄,汉,和张(2004年在缺乏)负
例子。 积极的例子可以说明一个数字
评审使用诸如池(霍金与技术
克拉斯韦尔,2005)。 不过,可能有多种原因
网页不是一个高品质的一所以均匀采样
不带偏见的几乎是不可能的。
几个学习无标记的数据为基础的机制
并积极提出一些例子以办公室策划,
plish的网页分类的工作。 技术
pebl学习框架(郁等人。,2004年),semisupervised
学习(nigam的麦卡勒姆,史朗和米切尔,2000),单
课堂学习(张健,1998年),和一类支持向量
机(osvm)(manevitz&优素福2002年)已经
以解决这个问题。 这些算法被证明是
与传统的有效的网页分类处理
的问题。 然而,他们可能不适用的检索
目标理由如下网页分类。
1。 它们不是用于低维和高
密度实例空间,这是重要问题
分类检索目标页面(这个数字查询
独立的功能通常是很小的,但人数
实例是巨大的)。
2。 这些算法的若干规定知识
在普遍建立积极的实例比例,
这不是供检索目标分类。
3。 这些算法通常耗时所以他们
不适合的清洗数十亿的网页重新任务
获得性的搜索引擎。
这些算法是不同的,我们的数据清洗方法
学习方法的基础上(米切尔,1997年的朴素贝叶斯,
第一章。 6),这被认为是有效和高效率
低维空间的实例。 此外,我们的方法
doesnotrequirepriorknowledgeoftheoriginaldataset.finally,
我们的方法是查询的应用独立的功能
从内页和跨页,其中涉及
信息比超链接分析算法。
特征检索目标网页
类型检索目标页面
据沙利文(sullivan公司,2003年),商业
搜索引擎处理用户请求每天数以百万计。
用户的要求产生了大量种类繁多的
检索查询和相应的目标页面。 但是,
这些目标网页被认为有共同的东西在com,
纹如他们的知名度和可靠性。 这些共同
属性的特征我们目前queryindependent结果
在查询的独立目标页面的检索功能。
在分析检索目标之间的差异
和普通的网页,我们首先制定明确的图片
什么是检索的目标页面。 论基础查询日志
分析alta vista酒厂,布罗德(2002)和罗斯和莱文森
(2004年)分为三类网络搜索查询:
导航,信息,和交易查询。 那个
之间的关系这三类问题及其
相应的目标网页列于表1。
这种分类分类是从大型
搜索引擎调查。 这是接受了大部分网络搜索
研究人员由节能全文检索最近通过的
干扰(trec)网站首曲目(霍金与克拉斯韦尔,2002;
霍金与克拉斯韦尔,2003;克拉斯韦尔及霍金,2004)。
主要的区别航行和informa -
tional /交易类型的查询是用户是否有
固定的搜索目标页面或没有。 对于航海类型的查询,
用户有一个固定的搜索和搜索目标的目的是
到达特定网页。 另一方面,当
用户提交的信息或交易类型的查询,
他或她没有一个固定的搜索目标网页。 代替,
用户有一个更广泛的搜查的目的并希望得到
就某个问题的资讯或服务的来源。
导航式查询有关两种
搜索任务:首页找到(目标页是一个家
页)发现和命名页(目标网页是partic,
ular页面描述它的名字叫做 命名页 )。
网络搜索任务处理的信息与交易
类型查询通常被称为 主题提取和肺心病,
响应目标页被称为 关键资源 成为一个
关键资源,页面应该是一个很好的切入点到web
网站。 本网站应提供可信的信息
某些主题。 应主要用于专题
而不是成为一个更大的地盘这也是主要的一部分
讨论主题(这个定义是从霍金&
克拉斯韦尔,2003)。 根据这一定义,大多数的家庭
网页属于关键资源的类别因为
几乎所有的人都进入网页能够满足需要
指定。 然后我们就可以进行分类网络搜索目标页
(包括网页,命名网页和)关键资源
学报信息科学与技术协会,2007年10月
3
日期:10.1002/asi
表1。 查询类型及其相应的搜索任务/目标页
(布罗德的基础上,2002年)此外,与搜索任务和目标网页类型
(根据克拉斯韦尔及霍金,2004)。
查询类型
搜索任务
目标网页类型
比例
导航
命名页面发现
命名页
约20%
主页发现
信息/
主题提取
关键资源
大约80%
交易
网页
asi5812_0150_20633.qxd 07-7-12下午8点13分第3页

第4页
到指定网页或关键资源,有两大类,一
见表1。
有两种方法可以到网站数据清理
任务:减少无用的网页或检索对象的定位。
减少无用的页面可能是困难的因为一个网页
可以是无用的各种原因:冗余,垃圾邮件,假
信息等。 因此,很难
拿起的各种无用的网页。 并没有减少一
单无用的网页可能会导致失败的
整个清洗任务。 在这种情况下,这将是一个更好的主意
侧重于寻找提取焦油因为只有高品质的网页,
获得网页被视为重要的红外面向网络数据
清洗任务。 然而,虽然人数检索
目标类型远比无用的网页
类型,还有两种类型的检索目标页面:导航,
igational网页和信息的网页。 是否应该
区别对待? 重新之间有什么区别,
trieval目标网页和普通网页? 我们回答这些
问题在大型网站上的统计分析
页面语料库。
查询无关的目标页面的检索功能
每个web页面是独立的几个特点
搜索引擎用户的要求(见例子见表4)。
这些功能大多是不相关的网页内容,如
在链接数,外的联系号码,pagerank值,和网址
长度,有些功能是内容相关的,如编码
信息和篇幅的链接锚文本。 所有这些有限元分析
雷斯不能自定义的搜索引擎用户让他们
被称为 查询独立的功能
如果我们要进行独立的数据清洗
用户查询,很自然地利用这些功能的使用是,
导致它们是独立的用户查询。 然而,
一直在查询的差异没有研究,独立
特征提取与目标页面和普通的网页
据我们所知。 在我们的研究,我们发现
这些功能可以告诉检索焦油之间的差异
获得网页和普通网页根据我们的分析
大型网页语料库。
weanalyzedtwodifferentwebpagecorpora:对。政府站心病,
1
,这是由120万英文网页,
与搜狗体,其中包含三千七点○万中
网页。
2
这两个语料库的一些特征
列于表2。
。gov是一trec(http://trec.nist.gov/)测试集,
在trec的跟踪服务网络2003年至2004年。
这令是组织严密许多web信息检索研究
已经执行的基础上它。 然而,抓取
5年前英寸政府网站域名的网页收集。
这可能是该研究限制使用此语料库
仅用于实验。
为了获得更近期的和有代表性的web测试
我们的清洁的研究数据收集,我们收集到的
搜狗体的随机取样的网页
网页收集索引sogou.com。 的规模
搜狗收集大约4.3%,全中文网页,
其中载有大约8.7亿在年底网页
2005年根据中国互联网络信息
中心(cnnic)的报告。
3
它不局限于特定的做,
主要的因此是较实用。政府网站数据集。
虽然中文网站集合可被视为一个特定的
在国际范围内收集,大部分的结论
不应该改变在多语言环境得到大大的话
查询,我们选择了独立的特点是不具体
中文。
一般而言,。政府网站收集了。政府网站域名
只是,这样网页的整体素质高于搜狗的,
这是抓取的网页构成的所有领域。 但是
后者令是最近的更具代表性。
我们建立检索目标网页样本集为语料库。
这些规定作为在学习过程中积极的例子。
为。政府网站设置的样本选自trec2003 - 2004
网络跟踪答案(克拉斯韦尔及霍金,2004年;霍金&
克拉斯韦尔,2003)和搜狗令规定的样本
标有三个使用池技术评审
4
(霍金与克拉斯韦尔,2005)。 我们使用下面的步骤
得到的检索目标页面设置示例:
1。 收集用户请求的数目可以代表
大多数用户的利益。
2。 从这些要求的流行抓取搜索结果
ular搜索引擎和建立一个每个重组后池
寻求与这些结果。
3。 衡量一个结果可以作为一个检索视为
目标页面为给定的要求和目标形成检索
网页样本集。
在这样的质量和检索的统一目标
网页抽样取决于选定的用户请求
可以代表大多数用户的利益。 trec提供网络跟踪
从搜索引擎收集的数百名查询
日志或设计评审。 此外,我们收集650
从搜狗搜索查询记录代表使用者的搜寻
在几个流行的领域例如电影,请求/电视明星,
歌曲,软件,电影,小说,电脑/电视游戏,人们的
4
学报信息科学与技术协会,2007年10月
日期:10.1002/asi
表2。 的统计。gov和搜狗语料库。
编号
总平均域名检索
语言的网页
大小
文件大小
限制
时间
。政府网站
1247753英语
18.1摹15.2 k表。政府网站
2002年年初
搜狗三千七百二十零万五千二百一十八中558.0摹15.0亩没有限制2005年11月
1
详细信息在网上http://es.csiro.au/trecweb/govinfo。
html格式。
2
简化版本可在网上http://www.sogou.com/labs/
分升/ t.html。
3
这份报告可在网上http://www.cnnic.cn/download/2006/
20060516.pdf。 但是,有没有其他语言版本。
4
可能检索的目标页面池是使用5众所周知的
中文搜索引擎:baidu.com,google.com,yisou.com,sogou.com,
和zhongsou.com。
asi5812_0150_20633.qxd 07-7-12下午8点13分第4页

第5页
名称,新闻专题,立场和体育。 有了这些
查询,检索我们建立了一个目标网页样本集节能,
tains 2,631页,。gov和48930的搜狗网页。
我们使用测试有关这些网页的有效半
数据清洗算法岬(1732页,。政府网站,
24927页搜狗)和其他算法
训练。
培训和搜狗体测试集是
最大的检索目标页面设置有史以来使用的web信息检索
就我们所知的研究人员。 然而,这是不可避免的
这些样本集只能涵盖了潜在用户的一小部分
建议要求网络搜索引擎。 幸运的是,
搜索用户的要求可分为三类
有可能组织一次这三个均匀采样
种查询。 搜狗都和我们的训练集
。政府网站语料库的设计涵盖所有三种网络
搜索查询,信息,导航,事务处理
正常运作。 这些查询类型设置的分数按
实际的网络搜索环境(见表1),这样我们页
集被认为最可靠的检索目标
网页样本集使用过。
此外,我们只有通过查询功能独立
我们的数据清洗算法。 这意味着尽管检索
目标网页是通过查询收集的,没有rela -
tionship之间的查询检索的内容和目标
页面属性我们发现。 有足够多的检索焦油
得到代表所有网页查询类型。 因此,抽样
可被视为一个模拟的实际users'requests
在我们的数据清洗算法的训练过程。
随着检索到语料库和相应的分析
目标样本集,我们发现目标页面的检索
完全不同的查询,从普通的独立功能
的。 相关值
5
目标和与检索
在一些普通的页面查询独立的功能
图1显示的。政府网站语料库。
在图1中,我们使用五个查询独立的功能
显示检索目标网页之间的差异
普通的网页。 这些功能是文件长度(数
在某网页的话),锚固长度(数目
字在链接锚文本为某网页),
的pagerank(获得使用由布林所描述的算法
&页[1998]),入度(数量在体),和outde,
格力(数目外链接)。 我们可以进行以下肥胖,
servations从图1所示的统计数据:
1。 检索目标网页和普通网页有不同的
统计分布的查询值无关
特征。 采取的pagerank,例如:相关
页面之间的命名和普通页值为0.07,
这是一个缺乏相互联系。
2。 检索的目标网页的两种,即名为
网页和关键资源的网页,在这些类似查询
独立的功能。 所有五个的相关值
页面功能之间的命名和关键资源网页
高于0.8的,显示这些二种
网页是成正比的。 这意味着尽管
目标页面的检索这两种类型的不同,
耳鼻喉科搜索请求,我们应该把目标页面检索
作为整个网络而不是独立红外定向数据
清洗研究。
为了了解如何检索目标网页的行为
不同于一般的网页,我们必须研究的度
(有多少个链接数为网页)的分布
在两个语料的网页。 的统计分布
在度值如图2所示。 在度分析
因为这是一个重要的web信息检索功能导向
在研究和超链接结构的关键因素之一
分析算法。
学报信息科学与技术协会,2007年10月
5
日期:10.1002/asi
图。 1。 在qurey差异独立的功能分布代表
由相关的值。 分类轴显示的查询独立的功能。
5
相关值定义为
,其中乔夫( 的x,y)
代表数组协方差的值 x和 y。 是用来描述
关系的两个或两个以上变量。 相关系数可以
范围从1.00至1.00。 值的1.00是一个完美的内加,
而略去相关值为1.00是一个完美的积极科雷,
lation。 值的0.00代表缺乏相互联系。
乔夫的x,y)/
x
š
ý
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
100%
1
2
4
8
16
32
64
128
256
其他
在-
在- degreree
普通
检索目标
(一)
1
2
4
8
16
32
64
128
256
其他
在-
在- degreree
(二)
普通
检索目标
比例
比例
图。 2。 在不同的检索目标和普通度分布
在页(1)搜狗体和(b)。政府网站语料库。
asi5812_0150_20633.qxd 07-7-12下午8点13分第5页

第6页
在图2中,最普通的网页有一个小数量,
贝尔在这两个语料库三通。 大约90%的网页中
搜狗和50%的页面。政府网站少于2
在三通据统计。 然而,检索目标
页面有很多的,学位:90%的目标更高的检索
搜狗和75%英寸政府网站有两个以上的三通。
这可以解释的是检索目标
不仅受到欢迎的搜索用户而且也由其他网站
网站和网页。 我们还发现网页。政府网站普遍
有更比搜狗的联系,因此,。政府网站
网页更受欢迎。 这种差异是由于不同的,
这两个语料库入口建设战略。 。gov是
目的是收集高质量的网页。 因此,其收藏
仅限于。政府网站域名,其中网页质量
高于整个网络。 同时,搜狗是山口,
lected建立整个网站收集的代表
中文版网站以便在网页的质量损失是可以预见的。
除了相关的功能的超链接结构,我们有
其他几个资料来源以确定不同的,
如doclength功能,显示在分配办法,
图1。 表3显示了另外两项内容相关的功能
可用于分离取回奥尔迪目标页面,
冠状动脉的。
根据表3,很少有统一的检索目标
资源定位符(url)的问号。 这表明
从动态的网页(网址的信息
“?”总是确定一个动态页面)并没有被接受的
用户的静态网页。 这可能是由于低
质量论坛或博客内容,通常是提出
动态网页。 我们也可以看到在表3中的每
centage非gbk(汉字内码规范)
编码检索目标页是很低的。 这可能是解释
这一事实sogou.com是一个中文搜索引擎
用户主要是注意在simpli书面网页,
fied中。
对前面提到的统计分析的基础上,
我们发现检索目标网页的行为不同于
在一个查询号码,普通的独立的功能。
这些功能列于表4我们的学习为基础的数据
清洗方法取决于他们的网站上清洗
信息检索数据的研究。
学习基于web数据清洗算法
目前已进入积极的,例如一些研究,
基于web的网页分类根据部分高
质量网页分类。 如这些作品,
pebl学习框架(郁等人。,2004年),改进治癌,
tional学习算法。 我们过去的工作试图
网络中采用清洁的数据但这些算法
有限的成功。 2004年(刘,张,与马,2004年),我们
成功地减少了80%。与的id3政府网页
决策树算法。 amajority的web检索请求
(信息/交易类型的查询),也有更好
表现较整个心病设置清洗数据
脓液。 然而,决策树学习需要的知识
对在体正面的例子的比例,
这是很难获得。 2005年(刘,王,张,&
马,2005年),我们采用的k -均值算法以防止
积极的比例问题并选择了约一半
。政府网页检索性能获得类似的
与通用的网络搜索整个主体
测试集。 然而,k -均值算法容受
问题的时间效率低不适合practi -
cal申请表。
在这项研究中,我们采取天真贝叶斯学习算法,
rithm(米切尔,1997年,第一章。6)解决检索目标
网页分类问题因为这是其中最
对学习问题的实际和有效的办法,
荷兰国际集团进行分类的文本文件或网页。 它也可以亲
明确概率随网页是否是一个
检索目标页面,也可以通过潜在的结果
排名的搜索引擎。 我们还可以估计质量
网页的根据概率。
对于检索的目标页的分类,我们的问题
认为两起案件,该案件时分类是基于
只有一个特点案件多的功能时
参与。
案例1:单功能的分析。 如果我们采取只有一个
查询功能 独立的概率网络
页面 ,p的特征是一个检索目标页可以
记为
(0)
我们可以用贝叶斯定理改写这个表达式为
(1)
in equation (1), p ( p
target page ) is the proportion of
retrieval target pages in the whole page set. as mentioned,
this proportion is difficult to estimate in many cases, includ-
ing our problem of retrieval target page classification. 操作方法
ever, if we just compare the values of p ( p
target page |
p has feature a ) in a given web page corpus, p ( p
目标
page ) can be regarded as a constant value and would not
affect the comparative results. so in a fixed corpus such as
.gov/sogou, we can rewrite equation (1) as
p ( p
target page )
p ( p has feature a ƒ p
target page )
p ( p has feature a )
p ( p
target page ƒ p has feature a )
p ( p
target page ƒ p has feature a ).
6
journal of the american society for information science and technology—october 2007
doi: 10.1002/asi
表3。 content-related features of retrieval target pages and ordinary
页。
ordinary page
retrieval target page
url contains a
13.06%
1.87%
question mark (“?”)
encode is not gbk
字母a
14.04%
1.39%
字母a
gbk represents chinese internal code specification. 人们普遍
adopted by chinese web sites in mainland china.
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 6

第7页
(2)
now consider the terms in equation (2); p ( p has feature a |
p
target page ) can be estimated using the proportion of
a -featured pages in the retrieval target page set. p ( p has fea-
ture a ) equals the proportion of the pages with feature a in a
given corpus. here we obtain
(3)
if the user query set is large enough to represent most user
interests, the sampling of retrieval target page can be
regarded as an approximately uniform process. 因此,
we can rewrite the numerator of (3) as
(4)
substituting expressions (3) and (4) into (2), we obtain
(5)
since all terms in (5) can be obtained by statistical analy-
sis on a web page corpus, we can calculate the probability
# ( p has feature a )
# ( corpus )
# p ( p has feature a
p
target page sample set )
# ( target page sample set )
ñ
p ( p
target page ƒ p has feature a ) r
# p ( p has feature a
p
target page sample set )
# ( target page sample set )
# p ( p has feature a
p
target page )
# ( target page )
# ( p has feature a )
# ( corpus )
# p ( p has feature a
p
target page )
# ( target page )
ñ
p ( p has feature a ƒ p
target page )
p ( p has feature a )
p ( p has feature a ƒ p
target page )
p ( p has feature a )
p ( p
target page ƒ p has feature a )
of being a retrieval target for each page according to this
方程。
case 2: multiple feature analysis. if we use more than one
feature to classify retrieval target pages, the naive bayes the-
orem assumes that the following equation holds:
(6)
for the problem of page classification with query-inde-
pendent features, we further found that the following equation
also approximately holds according to table 5.
(7)
this means for the features in table 5, the attribute values
adopted in the retrieval target page classification process are
independent as well as conditionally independent given the
target value.
the correlation values in table 5 show that these features
are approximately independent of one another. 这可能是
explained by the fact that these features are obtained from
different information sources and thus have little chance af-
fecting one another. in other words, the following equations
hold approximately for the retrieval target page classifica-
tion task according to the naive bayes assumption and our
statistical analysis:
(8)
q
ñ
字母i
1
p ( p
target page ƒ p has feature a
字母i
q
ñ
字母i
1
p ( p has feature a
字母i
ƒ p
target page ) p ( p
target page )
p ( p has feature a
字母i
p ( p has feature a
1
2
, p , a
ñ
ƒ p
target page ) p ( p
target page )
p ( p has feature a
1
2
, p , a
ñ
p ( p
target page ƒ p has feature a
1
2
, p , a
ñ
q
ñ
1
p ( p has feature a
字母i
p ( p has feature a
1
2
, p , a
ñ
q
ñ
1
p ( p has feature a
字母i
ƒ p
target page )
p ( p has feature a
1
2
, p , a
ñ
ƒ p
target page )
journal of the american society for information science and technology—october 2007
7
doi: 10.1002/asi
表4。 query-independent features applied in our cleansing experiments.
特征
说明
content-related features
doclength
number of words in a web page
anchorlength
number of words in a web page's in-link anchor text
urllength
number of dashes “ ” in a web page's url
pagesize
storage size of a web page
copynumber
number of mirror copies of a web page
urlformat
whether a url contains a question mark
编码
whether the encode of a web page is gbk
hyperlink structure-related features
outdegree
number of out-links of a web page
indegree
number of in-links of a web page
的pagerank
pagerank value calculated according to algorithm (brin & page, 1998)
in-site-outdegree
number of links from a web page to other pages in the same site
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 7

第8页
if we substitute (5) into (8), we can get the following
equation, which is fit for multifeature cases:
(9)
according to this equation, the probability of a web
page's being a retrieval target can be calculated with infor-
mation from the web corpus and its corresponding retrieval
target sample set. as mentioned previously, we construct
retrieval target sample sets for .gov/sogou corpus and
we select several query-independent features (shown in
表5)。 therefore, it is possible for us to use the algorithm
to accomplish the classification task.
experimental results and discussions
评价方法
to our knowledge, there has been little research on the
evaluation of ir-oriented web data cleansing. in our previous
works (liu et al., 2004; liu et al., 2005), we chose the re-
trieval target page recall rate for a cleansed page set as the
evaluation metric. if the cleansed size is significantly
smaller than that of the original set while retrieval target re-
call is above a threshold t , we regard the cleansing method
as an effective one. t is set according to practical application
要求。 although this simple method can prove the
effectiveness of a certain method, it has difficulties in
determining which cleansing method has better perfor-
芒斯。 there is a trade-off between cleansed corpus size and
high-quality page recall. for example, one can improve
recall simply by cleansing fewer pages, but doing so can
increase the cleansed set size. this means we should take
both factors into consideration while evaluating data cleansing
方法。
in order to solve this problem of ir-oriented data cleansing
evaluation, we proposed a new metric, called high-quality
page average recall ( ar ). when web pages in a certain
corpus are ranked using a certain data cleansing method,
# ( p has feature a
字母i
# ( corpus )
b
智商
ñ
1
a# ( p has feature a
字母i
p
target page sample set )
( target page sample set )
ñ
p ( p
target page ƒ p has feature a
1
2
, p , a
ñ
average recall of high-quality pages is the mean of the recall
scores after each page counted.
(10)
similar to the famous ir evaluation metric average preci-
sion ( ap ), ar is a summary measure of a ranked page list.
ar can also be calculated by averaging the high-quality page
recall values at various points of cleansed set size because
the following equation holds:
(11)
this means ar value can be calculated with the area
under the size-recall curves. a set of such curves are shown
图3。 the category axis is the percentage of retained
web pages in the original set after data cleansing. 价值
axis is the percentage of retained high-quality pages after
data cleansing.
in figure 3, each curve shows the performance of a cer-
tain kind of data cleansing algorithm. if we use area ( c
字母i
)至
represent the area under curve c
字母i
, then we have
(12)
this means the following equation also holds according
to equation 11:
(13)
ar ( c 1)
ar ( c 2)
ar ( c 3).
area ( c 1)
area ( c 2)
area ( c 3).
1
0
recall ( s ) ds ,
š
#( clensed set ) #( original set )
字母a
#( original set )
1
recall ( i )n# ( original set )
字母a
#( original set )
1
recall ( i )n# ( original set )
8
journal of the american society for information science and technology—october 2007
doi: 10.1002/asi
表5。 correlation values between query-independent features of web page.
url format
编码
的pagerank
集群
doclength
url length
indegree
urlformat
1.00
0.15
0.15
0.01
0.04
0.10
0.00
编码
1.00
0.20
0.00
0.06
0.30
0.00
的pagerank
1.00
0.01
0.06
0.03
0.05
copynumber
1.00
0.01
0.10
0.00
doclength
1.00
0.04
0.00
urllength
1.00
0.02
indegree
1.00
芹菜
c1
c3
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
cleansed set size
high-quality recall
图。 3。 ar calculation with cleansed size-recall curves.
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 8

第9页
we can get the same conclusion by analyzing these three
algorithms represented by the curves. for example, c 1 corre-
sponds to an effective data cleansing algorithm because with
a small cleansed set size (such as 10%), it can cover a more
than average number of high-quality pages (about 35%). c 2
shows a random sampling method of web pages because the
high-quality page recall increases linearly with the cleansed
set size and its ar value is equal to 1 2. c 3 can be regarded
as showing a low-quality page identification method, because
it retains almost no high-quality pages when cleansed set size
is small (below 30%).
from equation (11) and figure 3 we can see that ar is re-
lated to both cleansed set size and high-quality recall values.
this means ar contains both cleansed-size-oriented and recall-
oriented aspects and is suitable for the task of web page data
cleansing evaluation.
data cleansing experimental results
as mentioned in query-independent features of retrieval
target pages, we keep about half of the retrieval target sample
pages for testing our data cleansing algorithm. the algo-
rithm's effectiveness is examined by the following means:
first, we see whether this algorithm can pick up high-quality
页。 we then compare the performance of the query-inde-
pendent features applied in the algorithm to find out which
one plays the most important role in data cleansing. 检索
performance of the cleansed web set is also examined. at last,
we will check out whether this algorithm can separate low
quality or even spam pages as well.
high-quality page classification. figure shows the distrib-
ution of the probabilities of being retrieval targets for pages
in .gov corpus.
in figure 4, the x axis is the probability of being a key
resource page, which belongs to one kind of retrieval target
pages; the y axis is the probability of being a named page,
which belongs to the other kind of retrieval target pages. 同样地
mentioned in types of retrieval target page, retrieval target
pages can be grouped into these two kinds of pages.
we can see that pages in the retrieval target test set
(marked using circle ? and triangle ?) have higher probabili-
ties according to our algorithm (almost all these pages are
located in the dashed rectangle). meanwhile, part of ordinary
pages (marked with ) is not highly evaluated by our algo-
rithm (outside the dashed rectangle). we can also find several
ordinary pages mixed with retrieval target pages in the
dashed rectangle. this can be explained by the fact that we
cannot include all high-quality pages in our test set, and
pages with high probabilities may also be high-quality pages.
table 6 shows the cleansed corpus size and its corre-
sponding retrieval target page recall for .gov and sogou
corpora.
from the statistics in table 6, our data cleansing algo-
rithm can retain most retrieval target pages while signifi-
cantly reducing corpus size. a large proportion (95.53%
in .gov and 92.73% in sogou) of retrieval target pages
remain in the cleansed corpus. however, there is a difference
in the cleansed sizes when the algorithm is applied to differ-
ent corpora. only less than 5% pages are regarded as impor-
tant by the algorithm for the sogou corpus while 52%
pages are retained for .gov. it may be explained by the fact
that the data quality of .gov corpus is much higher than that
of sogou. .gov was crawled in 2002 and its pages are
limited to .gov domain, whose content is more reliable than
that of the whole web. sogou corpus was collected in 2005;
it has many more spam and low-quality pages appearing on
the web and the crawled pages are not limited to a certain
领土。 it is reasonable to find a larger proportion of high-
quality pages in .gov than in sogou. however, compared
with .gov corpus, sogou corpus is closer to the practical
application environment for a web search engine.
according to the experimental results in the test set, it is
possible to satisfy more than 90% of user requests with a
small number of pages in the corpus, and these pages can be
located query-independently using our data cleansing algo-
rithm。 web search engines may be able to adopt hierarchy
structure in their data index. the cleansed page set can be
placed into a high-level, frequently used, quickly accessible
index, which can meet most users'requests. the other pages
can be placed into low-level indexes, because they are not so
important for users and can meet the rest of search needs that
cannot be handled in the high-level index.
effectiveness of query-independent features. figure 5
shows the size-recall curve of our data cleansing result in
sogou corpus. according to the definition of high-quality
page ar in evaluation methods, the ar value for our algorithm
journal of the american society for information science and technology—october 2007
9
doi: 10.1002/asi
图。 4。 distribution of the probabilities of being retrieval target pages in
.gov corpus.
表6。 cleansed corpus size and corresponding target page recall
(the proportion of retained target page after data cleansing) using data
cleansing algorithm.
cleansed
检索
检索
corpus size
target page
target page
(percentage of
记得
记得
original set)
(training set)
(test set)
.gov
52.00%
95.53%
93.57%
sogou
4.96%
92.73%
92.37%
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 9

第10页
is 0.9064. this means that the algorithm is effective because
the upper bound for ar value is 1.0000 and a random sam-
pling algorithm's ar is 0.5000.
furthermore, we want to find out which query-independent
feature is the most important in our algorithm. 我们希望
to answer the question, does this cleansing result arise
from one or two “key” features or from a “combined” effort?
if one or two features can make the algorithm work, it is not
necessary to use a learning mechanism in the algorithm.
in order to answer this question, we test ar values of our
data cleansing algorithm, each time with one single feature
删除。 the experimental results are shown in table 7.
we can see from table 7 that when pagerank or indegree
is dropped out, the ar value drops most. this means these two
features play important roles in our data cleansing algorithm.
however, our further experimental results in figure 6
suggest that the other features should not be treated as unim-
portant。 we can see that the performance becomes worse
when only pagerank is adopted to rank pages in the data
cleansing process. a data cleansing algorithm that combines
other features can gain better performance. this is in accord
with the conclusion of henzinger (henzinger et al., 2003)
that a better page quality estimation algorithm should involve
other sources of information rather than using the hyperlink
structure analysis alone. although the features have different
abilities in identifying high-quality pages, the cleansing per-
formance results from a joint effort of all features instead of
from one or two “key features.”
retrieval performance of the cleansed page set. our data
cleansing algorithm aims at reducing the index size of
web search engines and maintaining or improving retrieval
performance at the same time. in effectiveness of query-
independent features, we have shown that the cleansed set
contains most high-quality pages although its size is signifi-
cantly smaller than that of the original set. it is also necessary
to test whether the cleansed page set really helps improve re-
sult ranking of the retrieval systems, because the effectiveness
of the cleansing algorithm degrades if the algorithm reduces
index size at the cost of sacrificing retrieval performance.
in order to find out the effect of data cleansing on retrieval
performance, we built three cleansed page sets from .gov
corpus using our cleansing method, with about 1 4, 1 2, and
3 4 amounts of pages from the original set, respectively.
trec has developed several retrieval tasks on .gov, so it is
possible for us to find reliable test sets for retrieval experi-
发言:。 we chose two search tasks from trec 2003 and
2004 web tracks because they are designed to be simulations
of the practical web search environment.
the trec 2003 web track task was focused on locating
key resource pages for topic distillation queries. this query
type covers about 80% of web search requests according to
表1。 the trec 2004 task was designed to cover all types
of web search requests; its search request set includes 1 3
home page finding queries, 1 3 named page finding queries,
and 1 3 topic distillation queries. for both retrieval tasks, the
retrieval performances are evaluated by mean average preci-
sion (map) and b-pref (buckley & voorhees, 2004), which
are two of the most frequently used metrics in web search
research (they are also adopted by trec as the major metrics).
retrieval experiment results are shown in figure 7; we
tested how the retrieval performance varies with respect to
the cleansed corpus size. for the trec 2003 task, the
cleansed set gets best performance while it contains about
1 4 pages of the original .gov corpus. however, when 3 4
pages are retained, the cleansed set performs best for the
trec 2004 task. this difference can be explained by
the different task settings of trec 2003 and 2004. for trec
2003, the topic distillation task tries to locate reliable pages
(called key resource pages ) for certain topics, so its search
target pages are more likely to be left in our cleansed set. 字母a
large fraction of trec 2004 queries are of home page
or named page finding types. these kinds of pages, especially
named pages, are likely to be ordinary pages (such as a cer-
tain news page) and are not highly evaluated by our cleansing
算法。 so we have to discard fewer pages to meet such
navigational type search requests. this means for different
10
journal of the american society for information science and technology—october 2007
doi: 10.1002/asi
图。 5。 cleansed size–recall curve for our data cleansing algorithm in
sogou corpus.
table 7. effectiveness of the query-independent features in the data
cleansing algorithm.
the feature which is dropped out
url format
0.9037
编码
0.9032
的pagerank
0.8756
集群
0.9012
doclength
0.9031
url length
0.8984
indegree
0.8860
high quality page average recall
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
pagerank only
without pagerank
without indegree
所有功能
图。 6。 effectiveness of pagerank and other features in data cleansing.
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 10

第11页
search tasks, we should choose different parameters for
cleansed set size to get the best retrieval performance. 对于
topic distillation task, fewer pages are needed in the cleansed
set; for navigational type searches, a larger cleansed set should
使用。
although the retrieval performance varies with different
retrieval tasks, it is interesting to find out that for each re-
trieval task the cleansed set outperforms the original set with
both smaller size and better performance. for trec 2003,
the cleansed set contains only 25% of pages of the original
set, while the algorithm helps improve map by 9.93% and
b-pref by 28.3%. for trec 2004, the cleansed set had bet-
ter retrieval performance (1.25% using map, 3.65% using
b-pref) with 25% of pages reduced.
in these experiments, data cleansing affects retrieval per-
formance in two aspects:
1。 cleansed size. by data cleansing, it is possible to reduce
low-quality pages that may be ranked higher than high-
quality ones by retrieval algorithms. 例如,在
topic “ the white house president bush's cabinet
(trec 2004, topic 26), the file “g45–22-1096484” in
.gov is ranked in third place by bm2500 weighting
algorithm (robertson, walker, hancock-beaulieu, &
gatford, 1994) according to our experimental results.
this page is from the us embassy in tokyo and men-
tions bush's cabinet, but it does not give any detailed
information on that topic. in the cleansing process, this
page is reduced because it is not popular and contains no
important information. by this means other important
pages related to this topic can be ranked higher. 目前,
fore, map for this topic improves from 0.143 to 0.500
after data cleansing.
2。 target page recall. retrieval target pages, especially
those for navigational type searches, may also be reduced
by our data cleansing algorithm. such loss is not huge on
average (less than 10% according to table 6), but it may
result in search failure for some particular cases if all
search target pages for a certain topic are discarded by
the cleansing algorithm. with techniques such as hierar-
chy indexing system (discussed in retrieval performance
of the cleansed page set), it is possible for our cleansing
algorithm to be effective for a majority proportion of
web search requests while not affecting the others much.
journal of the american society for information science and technology—october 2007
11
doi: 10.1002/asi
地图
0.105
0.110
0.115
0.120
0.125
0.130
25%
50%
75%
100%
cleansed size
(一)
(二)
b-pref
0.060
0.065
0.070
0.075
0.080
0.085
0.090
0.095
25%
50%
75%
100%
cleansed size
cleansed size
(三)
(四)
cleansed size
地图
0.36
0.37
0.38
0.39
0.40
0.41
0.42
0.43
0.44
0.45
25%
50%
75%
100%
b-pref
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
25%
50%
75%
100%
图。 7。 retrieval experiment results on the cleansed .gov corpus. (a), (b): retrieval using trec 2003 data set; (c), (d): retrieval using trec 2004
数据集。
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 11

第12页
the preceding two factors affect the retrieval performance at
同一时间。 if search target pages tend to be highly evaluated
by our algorithms (such as informational/transactional type tar-
get pages), the cleansed size is the key factor. then a smaller
cleansed size leads to higher performance just as occurred in
trec 2003 experiments (the smallest cleansed set has the best
map). if search target pages are usually obtain a low evalua-
tion by our cleansing algorithm (such as navigational type tar-
get pages), retrieval target page recall plays a more important
role. in this case, a smaller size often causes lower recall and
thus produces loss in performance (see trec 2004 results).
however, this does not mean a failure of our algorithm,
because a relatively large cleansed set may still outperform the
original set because the “cleansed size” factor also works.
if we take a closer look at the experiment results of trec
2004 in figure 8, we can also see how the two factors affect
retrieval performance.
in figure 8(a), informational type searches have a similar
retrieval performance distribution with trec 2003 because
they share the same type of search task. therefore, cleansed set
size is the key factor that leads to high performance of the
smallest cleansed set. in figure 8(b), we obtain the highest
map with 75% pages retained. this means retrieval target
page recall plays an important role along with cleansed set size.
from the preceding experimental results we can conclude
that our cleansing algorithm can improve the overall ret-
rieval performance. effectiveness varies with specific
retrieval tasks and is heavily dependent on two key factors:
cleansed size and retrieval target page recall. these two fac-
tors are also the ones we want to evaluate in our average
recall ( ar ) measure, so retrieval effectiveness and data
cleansing success can to a certain extent be judged together
with this measure.
reducing low-quality and spam pages. because the learning
process is based on analysis into high-quality page samples,
our data cleansing algorithm is designed to cleanse web data
by picking up high-quality pages (retrieval target pages)
instead of reducing low-quality or spam pages. 但是,
experimental results in figure 9 show that our algorithm can
also reduce a part of these harmful pages.
for sogou corpus, we have three assessors pick up a
number of low-quality and spam pages according to the fol-
lowing rules: low-quality pages are pages that offer little or
biased information and are not as useful as retrieval target
pages, while spam pages are those making use of tricks in
hyperlink structure or page content to get higher rankings
than they should have in web search results; 3,272 low-
quality pages are annotated together with 120 spam pages by
the assessors.
according to figure 9, using pagerank can reduce more
spam pages than using indegree, while the latter feature is
more effective in separating low-quality pages. 但是,
the data cleansing algorithm that makes use of both features
can reduce 30% of spam pages as well as 15.26% of low-
quality pages. this means the algorithm can fully exploit the
advantages of both pagerank and indegree in reducing
spam and low-quality pages.
possibilities of applying data cleansing method
in web search engines
as shown in data cleansing experimental results, our
data cleansing method is effective for sogou and .gov
corpus in significantly reducing collection size as well as
retaining a large proportion of high-quality pages. cleansed
.gov had better retrieval performance than the original cor-
pus while only about 50% of pages needed to be indexed.
however, there are several major differences between these
corpus-based experiments and practical web search applica-
tions, and each of them may result in failure of the lab-oriented
方法。
first, the web page corpus only covers a small (usually
tiny) part of the world wide web, so it is not always true that
the methods that work well in corpora are also effective for
12
journal of the american society for information science and technology—october 2007
doi: 10.1002/asi
地图
0.342
0.344
0.346
0.348
0.350
0.352
0.354
0.356
0.358
25%
50%
75%
100%
cleansed size
(一)
(二)
25%
50%
75%
100%
cleansed size
地图
0.40
0.45
0.50
0.55
0.60
0.65
图。 8。 detail retrieval experiment results for trec 2004 tasks. (a): informational/transactional type searches; (b): navigational type searches.
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 12

第13页
the whole web collection. second, the features adopted in
experimental environments may not be so easily obtained
for practical applications. for example, it is much more dif-
ficult to obtain hyperlink-related features in a practical web
environment because the link graph is much more complex
than that of a web page corpus. last, the methods that work
well for corpora may not be applicable for practical web
search applications if they are too time-consuming or involve
too many parameters to be tuned.
our data cleansing method is effective in the corpus-
based experiments, but its effectiveness in practical web
search applications relies on the following aspects.
reliability of the data cleansing performance. uniform
sampling of web pages is a challenge for many web appli-
cation studies, so the reliability of our data cleansing method
relies on the degree to which our experiment settings are
similar to the practical web environment. to our knowledge,
our method is built on the largest web page corpus and
corresponding retrieval target page training set ever used
by web ir researchers. the sogou corpus contains 37 mil-
lion web pages and covers almost 5% of pages of the whole
chinese web collection. experiments on .govalso prove that
data cleansing can be effective with various language envi-
ronments or web page sizes.
although the experiment settings are different from a
practical web search application environment, we believe
that they are currently the closest to that environment.
parameter tuning and feature selection may vary with dif-
ferent application requirements, but the basic idea is reli-
able and generally applicable for real-world web search
发动机。
availability of the query-independent features applied in the
data cleansing method. the differences between retrieval
target pages and ordinary pages exist because of the wide
spread of unimportant, outdated, contradictory, and spam
data in the www. once these kinds of low-quality data exist
on the web, our learning-based query-independent data
cleansing method can be a reliable and effective way to
解决这个问题。
query-independent features selected in our cleansing
algorithm are collected from a practical search engine’s
operation process. they are all adopted in the preprocessing,
indexing, or ranking stage of a search engine. pagerank fea-
ture is important in result ranking, number of copies is a key
factor for result clustering, and these features should be col-
lected by search engines even without the cleansing process.
so the feature collection process does not require additional
workload for the search engine system.
applicability of the data cleansing method. the applicability
of the cleansing method depends on two factors: efficiency
和有效性。 data cleansing should be efficient so that
billions of web pages can be evaluated before they are placed
in data indexes. the algorithm should also be effective in
reducing both the index size and the reaction time while re-
taining high retrieval performance.
the time complexity of the cleansing algorithm depends
on the learning algorithm we selected. because naive bayes
learning is adopted in our algorithm, the cleansing process
(judging whether the page should be cleansed or not) is with
liner complexity. so its efficiency fits well for web search
申请。
effectiveness of the cleansing algorithm is decided by
several aspects. according to retrieval performance of the
cleansed page set, retrieval performance depends on both
the cleansed set size and the retrieval target page recall of the
cleansed set. a possible way to reduce the reaction time and
obtain high performance for search engines is to use a hierar-
chical indexing structure such as the one shown in figure 10.
as shown in figure 10, a large number of user requests
can be satisfied with a frequently visited page set. 这页
set is from the data cleansing results of the original set and
only contains a small proportion of pages in the whole index-
able collection. experiments with the sogou corpus show
that the cleansed set may contain less than 5% of pages but
fulfill a large fraction of user requests. when the cleansed set
journal of the american society for information science and technology—october 2007
13
doi: 10.1002/asi
0%
5%
10%
15%
20%
25%
30%
35%
data cleansing
pagerank only
indegree only
spam reduced
low quality reduced
图。 9。 effectiveness of reducing low-quality and spam pages using data
cleansing.
cleansed
设置
整个
收藏
web data
洁净
搜索
接口
查询
is result
sufficient?
查询
结果
result from the whole collection
图。 10。 a hierarchical indexing architecture for search engines based on
data cleansing.
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 13

第14页
fails to give sufficient answers for a certain query (such as a
named page type query whose retrieval target is reduced in
data cleansing), the system turns to the whole collection and
returns retrieval results to the search users.
结论和未来工作
we have shown that by using a web data cleansing algo-
rithm, it is possible to reduce the web data size significantly
while retaining most high-quality pages. our algorithm,
based on analysis into large-scale web corpora, exploits the
differences between high-quality pages and ordinary pages
在网络上。 we combine machine learning techniques and
descriptive analysis on the query-independent features of re-
trieval target pages to provide a better understanding of the
relationship between user requests and the index structure of
web ir tools.
in the near future, we hope to extend this work's frame-
work to include other applications such as low-quality page
reduction and personalized web search. we also plan to work
on a hierarchical storage model for web ir tools according to
our findings in this paper.
致谢
this work was supported by the national high technology
research and development program of china (863 program),
the chinese national key foundation research & develop-
ment plan (2004cb318108), the natural science foundation
(60223004, 60321002, 60303005, 60503064), and the key
project of chinese ministry of education (no. 104236).
at the early stages of this work, we benefited enormously
from discussions with yijiang jin, qi guo, kuo zhang, and
lei yang; we thank fan lin and xiaochuan wang from
sohu corporation r&d center for kindly offering help in
corpus construction; we also thank daxin jiang, xiaoge
wang, le zhao, and the anonymous referees of this paper,
for their valuable comments and suggestions.
参考文献
brin, s., & page, l. (1998). the anatomy of a large-scale hypertextual web
search engine. in proceedings of the seventh world wide web confer-
ence (pp. 107–117). brisbane. australia: elsevier science publishers bv
broder, a. (2002). a taxonomy of web search. sigir forum, 36(2), 3–10.
buckley, c., & voorhees, em (2004). retrieval evaluation with incomplete
信息。 in proceedings of the 27th annual international acm
sigir conference on research and development in information
retrieval (pp. 25–32). sheffield, england: acm press.
buttler, d., liu, l., & pu, c. (2001). a fully automated object extraction
system for the world wide web. in proceedings of the international con-
ference on distributed computing systems (pp. 361). 华盛顿特区:
ieee计算机学会。
buyukkokten, o., garcia-molina, h., & paepcke, a. (2001). 手风琴
summarization for end-game browsing on pdas and cellular phones.
proceedings of the sigchi conference on human factors in computing
systems (pp. 213–220). new york: acm press.
cai, d., he, x., wen, j., & ma, w. (2004). block-level link analysis. 微软
technical report msr-tr-2004-50. retrieved june 17, 2005, from
ftp://ftp.research.microsoft.com/pub/tr/tr-2004–50.pdf
cai, d., yu, s., wen, j., & ma, w. (2003). vips: a vision-based page
segmentation algorithm. microsoft technical report msr-tr-2003-79.
retrieved june 17, 2005, from ftp://ftp.research.microsoft.com/pub/tr/
tr-2003–79.pdf
craswell, n., & hawking, d. (2004). overview of the trec 2003 web
跟踪。 in em voorhees and lori p. buckland (eds.), nist special pub-
lication 500-261: the 13th text retrieval conference (trec 2004).
washington, dc: department of commerce and national institute of
标准和技术。
craswell, n., hawking, d., & robertson, s. (2001). effective site finding
using link anchor information. in proceedings of the 24th annual interna-
tional acm sigir conference on research and development in informa-
tion retrieval (pp. 250–257). new york: acm press.
denis, f. (1998). pac learning from positive statistical queries. in mm
richter, ch smith, r. wiehagen, & t. zeugmann (eds.), proceedings
of the ninth international conference on algorithmic learning theory:
lecture notes in computer science (vol. 1501, pp. 112–126). 伦敦:
施普林格出版社,1998。
embley, dw, jiang, y., & ng, y.-k. (1999年)。 record-boundary discovery in
web documents. in proceedings of the 1999 acm sigmod interna-
tional conference on management of data (pp. 467–478). 费城:
acm出版社。
hawking, d., & craswell, n. (2002). overview of the trec 2002 web
跟踪。 in em voorhees and lori p. buckland (eds.), nist special pub-
lication 500-251: the 11th text retrieval conference (trec 2002).
washington, dc: department of commerce and national institute of
标准和技术。
hawking, d., & craswell, n. (2003). overview of the trec 2003 web
跟踪。 in em voorhees and lori p. buckland (eds.), nist special pub-
lication 500-255: the 12th text retrieval conference (trec 2003)
(pp.78–92). washington, dc: department of commerce and national
institute of standards and technology.
hawking, d., & craswell, n. (2005). very large-scale retrieval and web
搜索。 in ellen voorhees & donna harman (eds.), trec: experiment
and evaluation in information retrieval. 马萨诸塞州剑桥:麻省理工学院出版社。
hedger, j. (2005). google takes backhanded bow out of size war with yahoo.
retrieved october 17, 2005, from http://www.searchenginejournal.
com/?p 2277
henzinger, mr, motwani, r., & silverstein, c. (2003). challenges in web
search engines. in proceedings of the 18th international joint conference on
artificial intelligence (pp. 1573–1579). san francisco: morgan kaufmann.
kaasinen, e., aaltonen, m., kolari, j., melakoski, s., & laakko, t. (2000).
two approaches to bringing internet services to wap devices. 计算机
networks: the international journal of computer and telecommunica-
tions networking, 33(6), 231–246.
kleinberg, jm (1999). authoritative sources in a hyperlinked environment.
journal of acm 46(5), 604–632.
kumar, r., raghavan, p., rajagopalan, s., & tomkins, a. (2006). core algo-
rithms in the clever system. acm transactions on internet technology,
6(2), 131–152.
liu, y., zhang, m., & ma, s. (2004). effective topic distillation with key
resource pre-selection. lecture notes in computer science, 3411, 129–140.
liu, y., wang, c., zhang, m., & ma, s. (2005). web data cleansing for in-
formation retrieval using key resource page selection. 特别兴趣
tracks and posters of the 14th international conference on world wide
web (pp. 1136–1137). new york: acm press.
lyman, p., & varian, hr (2003). how much information 2003? 本站
june 18, 2005, from http://www.sims.berkeley.edu/how-much-info-2003
manevitz, lm, & yousef, m. (2002). one-class svms for document clas-
sification。 journal of machine learning research, 2, 139–154.
mitchell, t. (1997). bayesian learning. in machine learning. 纽约:
mcgraw-hill education.
nigam, k., mccallum, ak, thrun, s., & mitchell, t. (2000). text classi-
fication from labeled and unlabeled documents using em.
learning, 39(2–3): 103–134.
page, l., brin, s., motwani, r., & winograd, t. (1998). the pagerank
citation ranking: bringing order to the web. retrieved october 17, 2005,
from http://citeseer.ist.psu.edu/page98pagerank.html.
14
journal of the american society for information science and technology—october 2007
doi: 10.1002/asi
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 14

第15页
robertson, se, walker, s., hancock-beaulieu, mm, & gatford, m. (1994).
okapi at trec-3. nistspecial publication 500-225: overview of the third
text retrieval conference (trec-3) (pp. 109–126). washington, dc: de-
partment of commerce and national institute of standards and technology.
rose, de, & levinson, d. (2004). understanding user goals in web
搜索。 in proceedings of the 13th international conference on world
wide web (www '04) (pp. 13–19). new york: acm press.
song, r., liu, h., wen, j., & ma, w. (2004). learning block importance
models for web pages. in proceedings of the 13th international confer-
ence on world wide web (pp. 203–211). new york: acm press.
sullivan, d. (2003). search engine sizes. search engine watch web site arti-
克莱斯。 retrieved december 10, 2005, from http://searchenginewatch.com/
reports/article.php/2156461
sullivan, d. (2005). searches per day. search engine watch web site arti-
克莱斯。 retrieved december 10, 2005, from http://searchenginewatch.com/
reports/article.php/2156481
wong, tl, & lam, w. (2004). a probabilistic approach for adapting infor-
mation extraction wrappers and discovering new attributes. 在诉讼
of the fourth ieee international conference on data mining (pp.
257–264). washington, dc: ieee computer society.
wong, w., & fu, aw (2000). finding structure and characteristics of web
documents for classification. in proceedings of the acm sigmod
workshop on research issues in data mining and knowledge discovery
(pp. 96–105). dallas, tx: acm press.
yi, l., liu, b., & li, x. (2003). eliminating noisy information in web pages
for data mining. in proceedings of the ninth acm sigkdd interna-
tional conference on knowledge discovery and data mining (pp.
296–305). new york: acm press.
yu, h., han, j., & chang, kc (2004). pebl: web page classification with-
out negative examples. ieee transactions on knowledge and data engi-
neering 16(1), 70–81.
journal of the american society for information science and technology—october 2007
15
doi: 10.1002/asi
asi5812_0150_20633.qxd 7/12/07 8:13 pm page 15
现在75互联星空 - rapidshare搜..
渤imdb:投稿区:加工时间..
allaboutbranding.com:..
不像多布斯,一些保守媒体认为birthers..
投稿承包委员会的示范| alamy ..
投稿:德州月刊2010年2月..
撰稿人爱德华默罗的原件,这点我相信收音机系列..
渴望在2008年的启示
朱利叶斯诺伊多费尔,投稿 - 数据中心的内容..
2012-2-9 5:46:33
N
oyayiya
hhyy 机北