Search Engine and Web Mining Group
Introduction
The Search Engine and Web Mining (SEWM) Group, part of the network lab
of Department of Computer Science and Technology, Peking University,
has a strong research history in the search engine area since the web
inception in China. The group plays a leading role in the search engine
community. Tianwang search engine is one of the most famous search engines
in China since 1997. In particular, the SEWM group has been developing
Tianwang for over 9 years.
The group has vigorously technical advantages
in incremental crawling, full-text indexing and retrieving, and web archiving.
Beginning in 2001, the group started to collect and archive web pages
within the scope of China and engineer the Chinese Web Archive - Web Infomall.
As of today, Web Infomall has accumulated over 2 billions web pages published
in the past years, and stands as the second-largest web archives in the
world after the Internet Archive.
Current Projectsi, Full Proposals
here(passwd protected)
Accomplishments
- Tianwang search engine, http://e.pku.edu.cn
- Chinese Web Archive (Web infomall), http://www.infomall.cn.
This achievement received the second place award from the Beijing science
and technology advancement in 2004.
- Chinese Digital Assets Library(CDAL).
- The first book on search engine at home - "Search Engine: Principle,
Technology and Systems"
- Patent (no. 01109132.0) Method for judging the positional relevance
of a group of search key words in a web page.
- Initiated and organized the annual Symposium on Search Engine and
Web Mining since 2003.
People and Resources
The SEWM Group led by associate professor Hongfei Yan has four teachers,
six PhD students and seven master degree students. Dr. Yan is ever in
charge of Tianwang's parallel upgrade and made it become a tens of
millions pages search engine from one million pages one. He has also
successfully pioneered the deployment of the first large-scale Chinese
Web Test collection with 100 GB web pages (CWT100g) and has been continuously
organizing annual Workshop on Chinese Web Information Retrieval Evaluation
since 2004. Other faculty include Dr. Bo Peng, an expert in large-scale
full-text indexing and retrieving, Mr. Zhengmao Xie, an expert in high
performance large-scale web page crawling and professor Li Xiaoming
in charge of interdisciplinary research on integration of information
technology and social science.
Tianwang Search Engine and Web Infomall, consisting of 50 machines, are two
daily web service systems maintained by the SEWM group. Additionally, there
are many different kinds of open access archives including collections of
Chinese web pages, CWT100g, CWT200g, Tianwang query log, Chinese web classification
train set, etc.