Abstract
With
more than millions of pages, the Web has become a greatly enormous information
source. This information is in form of documents, images, videos as well as
text. With such vast sizes of data, it is a common problem to get the right
information that one wants. Oftentimes users have to search for the right
content they are looking for from the Web with the help of search engines.
Searching can be done manually by use of available platforms like Google or
automatically in form of web crawlers. Since the semantic web is not
structured, search results can include varying types of information relating to
the same query. Sometimes these results cannot be directly analyzed to meet the
specific interpretation need. The search result records (SRRs) returned from
the Web following manual or automatic queries are in form of web pages that
hold results obtained from underlying databases. Such results can further be
used in many applications such as data collection, comparison of prices etc.
Thus, there is a need to make the SRRs machine processable. To achieve that, it
is important that the SRRs are annotated in a meaningful fashion. Annotation
adds value to the SRRs in that the collected data can be stored for further
analysis and makes the collection easier to read and understand. Also
annotation prepares the data for data visualization. The SRRs bearing same concepts
are grouped together thus making it easier to make comparisons and analyze and
go through the collection. The
purpose of this research is to find out how search results from the Web can be
automatically annotated and restructured to allow for data visualization for
users in a specific domain of discourse. A case study application is
implemented that uses a web crawler to retrieve web pages about any topic in
public health domain. This research is a continuation of the work done by Mr.
Emanuel Onu in the project “Proposal of a Tool to Enhance Competitive
Intelligence on the Web”.
1
Introduction
People
of all walks of life use the internet for so many different tasks such as
buying and selling items, social networking, digital libraries, news, etc. Researchers
need information from digital libraries and other online document repositories
to conduct their research and share information; scholars need books to get
information and knowledge from; people communicate to one another through
emails via the Web; others utilize social media to exchange information as well
as having casual chat; some conduct transactions like purchasing items and
paying for bills via the web. The World Wide Web is today the main “all kind of
information” repository and has been so far very successful in disseminating
information to humans. The Web has become the preferred medium for many
database applications, such as e-commerce and digital libraries. Many database
applications store information in huge databases that users’ access, query and
update through the Web. The improvement in hardware technologies has seen
increase in computers and server’s storage capacity. As such, many web servers
store a lot of data in their storage drives. In some social media websites e.g.
Facebook[1], users can upload pictures, videos as well as other documents.
YouTube [2] allows its users to post videos of varying lengths to their
servers. There are other automated systems that collect a lot of data on daily
basis. For example, bank systems need to store daily Auto Teller Machine (ATM)
transactions as well as other customers’ transactions. Some monitoring systems
collect data about some aspect of life e.g. climate change, online shopping
systems that keep information about the clients’ daily shopping experience.
These are some but few ways that have led to a gigantic amount of information
and documents to be available on the Web.
However,
due to the heterogeneity and the lack of structure of Web information sources,
access to this huge collection of information has been limited to browsing and
searching. That is, for one to access a document, you need to put the URL
(Universal Resource Locator) in the Web Browser or making use of a search
engine. The former way is suitable when you know what you are looking for and
the exact location on the Web. But this is hardly the case and as such many of
the Web users locate particular content they are looking for by using search
engines. There are software systems that require a user to manually enter a
search term and the search engine retrieves documents according to the term
entered by the user; while there are also other automated search engines that
make use of a Web Crawler.
There
are several notable web based search engines that index web documents and are
available for use by Web users. Most common ones are Google, Yahoo, AltaVista
and many more. Such systems search through the collection of the documents
sourced from the Surface Web – which is indexed by standard search engines as
well as the Deep Web –which requires some special tools to be accessed. Most
users benefit from such systems when researching information that is not known
or they want to redirect to trace a website they know but can’t remember its
URL. Still, there are some business disciplines such as Competitive
Intelligence [3] that require particular type of information (domain specific)
in order to make strategic business decisions. In such scenarios, different
tools are developed in order to help in information gathering and analysis. Several
other methods for searching and information retrieval to gather intelligence
also work in such domain specific areas. For example, manually browsing the
Internet could be the simplest method for conducting a Competitive Intelligence
task. Manual browsing of the Internet to a reasonable level guarantees the
quality of documents collected which in turn improves the quality of knowledge
that is discoverable.[4] However, the challenge here is that a lot of time is
spent. According to Onu, a survey of over 300 Competitive Intelligence
professionals shows that data collection is the most time-consuming task in
typical Competitive Intelligence projects, amounting to more than 30% of the
total time spent in the whole project. In this case, for Competitive Intelligence
professionals to manually browse the Internet to read the information on every
page of a Website in order to locate useful information and also to synthesize
the information, it is mentally exhausting and overwhelming. There is
undeniably a huge demand for collecting data of interest from multiple Websites
across the Web. For example, an online shopping system that collects multiple
result records from different item sites, there is a need to determine whether
any two items retrieved in the search result records refer to the same item.
For a book online shopping system, the ISBN can be compared to achieve this. If
ISBNs are not available, then their titles and authors could be used instead.
Such a system is also expected to list down the price of an item from each
site. Thus the system needs to know the semantic of each data unit.
Unfortunately, the semantic labels of data units are often not provide in the
result pages. For instance, in Figure X, no semantic labels for the values of
title, author, publisher, etc., are given. Having semantic labels for data
units is not only important for the above record linkage task, but also for
storing collected search result records into a database table (e.g., Deep web
crawlers) for later analysis. Early applications require tremendous human
efforts to annotate data units manually, which severely limit their
scalability.
Different
tools have been developed that help to search, gather, analyze, categorize, and
visualize a large collection of web documents. One of such tools is one
proposed by Mr. Emmanuel Onu in his paper “Proposal of a Tool to enhance
Competitive Intelligence on the Web”, and the tool is called the CI Web
Snooper. It is from this tool that this paper is based. This research is a
continuation of the initial work from the above mentioned paper. The CI Web
Snooper is a tool for searching and retrieving Websites from the Internet that
can be used for information gathering and knowledge extraction. It uses a
real-time search technique so that the information it sources from the Web is
update. It has four major components: User Interface, Thesaurus Model, Web
Crawler and Indexer. The User Interface allows the user to specify search query
and also specify seed URLs for the Web Crawler to use in its search. The
Thesaurus Model is used to model the domain of interest and is key to query
reformulation and indexing of Web pages. The Web Crawler component is
responsible for finding and downloading Web pages using Breadth-First search
algorithm that starts from the URLs specified by the user.
Department: Computer Science (M.Sc Thesis)
Format: MS Word
Chapters: 1 - 5, Preliminary Pages, Abstract, References, Appendix.
No. of Pages: 34
Price: 20,000 NGN
In Stock
Our Customers are Happy!!!
No comments:
Post a Comment
Add Comment