UNIT I
1) Boolean
retrieval:
The Boolean model of information retrieval (BIR)[1] is
a classical information retrieval (IR) model and, at the same time, the first
and most adopted one. It is used by many IR systems to this day
3)
The BIR is based
on Boolean logic and classical set
theory in that both the
documents to be searched and the user's query are conceived as sets of terms.
Retrieval is based on whether or not the documents contain the query terms.
Given a finite set
4)
T = {t1, t2, ..., tj,
..., tm}
5)
of elements called
index terms (e.g. words or expressions - which may be stemmed - describing or characterising documents
such as keywords given for a journal article), a finite set
6)
D = {D1, ..., Di, ...,
Dn}, where Di is an element of the powerset of T
7)
of elements called
documents. Given a Boolean expression - in a normal form - Q called a query as
follows:
8)
Q = (Wi OR Wk OR ...)
AND ... AND (Wj OR Ws OR ...),
9)
with Wi=ti, Wk=tk,
Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts
10)
where ti means that
the term ti is present in document Di, whereas NON ti means that it is not.
11)
Equivalently, Q can be
given in a disjunctive normal form, too. An operation called retrieval, consisting of two steps,
is defined as follows:
12)
1. The sets Sj of
documents are obtained that contain or not term tj (depending on whether Wj=tj
or Wj=NON tj) :
13)
Sj = {Di|Wj element of
Di}
14)
2. Those documents are
retrieved in response to Q which are the result of the corresponding sets
operations, i.e. the answer to Q is as follows:
·
Clean formalism
·
Easy to implement
·
Intuitive concept
2) Recall the major steps in inverted index
construction:
1.
Collect the documents to be indexed.
2.
Tokenize the text.
3.
Do linguistic preprocessing of tokens.
4.
Index the documents that each term occurs in
Obtaining
the character sequence in a document
Digital documents that are the input to an indexing process
are typically bytes in a file or on a web server. The first step of processing
is to convert this byte sequence into a linear sequence of characters. For the
case of plain English text in ASCII encoding, this is trivial. But often things
get much more complex. The sequence of characters may be encoded by one of
various single byte or multibyte encoding schemes, such as Unicode UTF-8, or various national or
vendor-specific standards
Choosing a document unit
The next phase is to determine what the document unit for indexing is. Thus far we have assumed that documents
are fixed units for the purposes of indexing. For example, we take each file in
a folder as a document. But there are many cases in which you might want to do
something different. A traditional Unix (mbox-format) email file stores a
sequence of email messages (an email folder) in one file, but you might wish to
regard each email message as a separate document. Many email messages now
contain attached documents, and you might then want to regard the email message
and each contained attachment as separate documents. If an email message has an
attached zip file, you might want to decode the zip file and regard each file
it contains as a separate document. Going in the opposite direction, various
pieces of web software (such as latex2html) take things that you might regard as a single document
(e.g., a Powerpoint file or a LATEX document) and split them into
separate HTML pages for each slide or subsection, stored as separate files. In
these cases, you might want to combine multiple files into a single document.
Tokenization
Given a character sequence and a defined
document unit, tokenization is the task of chopping it up into pieces,
called tokens , perhaps at the same time
throwing away certain characters, such as punctuation. Here is an example of
tokenization:
Input:
Friends, Romans, Countrymen, lend me your ears;
Output:
Output:
These tokens are often
loosely referred to as terms or words, but it is sometimes important to make a
type/token distinction. A token is an instance of a sequence of characters in some
particular document that are grouped together as a useful semantic unit for
processing. A type is the class of all tokens
containing the same character sequence. A term is
a (perhaps normalized) type that is included in the IR system's dictionary.
Dropping common terms:
stop words
Sometimes, some extremely common words which
would appear to be of little value in helping select documents matching a user
need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a
stop list is to sort the terms by collection frequency (the
total number of times each term appears in the document collection), and then
to take the most frequent terms, often hand-filtered for their semantic content
relative to the domain of the documents being indexed, as a stop
list , the members of which are then discarded during indexing.
Normalization (equivalence classing of
terms)
Having broken up our documents (and also our query) into
tokens, the easy case is if tokens in the query just match tokens in the token
list of the document. However, there are many cases when two character
sequences are not quite the same but you would like a match to occur. For
instance, if you search for USA, you might hope to also match documents containing U.S.A.
Token normalization is the process of canonicalizing tokens so that matches
occur despite superficial differences in the character sequences of the tokens.
The most standard
way to normalize is to implicitly create equivalence classes , which are normally named after one member of the set. For
instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both
the document text and queries, then searches for one term will retrieve documents
that contain either.
The advantage of just using mapping rules that
remove characters like hyphens is that the equivalence classing to be done is
implicit, rather than being fully calculated in advance: the terms that happen
to become identical as the result of these rules are the equivalence classes.
It is only easy to write rules of this sort that remove characters. Since the
equivalence classes are implicit, it is not obvious when you might want to add
characters. For instance, it would be hard to know to turnantidiscriminatory into anti-discriminatory.
Figure 2.6: An
example of how asymmetric expansion of query terms can usefully model users'
expectations.
|
Accents and diacritics.
Diacritics on characters in English have a fairly marginal status,
and we might well want cliché and cliche to match, or naive and naïve. This can be done by normalizing tokens to remove
diacritics. In many other languages, diacritics are a regular part of the
writing system and distinguish different sounds. Occasionally words are
distinguished only by their accents. For instance, in Spanish, peña is `a cliff', while pena is `sorrow'.
Nevertheless, the important question is usually not prescriptive or linguistic
but is a question of how users are likely to write queries for these words. In
many cases, users will enter queries for words without diacritics, whether for
reasons of speed, laziness, limited software, or habits born of the days when
it was hard to use non-ASCII text on many computer systems. In these cases, it
might be best to equate all words to a form without diacritics.
Capitalization/case-folding.
A common strategy is to do case-folding by reducing all letters to lower case. Often this is a good
idea: it will allow instances of Automobile at the beginning of a sentence to match with a query of automobile. It will
also help on a web search engine when most of your users type in ferrari when they are
interested in a Ferrari car. On the other
hand, such case folding can equate words that might better be kept apart. Many
proper nouns are derived from common nouns and so are distinguished only by
case, including companies (General Motors, The Associated Press),
government organizations (the Fed vs. fed) and person names (Bush, Black). We already
mentioned an example of unintended query expansion with acronyms, which
involved not only acronym normalization (C.A.T.
CAT) but also
case-folding (CAT
cat).
3) Dictionaries and tolerant
retrieval.
4) Index construction:
An
inverted index for a given text collection can be quite large, especially if it
contains full
positional
information about all term occurrences in the collection. In a typical
collection of
English
text there is approximately one token for every 6 bytes of text (including
punctuation
and
whitespace characters). Hence, if postings are stored as 64-bit integers, we
may expect that
an
uncompressed positional index for such a collection consumes between 130% and
140% of
the
uncompressed size of the original text data. This estimate is confirmed by
Table 6.1, which
lists
the three example collections used in this book along with their uncompressed
size, their
compressed
size, and the sizes of an uncompressed and a compressed schema-independent
index.
An
uncompressed schema-independent index for the TREC45 collection, for instance,
consumes
about
331 MB, or 122% of the raw collection size.1
.
General-Purpose Data Compression
In
general, a data compression algorithm takes a chunk of data A and transforms it
into another
chunk
of data B, such that B is (hopefully) smaller than A, that is, it requires
fewer bits to
be
transmitted over a communication channel or to be stored on a storage medium.
Every
compression
algorithm consists of two components: the encoder (or compressor) and the
decoder
(or
decompressor). The encoder takes the original data A and outputs the compressed
data B.
The
decoder takes B and produces some output C.
Symbolwise
Data Compression
When
compressing a given chunk of data A, we are usually less concerned with A’s
actual
appearance
as a bit string than with the information contained in A. This information is
called
the
message, denoted as M. Many data compression techniques treat M as a sequence
of symbols
from
a symbol set S (called the alphabet):
M
= hσ1, σ2, . . . , σni, σi ∈ S.
UNIT II
1)
Scoring, term weighting and the vector space model:
Thus
far we have dealt with indexes that support Boolean queries: a document
eithermatches
or does notmatch a query. In the case of large document
collections,
the resulting number of matching documents can far exceed the
number
a human user could possibly sift through. Accordingly, it is essential
for
a search engine to rank-order the documents matching a query. To do
this,
the search engine computes, for each matching document, a score with
respect
to the query at hand. In this chapterwe initiate the study of assigning
a
score to a (query, document) pair. This chapter consists of threemain ideas.
Weighted
zone scoring
Given
a Boolean query q and a document d, weighted zone scoring assigns
to
the pair (q, d) a score in the interval [0, 1], by computing a
linear combination
of
zone scores, where each zone of the document contributes a Boolean
value.
More specifically, consider a set of documents each of which has ℓ
zones.
Let g1, . . . , gℓ ∈ [0, 1] such that åℓ
i=1
gi = 1. For 1 ≤ i ≤ ℓ, let si be the
Boolean
score denoting a match (or absence thereof) between q and the ith
zone.
For instance, the Boolean score from a zone could be 1 if all the query
term(s)
occur in that zone, and zero otherwise; indeed, it could be any Boolean
function
that maps the presence of query terms in a zone to 0, 1. Then,
the
weighted zone score is defined to be
ℓ
å
i=1
gisi.
(6.1)
Weighted
zone scoring is sometimes referred to also as ranked Boolean re- RANKED
BOOLEAN
RETRIEVAL
trieval.
Learning
weights
How
do we determine the weights gi for weighted zone scoring? These
weights
could be specified by an expert (or, in principle, the user); but increasingly,
these
weights are “learned” using training examples that have
been
judged editorially. This latter methodology falls under a general class
of
approaches to scoring and ranking in information retrieval, known as
machine-learned
relevance. We provide a brief introduction to this topic here
MACHINE-LEARNED
RELEVANCE
because weighted zone scoring presents a clean setting for introducing it;
Vector
space model:
Vector space model or term
vector model is an algebraic
model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index
terms. It is used in information
filtering, information
retrieval, indexing and relevancy rankings. Its first use was in
the SMART Information Retrieval System.
Documents and queries
are represented as vectors by .
Each dimension corresponds to a
separate term. If a term occurs in the document, its value in the vector is
non-zero. Several different ways of computing these values, also known as
(term) weights, have been developed. One of the best known schemes is tf-idf weighting
(see the example below).
The definition
of term depends on the application. Typically terms are single
words, keywords, or longer phrases.
If the words are chosen to be the terms, the dimensionality of the vector is
the number of words in the vocabulary (the number of distinct words occurring
in the corpus).
Relevance rankings of documents in a keyword search can be
calculated, using the assumptions of document similarities theory, by
comparing the deviation of angles between each document vector and the original
query vector where the query is represented as the same kind of vector as the
documents.
In practice, it is
easier to calculate the cosine of the angle
between the vectors, instead of the angle itself:
Where
is the intersection (i.e. the dot product) of the document (d2 in the
figure to the right) and the query (q in the figure) vectors,
is the norm of vector d2, and
is the norm of vector q. The norm of a vector is
calculated as such:
As all vectors under
consideration by this model are elementwise nonnegative, a cosine value of zero
means that the query and document vector are orthogonal and have no match (i.e. the query term
does not exist in the document being considered). See cosine similarity for further information.
1.
Simple model based on linear algebra
2.
Term weights not binary
3.
Allows computing a continuous degree of similarity between
queries and documents
4.
Allows ranking documents according to their possible relevance
5.
Allows partial matching
2)Computing
scores in complete search system
3)
Evolution of
information retrieval:
The long history of
information retrieval does not begin with the internet. It is only in the last
decade and a half
of the IEEE’s one
hundred years that web search engines have become pervasive and search has
become
integrated into the
fabric of desktop and mobile operating systems.
Early use of computers
for IR
To discuss means of
dealing with a perceived explosion in the amounts of scientific information
available, a
specially convened
conference was held by the UK’s Royal Society in 1948.
Ranked retrieval
The style of search
used by both the electro-mechanical and computer-based IR systems was so-called
Boolean
retrieval. A query was
a logical combination of terms (a synonym of word in IR literature), which
resulted in a
set of those documents
that exactly matched the query. Luhn [19] proposed and Maron, Kuhns, and Ray
[20]
tested an alternative
approach, where each document in the collection was assigned a score indicating
its
relevance to a given
query.
Learning to rank
Up to this point, the
ranking functions used in search engines were manually devised and tuned by
hand through
experimentation. Fuhr
[40] described work where the retrieval function was learned based on relevant
documents identified
for an existing set of queries
Web search
Until the rise of the
web, the collections that people searched were selected and edited from
authoritative
sources by the search
service providers. On the web, however, the situation was different. Search
engine
developers quickly
realised that they could use the links between web pages to construct a crawler
or robot to
traverse and gather
most web pages on the internet; thereby automating acquisition of content3
Information retrieval is the activity of obtaining information resources
relevant to an information need from a collection of information resources.
Searches can be based on metadata or on
full-text (or other content-based) indexing.
Automated information retrieval
systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and
other documents. Web search engines are the most visible IR
applications.
An
information retrieval process begins when a user enters a query into the system. Queries are formal statements
of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single
object in the collection.
Many different
measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All
common measures described here assume a ground truth notion of relevancy: every
document is known to be either relevant or non-relevant to a particular query.
In practice queries may be ill-posed and there may be
different shades of relevancy.
Precision is the
fraction of the documents retrieved that are relevant to the user's information need.
In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into
account. It can also be evaluated at a given cut-off rank, considering only the
topmost results returned by the system. This measure is called precision
at n or P@n.
Note that the meaning
and usage of "precision" in the field of Information Retrieval
differs from the definition of accuracy and precision within other branches of science
and statistics.
Recall is the fraction
of the documents that are relevant to the query that are successfully
retrieved.
In binary classification,
recall is often called sensitivity. So it can be looked at as the probability that a
relevant document is retrieved by the query.
It is trivial to
achieve recall of 100% by returning all documents in response to any query.
Therefore recall alone is not enough but one needs to measure the number of
non-relevant documents also, for example by computing the precision.
The proportion of
non-relevant documents that are retrieved, out of all non-relevant documents
available:
In binary classification,
fall-out is closely related to specificity and is equal to
. It can be looked at as the probability that a
non-relevant document is retrieved by the query.
It is trivial to
achieve fall-out of 0% by returning zero documents in response to any query.
Relevance feedback is a feature of some information
retrieval systems.
The idea behind relevance feedback is to take the results that are initially
returned from a given query and to use information about whether or not those
results are relevant to perform a new query. We can usefully distinguish
between three types of feedback: explicit feedback, implicit feedback, and
blind or "pseudo" feedback.
Explicit feedback is
obtained from assessors of relevance indicating the relevance of a document
retrieved for a query. This type of feedback is defined as explicit only when
the assessors (or other users of a system) know that the feedback provided is
interpreted as relevance judgments.
Users may indicate
relevance explicitly using a binary or graded relevance
system. Binary relevance feedback indicates that a document is either relevant
or irrelevant for a given query. Graded relevance feedback indicates the
relevance of a document to a query on a scale using numbers, letters, or
descriptions (such as "not relevant", "somewhat relevant",
"relevant", or "very relevant"). Graded relevance may also
take the form of a cardinal ordering of documents created by an assessor; that
is, the assessor places documents of a result set in order of (usually
descending) relevance. An example of this would be the SearchWiki feature
implemented by Google on their search website.
The relevance feedback
information needs to be interpolated with the original query to improve
retrieval performance, such as the well-known Rocchio Algorithm.
A performance metric which became
popular around 2005 to measure the usefulness of a ranking algorithm based on the explicit relevance feedback
is NDCG. Other measures include precision at kand mean average precision.
Implicit feedback is
inferred from user behavior, such as noting which documents they do and do not
select for viewing, the duration of time spent viewing a document, or page
browsing or scrolling actions [1].
1.
the user is not assessing relevance for the benefit of the IR
system, but only satisfying their own needs and
2.
the user is not necessarily informed that their behavior
(selected documents) will be used as relevance feedback
An example of this is
the Surf Canyon browser extension, which advances search results from later
pages of the result set based on both user interaction (clicking an icon) and
time spent viewing the page linked to in a search result.
Pseudo relevance
feedback, also known as blind relevance feedback, provides a method for
automatic local analysis. It automates the manual part of relevance feedback,
so that the user gets improved retrieval performance without an extended
interaction. The method is to do normal retrieval to find an initial set of
most relevant documents, to then assume that the top "k" ranked
documents are relevant, and finally to do relevance feedback as before under
this assumption. The procedure is:
1.
Take the results returned by initial query as relevant results
(only top k with k being between 10 to 50 in most experiments).
2.
Select top 20-30 (indicative number) terms from these documents
using for instance tf-idf weights.
3.
Do Query Expansion, add these terms to query, and then match the
returned documents for this query and finally return the most relevant
documents.
Some experiments such
as results from the Cornell SMART system published in (Buckley et al.1995),
show improvement of retrieval systems performances using pseudo-relevance
feedback in the context of TREC 4 experiments.
This automatic
technique mostly works. Evidence suggests that it tends to work better than
global analysis.[1] Through a query expansion, some relevant
documents missed in the initial round can then be retrieved to improve the
overall performance. Clearly, the effect of this method strongly relies on the
quality of selected expansion terms. It has been found to improve performance
in the TREC ad hoc task[citation needed]. But it is not
without the dangers of an automatic process. For example, if the query is about
copper mines and the top several documents are all about mines in Chile, then
there may be query drift in the direction of documents on Chile. In addition,
if the words added to the original query are unrelated to the query topic, the
quality of the retrieval is likely to be degraded, especially in Web search,
where web documents often cover multiple different topics. To improve the
quality of expansion words in pseudo-relevance feedback, a positional relevance
feedback for pseudo-relevance feedback has been proposed to select from
feedback documents those words that are focused on the query topic based on
positions of words in feedback documents.[2]Specifically, the positional relevance model
assigns more weights to words occurring closer to query words based on the
intuition that words closer to query words are more likely to be related to the
query topic.
UNIT III
1)
XML retrieval:- XML Retrieval
, or XML Information
Retrieval,[1] is the content-based retrieval of documents structured with XML (eXtensible Markup Language). As such it is
used for computing relevance of XML documents
2)
Information retrieval
systems are often contrasted with relational databases.
3)
Traditionally, IR
systems have retrieved information from unstructured text
4)
– by which we mean “raw”
text without markup. Databases are designed
5)
for querying relational data:
sets of records that have values for predefined
6)
attributes such as employee
number, title and salary. There are fundamental
7)
differences between
information retrieval and database systems in terms of
8)
retrievalmodel, data
structures and query language as shown in Table 10.1.1
9)
Some highly structured text
search problems are most efficiently handled
10)
by a relational database,
for example, if the employee table contains an attribute
11)
for short textual job
descriptions and you want to find all employees
12)
who are involved with
invoicing. In this case, the SQL query:
13)
select lastname from
employees where job_desc like ’invoic%’;
Basic XML concepts
An XML document is an ordered, labeled tree. Each node of the tree
is an
XML element and is written with an
opening and closing tag. An element can XML ELEMENT
have one or more XML attributes. In the XML document in
Figure 10.1, the XML ATTRIBUTE
scene element is enclosed by the
two tags <scene ...> and </scene>. It
has an attribute number with value vii and two child
elements, title and verse.
<play>
<author>Shakespeare</author>
<title>Macbeth</title>
<act
number="I">
<scene
number="vii">
<title>Macbeth’s
castle</title>
<verse>Will
I with wine and wassail ...</verse>
</scene>
</act>
</play>
2)
Probabilistic information retrieval:
During the discussion of relevance feedback in Section 9.1.2, we observed
that if we have some known relevant and nonrelevant documents,
then we
can straightforwardly start to estimate the probability of a term t
appearing
in a relevant document P(t|R = 1), and that
this could be the basis of a
classifier that decideswhether documents are relevant or not. .
3) Language models for information
retrieval:
Unigram models[edit]
A unigram model used in
information retrieval can be treated as the combination of several one-state finite automata.[1] It splits
the probabilities of different terms in a context, e.g. from
to
.
In this model, the probability to
hit each word all depends on its own, so we only have one-state finite automata
as units. For each automaton, we only have one way to hit its only state,
assigned with one probability. Viewing from the whole model, the sum of all the
one-state-hitting probabilities should be 1. Followed is an illustration of an
unigram model of a document.
Terms
|
Probability in doc
|
a
|
0.1
|
world
|
0.2
|
likes
|
0.05
|
we
|
0.05
|
share
|
0.3
|
...
|
...
|
The probability generated for a
specific query is calculated as
For different documents, we can
build their own unigram models, with different hitting probabilities of words
in it. And we use probabilities from different documents to generate different
hitting probabilities for a query. Then we can rank documents for a query
according to the generating probabilities. Next is an example of two unigram
models of two documents.
Terms
|
Probability in Doc1
|
Probability in Doc2
|
a
|
0.1
|
0.3
|
world
|
0.2
|
0.1
|
likes
|
0.05
|
0.03
|
we
|
0.05
|
0.02
|
share
|
0.3
|
0.2
|
...
|
...
|
...
|
In information retrieval
contexts, unigram language models are often smoothed to avoid instances where
. A common approach is to generate a maximum-likelihood model
for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for
each document to create a smoothed document model.[2]
4) Text classification:
Document classification or document
categorization is a problem
in library science, information science and computer science. The
task is to assign a document to one or
more classes orcategories. This may be
done "manually" (or "intellectually") or algorithmically. The
intellectual classification of documents has mostly been the province of
library science, while the algorithmic classification of documents is used
mainly in information science and computer science. The problems are
overlapping, however, and there is therefore also interdisciplinary research on
document classification.
The documents to be classified
may be texts, images, music, etc. Each kind of document possesses its special
classification problems. When not otherwise specified, text classification is
implied.
Documents may be classified
according to their subjects or according to other attributes (such as document type,
author, printing year etc.). In the rest of this article only subject
classification is considered. There are two main philosophies of subject
classification of documents: The content based approach and the request based approach.
"Content
based" versus "request based" classification[edit]
Content based classification is classification in which the weight given to particular
subjects in a document determines the class to which the document is assigned.
It is, for example, a rule in much library classification that at least 20% of
the content of a book should be about the class to which the book is assigned.[1] In
automatic classification it could be the number of times given words appears in
a document.
Request oriented
classification (or
-indexing) is classification in which the anticipated request from users is
influencing how documents are being classified. The classifier asks himself:
“Under which descriptors should this entity be found?” and “think of all the
possible queries and decide for which ones the entity at hand is relevant”
(Soergel, 1985, p. 230[2]).
Request oriented classification
may be classification that is targeted towards a particular audience or user
group. For example, a library or a database for feminist studies may
classify/index documents differently when compared to a historical library. It
is probably better, however, to understand request oriented classification as policy based classification:
The classification is done according to some ideals and reflects the purpose of
the library or database doing the classification. In this way it is not
necessarily a kind of classification or indexing based on user studies. Only if
empirical data about use or users are applied should request oriented
classification be regarded as a user-based approach.
Classification
versus indexing[edit]
Sometimes a distinction is made
between assigning documents to classes ("classification") versus
assigning subjects to documents ("subject indexing")
but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These
terminological distinctions,” he writes, “are quite meaningless and only serve
to cause confusion” (Lancaster, 2003, p. 21[3]). The view that
this distinction is purely superficial is also supported by the fact that a
classification system may be transformed into a thesaurus and vice
versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton,
2008;[6] Riesthuis
& Bliedung, 1991[7]). Therefore is
the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document
to the class of documents indexed by that term (all documents indexed or
classified as X belong to the same class of documents).
Automatic
document classification[edit]
Automatic document classification
tasks can be divided into three sorts: supervised
document classification where
some external mechanism (such as human feedback) provides information on the
correct classification for documents, unsupervised
document classification (also
known as document clustering), where the classification must be done entirely without
reference to external information, and semi-supervised
document classification, where parts of the documents are labeled by the
external mechanism.
5)
Vector space classification:
Vector space classification
The document representation in Naive Bayes is a sequence of
terms or a binary vector
. In this chapter we adopt a different representation for text
classification, the vector space model, developed in Chapter 6 . It represents each document as a vector with one
real-valued component, usually a tf-idf weight, for each term. Thus, the
document space
, the domain of the classification function
, is
. This chapter introduces a number of classification methods
that operate on real-valued vectors.
The basic hypothesis in using the vector space model for
classification is the contiguity hypothesis .
Contiguity hypothesis. Documents in the same
class form a contiguous region and regions of different classes do not overlap.
There are many classification tasks, in particular the type of
text classification that we encountered in Chapter 13 , where classes can be
distinguished by word patterns. For example, documents in the class China tend
to have high values on dimensions like Chinese, Beijing, and Mao whereas
documents in the class UK tend to have high values for London, British and
Queen. Documents of the two classes therefore form distinct contiguous regions
as shown in Figure 14.1 and we can draw
boundaries that separate them and classify new documents. How exactly this is
done is the topic of this chapter.
Rocchio classification
Figure 14.1 shows three classes,
China, UK and Kenya, in a two-dimensional (2D) space. Documents are shown as circles,
diamonds and X's. The boundaries in the figure, which we call decision
boundaries , are chosen to separate the three classes, but are otherwise
arbitrary. To classify a new document, depicted as a star in the figure, we
determine the region it occurs in and assign it the class of that region -
China in this case. Our task in vector space classification is to devise
algorithms that compute good boundaries where ``good'' means high
classification accuracy on data unseen during training.
Perhaps the best-known way of computing good class
boundaries is Rocchio
classification , which uses centroids to define the boundaries. The
centroid of a class
is computed as
the vector average or center of mass of its members:
UNIT IV
1)Support vectors machine and machine learning on
documents:
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1 ), and then extend the model in Section 15.2 to non-separable data, multi-class problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3 : what sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4 ). While several machine learning methods have been applied to this task, use of SVMs has been prominent. Support vector machines are not necessarily better than other machine learning methods (except perhaps in situations with little training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.
Soft margin classification
For the very high dimensional problems common in
text classification, sometimes the data are linearly separable. But in the
general case they are not, and even if they are, we might prefer a solution
that better separates the bulk of the data while ignoring a few weird noise
documents.
Multiclass SVMs
SVMs are inherently two-class classifiers. The traditional
way to do multiclass classification with SVMs is to use one of the methods
discussed in Section 14.5 (page14.5 ). In particular, the most common technique in practice has
been to build Nonlinear SVMs
Figure 15.6: Projecting
data that is not linearly separable into a higher dimensional space can make
it linearly separable.
|
2) Flat clustering:
Flat clustering
Clustering algorithms group a set of documents into subsets
or clusters . The algorithms' goal is to create
clusters that are coherent internally, but clearly different from each other.
In other words, documents within a cluster should be as similar as possible;
and documents in one cluster should be as dissimilar as possible from documents
in other clusters.Clustering is the most common form of unsupervised learning . No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1 . It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13 , page 13.1 ): our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
The key input to a clustering algorithm is the distance measure. In Figure 16.1 , the distance measure is distance in the 2D plane. This measure suggests three different clusters in the figure. In document clustering, the distance measure is often also Euclidean distance. Different distance measures give rise to different clusterings. Thus, the distance measure is an important means by which we can influence the outcome of clustering.
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters and will be covered in Chapter 17 . Chapter 17 also addresses the difficult problem of labeling clusters automatically.
3) Hierachical clustering:
Flat clustering is efficient and conceptually simple, but as we
saw in Chapter 16 it has a number of
drawbacks. The algorithms introduced in Chapter 16 return a flat
unstructured set of clusters, require a prespecified number of clusters as
input and are nondeterministic. Hierarchical clustering (or hierarchic
clustering ) outputs a hierarchy, a structure that is more informative than
the unstructured set of clusters returned by flat clustering.
Hierarchical clustering does
not require us to prespecify the number of clusters and most hierarchical
algorithms that have been used in IR are deterministic. These advantages of
hierarchical clustering come at the cost of lower efficiency. The most common
hierarchical clustering algorithms have a complexity that is at least quadratic
in the number of documents compared to the linear complexity of
-means and EM
Hierarchical agglomerative clustering
Hierarchical clustering algorithms are either top-down or
bottom-up. Bottom-up algorithms treat each document as a singleton cluster at
the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a
single cluster that contains all documents. Bottom-up hierarchical clustering
is therefore called hierarchical agglomerative clustering or HAC . Top-down clustering
requires a method for splitting a cluster. It proceeds by splitting clusters
recursively until individual documents are reached. See Section 17.6 . HAC is more
frequently used in IR than top-down clustering and is the main subject of this
chapter.
Figure 17.2: A
simple, but inefficient HAC algorithm.
|
We will now refine this algorithm for the different similarity measures of single-link and complete-link clustering (Section 17.2 ) and group-average and centroid clustering ( and 17.4 ). The merge criteria of these four variants of HAC are shown in Figure 17.3 .
5) Matrix decompositions and latent
semantic indexing
On page 6.3.1 we
introduced the notion of a term-document matrix: an
matrix
, each
of whose rows represents a term and each of whose columns represents a document
in the collection. Even for a collection of modest size, the term-document
matrix
is likely to have several tens of
thousands of rows and columns. In Section 18.1.1 we
first develop a class of operations from linear algebra, known as matrix
decomposition. In Section 18.2 we
use a special form of matrix decomposition to construct a low-rank approximation
to the term-document matrix. In Section 18.3 we
examine the application of such low-rank approximations to indexing and
retrieving documents, a technique referred to as latent semantic indexing.
While latent semantic indexing has not been established as a significant force
in scoring and ranking for information retrieval, it remains an intriguing
approach to clustering in a number of domains including for collections of text
documents (Section16.6 ,
page 16.6 ).
Understanding its full potential remains an area of active research.
Matrix decompositions
In this section we examine ways in which a square matrix can be factored into the product of
matrices derived from its eigenvectors; we refer to this process as matrix
decomposition . Matrix decompositions
similar to the ones in this section will form the basis of our principal
text-analysis technique in Section 18.3 , where we will look at
decompositions of non-square term-document matrices. The square decompositions
in this section are simpler and can be treated with sufficient mathematical
rigor to help the reader understand how such decompositions work. The detailed
mathematical derivation of the more complex decompositions in Section 18.2 are beyond the scope of
this book.
We begin by giving two theorems on the decomposition of a
square matrix into the product of three matrices of a special form. The first
of these, Theorem 18.1.1, gives the basic factorization of a square real-valued
matrix into three factors. The second, Theorem 18.1.1, applies to square symmetric matrices and is the basis of
the singular value decomposition described in Theorem 18.2.Theorem. (Matrix diagonalization theorem) Let
where the columns of
are the
eigenvectors of
and
is a
diagonal matrix whose diagonal entries are the eigenvalues of
in
decreasing order
If the eigenvalues are distinct, then this decomposition is
unique. End
theorem.
To understand how Theorem 18.1.1 works, we note that
Then we have
(226)
|
|||
(227)
|
|||
(228)
|
Thus, we have
, or
.
We next state a closely related decomposition of a
symmetric square matrix into the product of matrices derived from its
eigenvectors. This will pave the way for the development of our main tool for
text analysis, the singular value decomposition (Section 18.2 ).Theorem. (Symmetric diagonalization theorem) Let
where the columns of
are the
orthogonal and normalized (unit length, real) eigenvectors of
, and
is the
diagonal matrix whose entries are the eigenvalues of
. Further, all entries of
are real
and we have
. End theorem.
We will build on this symmetric diagonal decomposition to
build low-rank approximations to term-document matrices.Exercises.
- What is the rank of the
(230)
|
- Show that
Latent semantic
indexing
We now discuss the approximation of a term-document matrix
by one of
lower rank using the SVD. The low-rank approximation to
yields a
new representation for each document in the collection. We will cast queries
into this low-rank representation as well, enabling us to compute
query-document similarity scores in this low-rank representation. This process
is known as latent semantic indexing (generally abbreviated
LSI).
But first, we motivate such an approximation. Recall the
vector space representation of documents and queries introduced in
Section 6.3 (page Even for a collection of modest size, the term-document matrix
Next, we use the new
Now, we may use cosine similarities as in Section 6.3.1 (page
) to compute the similarity between a
query and a document, between two documents, or between two terms. Note
especially that Equation 244 does not in any way
depend on
being a
query; it is simply a vector in the space of terms. This means that if we have
an LSI representation of a collection of documents, a new document not in the
collection can be ``folded in'' to this representation using Equation 244. This allows us to incrementally add documents to an LSI
representation. Of course, such incremental addition fails to capture the
co-occurrences of the newly added documents (and even ignores any new terms
they contain). As such, the quality of the LSI representation will degrade as
more documents are added and will eventually require a recomputation of the LSI
representation.
The fidelity of the approximation of Worked example. Consider the term-document matrix
ship
|
1
|
0
|
1
|
0
|
0
|
0
|
||
boat
|
0
|
1
|
0
|
0
|
0
|
0
|
||
ocean
|
1
|
1
|
0
|
0
|
0
|
0
|
||
voyage
|
1
|
0
|
0
|
1
|
1
|
0
|
||
trip
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
2
|
3
|
4
|
5
|
|||
ship
|
|||||||
boat
|
0.00
|
0.73
|
|||||
ocean
|
0.00
|
||||||
voyage
|
0.35
|
0.15
|
0.16
|
||||
trip
|
0.65
|
0.58
|
2.16
|
0.00
|
0.00
|
0.00
|
0.00
|
0.00
|
1.59
|
0.00
|
0.00
|
0.00
|
0.00
|
0.00
|
1.28
|
0.00
|
0.00
|
0.00
|
0.00
|
0.00
|
1.00
|
0.00
|
0.00
|
0.00
|
0.00
|
0.00
|
0.39
|
Unit v
1)Web search basics:
It is
crucial that we understand the users of web search as well. This is
again a
significant change fromtraditional information retrieval,where users
were
typically professionals with at least some training in the art of phrasing
queries
over a well-authored collection whose style and structure they understood
well.
Index
size and estimation
To a
first approximation, comprehensiveness growswith index size, although
it does
matter which specific pages a search engine indexes – some pages are
more
informative than others. It is also difficult to reason about the fraction
of
theWeb indexed by a search engine, because there is an infinite number of
dynamic
web pages; for instance, http://www.yahoo.com/any_string
returns
a valid HTML page rather than an error, politely informing the user
that
there is no such page at Yahoo! Such a "soft 404 error" is only one
example
of many
ways in which web servers can generate an infinite number of
valid
web pages.
2) Web crawling and indexes:
Web crawling is
the process by which we gather pages from the Web, in
order to index
them and support a search engine. The objective of crawling
is to quickly
and efficiently gather as many useful web pages as possible,
together with
the link structure that interconnects them.
Features a
crawler should provide
Distributed: The crawler should have the ability to execute in a
distributed
fashion across
multiple machines.
Scalable: The crawler architecture should permit scaling up
the crawl rate
by adding extra
machines and bandwidth.
Performance and
efficiency: The crawl system should
make efficient use of
various system
resources including processor, storage and network bandwidth.
Quality: Given that a significant fraction of all web pages
are of poor utility
for serving user
query needs, the crawler should be biased towards
fetching
“useful” pages first.
Freshness: In
many applications, the crawler should operate in continuous
mode: it should obtain fresh copies of
previously fetched pages. A search
engine crawler, for instance, can thus
ensure that the search engine’s index
contains a fairly current representation
of each indexed web page. For
such continuous crawling, a crawler
should be able to crawl a page with
a frequency that approximates the rate
of change of that page.
Crawler architecture
The simple scheme outlined
above for crawling demands several modules
that fit together as shown
in Figure 20.1.
1. The URL frontier,
containing URLs yet to be fetched in the current crawl
(in the case of continuous
crawling, a URL may have been fetched previously
but is back in the frontier
for re-fetching). We describe this further
in Section 20.2.3.
2. A DNS resolution module
that determines the web server from which to
fetch the page specified by
a URL.We describe this further in Section 20.2.2.
3. A fetch module that uses
the http protocol to retrieve the web page at a
URL.
4. A parsingmodule that
extracts the text and set of links froma fetchedweb
page.
3) Link analysis:
The
analysis of hyperlinks and the graph structure of the Web has been instrumental
in
the development of web search. In this chapterwe focus on the
use
of hyperlinks for ranking web search results. Such link analysis is one
of
many factors considered by web search engines in computing a composite
score
for a web page on any given query. We begin by reviewing some
basics
of the Web as a graph in Section 21.1,
then proceed to the technical
development
of the elements of link analysis for ranking.
Link
analysis forweb search has intellectual antecedents in the field of citation
analysis,
aspects of which overlap with an area known as bibliometrics.
These
disciplines seek to quantify the influence of scholarly articles by analyzing
the
pattern of citations amongst them. Much as citations represent the
conferral
of authority from a scholarly article to others, link analysis on the
Web
treats hyperlinks from a web page to another as a conferral of authority.
Clearly,
not every citation or hyperlink implies such authority conferral; for
this
reason, simply measuring the quality of a web page by the number of
in-links
(citations from other pages) is not robust enough. For instance, one
may
contrive to set up multiple web pages pointing to a target web page,
with
the intent of artificially boosting the latter’s tally of in-links. This
phenomenon
is
referred to as link spam. Nevertheless, the phenomenon of citation
is
prevalent and dependable enough that it is feasible for web search
engines
to derive useful signals for ranking from more sophisticated link
analysis. Link analysis also
proves to be a useful indicator of what page(s)
to crawl next while crawling
the web; this is done by using link analysis to
guide the priority
assignment in the front queues of Chapter 20.
The Web as a graph
Recall the notion of
the web graph from Section 19.2.1 and particularly Figure
19.2. Our study of link analysis builds on two intuitions:
1. The anchor text
pointing to page B is a good description of page B.
2. The hyperlink from
A to B represents an endorsement of page B, by the
creator of page A.
This is not always the case; for instance, many links
amongst pages within
a single website stem from the user of a common
template. For
instance, most corporate websites have a pointer from every
page to a page
containing a copyright notice – this is clearly not an
endorsement.
Accordingly, implementations of link analysis algorithms
will typical discount
such “internal” links.
21.1.1 Anchor text
and the web graph
The following
fragment of HTML code from a web page shows a hyperlink
pointing to the home
page of the Journal of the ACM:
<a
href="http://www.acm.org/jacm/">Journal of the ACM.</a>
PageRank
We now focus on scoring and
ranking measures derived from the link structure
alone. Our first technique for
link analysis assigns to every node in
Markov chains
A Markov chain is a discrete-time
stochastic process: a process that occurs in
a series of time-steps in each of
which a random choice is made. A Markov
chain consists of N states.
Each web page will correspond to a state in the
Markov chain we will formulate.
No comments:
Post a Comment