INFORMATION RETRIEVAL SYSTEM



                                                                             UNIT I
1)      Boolean retrieval:
 The Boolean model of information retrieval (BIR)[1] is a classical information retrieval (IR) model and, at the same time, the first and most adopted one. It is used by many IR systems to this day
2)      Definitions[edit]
3)      The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms. Retrieval is based on whether or not the documents contain the query terms. Given a finite set
4)      T = {t1, t2, ..., tj, ..., tm}
5)      of elements called index terms (e.g. words or expressions - which may be stemmed - describing or characterising documents such as keywords given for a journal article), a finite set
6)      D = {D1, ..., Di, ..., Dn}, where Di is an element of the powerset of T
7)      of elements called documents. Given a Boolean expression - in a normal form - Q called a query as follows:
8)      Q = (Wi OR Wk OR ...) AND ... AND (Wj OR Ws OR ...),
9)      with Wi=ti, Wk=tk, Wj=tj, Ws=ts, or Wi=NON ti, Wk=NON tk, Wj=NON tj, Ws=NON ts
10)   where ti means that the term ti is present in document Di, whereas NON ti means that it is not.
11)   Equivalently, Q can be given in a disjunctive normal form, too. An operation called retrieval, consisting of two steps, is defined as follows:
12)   1. The sets Sj of documents are obtained that contain or not term tj (depending on whether Wj=tj or Wj=NON tj) :
13)   Sj = {Di|Wj element of Di}
14)   2. Those documents are retrieved in response to Q which are the result of the corresponding sets operations, i.e. the answer to Q is as follows:
15)   UNION ( INTERSECTION Sj)

Advantages[edit]
·         Clean formalism
·         Easy to implement
·         Intuitive concept






2) Recall the major steps in inverted index construction:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in

Obtaining the character sequence in a document

Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to convert this byte sequence into a linear sequence of characters. For the case of plain English text in ASCII encoding, this is trivial. But often things get much more complex. The sequence of characters may be encoded by one of various single byte or multibyte encoding schemes, such as Unicode UTF-8, or various national or vendor-specific standards

Choosing a document unit

The next phase is to determine what the document unit for indexing is. Thus far we have assumed that documents are fixed units for the purposes of indexing. For example, we take each file in a folder as a document. But there are many cases in which you might want to do something different. A traditional Unix (mbox-format) email file stores a sequence of email messages (an email folder) in one file, but you might wish to regard each email message as a separate document. Many email messages now contain attached documents, and you might then want to regard the email message and each contained attachment as separate documents. If an email message has an attached zip file, you might want to decode the zip file and regard each file it contains as a separate document. Going in the opposite direction, various pieces of web software (such as latex2html) take things that you might regard as a single document (e.g., a Powerpoint file or a LATEX document) and split them into separate HTML pages for each slide or subsection, stored as separate files. In these cases, you might want to combine multiple files into a single document.
Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:
Input: Friends, Romans, Countrymen, lend me your ears;
Output:             
These tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system's dictionary.
Dropping common terms: stop words
Figure 2.5: A stop list of 25 semantically non-selective words which are common in Reuters-RCV1.
Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing.

Normalization (equivalence classing of terms)

Having broken up our documents (and also our query) into tokens, the easy case is if tokens in the query just match tokens in the token list of the document. However, there are many cases when two character sequences are not quite the same but you would like a match to occur. For instance, if you search for USA, you might hope to also match documents containing U.S.A.
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens.  The most standard way to normalize is to implicitly create equivalence classes , which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.
The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turnantidiscriminatory into anti-discriminatory.
Figure 2.6: An example of how asymmetric expansion of query terms can usefully model users' expectations.

Accents and diacritics.

Diacritics on characters in English have a fairly marginal status, and we might well want cliché and cliche to match, or naive and naïve. This can be done by normalizing tokens to remove diacritics. In many other languages, diacritics are a regular part of the writing system and distinguish different sounds. Occasionally words are distinguished only by their accents. For instance, in Spanish, peña is `a cliff', while pena is `sorrow'. Nevertheless, the important question is usually not prescriptive or linguistic but is a question of how users are likely to write queries for these words. In many cases, users will enter queries for words without diacritics, whether for reasons of speed, laziness, limited software, or habits born of the days when it was hard to use non-ASCII text on many computer systems. In these cases, it might be best to equate all words to a form without diacritics.

Capitalization/case-folding.

A common strategy is to do case-folding by reducing all letters to lower case. Often this is a good idea: it will allow instances of Automobile at the beginning of a sentence to match with a query of automobile. It will also help on a web search engine when most of your users type in ferrari when they are interested in a Ferrari car. On the other hand, such case folding can equate words that might better be kept apart. Many proper nouns are derived from common nouns and so are distinguished only by case, including companies (General Motors, The Associated Press), government organizations (the Fed vs. fed) and person names (Bush, Black). We already mentioned an example of unintended query expansion with acronyms, which involved not only acronym normalization (C.A.T.   CAT) but also case-folding (CAT  cat).
3) Dictionaries and tolerant retrieval.


4) Index construction:
5) Index compression.


An inverted index for a given text collection can be quite large, especially if it contains full
positional information about all term occurrences in the collection. In a typical collection of
English text there is approximately one token for every 6 bytes of text (including punctuation
and whitespace characters). Hence, if postings are stored as 64-bit integers, we may expect that
an uncompressed positional index for such a collection consumes between 130% and 140% of
the uncompressed size of the original text data. This estimate is confirmed by Table 6.1, which
lists the three example collections used in this book along with their uncompressed size, their
compressed size, and the sizes of an uncompressed and a compressed schema-independent index.
An uncompressed schema-independent index for the TREC45 collection, for instance, consumes
about 331 MB, or 122% of the raw collection size.1


. General-Purpose Data Compression
In general, a data compression algorithm takes a chunk of data A and transforms it into another
chunk of data B, such that B is (hopefully) smaller than A, that is, it requires fewer bits to
be transmitted over a communication channel or to be stored on a storage medium. Every
compression algorithm consists of two components: the encoder (or compressor) and the decoder
(or decompressor). The encoder takes the original data A and outputs the compressed data B.
The decoder takes B and produces some output C.
Symbolwise Data Compression
When compressing a given chunk of data A, we are usually less concerned with A’s actual
appearance as a bit string than with the information contained in A. This information is called
the message, denoted as M. Many data compression techniques treat M as a sequence of symbols
from a symbol set S (called the alphabet):
M = hσ1, σ2, . . . , σni, σi S.









                              UNIT II
1)       Scoring, term weighting and the vector space model:

Thus far we have dealt with indexes that support Boolean queries: a document
eithermatches or does notmatch a query. In the case of large document
collections, the resulting number of matching documents can far exceed the
number a human user could possibly sift through. Accordingly, it is essential
for a search engine to rank-order the documents matching a query. To do
this, the search engine computes, for each matching document, a score with
respect to the query at hand. In this chapterwe initiate the study of assigning
a score to a (query, document) pair. This chapter consists of threemain ideas.

Weighted zone scoring
Given a Boolean query q and a document d, weighted zone scoring assigns
to the pair (q, d) a score in the interval [0, 1], by computing a linear combination
of zone scores, where each zone of the document contributes a Boolean
value. More specifically, consider a set of documents each of which has
zones. Let g1, . . . , g [0, 1] such that å
i=1 gi = 1. For 1 ≤ i , let si be the
Boolean score denoting a match (or absence thereof) between q and the ith
zone. For instance, the Boolean score from a zone could be 1 if all the query
term(s) occur in that zone, and zero otherwise; indeed, it could be any Boolean
function that maps the presence of query terms in a zone to 0, 1. Then,
the weighted zone score is defined to be
å
i=1
gisi. (6.1)
Weighted zone scoring is sometimes referred to also as ranked Boolean re- RANKED BOOLEAN
RETRIEVAL trieval.
Learning weights
How do we determine the weights gi for weighted zone scoring? These
weights could be specified by an expert (or, in principle, the user); but increasingly,
these weights are “learned” using training examples that have
been judged editorially. This latter methodology falls under a general class
of approaches to scoring and ranking in information retrieval, known as
machine-learned relevance. We provide a brief introduction to this topic here MACHINE-LEARNED
RELEVANCE because weighted zone scoring presents a clean setting for introducing it;

Vector space model:
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
Definitions[edit]
Documents and queries are represented as vectors by .
Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below).
The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).
Applications[edit]
Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.
In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:
Where   is the intersection (i.e. the dot product) of the document (d2 in the figure to the right) and the query (q in the figure) vectors,   is the norm of vector d2, and   is the norm of vector q. The norm of a vector is calculated as such:
As all vectors under consideration by this model are elementwise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered). See cosine similarity for further information.
Advantages[edit]
The vector space model has the following advantages over the Standard Boolean model:
1.     Simple model based on linear algebra
2.     Term weights not binary
3.     Allows computing a continuous degree of similarity between queries and documents
4.     Allows ranking documents according to their possible relevance
5.     Allows partial matching


2)Computing scores in complete search system


3)   Evolution of information retrieval:

The long history of information retrieval does not begin with the internet. It is only in the last decade and a half
of the IEEE’s one hundred years that web search engines have become pervasive and search has become
integrated into the fabric of desktop and mobile operating systems.
Early use of computers for IR
To discuss means of dealing with a perceived explosion in the amounts of scientific information available, a
specially convened conference was held by the UK’s Royal Society in 1948.

Ranked retrieval
The style of search used by both the electro-mechanical and computer-based IR systems was so-called Boolean
retrieval. A query was a logical combination of terms (a synonym of word in IR literature), which resulted in a
set of those documents that exactly matched the query. Luhn [19] proposed and Maron, Kuhns, and Ray [20]
tested an alternative approach, where each document in the collection was assigned a score indicating its
relevance to a given query.
Learning to rank
Up to this point, the ranking functions used in search engines were manually devised and tuned by hand through
experimentation. Fuhr [40] described work where the retrieval function was learned based on relevant
documents identified for an existing set of queries


Web search
Until the rise of the web, the collections that people searched were selected and edited from authoritative
sources by the search service providers. On the web, however, the situation was different. Search engine
developers quickly realised that they could use the links between web pages to construct a crawler or robot to
traverse and gather most web pages on the internet; thereby automating acquisition of content3

Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing.
Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection.
Performance and correctness measures[edit]
Main article: Precision and recall
Many different measures for evaluating the performance of information retrieval systems have been proposed. The measures require a collection of documents and a query. All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and there may be different shades of relevancy.
Precision[edit]
Precision is the fraction of the documents retrieved that are relevant to the user's information need.
In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n.
Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and statistics.
Recall[edit]
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
In binary classification, recall is often called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.
Fall-out[edit]
The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:
In binary classification, fall-out is closely related to specificity and is equal to  . It can be looked at as the probability that a non-relevant document is retrieved by the query.
It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.


4) Relevance feedback 

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query. We can usefully distinguish between three types of feedback: explicit feedback, implicit feedback, and blind or "pseudo" feedback.
Explicit feedback[edit]
Explicit feedback is obtained from assessors of relevance indicating the relevance of a document retrieved for a query. This type of feedback is defined as explicit only when the assessors (or other users of a system) know that the feedback provided is interpreted as relevance judgments.
Users may indicate relevance explicitly using a binary or graded relevance system. Binary relevance feedback indicates that a document is either relevant or irrelevant for a given query. Graded relevance feedback indicates the relevance of a document to a query on a scale using numbers, letters, or descriptions (such as "not relevant", "somewhat relevant", "relevant", or "very relevant"). Graded relevance may also take the form of a cardinal ordering of documents created by an assessor; that is, the assessor places documents of a result set in order of (usually descending) relevance. An example of this would be the SearchWiki feature implemented by Google on their search website.
The relevance feedback information needs to be interpolated with the original query to improve retrieval performance, such as the well-known Rocchio Algorithm.
A performance metric which became popular around 2005 to measure the usefulness of a ranking algorithm based on the explicit relevance feedback is NDCG. Other measures include precision at kand mean average precision.
Implicit feedback[edit]
Implicit feedback is inferred from user behavior, such as noting which documents they do and do not select for viewing, the duration of time spent viewing a document, or page browsing or scrolling actions [1].
The key differences of implicit relevance feedback from that of explicit include [2]:
1.     the user is not assessing relevance for the benefit of the IR system, but only satisfying their own needs and
2.     the user is not necessarily informed that their behavior (selected documents) will be used as relevance feedback
An example of this is the Surf Canyon browser extension, which advances search results from later pages of the result set based on both user interaction (clicking an icon) and time spent viewing the page linked to in a search result.
Blind feedback[edit]
Pseudo relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top "k" ranked documents are relevant, and finally to do relevance feedback as before under this assumption. The procedure is:
1.     Take the results returned by initial query as relevant results (only top k with k being between 10 to 50 in most experiments).
2.     Select top 20-30 (indicative number) terms from these documents using for instance tf-idf weights.
3.     Do Query Expansion, add these terms to query, and then match the returned documents for this query and finally return the most relevant documents.
Some experiments such as results from the Cornell SMART system published in (Buckley et al.1995), show improvement of retrieval systems performances using pseudo-relevance feedback in the context of TREC 4 experiments.
This automatic technique mostly works. Evidence suggests that it tends to work better than global analysis.[1] Through a query expansion, some relevant documents missed in the initial round can then be retrieved to improve the overall performance. Clearly, the effect of this method strongly relies on the quality of selected expansion terms. It has been found to improve performance in the TREC ad hoc task[citation needed]. But it is not without the dangers of an automatic process. For example, if the query is about copper mines and the top several documents are all about mines in Chile, then there may be query drift in the direction of documents on Chile. In addition, if the words added to the original query are unrelated to the query topic, the quality of the retrieval is likely to be degraded, especially in Web search, where web documents often cover multiple different topics. To improve the quality of expansion words in pseudo-relevance feedback, a positional relevance feedback for pseudo-relevance feedback has been proposed to select from feedback documents those words that are focused on the query topic based on positions of words in feedback documents.[2]Specifically, the positional relevance model assigns more weights to words occurring closer to query words based on the intuition that words closer to query words are more likely to be related to the query topic.



UNIT III
1)      XML retrieval:- XML Retrieval
, or XML Information Retrieval,[1] is the content-based retrieval of documents structured with XML (eXtensible Markup Language). As such it is used for computing relevance of XML documents
2)      Information retrieval systems are often contrasted with relational databases.
3)      Traditionally, IR systems have retrieved information from unstructured text
4)      – by which we mean “raw” text without markup. Databases are designed
5)      for querying relational data: sets of records that have values for predefined
6)      attributes such as employee number, title and salary. There are fundamental
7)      differences between information retrieval and database systems in terms of
8)      retrievalmodel, data structures and query language as shown in Table 10.1.1
9)      Some highly structured text search problems are most efficiently handled
10)   by a relational database, for example, if the employee table contains an attribute
11)   for short textual job descriptions and you want to find all employees
12)   who are involved with invoicing. In this case, the SQL query:
13)   select lastname from employees where job_desc like ’invoic%’;

Basic XML concepts
An XML document is an ordered, labeled tree. Each node of the tree is an
XML element and is written with an opening and closing tag. An element can XML ELEMENT
have one or more XML attributes. In the XML document in Figure 10.1, the XML ATTRIBUTE
scene element is enclosed by the two tags <scene ...> and </scene>. It
has an attribute number with value vii and two child elements, title and verse.
<play>
<author>Shakespeare</author>
<title>Macbeth</title>
<act number="I">
<scene number="vii">
<title>Macbeth’s castle</title>
<verse>Will I with wine and wassail ...</verse>
</scene>
</act>
</play>

2) Probabilistic information retrieval:

During the discussion of relevance feedback in Section 9.1.2, we observed
that if we have some known relevant and nonrelevant documents, then we
can straightforwardly start to estimate the probability of a term t appearing
in a relevant document P(t|R = 1), and that this could be the basis of a
classifier that decideswhether documents are relevant or not. .

3) Language models for information retrieval:

Unigram models[edit]

A unigram model used in information retrieval can be treated as the combination of several one-state finite automata.[1] It splits the probabilities of different terms in a context, e.g. from   to  .
In this model, the probability to hit each word all depends on its own, so we only have one-state finite automata as units. For each automaton, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1. Followed is an illustration of an unigram model of a document.
Terms
Probability in doc
a
0.1
world
0.2
likes
0.05
we
0.05
share
0.3
...
...
The probability generated for a specific query is calculated as
For different documents, we can build their own unigram models, with different hitting probabilities of words in it. And we use probabilities from different documents to generate different hitting probabilities for a query. Then we can rank documents for a query according to the generating probabilities. Next is an example of two unigram models of two documents.
Terms
Probability in Doc1
Probability in Doc2
a
0.1
0.3
world
0.2
0.1
likes
0.05
0.03
we
0.05
0.02
share
0.3
0.2
...
...
...
In information retrieval contexts, unigram language models are often smoothed to avoid instances where  . A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to create a smoothed document model.[2]

4) Text classification:

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes orcategories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.

"Content based" versus "request based" classification[edit]

Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned.[1] In automatic classification it could be the number of times given words appears in a document.
Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230[2]).
Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request oriented classification as policy based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.

Classification versus indexing[edit]

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21[3]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore is the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).

Automatic document classification[edit]

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism.
5) Vector space classification:

Vector space classification

The document representation in Naive Bayes is a sequence of terms or a binary vector  . In this chapter we adopt a different representation for text classification, the vector space model, developed in Chapter 6 . It represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space  , the domain of the classification function  , is  . This chapter introduces a number of classification methods that operate on real-valued vectors.
The basic hypothesis in using the vector space model for classification is the contiguity hypothesis .
Contiguity hypothesis. Documents in the same class form a contiguous region and regions of different classes do not overlap.
There are many classification tasks, in particular the type of text classification that we encountered in Chapter 13 , where classes can be distinguished by word patterns. For example, documents in the class China tend to have high values on dimensions like Chinese, Beijing, and Mao whereas documents in the class UK tend to have high values for London, British and Queen. Documents of the two classes therefore form distinct contiguous regions as shown in Figure 14.1 and we can draw boundaries that separate them and classify new documents. How exactly this is done is the topic of this chapter.

Rocchio classification

Figure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional (2D) space. Documents are shown as circles, diamonds and X's. The boundaries in the figure, which we call decision boundaries , are chosen to separate the three classes, but are otherwise arbitrary. To classify a new document, depicted as a star in the figure, we determine the region it occurs in and assign it the class of that region - China in this case. Our task in vector space classification is to devise algorithms that compute good boundaries where ``good'' means high classification accuracy on data unseen during training.
Figure 14.3: Rocchio classification.
Perhaps the best-known way of computing good class boundaries is Rocchio classification , which uses centroids to define the boundaries. The centroid of a class   is computed as the vector average or center of mass of its members: 

Figure 14.1: Vector space classification into three classes.






       UNIT IV
1)Support vectors machine and machine learning on documents:

Improving classifier effectiveness has been an area of intensive machine-learning research over the last two decades, and this work has led to a new generation of state-of-the-art classifiers, such as support vector machines, boosted decision trees, regularized logistic regression, neural networks, and random forests. Many of these methods, including support vector machines (SVMs), the main topic of this chapter, have been applied with success to information retrieval problems, particularly text classification. An SVM is a kind of large-margin classifier: it is a vector space based machine learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise).
We will initially motivate and develop SVMs for the case of two-class data sets that are separable by a linear classifier (Section 15.1 ), and then extend the model in Section 15.2 to non-separable data, multi-class problems, and nonlinear models, and also present some additional discussion of SVM performance. The chapter then moves to consider the practical deployment of text classifiers in Section 15.3 : what sorts of classifiers are appropriate when, and how can you exploit domain-specific text features in classification? Finally, we will consider how the machine learning technology that we have been building for text classification can be applied back to the problem of learning how to rank documents in ad hoc retrieval (Section 15.4 ). While several machine learning methods have been applied to this task, use of SVMs has been prominent. Support vector machines are not necessarily better than other machine learning methods (except perhaps in situations with little training data), but they perform at the state-of-the-art level and have much current theoretical and empirical appeal.
Soft margin classification
For the very high dimensional problems common in text classification, sometimes the data are linearly separable. But in the general case they are not, and even if they are, we might prefer a solution that better separates the bulk of the data while ignoring a few weird noise documents.
Figure 15.5:Large margin classification with slack variables.

Multiclass SVMs

SVMs are inherently two-class classifiers. The traditional way to do multiclass classification with SVMs is to use one of the methods discussed in Section 14.5 (page14.5 ). In particular, the most common technique in practice has been to build   one-versus-rest classifiers (commonly referred to as ``one-versus-all'' or OVA classification), and to choose the class which classifies the test datum with greatest margin. Another strategy is to build a set of one-versus-one classifiers, and to choose the class that is selected by the most classifiers. While this involves building   classifiers, the time for training classifiers may actually decrease, since the training data set for each classifier is much smaller.

Nonlinear SVMs
Figure 15.6: Projecting data that is not linearly separable into a higher dimensional space can make it linearly separable.

2) Flat clustering:

Flat clustering

Clustering algorithms group a set of documents into subsets or clusters . The algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.
Figure 16.1: An example of a data set with a clear cluster structure.
Clustering is the most common form of unsupervised learning . No supervision means that there is no human expert who has assigned documents to classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1 . It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion.
The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13 , page 13.1 ): our goal is to replicate a categorical distinction that a human supervisor imposes on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us.
The key input to a clustering algorithm is the distance measure. In Figure 16.1 , the distance measure is distance in the 2D plane. This measure suggests three different clusters in the figure. In document clustering, the distance measure is often also Euclidean distance. Different distance measures give rise to different clusterings. Thus, the distance measure is an important means by which we can influence the outcome of clustering.
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. Hierarchical clustering creates a hierarchy of clusters and will be covered in Chapter 17 . Chapter 17 also addresses the difficult problem of labeling clusters automatically.
3) Hierachical clustering:
Flat clustering is efficient and conceptually simple, but as we saw in Chapter 16 it has a number of drawbacks. The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic. Hierarchical clustering (or hierarchic clustering ) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering. Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic. These advantages of hierarchical clustering come at the cost of lower efficiency. The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of  -means and EM

Hierarchical agglomerative clustering

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom-up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC . Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached. See Section 17.6 . HAC is more frequently used in IR than top-down clustering and is the main subject of this chapter.
Figure 17.2: A simple, but inefficient HAC algorithm.
A simple, naive HAC algorithm is shown in Figure 17.2 . We first compute the   similarity matrix  . The algorithm then executes   steps of merging the currently most similar clusters. In each iteration, the two most similar clusters are merged and the rows and columns of the merged cluster   in   are updated. The clustering is stored as a list of merges in  .   indicates which clusters are still available to be merged. The function SIM  computes the similarity of cluster  with the merge of clusters   and  . For some HAC algorithms, SIM  is simply a function of   and  , for example, the maximum of these two values for single-link.
We will now refine this algorithm for the different similarity measures of single-link and complete-link clustering (Section 17.2 ) and group-average and centroid clustering ( and 17.4 ). The merge criteria of these four variants of HAC are shown in Figure 17.3 .
5) Matrix decompositions and latent
semantic indexing


On page 6.3.1 we introduced the notion of a term-document matrix: an   matrix  , each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix   is likely to have several tens of thousands of rows and columns. In Section 18.1.1 we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2 we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. While latent semantic indexing has not been established as a significant force in scoring and ranking for information retrieval, it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section16.6 , page 16.6 ). Understanding its full potential remains an area of active research.

Matrix decompositions

In this section we examine ways in which a square matrix can be factored into the product of matrices derived from its eigenvectors; we refer to this process as matrix decomposition . Matrix decompositions similar to the ones in this section will form the basis of our principal text-analysis technique in Section 18.3 , where we will look at decompositions of non-square term-document matrices. The square decompositions in this section are simpler and can be treated with sufficient mathematical rigor to help the reader understand how such decompositions work. The detailed mathematical derivation of the more complex decompositions in Section 18.2 are beyond the scope of this book.
We begin by giving two theorems on the decomposition of a square matrix into the product of three matrices of a special form. The first of these, Theorem 18.1.1, gives the basic factorization of a square real-valued matrix into three factors. The second, Theorem 18.1.1, applies to square symmetric matrices and is the basis of the singular value decomposition described in Theorem 18.2.
Theorem. (Matrix diagonalization theorem) Let   be a square real-valued   matrix with   linearly independent eigenvectors. Then there exists an eigen decomposition 
(223)

where the columns of   are the eigenvectors of   and   is a diagonal matrix whose diagonal entries are the eigenvalues of   in decreasing order 
(224)

If the eigenvalues are distinct, then this decomposition is unique. End theorem.
To understand how Theorem 18.1.1 works, we note that   has the eigenvectors of   as columns 
(225)

Then we have 
(226)

(227)

(228)

Thus, we have  , or  .
We next state a closely related decomposition of a symmetric square matrix into the product of matrices derived from its eigenvectors. This will pave the way for the development of our main tool for text analysis, the singular value decomposition (Section 18.2 ).
Theorem. (Symmetric diagonalization theorem) Let   be a square, symmetric real-valued   matrix with   linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition 
(229)

where the columns of   are the orthogonal and normalized (unit length, real) eigenvectors of  , and   is the diagonal matrix whose entries are the eigenvalues of  . Further, all entries of   are real and we have  . End theorem.
We will build on this symmetric diagonal decomposition to build low-rank approximations to term-document matrices.
Exercises.
  • What is the rank of the   diagonal matrix below? 
(230)

  • Show that   is an eigenvalue of 

Latent semantic indexing

We now discuss the approximation of a term-document matrix   by one of lower rank using the SVD. The low-rank approximation to   yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query-document similarity scores in this low-rank representation. This process is known as latent semantic indexing (generally abbreviated LSI).
But first, we motivate such an approximation. Recall the vector space representation of documents and queries introduced in Section 6.3 (page  ). This vector space representation enjoys a number of advantages including the uniform treatment of queries and documents as vectors, the induced score computation based on cosine similarity, the ability to weight different terms differently, and its extension beyond document retrieval to such applications as clustering and classification. The vector space representation suffers, however, from its inability to cope with two classic problems arising in natural languages: synonymy and polysemy. Synonymy refers to a case where two different words (say car and automobile) have the same meaning. Because the vector space representation fails to capture the relationship between synonymous terms such as car and automobile - according each a separate dimension in the vector space. Consequently the computed similarity   between a query  (say, car) and a document   containing both car and automobile underestimates the true similarity that a user would perceive. Polysemy on the other hand refers to the case where a term such as charge has multiple meanings, so that the computed similarity   overestimates the similarity that a user would perceive. Could we use the co-occurrences of terms (whether, for instance, charge occurs in a document containing steed versus in a document containing electron) to capture the latent semantic associations of terms and alleviate these problems?
Even for a collection of modest size, the term-document matrix   is likely to have several tens of thousand of rows and columns, and a rank in the tens of thousands as well. In latent semantic indexing (sometimes referred to as latent semantic analysis (LSA) ), we use the SVD to construct a low-rank approximation   to the term-document matrix, for a value of   that is far smaller than the original rank of  . In the experimental work cited later in this section,   is generally chosen to be in the low hundreds. We thus map each row/column (respectively corresponding to a term/document) to a  -dimensional space; this space is defined by the   principal eigenvectors (corresponding to the largest eigenvalues) of   and  . Note that the matrix   is itself still an   matrix, irrespective of  .
Next, we use the new  -dimensional LSI representation as we did the original representation - to compute similarities between vectors. A query vector   is mapped into its representation in the LSI space by the transformation 
(244)

Now, we may use cosine similarities as in Section 6.3.1 (page  ) to compute the similarity between a query and a document, between two documents, or between two terms. Note especially that Equation 244 does not in any way depend on   being a query; it is simply a vector in the space of terms. This means that if we have an LSI representation of a collection of documents, a new document not in the collection can be ``folded in'' to this representation using Equation 244. This allows us to incrementally add documents to an LSI representation. Of course, such incremental addition fails to capture the co-occurrences of the newly added documents (and even ignores any new terms they contain). As such, the quality of the LSI representation will degrade as more documents are added and will eventually require a recomputation of the LSI representation.
The fidelity of the approximation of   to   leads us to hope that the relative values of cosine similarities are preserved: if a query is close to a document in the original space, it remains relatively close in the  -dimensional space. But this in itself is not sufficiently interesting, especially given that the sparse query vector  turns into a dense query vector   in the low-dimensional space. This has a significant computational cost, when compared with the cost of processing   in its native form.
Worked example. Consider the term-document matrix 




ship
1
0
1
0
0
0


boat
0
1
0
0
0
0


ocean
1
1
0
0
0
0


voyage
1
0
0
1
1
0


trip
0
0
0
1
0
1

Its singular value decomposition is the product of three matrices as below. First we have   which in this example is:


1
2
3
4
5


ship


boat
0.00
0.73


ocean
0.00


voyage
0.35
0.15
0.16


trip
0.65
0.58

When applying the SVD to a term-document matrix,   is known as the SVD term matrix. The singular values are 
2.16
0.00
0.00
0.00
0.00
0.00
1.59
0.00
0.00
0.00
0.00
0.00
1.28
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.39





Unit v
1)Web search basics:















The search user experience
It is crucial that we understand the users of web search as well. This is
again a significant change fromtraditional information retrieval,where users
were typically professionals with at least some training in the art of phrasing
queries over a well-authored collection whose style and structure they understood
well.
Index size and estimation
To a first approximation, comprehensiveness growswith index size, although
it does matter which specific pages a search engine indexes – some pages are
more informative than others. It is also difficult to reason about the fraction
of theWeb indexed by a search engine, because there is an infinite number of
dynamic web pages; for instance, http://www.yahoo.com/any_string
returns a valid HTML page rather than an error, politely informing the user
that there is no such page at Yahoo! Such a "soft 404 error" is only one example
of many ways in which web servers can generate an infinite number of
valid web pages.



2) Web crawling and indexes:
Web crawling is the process by which we gather pages from the Web, in
order to index them and support a search engine. The objective of crawling
is to quickly and efficiently gather as many useful web pages as possible,
together with the link structure that interconnects them.

Features a crawler should provide
Distributed: The crawler should have the ability to execute in a distributed
fashion across multiple machines.
Scalable: The crawler architecture should permit scaling up the crawl rate
by adding extra machines and bandwidth.
Performance and efficiency: The crawl system should make efficient use of
various system resources including processor, storage and network bandwidth.
Quality: Given that a significant fraction of all web pages are of poor utility
for serving user query needs, the crawler should be biased towards
fetching “useful” pages first.
Freshness: In many applications, the crawler should operate in continuous
mode: it should obtain fresh copies of previously fetched pages. A search
engine crawler, for instance, can thus ensure that the search engine’s index
contains a fairly current representation of each indexed web page. For
such continuous crawling, a crawler should be able to crawl a page with
a frequency that approximates the rate of change of that page.

Crawler architecture
The simple scheme outlined above for crawling demands several modules
that fit together as shown in Figure 20.1.
1. The URL frontier, containing URLs yet to be fetched in the current crawl
(in the case of continuous crawling, a URL may have been fetched previously
but is back in the frontier for re-fetching). We describe this further
in Section 20.2.3.
2. A DNS resolution module that determines the web server from which to
fetch the page specified by a URL.We describe this further in Section 20.2.2.
3. A fetch module that uses the http protocol to retrieve the web page at a
URL.
4. A parsingmodule that extracts the text and set of links froma fetchedweb
page.



3) Link analysis:
The analysis of hyperlinks and the graph structure of the Web has been instrumental
in the development of web search. In this chapterwe focus on the
use of hyperlinks for ranking web search results. Such link analysis is one
of many factors considered by web search engines in computing a composite
score for a web page on any given query. We begin by reviewing some
basics of the Web as a graph in Section 21.1, then proceed to the technical
development of the elements of link analysis for ranking.
Link analysis forweb search has intellectual antecedents in the field of citation
analysis, aspects of which overlap with an area known as bibliometrics.
These disciplines seek to quantify the influence of scholarly articles by analyzing
the pattern of citations amongst them. Much as citations represent the
conferral of authority from a scholarly article to others, link analysis on the
Web treats hyperlinks from a web page to another as a conferral of authority.
Clearly, not every citation or hyperlink implies such authority conferral; for
this reason, simply measuring the quality of a web page by the number of
in-links (citations from other pages) is not robust enough. For instance, one
may contrive to set up multiple web pages pointing to a target web page,
with the intent of artificially boosting the latter’s tally of in-links. This phenomenon
is referred to as link spam. Nevertheless, the phenomenon of citation
is prevalent and dependable enough that it is feasible for web search
engines to derive useful signals for ranking from more sophisticated link
analysis. Link analysis also proves to be a useful indicator of what page(s)
to crawl next while crawling the web; this is done by using link analysis to
guide the priority assignment in the front queues of Chapter 20.

The Web as a graph
Recall the notion of the web graph from Section 19.2.1 and particularly Figure
19.2. Our study of link analysis builds on two intuitions:
1. The anchor text pointing to page B is a good description of page B.
2. The hyperlink from A to B represents an endorsement of page B, by the
creator of page A. This is not always the case; for instance, many links
amongst pages within a single website stem from the user of a common
template. For instance, most corporate websites have a pointer from every
page to a page containing a copyright notice – this is clearly not an
endorsement. Accordingly, implementations of link analysis algorithms
will typical discount such “internal” links.
21.1.1 Anchor text and the web graph
The following fragment of HTML code from a web page shows a hyperlink
pointing to the home page of the Journal of the ACM:
<a href="http://www.acm.org/jacm/">Journal of the ACM.</a>
PageRank
We now focus on scoring and ranking measures derived from the link structure
alone. Our first technique for link analysis assigns to every node in
Markov chains
A Markov chain is a discrete-time stochastic process: a process that occurs in
a series of time-steps in each of which a random choice is made. A Markov
chain consists of N states. Each web page will correspond to a state in the
Markov chain we will formulate.










No comments: