M.Tech Notes: ADVANCED DATA MINING UNIT 4

Unit 4

1)web and Text mining: introduction:

Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Web usage mining[edit]

Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further depending on the kind of usage data considered:

· Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.

· Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.

· Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the categories above.

2) Web content mining[edit]

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

3) Web structure mining[edit]

Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

4) Text mining:

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-qualityinformation from text. High-quality information is typically derived through the devising of patterns and trends through means such asstatistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering,concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition,tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

5) Unstructured text:

About Unstructured Text

Data mining algorithms act on data that is numerical or categorical. Numerical data is ordered. It is stored in columns that have a numeric data type, such asNUMBER or FLOAT. Categorical data is identified by category or classification. It is stored in columns that have a character data type, such as VARCHAR2 or CHAR.

Unstructured text data is neither numerical nor categorical. Unstructured text includes items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes. It has been said that unstructured text accounts for more than three quarters of all enterprise data. Extracting meaningful information from unstructured text can be critical to the success of a business.

About Text Mining and Oracle Text

Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.

Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by the user to Oracle Text and uses the results in the model creation process.

6)Episode rule discovery for text:

Discovering association rules is an important datamining

problem. The problem was first defined in the context

of the market basket data to identify customers’ buying

habits [1].

Our overall goal is to analyze event sequences, discover

recurrent patterns of events, and generate sequential association

rules. Our approach is based on the concept of representative

association rules combined with event constraints.

A sequential dataset is normalized and then discretized

by forming subsequences using a sliding window [5]. Using

a sliding window of size _, every normalized time stamp

value xt is used to compute each of the new sequence values

yt__=2 to yt+_=2. Thus, the dataset has been divided into

segments, each of size _. The discretized version of the

time series is obtained by using some clustering algorithm

and a suitable similarity measure. Each cluster identifier is

an event type, and the set of cluster labels is the class of

events E.

Representative Episodal Association Rules

We use the set of frequent closed episodes FCE produced

from the Gen-FCE algorithm to generate the representative

episodal association rules that cover the entire set

of association rules [4].

The cover of a rule r : X ) Y , denoted by C(r), is the

set of association rules that can be generated from r. That

is, C(r : X ) Y ) = fX [ U ) V j U; V _ Y; U \

V = ;; and V 6= ;g. An important property of the

cover operator stated in [4]is that if an association rule r

has support s and confidence c, then every rule

r0 2 C(r)

has support at least s and confidence at least c.

7)Hierarchy of categeories:

To model a product category hierarchy, this solution keeps each category in its own document that also has a list of its ancestors or “parents.” This document uses music genres as the basis of its examples:

Initial category hierarchy

Because these kinds of categories change infrequently, this model focuses on the operations needed to keep the hierarchy up-to-date rather than the performance profile of update operations.

Schema

This schema has the following properties:

A single document represents each category in the hierarchy.
An ObjectId identifies each category document for internal cross-referencing.
Each category document has a human-readable name and a URL compatible slug field.
The schema stores a list of ancestors for each category to facilitate displaying a query and its ancestors using only a single query.

Consider the following prototype:

{ "_id" : ObjectId("4f5ec858eb03303a11000002"),

  "name" : "Modal Jazz",

  "parent" : ObjectId("4f5ec858eb03303a11000001"),

  "slug" : "modal-jazz",

  "ancestors" : [

         { "_id" : ObjectId("4f5ec858eb03303a11000001"),

        "slug" : "bop",

        "name" : "Bop" },

         { "_id" : ObjectId("4f5ec858eb03303a11000000"),

           "slug" : "ragtime",

           "name" : "Ragtime" } ]

Operations

Read and Display a Category

Querying

Use the following option to read and display a category hierarchy. This query will use the slug field to return the category information and a “bread crumb” trail from the current category to the top level category.

category = db.categories.find(

    {'slug':slug},

    {'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })

Indexing

Create a unique index on the slug field with the following operation on the Python/PyMongo console:

>>> db.categories.ensure_index('slug', unique=True)

11) Text clustering:

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared offline applications.

Clustering in search engines[edit]

A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such asPolyMeta and Helioid, or open source software such as Carrot2.

Examples:

· Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."

M.Tech Notes

Pages

ADVANCED DATA MINING UNIT 4