M.Tech Notes: mtech advanced data mining

UNIT I

1): Data mining tasks:

Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be mined there are two kind of functions involved in Data Mining, that are listed below:

· Descriptive

· Classification and Prediction

Descriptive

The descriptive function deals with general properties of data in the database. Here is the list of descriptive functions:

· Class/Concept Description

· Mining of Frequent Patterns

· Mining of Associations

· Mining of Correlations

· Mining of Clusters

Class/Concept Description

Class/Concepts refers the data to be associated with classes or concepts. For example, in a company classes of items for sale include computer and printers, and concepts of customers include big spenders and budget spenders.Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived by following two ways:

· Data Characterization - This refers to summarizing data of class under study. This class under study is called as Target Class.

· Data Discrimination - It refers to mapping or classification of a class with some predefined group or class.

2) Mining Frequent patterns:

Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of kind of frequent patterns:

· Frequent Item Set - It refers to set of items that frequently appear together for example milk and bread.

· Frequent Subsequence- A sequence of patterns that occur frequently such as purchasing a camera is followed by memory card.

· Frequent Sub Structure - Substructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with itemsets or subsequences.

3)Mining of Association

Associations are used in retail sales to identify patterns that are frequently purchased together. This process refers to process of uncovering the relationship among data and determining association rules.

For example A retailer generates association rule that show that 70% of time milk is sold with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations

It is kind of additional analysis performed to uncover interesting statistical correlations between associated-attribute- value pairs or between two item Sets to analyze that if they have positive, negative or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters.

4):Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The purpose is to be able to use this model to predict the class of objects whose class label is unknown. This derived model is based on analysis of set of training data. The derived model can be presented in the following forms:

· Classification (IF-THEN) Rules

· Decision Trees

· Mathematical Formulae

· Neural Networks

Here is the list of functions involved in this:

· Classification - It predicts the class of objects whose class label is unknown.Its objective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on analysis set of training data i.e the data object whose class label is well known.

· Prediction - It is used to predict missing or unavailable numerical data values rather than class labels.Regression Analysis is generally used for prediction.Prediction can also be used for identification of distribution trends based on available data.

· Outlier Analysis - The Outliers may be defined as the data objects that do not comply with general behaviour or model of the data available.

· Evolution Analysis - Evolution Analysis refers to description and model regularities or trends for objects whose behaviour changes over time.

5) :Cluster Analysis

Cluster is a group of objects that belong to the same class. In other words the similar object are grouped in one cluster and dissimilar are grouped in other cluster.

Clustering is the process of making group of abstract objects into classes of similar objects.

Points to Remember

· A cluster of data objects can be treated as a one group.

· While doing the cluster analysis, we first partition the set of data into groups based on data similarity and then assign the label to the groups.

· The main advantage of Clustering over classification is that, It is adaptable to changes and help single out useful features that distinguished different groups.

Applications of Cluster Analysis

· Clustering Analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

· Clustering can also help marketers discover distinct groups in their customer basis. And they can characterize their customer groups based on purchasing patterns.

· In field of biology it can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.

· Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according house type, value, geographic location.

· Clustering also helps in classifying documents on the web for information discovery.

· Clustering is also used in outlier detection applications such as detection of credit card fraud.

Clustering Methods

The clustering methods can be classified into following categories:

Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method

n 6)Outlier Analysis:

n A data object that deviates significantly from the normal objects as if it were generated by a different mechanism

n Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...

n Outliers are different from the noise data

n Noise is random error or variance in a measured variable

n Noise should be removed before outlier detection

n Outliers are interesting: It violates the mechanism that generates the normal data

n Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model

n Applications:

n Credit card fraud detection

n Telecom fraud detection

n Customer segmentation

n Medical analysis

3 types:

->Global outlier

->Contextual

->collective outlier

UNIT II

1): Classification by back propagation:

ackpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networksused in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respects to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function

The backpropagation learning algorithm can be divided into two phases: propagation and weight update.

Phase 1: Propagation[edit]

Each propagation involves the following steps:

1. Forward propagation of a training pattern's input through the neural network in order to generate the propagation's output activations.

2. Backward propagation of the propagation's output activations through the neural network using the training pattern target in order to generate the deltas of all output and hidden neurons.

Phase 2: Weight update[edit]

For each weight-synapse follow the following steps:

1. Multiply its output delta and input activation to get the gradient of the weight.

2. Subtract a ratio (percentage) of the gradient from the weight.

This ratio (percentage) influences the speed and quality of learning; it is called the learning rate. The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the training is. The sign of the gradient of a weight indicates where the error is increasing, this is why the weight must be updated in the opposite direction.

2): support vector machines:

In machine learning, support vector machines (SVMs, also support vector networks^[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. .

Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.

3) Classificatio using frequent pattern:

The classification model is built in the feature space of single

features as well as frequent patterns, i.e. map the data to a higher

dimensional space.

• Feature combination can capture more underlying semantics than

single features.

• Example: word phrases can improve the accuracy of document

classification.

• FP-based classification been applied to many problems:

– Graph classification

– Time series classification

– Protein classification

– Text classification

4) Other classification methods:

Tree-based frequent patterns

• [Fan et al. 2008] proposed building a decision tree in the space of

frequent patterns as an alternative for the two phases approach

[Cheng et al. 2007].

• Arguments against the two phases approach:

1. The number of frequent patterns can be too large for effective

feature selection in the second phase.

2. It is difficult to set the optimal min_sup threshold:

– low min_sup generate too many candidates and is very slow.

– High min_sup may miss some important patterns.

• Method: Construct a decision tree using frequent patterns:

– Apply frequent pattern on the whole data

– Select the best frequent pattern P to divide the data into two sets:

one containing P and the other not.

– Repeat the procedure on each subset until a stopping condition is

satisfied.

• Decreasing the support with smaller partitions makes the algorithm

able to mine patterns with very low global support.

5)Genetic Algorithms

The idea of Genetic Algorithm is derived from natural evolution. In Genetic Algorithm first of all initial population is created. This initial population consist of randomly generated rules. we can represent each rule by a string of bits.

For example , suppose that in a given training set the samples are described by two boolean attributes such as A1 and A2. And this given training set contains two classes such as C1 and C2.

We can encode the rule IF A1 AND NOT A2 THEN C2 into bit string 100. In this bit representation the two leftmost bit represent the attribute A1 and A2, respectively.

Likewise the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.

Note:If the attribute has K values where K>2, then we can use the K bits to encode the attribute values . The classes are also encoded in the same manner.

Points to remember:

· Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population and offspring values of these rules as well.

· The fitness of the rule is assessed by its classification accuracy on a set of training samples.

· The genetic operators such as crossover and mutation are applied to create offsprings.

· In crossover the substring from pair of rules are swapped to from a new pair of rules.

· In mutation, randomly selected bits in a rule's string are inverted.

Rough Set Approach

To discover structural relationship within imprecise and noisy data we can use the rough set.

Note:This approach can only be applied on discrete-valued attributes.Therefore, continuous-valued attributes must be discretized before its use.

The Rough Set Theory is base on establishment of equivalence classes within the given training data. The tuples that forms the equivalence class are indiscernible. It means the samples are identical wrt to the attributes describing the data.

There are some classes in given real world data, which can not be distinguished in terms of available attributes. We can use the rough sets to roughly define such classes.

For a given class, C, the rough set definition is approximated by two sets as follows:

· Lower Approximation of C - The lower approximation of C consist of all the data tuples, that bases on knowledge of attribute. These attribute are certain to belong to class C.

· Upper Approximation of C - The upper approximation of C consist of all the tuples that based on knowledge of attributes, can not be described as not belonging to C.

The following diagram shows the Upper and Lower Approximation of class C:

Fuzzy Set Approaches

Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi Zadeh in 1965. This approach is an alternative Two-value logic.This theory allows us to work at high level of abstraction. This theory also provide us means for dealing with imprecise measurement of data.

The fuzzy set theory also allow to deal with vague or inexact facts. For example being a member of a set of high incomes is inexact (eg. if $50,000 is high then what about $49,000 and $48,000). Unlike the traditional CRISP set where the element either belong to S or its complement but in fuzzy set theory the element can belong to more than one fuzzy set.

For example, the income value $49,000 belong to both the medium and high fuzzy sets but to differing degrees. Fuzzy set notation for this income value is as follows:

m_{medium_income}($49k)=0.15 and m_{high_income}($49k)=0.96

where m is membership function that operates on fuzzy set of medium_income and high_income respectively. This notation can be shown diagrammatically as follows:

7)Roughest approach:

In computer science, a rough set, first described by Polish computer scientist Zdzisław I. Pawlak, is a formal approximation of a crisp set (i.e., conventional set) in terms of a pair of sets which give the lower and the upper approximation of the original set. In the standard version of rough set theory (Pawlak 1991), the lower- and upper-approximation sets are crisp sets, but in other variations, the approximating sets may be fuzzy sets. Information system framework[edit]

Let be an information system (attribute-value system), where is a non-empty set of finite objects (the universe) and is a non-empty, finite set of attributes such that for every . is the set of values that attribute may take. The information table assigns a value from to each attribute and object in the universe .

With any there is an associated equivalence relation :

The relation is called a -indiscernibility relation. The partition of is a family of all equivalence classes of and is denoted by (or ).

If , then and are indiscernible (or indistinguishable) by attributes from .

Example: equivalence-class structure[edit]

For example, consider the following information table:

Sample Information System
Object
	1	2	0	1	1
	1	2	0	1	1
	2	0	0	1	0
	0	0	1	2	1
	2	1	0	2	1
	0	0	1	2	2
	2	0	0	1	0
	0	1	2	2	1
	2	1	0	2	2
	2	0	0	1	0

When the full set of attributes is considered, we see that we have the following seven equivalence classes:

Definition of a rough set[edit]

Let be a target set that we wish to represent using attribute subset ; that is, we are told that an arbitrary set of objects comprises a single class, and we wish to express this class (i.e., this subset) using the equivalence classes induced by attribute subset . In general, cannot be expressed exactly, because the set may include and exclude objects which are indistinguishable on the basis of attributes .

For example, consider the target set , and let attribute subset , the full available set of features. It will be noted that the set cannot be expressed exactly, because in , objects are indiscernible. Thus, there is no way to represent any set which includes but excludes objects and .

However, the target set can be approximated using only the information contained within by constructing the -lower and -upper approximations of :

7)Fuzz set:

Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets were introduced by Lotfi A. Zadeh ^[1] and Dieter Klaua^[2] in 1965 as an extension of the classical notion of set. At the same time, Salii (1965) defined a more general kind of structures called L-relations, which were studied by him in an abstract algebraic context. A fuzzy set is a pair where is a set and

For each the value is called the grade of membership of in For a finite set the fuzzy set is often denoted by

Let Then is called not included in the fuzzy set if , is called fully included if , and is called a fuzzy member if .^[6] The set is called the support of and the set is called its kernel. The function is called the membership fun

Fuzzy logic[edit]

Main article: Fuzzy logic

As an extension of the case of multi-valued logic, valuations ( ) of propositional variables ( ) into a set of membership degrees ( ) can be thought of as membership functionsmapping predicates into fuzzy sets (or more formally, into an ordered set of fuzzy pairs, called a fuzzy relation). With these valuations, many-valued logic can be extended to allow for fuzzy premisesfrom which graded conclusions may be drawn.^[8]

This extension is sometimes called "fuzzy logic in the narrow sense" as opposed to "fuzzy logic in the wider sense," which originated in the engineering fields of automated control and knowledge engineering, and which encompasses many topics involving fuzzy sets and "approximated reasoning."^[9]

Industrial applications of fuzzy sets in the context of "fuzzy logic in the wider sense" can be found at fuzzy logic.

ction of the fuzzy set

Sometimes, more general variants of the notion of fuzzy set are used, with membership functions taking values in a (fixed or variable) algebra or structure of a given kind; usually it is required that be at least a poset or lattice. These are usually called L-fuzzy sets, to distinguish them from those valued over the unit interval. The usual membership functions with values in [0, 1] are then called [0, 1]-valued membership functions.

UNIT III

1):Density basedf methods:

In density-based clustering,^[8] clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to be noise and border points.

The most popular^[9] density based clustering method is DBSCAN.

It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.^[2] OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the ε parameter with a maximum search radius.

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region^[a] (minPts). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.

Advantages[edit]

1. DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.

2. DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.

2) Optics:

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander.^[1] Its basic idea is similar to DBSCAN,^[2]but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster.

Like DBSCAN, OPTICS requires two parameters: , which describes the maximum distance (radius) to consider, and , describing the number of points required to form a cluster. A point is a core point if at least points are found within its -neighborhood . Contrary to DBSCAN, OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned a core distance that describes the distance to the th closest point:

3) Denclue:

4)Gride based methods sting,cliques:

Basic Grid-based Algorithm

Define a set of grid-cells
Assign objects to the appropriate grid cell and compute the density of each cell.
Eliminate cells, whose density is below a certain threshold t.
Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function)

Advantages of Grid-based Clustering Algorithms

n fast:

n No distance computations

n Clustering is performed on summaries and not individual objects; complexity is usually O(#-populated-grid-cells) and not O(#objects)

n Easy to determine which clusters are neighboring

n Shapes are limited to union of grid-cells

Grid-Based Clustering Methods

n Using multi-resolution grid data structure

n Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset

n Several interesting methods (in addition to the basic grid-based algorithm)

n STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

n CLIQUE: Agrawal, et al. (SIGMOD’98)

STING: A Statistical Information Grid Approach

n Wang, Yang and Muntz (VLDB’97)

n The spatial area area is divided into rectangular cells

n There are several levels of cells corresponding to different levels of resolution

n Each cell at a high level is partitioned into a number of smaller cells in the next lower level

n Statistical info of each cell is calculated and stored beforehand and is used to answer queries

n Parameters of higher level cells can be easily calculated from parameters of lower level cell

n count, mean, s, min, max

n type of distribution—normal, uniform, etc.

n Use a top-down approach to answer spatial data queries

Clicque:

n Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

n Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

n CLIQUE can be considered as both density-based and grid-based

n It partitions each dimension into the same number of equal length interval

n It partitions an m-dimensional data space into non-overlapping rectangular units

n A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

n A cluster is a maximal set of connected dense units within a subspace

n Partition the data space and find the number of points that lie inside each cell of the partition.

n Identify the subspaces that contain clusters using the Apriori principle

n Identify clusters:

n Determine dense units in all subspaces of interests

n Determine connected dense units in all subspaces of interests.

5) Exeption maximization algorithm:

In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihoodevaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. That is, either there are missing values among the data, or the model can be formulated more simply by assuming the existence of additional unobserved data points. For example, a mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component that each data point belongs to.

Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values — viz. the parameters and the latent variables — and simultaneously solving the resulting equations. In statistical models with latent variables, this usually is not possible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice-versa, but substituting one set of equations into the other produces an unsolvable equation.

Description[edit]

Given a statistical model consisting of a set of observed data, a set of unobserved latent data or missing values , and a vector of unknown parameters , along with a likelihood function , the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data

However, this quantity is often intractable (e.g. if is a sequence of events, so that the number of values grows exponentially with the sequence length, making the exact calculation of the sum extremely difficult).

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

Expectation step (E step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of given under the current estimate of the parameters :

Maximization step (M step): Find the parameter that maximizes this quantity:

6) Clustering high-dimensional data:

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Problems[edit]

According to Kriegel, Kröger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:

· Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the curse of dimensionality.

· The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:

Subspace clustering[edit]

Example 2D space with subspace clusters

Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. The term is often used synonymous with general clustering in high-dimensional data.

Projected clustering[edit]

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regularclustering algorithm.

7)Clustering graph n network data:

Clustering data is a fundamental task in machine

learning.

NETWORK

Unit 4

1)web and Text mining: introduction:

Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Web usage mining[edit]

Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further depending on the kind of usage data considered:

· Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.

· Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.

· Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the categories above.

2) Web content mining[edit]

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

3) Web structure mining[edit]

Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

4) Text mining:

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-qualityinformation from text. High-quality information is typically derived through the devising of patterns and trends through means such asstatistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering,concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition,tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

5) Unstructured text:

About Unstructured Text

Data mining algorithms act on data that is numerical or categorical. Numerical data is ordered. It is stored in columns that have a numeric data type, such asNUMBER or FLOAT. Categorical data is identified by category or classification. It is stored in columns that have a character data type, such as VARCHAR2 or CHAR.

Unstructured text data is neither numerical nor categorical. Unstructured text includes items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes. It has been said that unstructured text accounts for more than three quarters of all enterprise data. Extracting meaningful information from unstructured text can be critical to the success of a business.

About Text Mining and Oracle Text

Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.

Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by the user to Oracle Text and uses the results in the model creation process.

6)Episode rule discovery for text:

Discovering association rules is an important datamining

problem. The problem was first defined in the context

of the market basket data to identify customers’ buying

habits [1].

Our overall goal is to analyze event sequences, discover

recurrent patterns of events, and generate sequential association

rules. Our approach is based on the concept of representative

association rules combined with event constraints.

A sequential dataset is normalized and then discretized

by forming subsequences using a sliding window [5]. Using

a sliding window of size _, every normalized time stamp

value xt is used to compute each of the new sequence values

yt__=2 to yt+_=2. Thus, the dataset has been divided into

segments, each of size _. The discretized version of the

time series is obtained by using some clustering algorithm

and a suitable similarity measure. Each cluster identifier is

an event type, and the set of cluster labels is the class of

events E.

Representative Episodal Association Rules

We use the set of frequent closed episodes FCE produced

from the Gen-FCE algorithm to generate the representative

episodal association rules that cover the entire set

of association rules [4].

The cover of a rule r : X ) Y , denoted by C(r), is the

set of association rules that can be generated from r. That

is, C(r : X ) Y ) = fX [ U ) V j U; V _ Y; U \

V = ;; and V 6= ;g. An important property of the

cover operator stated in [4]is that if an association rule r

has support s and confidence c, then every rule

r0 2 C(r)

has support at least s and confidence at least c.

7)Hierarchy of categeories:

To model a product category hierarchy, this solution keeps each category in its own document that also has a list of its ancestors or “parents.” This document uses music genres as the basis of its examples:

Initial category hierarchy

Because these kinds of categories change infrequently, this model focuses on the operations needed to keep the hierarchy up-to-date rather than the performance profile of update operations.

Schema

This schema has the following properties:

A single document represents each category in the hierarchy.
An ObjectId identifies each category document for internal cross-referencing.
Each category document has a human-readable name and a URL compatible slug field.
The schema stores a list of ancestors for each category to facilitate displaying a query and its ancestors using only a single query.

Consider the following prototype:

{ "_id" : ObjectId("4f5ec858eb03303a11000002"),

  "name" : "Modal Jazz",

  "parent" : ObjectId("4f5ec858eb03303a11000001"),

  "slug" : "modal-jazz",

  "ancestors" : [

         { "_id" : ObjectId("4f5ec858eb03303a11000001"),

        "slug" : "bop",

        "name" : "Bop" },

         { "_id" : ObjectId("4f5ec858eb03303a11000000"),

           "slug" : "ragtime",

           "name" : "Ragtime" } ]

Operations

Read and Display a Category

Querying

Use the following option to read and display a category hierarchy. This query will use the slug field to return the category information and a “bread crumb” trail from the current category to the top level category.

category = db.categories.find(

    {'slug':slug},

    {'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })

Indexing

Create a unique index on the slug field with the following operation on the Python/PyMongo console:

>>> db.categories.ensure_index('slug', unique=True)

11) Text clustering:

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared offline applications.

Clustering in search engines[edit]

A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such asPolyMeta and Helioid, or open source software such as Carrot2.

Examples:

· Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."

UNIT 5

1)Temoral and special data mining:

Introduction:

One of the main unresolved problems that arise during the data

mining process is treating data that contains temporal information. In this case,

a complete understanding of the entire phenomenon requires that the data

should be viewed as a sequence of events.

One of the main unresolved problems that arise during the data

mining process is treating data that contains temporal information. In this case,

a complete understanding of the entire phenomenon requires that the data

should be viewed as a sequence of events.

Stock market data•Robot sensors•Weather data•Biological data: e.g. monitoring fish population.•Network monitoring•Weblog data•Customer transactions

Temporal data have a unique structure:High dimensionalityHigh feature correlation

2) Temporal asoociation rules:

Mining association rules in transaction data is an important data mining problem. Association rules represent directed relationships amongst variables wherein the existence of some variables predict the existence of the other variables in a transaction ({X->Y}, for sets X and Y), with certain amount of confidence.

Association Rule Mining Algorithms Several algorithms have been proposed for finding association rules in a transaction datasets. Of these algorithms Apriori algorithm proposed by [Agarwal et. al., 1994] is the most popular. The Apriori Algorithm works in two steps: 1) Identification of frequent item sets in the dataset. These correspond to item sets which occur in the dataset with a frequency greater than a threshold value called minimum support. 2) Obtaining association rules from these frequent item sets which have a confidence greater than a threshold value called minimum confidence. The confidence in an association rule {X->Y} is expressed as the ratio of the probability of X+Y over probability of X i.e. the probability that the customer would purchase both X and Y given he purchased X.

3)Sequential mining:

Sequential Pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.^[1] It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.
There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.

String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.

Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.

Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.

Some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson Up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".

4) GSP algorithm:

GSP Algorithm (Generalized Sequential Pattern algorithm) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion. It simply means counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. This modified database becomes an input to the GSP algorithm. This process requires one pass over the whole database.

GSP Algorithm makes multiple database passes. In the first pass, all single items (1-sequences) are counted. From the frequent items, a set of candidate 2-sequences are formed, and another pass is made to identify their frequency. The frequent 2-sequences are used to generate the candidate 3-sequences, and this process is repeated until no more frequent sequences are found. There are two main steps in the algorithm.

Candidate Generation. Given the set of frequent (k-1)-frequent sequences F(k-1), the candidates for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any sequence, at least one of whose subsequences is not frequent.
Support Counting. Normally, a hash tree–based search is employed for efficient support counting. Finally non-maximal frequent sequences are removed

· F1 = the set of frequent 1-sequence

· k=2,

· do while F(k-1)!= Null;

· Generate candidate sets Ck (set of candidate k-sequences);

· For all input sequences s in the database D

· do

· Increment count of all a in Ck if s supports a

· Fk = {a ∈ Ck such that its frequency exceeds the threshold}

· k= k+1;

· Result = Set of all frequent sequences is the union of all Fks

· End do

5) SPADE:

Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, a retail shop may possess a transaction database which specifies which products were acquired by each customer over time. In this case, the store may use Frequent Sequence Mining to find that 40% of the people who bought the first volume of Lord of the Rings came back to buy the second volume a month later. This kind of information may be used to support directed advertising campaigns or recommendation systems.

Another application domain where Frequent Sequence Mining may be used is Web click log analysis in Information Retrieval systems, in which case the system performance may be refined by analyzing the sequence of interactions that the user exposed while searching or browsing for a specific information. This kind of usage becomes especially clear when we consider the huge amount of data obtained by industrial search engines in the form of query logs. Google alone was reported to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)

A sequence α is an ordered list of events <a1,a2,...,am>. An event is a non-empty unordered set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support of a sequence α is the number of sequences of D which contains α as a subsequence. If the support of α is bigger than a threshold maxsup, then α is a frequent sequence (Peng and Liao, 2009).

Algorithm[edit]

An algorithm to Frequent Sequence Mining is the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a vertical id-list database format, where we associate to each sequence a list of objects in which it occurs. Then, frequent sequences can be found efficiently using intersections on id-lists. The method also reduces the number of databases scans, and therefore also reduces the execution time.

The first step of SPADE is to compute the frequencies of 1-sequences, which are sequences with only one item. This is done in a single database scan. The second step consists of counting 2-sequences. This is done by transforming the vertical representation into an horizontal representation in memory, and counting the number of sequences for each pair of items using a bidimensional matrix. Therefore, this step can also be executed in only one scan.

6) Spirit episode discovery:

8) Time series analysis:

A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average and the annual flow volume of the Nile River atAswan. Time series are very frequently plotted via line charts. Time series are used instatistics, signal processing, pattern recognition, econometrics, mathematical finance,weather forecasting, earthquake prediction, electroencephalography, control engineering,astronomy, communications engineering, and largely in any domain of applied science andengineering which involves temporal measurements.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. Whileregression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called "time series analysis", which focuses on comparing values of a single time series or multiple dependent time series at different points in time.^[1]

Models

· 3.1 Notation

· 3.2 Conditions

· 3.3 Models

· 3.4 Measures

Notation

A number of different notations are in use for time-series analysis. A common notation specifying a time series X that is indexed by thenatural numbers is written

X = {X₁, X₂, ...}.

Another common notation is

Y = {Y_t: t ∈ T},

where T is the index set.

Conditions[edit]

There are two sets of conditions under which much of the theory is built:

· Stationary process

· Ergodic process

However, ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order stationarity. Both models and applications can be developed under each of these conditions, although the models in the latter case might be considered as only partly specified.

In addition, time-series analysis can be applied where the series are seasonally stationary or non-stationary. Situations where the amplitudes of frequency components change with time can be dealt with in time-frequency analysis which makes use of a time–frequency representation of a time-series or signal.^[7]

Models[edit]

Main article: Autoregressive model

The general representation of an autoregressive model, well known as AR(p), is

8) Spatial mining:

Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling.

Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management.^[48] Related to this is the range and diversity of geographic data formats, which present unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data, such as imagery and geo-referenced multi-media.^[49

⁹⁾Spatial mining tasks:

Spatial Data Mining Tasks

ƒ Geo-Spatial Warehousing and OLAP

ƒ Spatial data classification/predictive modeling

ƒ Spatial clustering/segmentation

ƒ Spatial association and correlation analysis

ƒ Spatial regression analysis

ƒ Time-related spatial pattern analysis: trends,

sequential patterns, partial periodicity analysis

ƒ Many more to be explored

10) Spatial clustering:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, andbioinformatics.

10) Data mining applications:

Data Mining is widely used in diverse areas. There are number of commercial data mining system available today yet there are many challenges in this field. In this tutorial we will applications and trend of Data Mining.

Data Mining Applications

Here is the list of areas where data mining is widely used:

· Financial Data Analysis

· Retail Industry

· Telecommunication Industry

· Biological Data Analysis

· Other Scientific Applications

· Intrusion Detection

FINANCIAL DATA ANALYSIS

The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining. Here are the few typical cases:.

· Loan payment prediction and customer credit policy analysis.

RETAIL INDUSTRY

Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web.

· Design and Construction of data warehouses based on benefits of data mining.

· Multidimensional analysis of sales, customers, products, time and region.

· Analysis of effectiveness of sales campaigns.

· Customer Retention.

TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc.Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.

· Multidimensional Analysis of Telecommunication data.

· Fraudulent pattern analysis.

· Identification of unusual patterns.

BIOLOGICAL DATA ANALYSIS

Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research.Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis:

· Alignment, indexing , similarity search and comparative analysis multiple nucleotide sequences.

· Discovery of structural patterns and analysis of genetic networks and protein pathways.

· Association and path analysis.

· Visualization tools in genetic data analysis.

M.Tech Notes

Pages

mtech advanced data mining