UNIT I
1): Data mining tasks:
Data Mining deals with what kind of patterns can be mined.
On the basis of kind of data to be mined there are two kind of functions
involved in Data Mining, that are listed below:
·
Descriptive
·
Classification and Prediction
Descriptive
The descriptive function
deals with general properties of data in the database. Here is the list of
descriptive functions:
·
Class/Concept
Description
·
Mining of Frequent
Patterns
·
Mining of Associations
·
Mining of Correlations
·
Mining of Clusters
Class/Concept Description
Class/Concepts refers
the data to be associated with classes or concepts. For example, in a company
classes of items for sale include computer and printers, and concepts of
customers include big spenders and budget spenders.Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be
derived by following two ways:
·
Data
Characterization - This refers to
summarizing data of class under study. This class under study is called as
Target Class.
·
Data
Discrimination - It refers to
mapping or classification of a class with some predefined group or class.
2) Mining Frequent patterns:
Frequent patterns are those patterns that occur frequently
in transactional data. Here is the list of kind of frequent patterns:
·
Frequent Item Set - It refers to set of items that frequently appear together
for example milk and bread.
·
Frequent
Subsequence- A sequence of patterns that
occur frequently such as purchasing a camera is followed by memory card.
·
Frequent Sub
Structure - Substructure refers to different structural forms, such
as graphs, trees, or lattices, which may be combined with itemsets or
subsequences.
3)Mining of Association
Associations are used in retail sales to identify patterns that
are frequently purchased together. This process refers to process of uncovering
the relationship among data and determining association rules.
For example A retailer generates association rule that show
that 70% of time milk is sold with bread and only 30% of times biscuits are
sold with bread.
Mining of Correlations
It is kind of additional analysis performed to uncover
interesting statistical correlations between associated-attribute- value pairs
or between two item Sets to analyze that if they have positive, negative or no
effect on each other.
Mining
of Clusters
Cluster refers to a group of similar kind of objects.
Cluster analysis refers to forming group of objects that are very similar to
each other but are highly different from the objects in other clusters.
4):Classification and Prediction
Classification is the process of finding a model that
describes the data classes or concepts. The purpose is to be able to use this
model to predict the class of objects whose class label is unknown. This derived
model is based on analysis of set of training data. The derived model can be
presented in the following forms:
·
Classification
(IF-THEN) Rules
·
Decision Trees
·
Mathematical Formulae
·
Neural Networks
Here is the list of functions involved in this:
·
Classification - It predicts the class of objects whose class label is
unknown.Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on analysis
set of training data i.e the data object whose class label is well known.
·
Prediction - It is used to predict missing or unavailable numerical
data values rather than class labels.Regression Analysis is generally used for
prediction.Prediction can also be used for identification of distribution
trends based on available data.
·
Outlier Analysis - The Outliers may be defined as the data objects that do
not comply with general behaviour or model of the data available.
·
Evolution Analysis - Evolution Analysis refers to description and model
regularities or trends for objects whose behaviour changes over time.
5) :Cluster Analysis
Cluster is a group of objects that belong to the same class.
In other words the similar object are grouped in one cluster and dissimilar are
grouped in other cluster.
Clustering is the
process of making group of abstract objects into classes of similar objects.
Points to Remember
·
A cluster of data
objects can be treated as a one group.
·
While doing the cluster
analysis, we first partition the set of data into groups based on data
similarity and then assign the label to the groups.
·
The main advantage of
Clustering over classification is that, It is adaptable to changes and help
single out useful features that distinguished different groups.
Applications of Cluster Analysis
·
Clustering Analysis is
broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
·
Clustering can also help
marketers discover distinct groups in their customer basis. And they can
characterize their customer groups based on purchasing patterns.
·
In field of biology it
can be used to derive plant and animal taxonomies, categorize genes with
similar functionality and gain insight into structures inherent in populations.
·
Clustering also helps in
identification of areas of similar land use in an earth observation database.
It also helps in the identification of groups of houses in a city according
house type, value, geographic location.
·
Clustering also helps in
classifying documents on the web for information discovery.
·
Clustering is also used
in outlier detection applications such as detection of credit card fraud.
Clustering Methods
The clustering methods
can be classified into following categories:
- Partitioning Method
- Hierarchical Method
- Density-based Method
- Grid-Based Method
- Model-Based Method
n 6)Outlier
Analysis:
n A data object that deviates significantly from
the normal objects as if it were generated by a different mechanism
n Ex.: Unusual credit card purchase, sports: Michael
Jordon, Wayne Gretzky, ...
n Outliers are different from the
noise data
n Noise is random error or variance
in a measured variable
n Noise should be removed before
outlier detection
n Outliers are interesting: It violates the mechanism that generates the
normal data
n Outlier detection vs. novelty
detection: early stage, outlier; but later merged into the model
n Applications:
n Credit card fraud detection
n Telecom fraud detection
n Customer segmentation
n Medical analysis
3 types:
->Global
outlier
->Contextual
->collective outlier
UNIT II
1): Classification by back propagation:
ackpropagation, an abbreviation for "backward propagation of
errors", is a common method of training artificial neural networksused in conjunction with an optimization method such as gradient
descent. The method
calculates the gradient of a loss
function with
respects to all the weights in the network. The gradient is fed to the
optimization method which in turn uses it to update the weights, in an attempt
to minimize the loss function
The backpropagation
learning algorithm can be divided into two phases: propagation and weight update.
Each propagation
involves the following steps:
1.
Forward propagation of
a training pattern's input through the neural network in order to generate the
propagation's output activations.
2.
Backward propagation
of the propagation's output activations through the neural network using the
training pattern target in order to generate the deltas of all output and
hidden neurons.
For each
weight-synapse follow the following steps:
1.
Multiply its output
delta and input activation to get the gradient of the weight.
2.
Subtract a ratio
(percentage) of the gradient from the weight.
This ratio
(percentage) influences the speed and quality of learning; it is called
the learning rate. The greater the ratio, the faster the neuron
trains; the lower the ratio, the more accurate the training is. The sign of the
gradient of a weight indicates where the error is increasing, this is why the
weight must be updated in the opposite direction.
2): support vector
machines:
In machine learning, support
vector machines (SVMs,
also support vector networks[1]) are supervised learning models
with associated learning algorithms that
analyze data and recognize patterns, used for classification and regression analysis. .
Given a set of training examples, each marked
as belonging to one of two categories, an SVM training algorithm builds a model
that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points
in space, mapped so that the examples of the separate categories are divided by
a clear gap that is as wide as possible. New examples are then mapped into that
same space and predicted to belong to a category based on which side of the gap
they fall on.
More formally, a support vector machine
constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used
for classification, regression, or other tasks.
3) Classificatio using frequent pattern:
The classification model is built in the
feature space of single
features as well as frequent patterns, i.e. map
the data to a higher
dimensional space.
• Feature combination can capture more
underlying semantics than
single features.
• Example: word phrases can improve the
accuracy of document
classification.
• FP-based classification been applied to many
problems:
– Graph classification
– Time series classification
– Protein classification
– Text classification
4)
Other classification methods:
Tree-based
frequent patterns
•
[Fan et al. 2008] proposed building a decision tree in the space of
frequent
patterns as an alternative for the two phases approach
[Cheng
et al. 2007].
•
Arguments against the two phases approach:
1.
The number of frequent patterns can be too large for effective
feature
selection in the second phase.
2.
It is difficult to set the optimal min_sup threshold:
–
low min_sup generate too many candidates and is very slow.
–
High min_sup may miss some important patterns.
–
Apply frequent pattern on the whole data
–
Select the best frequent pattern P to divide the data into two sets:
one
containing P and the other not.
–
Repeat the procedure on each subset until a stopping condition is
satisfied.
•
Decreasing the support with smaller partitions makes the algorithm
able
to mine patterns with very low global support.
5)Genetic
Algorithms
The idea of Genetic Algorithm is derived from natural
evolution. In Genetic Algorithm first of all initial population is created.
This initial population consist of randomly generated rules. we can represent
each rule by a string of bits.
For example , suppose that in a given training set the
samples are described by two boolean attributes such as A1 and A2. And this
given training set contains two classes such as C1 and C2.
We can encode the
rule IF A1 AND NOT A2 THEN C2 into bit string 100. In this bit representation the two leftmost bit represent
the attribute A1 and A2, respectively.
Likewise the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note:If the attribute has K values where K>2, then we can use
the K bits to encode the attribute values . The classes are also encoded in the
same manner.
Points to remember:
·
Based on the notion of
survival of the fittest, a new population is formed to consist of the fittest
rules in the current population and offspring values of these rules as well.
·
The fitness of the
rule is assessed by its classification accuracy on a set of training samples.
·
The genetic operators
such as crossover and mutation are applied to create offsprings.
·
In crossover the
substring from pair of rules are swapped to from a new pair of rules.
·
In mutation, randomly
selected bits in a rule's string are inverted.
Rough Set Approach
To discover structural relationship within imprecise and
noisy data we can use the rough set.
Note:This approach can only be applied on discrete-valued
attributes.Therefore, continuous-valued attributes must be discretized before
its use.
The Rough Set Theory is base on establishment of
equivalence classes within the given training data. The tuples that forms the
equivalence class are indiscernible. It means the samples are identical wrt to
the attributes describing the data.
There are some
classes in given real world data, which can not be distinguished in terms of
available attributes. We can use the rough sets to roughly define such classes.
For a given class, C, the rough set definition is
approximated by two sets as follows:
·
Lower Approximation
of C - The lower approximation of C consist of all the data
tuples, that bases on knowledge of attribute. These attribute are certain to
belong to class C.
·
Upper Approximation
of C - The upper approximation of C consist of all the tuples
that based on knowledge of attributes, can not be described as not belonging to
C.
The following diagram shows the Upper and Lower
Approximation of class C:
Fuzzy Set Approaches
Fuzzy Set Theory is
also called Possibility Theory. This theory was proposed by Lotfi Zadeh in
1965. This approach is an alternative Two-value logic.This theory allows us to work at high level of
abstraction. This theory also provide us means for dealing with imprecise
measurement of data.
The fuzzy set theory also allow to deal with vague or
inexact facts. For example being a member of a set of high incomes is inexact
(eg. if $50,000 is high then what about $49,000 and $48,000). Unlike the
traditional CRISP set where the element either belong to S or its complement
but in fuzzy set theory the element can belong to more than one fuzzy set.
For example, the income value $49,000 belong to both the
medium and high fuzzy sets but to differing degrees. Fuzzy set notation for
this income value is as follows:
mmedium_income($49k)=0.15 and mhigh_income($49k)=0.96
where m is membership function that operates on fuzzy set
of medium_income and high_income respectively. This notation can be shown
diagrammatically as follows:
7)Roughest
approach:
In computer science, a rough set, first described by Polish computer scientist Zdzisław
I. Pawlak, is a formal
approximation of a crisp set (i.e., conventional set) in terms of a pair of sets which
give the lower and the upper approximation of the original set. In the standard version of
rough set theory (Pawlak 1991), the lower- and upper-approximation sets are
crisp sets, but in other variations, the approximating sets may be fuzzy sets. Information system
framework[edit]
Let
be an information system (attribute-value
system), where
is a non-empty set of finite objects (the universe) and
is a non-empty, finite set of attributes such that
for every
.
is the set of values that attribute
may take. The information table assigns a value
from
to each attribute
and object
in the universe
.
The relation
is called a
-indiscernibility relation. The partition of
is a family of all equivalence classes of
and is denoted by
(or
).
If
, then
and
are indiscernible (or indistinguishable) by attributes from
.
Example: equivalence-class structure[edit]
For example, consider the
following information table:
Sample Information System
|
|||||
Object
|
|||||
1
|
2
|
0
|
1
|
1
|
|
1
|
2
|
0
|
1
|
1
|
|
2
|
0
|
0
|
1
|
0
|
|
0
|
0
|
1
|
2
|
1
|
|
2
|
1
|
0
|
2
|
1
|
|
0
|
0
|
1
|
2
|
2
|
|
2
|
0
|
0
|
1
|
0
|
|
0
|
1
|
2
|
2
|
1
|
|
2
|
1
|
0
|
2
|
2
|
|
2
|
0
|
0
|
1
|
0
|
When the full set of attributes
is considered, we see that we have the following seven
equivalence classes:
Definition of a rough set[edit]
Let
be a target set that we wish to represent using attribute
subset
; that is, we are told that an
arbitrary set of objects
comprises a single class, and we wish to express this
class (i.e., this subset) using the equivalence classes induced by attribute
subset
. In general,
cannot be expressed exactly, because the set may include
and exclude objects which are indistinguishable on the basis of attributes
.
For example, consider the target
set
, and let attribute subset
, the full available set of
features. It will be noted that the set
cannot be expressed exactly, because in
, objects
are indiscernible. Thus, there is no way to represent any
set
which includes
but excludes objects
and
.
However, the target set
can be approximated using only the information contained within
by constructing the
-lower and
-upper approximations of
:
7)Fuzz
set:
Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets
were introduced by Lotfi A. Zadeh[1] and Dieter Klaua[2] in 1965 as an extension of the classical
notion of set. At the same time, Salii (1965) defined a more general kind of
structures called L-relations,
which were studied by him in an abstract algebraic context. A fuzzy set is a pair
where
is a set and
For each
the value
is called the grade of membership of
in
For a finite set
the fuzzy set
is often denoted by
Let
Then
is called not included in the fuzzy set
if
,
is called fully
included if
, and
is called a fuzzy member if
.[6] The set
is called the support of
and the set
is called its kernel. The function
is called the membership
fun
Fuzzy logic[edit]
As an extension of the case of multi-valued logic, valuations (
) of propositional
variables (
) into a set of membership degrees (
) can be thought of as membership
functionsmapping predicates into fuzzy sets
(or more formally, into an ordered set of fuzzy pairs, called a fuzzy relation).
With these valuations, many-valued logic can be extended to allow for fuzzy premisesfrom which graded
conclusions may be drawn.[8]
This extension is sometimes
called "fuzzy logic in the narrow sense" as opposed to "fuzzy
logic in the wider sense," which originated in the engineering fields of automated control and knowledge engineering, and which encompasses many topics involving fuzzy sets
and "approximated reasoning."[9]
Industrial applications of fuzzy
sets in the context of "fuzzy logic in the wider sense" can be found
at fuzzy logic.
ction of the fuzzy set
Sometimes, more general variants
of the notion of fuzzy set are used, with membership functions taking values in
a (fixed or variable) algebra or structure
of a given kind; usually it is required that
be at least a poset or lattice. These are
usually called L-fuzzy sets, to distinguish them from those valued over the unit
interval. The usual membership functions with values in [0, 1] are then
called [0, 1]-valued membership functions.
UNIT III
1):Density basedf methods:
In density-based clustering,[8] clusters
are defined as areas of higher density than the remainder of the data set.
Objects in these sparse areas - that are required to separate clusters - are
usually considered to be noise and border points.
It is a density-based
clustering algorithm
because it finds a number of clusters starting from the estimated density
distribution of corresponding nodes. DBSCAN is one of the most common
clustering algorithms and also most cited in scientific literature.[2] OPTICS can be seen as a generalization of DBSCAN to
multiple ranges, effectively replacing the ε parameter with a maximum search
radius.
DBSCAN requires two parameters: ε (eps) and the
minimum number of points required to form a dense region[a] (minPts). It starts with an arbitrary starting
point that has not been visited. This point's ε-neighborhood is retrieved, and
if it contains sufficiently many points, a cluster is started. Otherwise, the
point is labeled as noise. Note that this point might later be found in a
sufficiently sized ε-environment of a different point and hence be made part of
a cluster.
1.
DBSCAN does not require
one to specify the number of clusters in the data a priori, as opposed to k-means.
2.
DBSCAN can find
arbitrarily shaped clusters. It can even find a cluster completely surrounded
by (but not connected to) a different cluster. Due to the MinPts parameter, the
so-called single-link effect (different clusters being connected by a thin line
of points) is reduced.
2)
Optics:
Ordering points to identify the clustering
structure (OPTICS)
is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael
Ankerst, Markus M. Breunig, Hans-Peter
Kriegel and
Jörg Sander.[1] Its basic idea is similar to DBSCAN,[2]but
it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful
clusters in data of varying density. In order to do so, the points of the
database are (linearly) ordered such that points which are spatially closest
become neighbors in the ordering. Additionally, a special distance is stored
for each point that represents the density that needs to be accepted for a
cluster in order to have both points belong to the same cluster.
Like DBSCAN, OPTICS requires two parameters:
, which describes the maximum distance (radius) to consider,
and
, describing the number of points required to form a cluster.
A point
is a core point if at least
points are
found within its
-neighborhood
. Contrary to DBSCAN, OPTICS also considers points that are part of a more
densely packed cluster, so each point is assigned a core distance that describes the distance to the
th closest point:
3) Denclue:
4)Gride based methods sting,cliques:
Basic Grid-based Algorithm
- Define a set of
grid-cells
- Assign objects to
the appropriate grid cell and compute the density of each cell.
- Eliminate cells,
whose density is below a certain threshold t.
- Form clusters from
contiguous (adjacent) groups of dense cells (usually minimizing a given
objective function)
Advantages
of Grid-based Clustering Algorithms
n fast:
n No
distance computations
n Clustering
is performed on summaries and not individual objects; complexity is usually
O(#-populated-grid-cells) and not O(#objects)
n Easy
to determine which clusters are neighboring
n Shapes
are limited to union of grid-cells
Grid-Based
Clustering Methods
n Using
multi-resolution grid data structure
n Clustering
complexity depends on the number of populated grid cells and not on the number
of objects in the dataset
n Several
interesting methods (in addition to the basic grid-based algorithm)
n STING
(a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
n CLIQUE:
Agrawal, et al. (SIGMOD’98)
STING:
A Statistical Information Grid Approach
n Wang,
Yang and Muntz (VLDB’97)
n The
spatial area area is divided into rectangular cells
n There
are several levels of cells corresponding to different levels of resolution
n
Each cell at a high level is partitioned
into a number of smaller cells in the next lower level
n Statistical
info of each cell is calculated and
stored beforehand and is used to answer queries
n Parameters
of higher level cells can be easily calculated from parameters of lower level
cell
n count,
mean, s, min, max
n type
of distribution—normal, uniform,
etc.
n Use
a top-down approach to answer spatial data queries
Clicque:
n Agrawal,
Gehrke, Gunopulos, Raghavan (SIGMOD’98).
n Automatically
identifying subspaces of a high dimensional data space that allow better
clustering than original space
n CLIQUE
can be considered as both density-based and grid-based
n It
partitions each dimension into the same number of equal length interval
n It
partitions an m-dimensional data space into non-overlapping rectangular units
n A
unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
n A
cluster is a maximal set of connected dense units within a subspace
n Partition
the data space and find the number of points that lie inside each cell of the
partition.
n Identify
the subspaces that contain clusters using the Apriori principle
n Identify
clusters:
n Determine
dense units in all subspaces of interests
n Determine
connected dense units in all subspaces of interests.
5) Exeption maximization algorithm:
In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum
likelihood or maximum a
posteriori (MAP)
estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between
performing an expectation (E) step, which creates a function for the
expectation of the log-likelihoodevaluated using the current estimate for the
parameters, and a maximization (M) step, which computes parameters maximizing
the expected log-likelihood found on the E step. These parameter-estimates are
then used to determine the distribution of the latent variables in the next E
step.
The EM algorithm is used to find
the maximum likelihood parameters of a statistical model in cases
where the equations cannot be solved directly. Typically these models involve latent variables in
addition to unknown parameters and known
data observations. That is, either there are missing values among the
data, or the model can be formulated more simply by assuming the existence of
additional unobserved data points. For example, a mixture model can be
described more simply by assuming that each observed data point has a
corresponding unobserved data point, or latent variable, specifying the mixture
component that each data point belongs to.
Finding a maximum likelihood
solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values — viz. the
parameters and the latent variables — and simultaneously solving the resulting
equations. In statistical models with latent variables, this usually is not
possible. Instead, the result is typically a set of interlocking equations in
which the solution to the parameters requires the values of the latent
variables and vice-versa, but substituting one set of equations into the other
produces an unsolvable equation.
Description[edit]
Given a statistical model consisting
of a set
of observed
data, a set of unobserved latent data or missing values
, and a vector of unknown
parameters
, along with a likelihood function
, the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data
However, this quantity is often
intractable (e.g. if
is a sequence
of events, so that the number of values grows exponentially with the sequence
length, making the exact calculation of the sum extremely difficult).
The EM algorithm seeks to find
the MLE of the marginal likelihood by iteratively applying the following two
steps:
Expectation step (E step):
Calculate the expected value of the log likelihood function, with
respect to the conditional
distribution of
given
under the
current estimate of the parameters
:
Maximization step (M step):
Find the parameter that maximizes this quantity:
6) Clustering high-dimensional data:
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands
of dimensions. Such high-dimensional data spaces are often encountered in
areas such as medicine, where DNA microarray technology can produce a large number of
measurements at once, and the clustering of text documents, where, if a
word-frequency vector is used, the number of dimensions equals the size of the vocabulary.
Problems[edit]
According to Kriegel, Kröger & Zimek (2009), four problems
need to be overcome for clustering in high-dimensional data:
·
Multiple dimensions are hard to think in, impossible to
visualize, and, due to the exponential growth of the number of possible values
with each dimension, complete enumeration of all subspaces becomes intractable
with increasing dimensionality. This problem is known as the curse of dimensionality.
·
The concept of distance becomes less precise as the number of
dimensions grows, since the distance between any two points in a given dataset
converges. The discrimination of the nearest and farthest point in particular
becomes meaningless:
Subspace
clustering[edit]
Example 2D space with
subspace clusters
Subspace clustering is the task
of detecting all clusters in all subspaces. This means that
a point might be a member of multiple clusters, each existing in a different
subspace. Subspaces can either be axis-parallel or affine. The term is often
used synonymous with general clustering in high-dimensional data.
Projected
clustering[edit]
Projected clustering seeks to
assign each point to a unique cluster, but clusters may exist in different
subspaces. The general approach is to use a special distance function together
with a regularclustering algorithm.
7)Clustering graph n network data:
Clustering
data is a fundamental task in machine
learning.
NETWORK
Unit 4
1)web and Text
mining: introduction:
Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided
into three different types, which are Web usage mining, Web content mining and Web structure mining.
Web usage mining is
the process of extracting useful information from server logs e.g. use Web
usage mining is the process of finding out what users are looking for on
the Internet. Some users might be looking at only textual
data, whereas some others might be interested in multimedia data. Web Usage
Mining is the application of data mining techniques to discover interesting
usage patterns from Web data in order to understand and better serve the needs
of Web-based applications. Usage data captures the identity or origin of Web
users along with their browsing behavior at a Web site. Web usage mining itself
can be classified further depending on the kind of usage data considered:
·
Web Server Data: The
user logs are collected by the Web server. Typical data includes IP address,
page reference and access time.
·
Application Server
Data: Commercial application servers have significant features to enable
e-commerce applications to be built on top of them with little effort. A key
feature is the ability to track various kinds of business events and log them
in application server logs.
·
Application Level
Data: New kinds of events can be defined in an application, and logging can be
turned on for them thus generating histories of these specially defined events.
It must be noted, however, that many end applications require a combination of
one or more of the techniques applied in the categories above.
2) Web
content mining[edit]
Web content mining is the mining,
extraction and integration of useful data, information and knowledge from Web
page content. The heterogeneity and the lack of structure that permits much of
the ever-expanding information sources on the World Wide Web, such as hypertext
documents, makes automated discovery, organization, and search and indexing
tools of the Internet and the World Wide Web such as Lycos, Alta Vista,
WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users,
but they do not generally provide structural information nor categorize,
filter, or interpret documents. In recent years these factors have prompted
researchers to develop more intelligent tools for information retrieval, such
as intelligent web agents, as well as to extend database and data mining
techniques to provide a higher level of organization for semi-structured data
available on the web. The agent-based approach to web mining involves the
development of sophisticated AI systems that can act autonomously or semi-autonomously
on behalf of a particular user, to discover and organize web-based information.
3) Web
structure mining[edit]
Web structure mining is the
process of using graph theory to analyze the node and connection structure of a
web site. According to the type of web structural data, web structure mining
can be divided into two kinds:
1. Extracting patterns from
hyperlinks in the web: a hyperlink is a
structural component that connects the web page to a different location.
2. Mining the document structure:
analysis of the tree-like structure of page structures to describe HTML or XML tag usage.
4) Text mining:
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-qualityinformation from text. High-quality information is typically derived through the
devising of patterns and trends through means such asstatistical
pattern learning. Text
mining usually involves the process of structuring the input text (usually
parsing, along with the addition of some derived linguistic features and the
removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of
the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text
categorization, text clustering,concept/entity extraction, production of granular taxonomies, sentiment
analysis, document
summarization, and
entity relation modeling (i.e., learning relations between named
entities).
Text analysis involves information
retrieval, lexical analysis to study word frequency distributions, pattern
recognition,tagging/annotation, information
extraction, data mining techniques including link and association analysis, visualization, and predictive
analytics. The overarching
goal is, essentially, to turn text into data for analysis, via application of natural
language processing (NLP) and analytical methods.
5) Unstructured text:
About Unstructured Text
Data mining algorithms act on
data that is numerical or categorical. Numerical data is ordered. It is stored
in columns that have a numeric data type, such as
NUMBER
or FLOAT
. Categorical data is identified by category or
classification. It is stored in columns that have a character data type, such
as VARCHAR2
or CHAR
.
Unstructured text data is neither numerical nor categorical. Unstructured text
includes items such as web pages, document libraries, Power Point
presentations, product specifications, emails, comment fields in reports, and
call center notes. It has been said that unstructured text accounts for more
than three quarters of all enterprise data. Extracting meaningful information
from unstructured text can be critical to the success of a business.
About Text Mining and Oracle Text
Text mining is the process of
applying data mining techniques to text terms, also called text features
or tokens. Text terms are words or groups of words that have been extracted
from text documents and assigned numeric weights. Text terms are the
fundamental unit of text that can be manipulated and analyzed.
Oracle Text
is a Database technology that provides term extraction, word and theme
searching, and other utilities for querying text. When columns of text are
present in the training data, Oracle Data Mining uses Oracle Text utilities and
term weighting strategies to transform the text for mining. Oracle Data Mining
passes configuration information supplied by the user to Oracle Text and uses
the results in the model creation process.
6)Episode rule
discovery for text:
Discovering
association rules is an important datamining
problem.
The problem was first defined in the context
of
the market basket data to identify customers’ buying
habits
[1].
Our
overall goal is to analyze event sequences, discover
recurrent
patterns of events, and generate sequential association
rules.
Our approach is based on the concept of representative
association
rules combined with event constraints.
A
sequential dataset is normalized and then discretized
by
forming subsequences using a sliding window [5]. Using
a
sliding window of size _, every normalized time stamp
value
xt is used to compute each of the new sequence values
yt__=2
to yt+_=2. Thus, the dataset has been divided into
segments,
each of size _. The discretized version of the
time
series is obtained by using some clustering algorithm
and
a suitable similarity measure. Each cluster identifier is
an
event type, and the set of cluster labels is the class of
events
E.
Representative
Episodal Association Rules
We
use the set of frequent closed episodes FCE produced
from
the Gen-FCE algorithm to generate the representative
episodal
association rules that
cover the entire set
of
association rules [4].
The
cover of a rule r : X ) Y , denoted by C(r), is the
set
of association rules that can be generated from r. That
is,
C(r : X ) Y ) = fX [ U ) V j U; V _ Y; U \
V
= ;; and V 6= ;g. An important property of the
cover
operator stated in [4]is that if an association rule r
has
support s and confidence c, then every rule
r0
2 C(r)
has
support at least s and confidence at least c.
7)Hierarchy of
categeories:
To model a product category
hierarchy, this solution keeps each category in its own document that also has
a list of its ancestors or “parents.” This document uses music genres as the
basis of its examples:
Initial category hierarchy
Because these kinds of categories
change infrequently, this model focuses on the operations needed to keep the
hierarchy up-to-date rather than the performance profile of update operations.
Schema
This schema has the following
properties:
- A single
document represents each category in the hierarchy.
- An ObjectId identifies each category document
for internal cross-referencing.
- Each
category document has a human-readable name and a URL compatible slug field.
- The schema
stores a list of ancestors for each category to facilitate displaying a
query and its ancestors using only a single query.
Consider the following prototype:
{ "_id" : ObjectId("4f5ec858eb03303a11000002"),
"name" : "Modal Jazz",
"parent" : ObjectId("4f5ec858eb03303a11000001"),
"slug" : "modal-jazz",
"ancestors" : [
{ "_id" : ObjectId("4f5ec858eb03303a11000001"),
"slug" : "bop",
"name" : "Bop" },
{ "_id" : ObjectId("4f5ec858eb03303a11000000"),
"slug" : "ragtime",
"name" : "Ragtime" } ]
}
Operations
Read and Display a Category
Querying
Use the following option to read
and display a category hierarchy. This query will use the slug field to
return the category information and a “bread crumb” trail from the current
category to the top level category.
category = db.categories.find(
{'slug':slug},
{'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })
Indexing
Create a unique index on the slug field
with the following operation on the Python/PyMongo console:
>>> db.categories.ensure_index('slug', unique=True)
11) Text
clustering:
Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document
organization, topic extraction and fast information
retrieval or filtering.
Document clustering involves the
use of descriptors and descriptor extraction. Descriptors are sets of words
that describe the contents within the cluster. Document clustering is generally
considered to be a centralized process. Examples of document clustering include
web document clustering for search users.
The application of document
clustering can be categorized to two types, online and offline. Online
applications are usually constrained by efficiency problems when compared
offline applications.
A web
search engine often returns
thousands of pages in response to a broad query, making it difficult for users
to browse or to identify relevant information. Clustering methods can be used
to automatically group the retrieved documents into a list of meaningful
categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such asPolyMeta and Helioid,
or open source software such as Carrot2.
Examples:
·
Clustering divides the
results of a search for "cell" into groups like "biology,"
"battery," and "prison."
\
UNIT
5
1)Temoral and special data mining:
Introduction:
One of the main unresolved problems that arise
during the data
mining process is treating data that contains
temporal information. In this case,
a complete understanding of the entire phenomenon
requires that the data
should be viewed as a sequence of events.
One of the main unresolved problems that arise
during the data
mining process is treating data that contains
temporal information. In this case,
a complete understanding of the entire phenomenon
requires that the data
should be viewed as a sequence of events.
Stock market data•Robot
sensors•Weather data•Biological data: e.g. monitoring fish population.•Network
monitoring•Weblog data•Customer transactions
Temporal data have a unique
structure:High dimensionalityHigh feature correlation
2) Temporal asoociation rules:
Mining association rules in
transaction data is an important data mining problem. Association rules
represent directed relationships amongst variables wherein the existence of
some variables predict the existence of the other variables in a transaction
({X->Y}, for sets X and Y), with certain amount of confidence.
Association Rule
Mining Algorithms Several
algorithms have been proposed for finding association rules in a transaction
datasets. Of these algorithms Apriori algorithm proposed by [Agarwal et. al.,
1994] is the most popular. The Apriori Algorithm works in two steps: 1) Identification
of frequent item sets in the dataset. These correspond to item sets which
occur in the dataset with a frequency greater than a threshold value called
minimum support. 2) Obtaining association rules from these frequent item
sets which have a confidence greater than a threshold value called minimum
confidence. The confidence in an association rule {X->Y} is expressed as
the ratio of the probability of X+Y over probability of X i.e. the probability
that the customer would purchase both X and Y given he purchased X.
3)Sequential mining:
Sequential Pattern mining is a topic of data mining
concerned with finding statistically relevant patterns between data examples
where the values are delivered in a sequence.[1]
It is usually presumed that the values are discrete, and thus time series
mining is closely related, but usually considered a different activity.
Sequential pattern mining is a special case of structured data mining.There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.
String mining typically deals with a
limited alphabet for items
that appear in a sequence, but the sequence itself may be typically very long.
Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide
bases 'A', 'G', 'C' and 'T' in DNA sequences,
or amino acids
for protein sequences. In biology applications analysis of the arrangement of the alphabet in
strings can be used to examine gene and protein sequences to determine their properties. Knowing the
sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task
is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual
regions or structural units within each sequence and then assigning a function
to each structural unit. In many cases this requires comparing a given sequence
with previously studied ones. The comparison between the strings becomes
complicated when insertions, deletions and mutations occur in a string.
- Repeat-related problems: that deal with operations on single sequences and can
be based on exact
string matching or approximate
string matching methods for finding dispersed
fixed length and maximal length repeats, finding tandem repeats, and
finding unique subsequences and missing (un-spelled) subsequences.
- Alignment problems:
that deal with comparison between strings by first aligning one or more
sequences; examples of popular methods include BLAST
for comparing a single sequence with multiple sequences in a database, and
ClustalW
for multiple alignments. Alignment algorithms can be based on either exact
or approximate methods, and can also be classified as global alignments,
semi-global alignments and local alignment. See sequence alignment.
Some problems in sequence mining lend themselves
discovering frequent itemsets and the order they appear, for example, one is
seeking rules of the form "if a {customer buys a car}, he or she is likely
to {buy insurance} within 1 week", or in the context of stock prices,
"if {Nokia up and Ericsson Up}, it is likely that {Motorola up and Samsung
up} within 2 days". Traditionally, itemset mining is used in marketing
applications for discovering regularities between frequently co-occurring items
in large transactions. For example, by analysing transactions of customer shopping
baskets in a supermarket, one can produce a rule which reads "if a
customer buys onions and potatoes together, he or she is likely to also buy
hamburger meat in the same transaction".
4) GSP algorithm:
GSP Algorithm (Generalized Sequential Pattern algorithm) is an algorithm
used for sequence mining. The algorithms for solving sequence mining problems are
mostly based on the a priori
(level-wise) algorithm. One way to use the level-wise paradigm is to first
discover all the frequent items in a level-wise fashion. It simply means
counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end
of this step, each transaction consists of only the frequent elements it
originally contained. This modified database becomes an input to the GSP
algorithm. This process requires one pass over the whole database.
GSP Algorithm makes multiple
database passes. In the first pass, all single items (1-sequences) are counted.
From the frequent items, a set of candidate 2-sequences are formed, and another
pass is made to identify their frequency. The frequent 2-sequences are used to
generate the candidate 3-sequences, and this process is repeated until no more
frequent sequences are found. There are two main steps in the algorithm.
- Candidate Generation. Given the set of frequent
(k-1)-frequent sequences F(k-1), the candidates for the next pass are
generated by joining F(k-1) with itself. A pruning phase eliminates any
sequence, at least one of whose subsequences is not frequent.
- Support Counting. Normally, a hash tree–based
search is employed for efficient support counting. Finally non-maximal
frequent sequences are removed
·
F1 = the set of frequent 1-sequence
·
k=2,
·
do while F(k-1)!= Null;
·
Generate candidate sets Ck (set of candidate
k-sequences);
·
For all input sequences s in the
database D
·
do
·
Increment count of all a in Ck if s
supports a
·
Fk = {a ∈ Ck such that its frequency exceeds the threshold}
·
k= k+1;
·
Result = Set of all frequent
sequences is the union of all Fks
·
End do
·
End do
5) SPADE:
Frequent
Sequence Mining is used to discover a set of patterns shared among objects
which have between them a specific order. For instance, a retail shop may
possess a transaction database which specifies which products were acquired by
each customer over time. In this case, the store may use Frequent Sequence
Mining to find that 40% of the people who bought the first volume of Lord of
the Rings came back to buy the second volume a month later. This kind of
information may be used to support directed advertising campaigns or
recommendation systems.
Another
application domain where Frequent Sequence Mining may be used is Web click log
analysis in Information Retrieval systems, in which case the system performance
may be refined by analyzing the sequence of interactions that the user exposed
while searching or browsing for a specific information. This kind of usage
becomes especially clear when we consider the huge amount of data obtained by
industrial search engines in the form of query logs. Google alone was reported
to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)
A sequence α is an
ordered list of events <a1,a2,...,am>. An event is a non-empty unordered
set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a
subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im
such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support
of a sequence α is the number of sequences of D which contains α as a
subsequence. If the support of α is bigger than a threshold maxsup, then α is a
frequent sequence (Peng and Liao, 2009).
Algorithm[edit]
An algorithm to Frequent Sequence Mining is the SPADE
(Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a
vertical id-list database format, where we associate to each sequence a list of
objects in which it occurs. Then, frequent sequences can be found efficiently
using intersections on id-lists. The method also reduces the number of
databases scans, and therefore also reduces the execution time.
The first step of SPADE is to compute the frequencies of
1-sequences, which are sequences with only one item. This is done in a single
database scan. The second step consists of counting 2-sequences. This is done
by transforming the vertical representation into an horizontal representation
in memory, and counting the number of sequences for each pair of items using a
bidimensional matrix. Therefore, this step can also be executed in only one
scan.
6) Spirit episode discovery:
8) Time series analysis:
A time series is a sequence of data points,
measured typically at successive points in time spaced at uniform time
intervals. Examples of time series are the daily closing value of the Dow Jones
Industrial Average and the
annual flow volume of the Nile
River atAswan. Time series are very frequently plotted via line charts. Time
series are used instatistics, signal processing, pattern recognition, econometrics, mathematical finance,weather forecasting, earthquake prediction, electroencephalography, control engineering,astronomy, communications
engineering, and largely in any domain
of applied science andengineering which
involves temporal measurements.
Time
series analysis comprises methods for analyzing time series data in order
to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to
predict future values based on previously observed values. Whileregression analysis is often employed in such a way as to test theories that
the current values of one or more independent time series affect the current
value of another time series, this type of analysis of time series is not
called "time series analysis", which focuses on comparing values of a
single time series or multiple dependent time series at different points in
time.[1]
Notation
A number of different notations are in use for time-series
analysis. A common notation specifying a time series X that is
indexed by thenatural numbers is written
X = {X1, X2, ...}.
Another common notation is
Y = {Yt: t ∈ T},
Conditions[edit]
There are
two sets of conditions under which much of the theory is built:
However,
ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order
stationarity. Both models and
applications can be developed under each of these conditions, although the
models in the latter case might be considered as only partly specified.
In
addition, time-series analysis can be applied where the series are seasonally
stationary or non-stationary. Situations where the amplitudes of
frequency components change with time can be dealt with in time-frequency
analysis which makes use of a time–frequency
representation of a
time-series or signal.[7]
Models[edit]
The
general representation of an autoregressive model, well known as AR(p),
is
8) Spatial mining:
Spatial data mining is the application of data
mining methods to spatial data. The end objective of spatial data mining is to
find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, each with its own methods, traditions, and approaches to
visualization and data analysis. Particularly, most contemporary GIS have only
very basic spatial analysis functionality. The immense explosion in geographically
referenced data occasioned by developments in IT, digital mapping, remote
sensing, and the global diffusion of GIS emphasizes the importance of
developing data-driven inductive approaches to geographical analysis and
modeling.
Challenges
in Spatial mining: Geospatial data repositories tend to be very large.
Moreover, existing GIS datasets are often splintered into feature and attribute
components that are conventionally archived in hybrid data management systems.
Algorithmic requirements differ substantially for relational (attribute) data
management and for topological (feature) data management.[48] Related to this is the range and diversity of
geographic data formats, which present unique challenges. The digital
geographic data revolution is creating new types of data formats beyond the
traditional "vector" and "raster" formats. Geographic data
repositories increasingly include ill-structured data, such as imagery and geo-referenced
multi-media.[49
9) Spatial mining tasks:
Spatial Data Mining Tasks
Geo-Spatial Warehousing
and OLAP
Spatial data
classification/predictive modeling
Spatial
clustering/segmentation
Spatial association and
correlation analysis
Spatial regression
analysis
Time-related spatial
pattern analysis: trends,
sequential patterns, partial periodicity analysis
Many more to be explored
10) Spatial clustering:
Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, andbioinformatics.
10) Data mining applications:
Data Mining is widely used in diverse areas. There are
number of commercial data mining system available today yet there are many challenges
in this field. In this tutorial we will applications and trend of Data Mining.
Data Mining Applications
Here is the list of areas where data mining is widely used:
·
Financial Data
Analysis
·
Retail Industry
·
Telecommunication
Industry
·
Biological Data
Analysis
·
Other Scientific
Applications
·
Intrusion Detection
FINANCIAL DATA ANALYSIS
The financial data in banking and
financial industry is generally reliable and of high quality which facilitates
the systematic data analysis and data mining. Here are the few typical cases:.
·
Loan payment
prediction and customer credit policy analysis.
RETAIL INDUSTRY
Data Mining has its great
application in Retail Industry because it collects large amount data from on
sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to
expand rapidly because of increasing ease, availability and popularity of web.
·
Design and
Construction of data warehouses based on benefits of data mining.
·
Multidimensional
analysis of sales, customers, products, time and region.
·
Analysis of
effectiveness of sales campaigns.
·
Customer Retention.
TELECOMMUNICATION INDUSTRY
Today the Telecommunication
industry is one of the most emerging industries providing various services such
as fax, pager, cellular phone, Internet messenger, images, e-mail, web data
transmission etc.Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the
reason why data mining is become very important to help and understand the
business.
·
Multidimensional
Analysis of Telecommunication data.
·
Fraudulent pattern
analysis.
·
Identification of
unusual patterns.
BIOLOGICAL DATA ANALYSIS
Now a days we see that there is
vast growth in field of biology such as genomics, proteomics, functional
Genomics and biomedical research.Biological data mining is very important part
of Bioinformatics. Following are the aspects in which Data mining contribute
for biological data analysis:
·
Alignment, indexing ,
similarity search and comparative analysis multiple nucleotide sequences.
·
Discovery of
structural patterns and analysis of genetic networks and protein pathways.
·
Association and path
analysis.
·
Visualization tools in
genetic data analysis.
No comments:
Post a Comment