UNIT
5
1)Temoral and special data mining:
Introduction:
One of the main unresolved problems that arise
during the data
mining process is treating data that contains
temporal information. In this case,
a complete understanding of the entire phenomenon
requires that the data
should be viewed as a sequence of events.
One of the main unresolved problems that arise
during the data
mining process is treating data that contains
temporal information. In this case,
a complete understanding of the entire phenomenon
requires that the data
should be viewed as a sequence of events.
Stock market data•Robot
sensors•Weather data•Biological data: e.g. monitoring fish population.•Network
monitoring•Weblog data•Customer transactions
Temporal data have a unique
structure:High dimensionalityHigh feature correlation
2) Temporal asoociation rules:
Mining association rules in
transaction data is an important data mining problem. Association rules
represent directed relationships amongst variables wherein the existence of
some variables predict the existence of the other variables in a transaction
({X->Y}, for sets X and Y), with certain amount of confidence.
Association Rule
Mining Algorithms Several
algorithms have been proposed for finding association rules in a transaction
datasets. Of these algorithms Apriori algorithm proposed by [Agarwal et. al.,
1994] is the most popular. The Apriori Algorithm works in two steps: 1) Identification
of frequent item sets in the dataset. These correspond to item sets which
occur in the dataset with a frequency greater than a threshold value called
minimum support. 2) Obtaining association rules from these frequent item
sets which have a confidence greater than a threshold value called minimum
confidence. The confidence in an association rule {X->Y} is expressed as
the ratio of the probability of X+Y over probability of X i.e. the probability
that the customer would purchase both X and Y given he purchased X.
3)Sequential mining:
Sequential Pattern mining is a topic of data mining
concerned with finding statistically relevant patterns between data examples
where the values are delivered in a sequence.[1]
It is usually presumed that the values are discrete, and thus time series
mining is closely related, but usually considered a different activity.
Sequential pattern mining is a special case of structured data mining.There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.
String mining typically deals with a
limited alphabet for items
that appear in a sequence, but the sequence itself may be typically very long.
Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide
bases 'A', 'G', 'C' and 'T' in DNA sequences,
or amino acids
for protein sequences. In biology applications analysis of the arrangement of the alphabet in
strings can be used to examine gene and protein sequences to determine their properties. Knowing the
sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task
is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual
regions or structural units within each sequence and then assigning a function
to each structural unit. In many cases this requires comparing a given sequence
with previously studied ones. The comparison between the strings becomes
complicated when insertions, deletions and mutations occur in a string.
- Repeat-related problems: that deal with operations on single sequences and can
be based on exact
string matching or approximate
string matching methods for finding dispersed
fixed length and maximal length repeats, finding tandem repeats, and
finding unique subsequences and missing (un-spelled) subsequences.
- Alignment problems:
that deal with comparison between strings by first aligning one or more
sequences; examples of popular methods include BLAST
for comparing a single sequence with multiple sequences in a database, and
ClustalW
for multiple alignments. Alignment algorithms can be based on either exact
or approximate methods, and can also be classified as global alignments,
semi-global alignments and local alignment. See sequence alignment.
Some problems in sequence mining lend themselves
discovering frequent itemsets and the order they appear, for example, one is
seeking rules of the form "if a {customer buys a car}, he or she is likely
to {buy insurance} within 1 week", or in the context of stock prices,
"if {Nokia up and Ericsson Up}, it is likely that {Motorola up and Samsung
up} within 2 days". Traditionally, itemset mining is used in marketing
applications for discovering regularities between frequently co-occurring items
in large transactions. For example, by analysing transactions of customer shopping
baskets in a supermarket, one can produce a rule which reads "if a
customer buys onions and potatoes together, he or she is likely to also buy
hamburger meat in the same transaction".
4) GSP algorithm:
GSP Algorithm (Generalized Sequential Pattern algorithm) is an algorithm
used for sequence mining. The algorithms for solving sequence mining problems are
mostly based on the a priori
(level-wise) algorithm. One way to use the level-wise paradigm is to first
discover all the frequent items in a level-wise fashion. It simply means
counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end
of this step, each transaction consists of only the frequent elements it
originally contained. This modified database becomes an input to the GSP
algorithm. This process requires one pass over the whole database.
GSP Algorithm makes multiple
database passes. In the first pass, all single items (1-sequences) are counted.
From the frequent items, a set of candidate 2-sequences are formed, and another
pass is made to identify their frequency. The frequent 2-sequences are used to
generate the candidate 3-sequences, and this process is repeated until no more
frequent sequences are found. There are two main steps in the algorithm.
- Candidate Generation. Given the set of frequent
(k-1)-frequent sequences F(k-1), the candidates for the next pass are
generated by joining F(k-1) with itself. A pruning phase eliminates any
sequence, at least one of whose subsequences is not frequent.
- Support Counting. Normally, a hash tree–based
search is employed for efficient support counting. Finally non-maximal
frequent sequences are removed
·
F1 = the set of frequent 1-sequence
·
k=2,
·
do while F(k-1)!= Null;
·
Generate candidate sets Ck (set of candidate
k-sequences);
·
For all input sequences s in the
database D
·
do
·
Increment count of all a in Ck if s
supports a
·
Fk = {a ∈ Ck such that its frequency exceeds the threshold}
·
k= k+1;
·
Result = Set of all frequent
sequences is the union of all Fks
·
End do
·
End do
5) SPADE:
Frequent
Sequence Mining is used to discover a set of patterns shared among objects
which have between them a specific order. For instance, a retail shop may
possess a transaction database which specifies which products were acquired by
each customer over time. In this case, the store may use Frequent Sequence
Mining to find that 40% of the people who bought the first volume of Lord of
the Rings came back to buy the second volume a month later. This kind of
information may be used to support directed advertising campaigns or
recommendation systems.
Another
application domain where Frequent Sequence Mining may be used is Web click log
analysis in Information Retrieval systems, in which case the system performance
may be refined by analyzing the sequence of interactions that the user exposed
while searching or browsing for a specific information. This kind of usage
becomes especially clear when we consider the huge amount of data obtained by
industrial search engines in the form of query logs. Google alone was reported
to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)
A sequence α is an
ordered list of events <a1,a2,...,am>. An event is a non-empty unordered
set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a
subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im
such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support
of a sequence α is the number of sequences of D which contains α as a
subsequence. If the support of α is bigger than a threshold maxsup, then α is a
frequent sequence (Peng and Liao, 2009).
Algorithm[edit]
An algorithm to Frequent Sequence Mining is the SPADE
(Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a
vertical id-list database format, where we associate to each sequence a list of
objects in which it occurs. Then, frequent sequences can be found efficiently
using intersections on id-lists. The method also reduces the number of
databases scans, and therefore also reduces the execution time.
The first step of SPADE is to compute the frequencies of
1-sequences, which are sequences with only one item. This is done in a single
database scan. The second step consists of counting 2-sequences. This is done
by transforming the vertical representation into an horizontal representation
in memory, and counting the number of sequences for each pair of items using a
bidimensional matrix. Therefore, this step can also be executed in only one
scan.
6) Spirit episode discovery:
8) Time series analysis:
A time series is a sequence of data points,
measured typically at successive points in time spaced at uniform time
intervals. Examples of time series are the daily closing value of the Dow Jones
Industrial Average and the
annual flow volume of the Nile
River atAswan. Time series are very frequently plotted via line charts. Time
series are used instatistics, signal processing, pattern recognition, econometrics, mathematical finance,weather forecasting, earthquake prediction, electroencephalography, control engineering,astronomy, communications
engineering, and largely in any domain
of applied science andengineering which
involves temporal measurements.
Time
series analysis comprises methods for analyzing time series data in order
to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to
predict future values based on previously observed values. Whileregression analysis is often employed in such a way as to test theories that
the current values of one or more independent time series affect the current
value of another time series, this type of analysis of time series is not
called "time series analysis", which focuses on comparing values of a
single time series or multiple dependent time series at different points in
time.[1]
Notation
A number of different notations are in use for time-series
analysis. A common notation specifying a time series X that is
indexed by thenatural numbers is written
X = {X1, X2, ...}.
Another common notation is
Y = {Yt: t ∈ T},
where T is the index
set.
Conditions[edit]
There are
two sets of conditions under which much of the theory is built:
However,
ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order
stationarity. Both models and
applications can be developed under each of these conditions, although the
models in the latter case might be considered as only partly specified.
In
addition, time-series analysis can be applied where the series are seasonally
stationary or non-stationary. Situations where the amplitudes of
frequency components change with time can be dealt with in time-frequency
analysis which makes use of a time–frequency
representation of a
time-series or signal.[7]
Models[edit]
Main
article: Autoregressive model
The
general representation of an autoregressive model, well known as AR(p),
is
8) Spatial mining:
Spatial data mining is the application of data
mining methods to spatial data. The end objective of spatial data mining is to
find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate
technologies, each with its own methods, traditions, and approaches to
visualization and data analysis. Particularly, most contemporary GIS have only
very basic spatial analysis functionality. The immense explosion in geographically
referenced data occasioned by developments in IT, digital mapping, remote
sensing, and the global diffusion of GIS emphasizes the importance of
developing data-driven inductive approaches to geographical analysis and
modeling.
Challenges
in Spatial mining: Geospatial data repositories tend to be very large.
Moreover, existing GIS datasets are often splintered into feature and attribute
components that are conventionally archived in hybrid data management systems.
Algorithmic requirements differ substantially for relational (attribute) data
management and for topological (feature) data management.[48] Related to this is the range and diversity of
geographic data formats, which present unique challenges. The digital
geographic data revolution is creating new types of data formats beyond the
traditional "vector" and "raster" formats. Geographic data
repositories increasingly include ill-structured data, such as imagery and geo-referenced
multi-media.[49
9) Spatial mining tasks:
Spatial Data Mining Tasks
Geo-Spatial Warehousing
and OLAP
Spatial data
classification/predictive modeling
Spatial
clustering/segmentation
Spatial association and
correlation analysis
Spatial regression
analysis
Time-related spatial
pattern analysis: trends,
sequential patterns, partial periodicity analysis
Many more to be explored
10) Spatial clustering:
Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, andbioinformatics.
10) Data mining applications:
Data Mining is widely used in diverse areas. There are
number of commercial data mining system available today yet there are many challenges
in this field. In this tutorial we will applications and trend of Data Mining.
Data Mining Applications
Here is the list of areas where data mining is widely used:
·
Financial Data
Analysis
·
Retail Industry
·
Telecommunication
Industry
·
Biological Data
Analysis
·
Other Scientific
Applications
·
Intrusion Detection
FINANCIAL DATA ANALYSIS
The financial data in banking and
financial industry is generally reliable and of high quality which facilitates
the systematic data analysis and data mining. Here are the few typical cases:.
·
Loan payment
prediction and customer credit policy analysis.
RETAIL INDUSTRY
Data Mining has its great
application in Retail Industry because it collects large amount data from on
sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to
expand rapidly because of increasing ease, availability and popularity of web.
·
Design and
Construction of data warehouses based on benefits of data mining.
·
Multidimensional
analysis of sales, customers, products, time and region.
·
Analysis of
effectiveness of sales campaigns.
·
Customer Retention.
TELECOMMUNICATION INDUSTRY
Today the Telecommunication
industry is one of the most emerging industries providing various services such
as fax, pager, cellular phone, Internet messenger, images, e-mail, web data
transmission etc.Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the
reason why data mining is become very important to help and understand the
business.
·
Multidimensional
Analysis of Telecommunication data.
·
Fraudulent pattern
analysis.
·
Identification of
unusual patterns.
BIOLOGICAL DATA ANALYSIS
Now a days we see that there is
vast growth in field of biology such as genomics, proteomics, functional
Genomics and biomedical research.Biological data mining is very important part
of Bioinformatics. Following are the aspects in which Data mining contribute
for biological data analysis:
·
Alignment, indexing ,
similarity search and comparative analysis multiple nucleotide sequences.
·
Discovery of
structural patterns and analysis of genetic networks and protein pathways.
·
Association and path
analysis.
·
Visualization tools in
genetic data analysis.
No comments:
Post a Comment