M.Tech Notes: ADVANCED DATA MINING UNIT 5

UNIT 5

1)Temoral and special data mining:

Introduction:

One of the main unresolved problems that arise during the data

mining process is treating data that contains temporal information. In this case,

a complete understanding of the entire phenomenon requires that the data

should be viewed as a sequence of events.

One of the main unresolved problems that arise during the data

mining process is treating data that contains temporal information. In this case,

a complete understanding of the entire phenomenon requires that the data

should be viewed as a sequence of events.

Stock market data•Robot sensors•Weather data•Biological data: e.g. monitoring fish population.•Network monitoring•Weblog data•Customer transactions

Temporal data have a unique structure:High dimensionalityHigh feature correlation

2) Temporal asoociation rules:

Mining association rules in transaction data is an important data mining problem. Association rules represent directed relationships amongst variables wherein the existence of some variables predict the existence of the other variables in a transaction ({X->Y}, for sets X and Y), with certain amount of confidence.

Association Rule Mining Algorithms Several algorithms have been proposed for finding association rules in a transaction datasets. Of these algorithms Apriori algorithm proposed by [Agarwal et. al., 1994] is the most popular. The Apriori Algorithm works in two steps: 1) Identification of frequent item sets in the dataset. These correspond to item sets which occur in the dataset with a frequency greater than a threshold value called minimum support. 2) Obtaining association rules from these frequent item sets which have a confidence greater than a threshold value called minimum confidence. The confidence in an association rule {X->Y} is expressed as the ratio of the probability of X+Y over probability of X i.e. the probability that the customer would purchase both X and Y given he purchased X.

3)Sequential mining:

Sequential Pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.^[1] It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.
There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.

String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.

Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.

Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.

Some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson Up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".

4) GSP algorithm:

GSP Algorithm (Generalized Sequential Pattern algorithm) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion. It simply means counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. This modified database becomes an input to the GSP algorithm. This process requires one pass over the whole database.

GSP Algorithm makes multiple database passes. In the first pass, all single items (1-sequences) are counted. From the frequent items, a set of candidate 2-sequences are formed, and another pass is made to identify their frequency. The frequent 2-sequences are used to generate the candidate 3-sequences, and this process is repeated until no more frequent sequences are found. There are two main steps in the algorithm.

Candidate Generation. Given the set of frequent (k-1)-frequent sequences F(k-1), the candidates for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any sequence, at least one of whose subsequences is not frequent.
Support Counting. Normally, a hash tree–based search is employed for efficient support counting. Finally non-maximal frequent sequences are removed

· F1 = the set of frequent 1-sequence

· k=2,

· do while F(k-1)!= Null;

· Generate candidate sets Ck (set of candidate k-sequences);

· For all input sequences s in the database D

· do

· Increment count of all a in Ck if s supports a

· Fk = {a ∈ Ck such that its frequency exceeds the threshold}

· k= k+1;

· Result = Set of all frequent sequences is the union of all Fks

· End do

5) SPADE:

Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, a retail shop may possess a transaction database which specifies which products were acquired by each customer over time. In this case, the store may use Frequent Sequence Mining to find that 40% of the people who bought the first volume of Lord of the Rings came back to buy the second volume a month later. This kind of information may be used to support directed advertising campaigns or recommendation systems.

Another application domain where Frequent Sequence Mining may be used is Web click log analysis in Information Retrieval systems, in which case the system performance may be refined by analyzing the sequence of interactions that the user exposed while searching or browsing for a specific information. This kind of usage becomes especially clear when we consider the huge amount of data obtained by industrial search engines in the form of query logs. Google alone was reported to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)

A sequence α is an ordered list of events <a1,a2,...,am>. An event is a non-empty unordered set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support of a sequence α is the number of sequences of D which contains α as a subsequence. If the support of α is bigger than a threshold maxsup, then α is a frequent sequence (Peng and Liao, 2009).

Algorithm[edit]

An algorithm to Frequent Sequence Mining is the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a vertical id-list database format, where we associate to each sequence a list of objects in which it occurs. Then, frequent sequences can be found efficiently using intersections on id-lists. The method also reduces the number of databases scans, and therefore also reduces the execution time.

The first step of SPADE is to compute the frequencies of 1-sequences, which are sequences with only one item. This is done in a single database scan. The second step consists of counting 2-sequences. This is done by transforming the vertical representation into an horizontal representation in memory, and counting the number of sequences for each pair of items using a bidimensional matrix. Therefore, this step can also be executed in only one scan.

6) Spirit episode discovery:

8) Time series analysis:

A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average and the annual flow volume of the Nile River atAswan. Time series are very frequently plotted via line charts. Time series are used instatistics, signal processing, pattern recognition, econometrics, mathematical finance,weather forecasting, earthquake prediction, electroencephalography, control engineering,astronomy, communications engineering, and largely in any domain of applied science andengineering which involves temporal measurements.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. Whileregression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called "time series analysis", which focuses on comparing values of a single time series or multiple dependent time series at different points in time.^[1]

Models

· 3.1 Notation

· 3.2 Conditions

· 3.3 Models

· 3.4 Measures

Notation

A number of different notations are in use for time-series analysis. A common notation specifying a time series X that is indexed by thenatural numbers is written

X = {X₁, X₂, ...}.

Another common notation is

Y = {Y_t: t ∈ T},

where T is the index set.

Conditions[edit]

There are two sets of conditions under which much of the theory is built:

· Stationary process

· Ergodic process

However, ideas of stationarity must be expanded to consider two important ideas: strict stationarity and second-order stationarity. Both models and applications can be developed under each of these conditions, although the models in the latter case might be considered as only partly specified.

In addition, time-series analysis can be applied where the series are seasonally stationary or non-stationary. Situations where the amplitudes of frequency components change with time can be dealt with in time-frequency analysis which makes use of a time–frequency representation of a time-series or signal.^[7]

Models[edit]

Main article: Autoregressive model

The general representation of an autoregressive model, well known as AR(p), is

8) Spatial mining:

Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling.

Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management.^[48] Related to this is the range and diversity of geographic data formats, which present unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data, such as imagery and geo-referenced multi-media.^[49

⁹⁾Spatial mining tasks:

Spatial Data Mining Tasks

ƒ Geo-Spatial Warehousing and OLAP

ƒ Spatial data classification/predictive modeling

ƒ Spatial clustering/segmentation

ƒ Spatial association and correlation analysis

ƒ Spatial regression analysis

ƒ Time-related spatial pattern analysis: trends,

sequential patterns, partial periodicity analysis

ƒ Many more to be explored

10) Spatial clustering:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, andbioinformatics.

10) Data mining applications:

Data Mining is widely used in diverse areas. There are number of commercial data mining system available today yet there are many challenges in this field. In this tutorial we will applications and trend of Data Mining.

Data Mining Applications

Here is the list of areas where data mining is widely used:

· Financial Data Analysis

· Retail Industry

· Telecommunication Industry

· Biological Data Analysis

· Other Scientific Applications

· Intrusion Detection

FINANCIAL DATA ANALYSIS

The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining. Here are the few typical cases:.

· Loan payment prediction and customer credit policy analysis.

RETAIL INDUSTRY

Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web.

· Design and Construction of data warehouses based on benefits of data mining.

· Multidimensional analysis of sales, customers, products, time and region.

· Analysis of effectiveness of sales campaigns.

· Customer Retention.

TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc.Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.

· Multidimensional Analysis of Telecommunication data.

· Fraudulent pattern analysis.

· Identification of unusual patterns.

BIOLOGICAL DATA ANALYSIS

Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research.Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis:

· Alignment, indexing , similarity search and comparative analysis multiple nucleotide sequences.

· Discovery of structural patterns and analysis of genetic networks and protein pathways.

· Association and path analysis.

· Visualization tools in genetic data analysis.

M.Tech Notes

Pages

ADVANCED DATA MINING UNIT 5