Unit 4
1)web and Text
mining: introduction:
Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided
into three different types, which are Web usage mining, Web content mining and Web structure mining.
Web usage mining[edit]
Web usage mining is
the process of extracting useful information from server logs e.g. use Web
usage mining is the process of finding out what users are looking for on
the Internet. Some users might be looking at only textual
data, whereas some others might be interested in multimedia data. Web Usage
Mining is the application of data mining techniques to discover interesting
usage patterns from Web data in order to understand and better serve the needs
of Web-based applications. Usage data captures the identity or origin of Web
users along with their browsing behavior at a Web site. Web usage mining itself
can be classified further depending on the kind of usage data considered:
·
Web Server Data: The
user logs are collected by the Web server. Typical data includes IP address,
page reference and access time.
·
Application Server
Data: Commercial application servers have significant features to enable
e-commerce applications to be built on top of them with little effort. A key
feature is the ability to track various kinds of business events and log them
in application server logs.
·
Application Level
Data: New kinds of events can be defined in an application, and logging can be
turned on for them thus generating histories of these specially defined events.
It must be noted, however, that many end applications require a combination of
one or more of the techniques applied in the categories above.
2) Web
content mining[edit]
Web content mining is the mining,
extraction and integration of useful data, information and knowledge from Web
page content. The heterogeneity and the lack of structure that permits much of
the ever-expanding information sources on the World Wide Web, such as hypertext
documents, makes automated discovery, organization, and search and indexing
tools of the Internet and the World Wide Web such as Lycos, Alta Vista,
WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users,
but they do not generally provide structural information nor categorize,
filter, or interpret documents. In recent years these factors have prompted
researchers to develop more intelligent tools for information retrieval, such
as intelligent web agents, as well as to extend database and data mining
techniques to provide a higher level of organization for semi-structured data
available on the web. The agent-based approach to web mining involves the
development of sophisticated AI systems that can act autonomously or semi-autonomously
on behalf of a particular user, to discover and organize web-based information.
3) Web
structure mining[edit]
Web structure mining is the
process of using graph theory to analyze the node and connection structure of a
web site. According to the type of web structural data, web structure mining
can be divided into two kinds:
1. Extracting patterns from
hyperlinks in the web: a hyperlink is a
structural component that connects the web page to a different location.
2. Mining the document structure:
analysis of the tree-like structure of page structures to describe HTML or XML tag usage.
4) Text mining:
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-qualityinformation from text. High-quality information is typically derived through the
devising of patterns and trends through means such asstatistical
pattern learning. Text
mining usually involves the process of structuring the input text (usually
parsing, along with the addition of some derived linguistic features and the
removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of
the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text
categorization, text clustering,concept/entity extraction, production of granular taxonomies, sentiment
analysis, document
summarization, and
entity relation modeling (i.e., learning relations between named
entities).
Text analysis involves information
retrieval, lexical analysis to study word frequency distributions, pattern
recognition,tagging/annotation, information
extraction, data mining techniques including link and association analysis, visualization, and predictive
analytics. The overarching
goal is, essentially, to turn text into data for analysis, via application of natural
language processing (NLP) and analytical methods.
5) Unstructured text:
About Unstructured Text
Data mining algorithms act on
data that is numerical or categorical. Numerical data is ordered. It is stored
in columns that have a numeric data type, such as
NUMBER
or FLOAT
. Categorical data is identified by category or
classification. It is stored in columns that have a character data type, such
as VARCHAR2
or CHAR
.
Unstructured text data is neither numerical nor categorical. Unstructured text
includes items such as web pages, document libraries, Power Point
presentations, product specifications, emails, comment fields in reports, and
call center notes. It has been said that unstructured text accounts for more
than three quarters of all enterprise data. Extracting meaningful information
from unstructured text can be critical to the success of a business.
About Text Mining and Oracle Text
Text mining is the process of
applying data mining techniques to text terms, also called text features
or tokens. Text terms are words or groups of words that have been extracted
from text documents and assigned numeric weights. Text terms are the
fundamental unit of text that can be manipulated and analyzed.
Oracle Text
is a Database technology that provides term extraction, word and theme
searching, and other utilities for querying text. When columns of text are
present in the training data, Oracle Data Mining uses Oracle Text utilities and
term weighting strategies to transform the text for mining. Oracle Data Mining
passes configuration information supplied by the user to Oracle Text and uses
the results in the model creation process.
6)Episode rule
discovery for text:
Discovering
association rules is an important datamining
problem.
The problem was first defined in the context
of
the market basket data to identify customers’ buying
habits
[1].
Our
overall goal is to analyze event sequences, discover
recurrent
patterns of events, and generate sequential association
rules.
Our approach is based on the concept of representative
association
rules combined with event constraints.
A
sequential dataset is normalized and then discretized
by
forming subsequences using a sliding window [5]. Using
a
sliding window of size _, every normalized time stamp
value
xt is used to compute each of the new sequence values
yt__=2
to yt+_=2. Thus, the dataset has been divided into
segments,
each of size _. The discretized version of the
time
series is obtained by using some clustering algorithm
and
a suitable similarity measure. Each cluster identifier is
an
event type, and the set of cluster labels is the class of
events
E.
Representative
Episodal Association Rules
We
use the set of frequent closed episodes FCE produced
from
the Gen-FCE algorithm to generate the representative
episodal
association rules that
cover the entire set
of
association rules [4].
The
cover of a rule r : X ) Y , denoted by C(r), is the
set
of association rules that can be generated from r. That
is,
C(r : X ) Y ) = fX [ U ) V j U; V _ Y; U \
V
= ;; and V 6= ;g. An important property of the
cover
operator stated in [4]is that if an association rule r
has
support s and confidence c, then every rule
r0
2 C(r)
has
support at least s and confidence at least c.
7)Hierarchy of
categeories:
To model a product category
hierarchy, this solution keeps each category in its own document that also has
a list of its ancestors or “parents.” This document uses music genres as the
basis of its examples:
Initial category hierarchy
Because these kinds of categories
change infrequently, this model focuses on the operations needed to keep the
hierarchy up-to-date rather than the performance profile of update operations.
Schema
This schema has the following
properties:
- A single
document represents each category in the hierarchy.
- An ObjectId identifies each category document
for internal cross-referencing.
- Each
category document has a human-readable name and a URL compatible slug field.
- The schema
stores a list of ancestors for each category to facilitate displaying a
query and its ancestors using only a single query.
Consider the following prototype:
{ "_id" : ObjectId("4f5ec858eb03303a11000002"),
"name" : "Modal Jazz",
"parent" : ObjectId("4f5ec858eb03303a11000001"),
"slug" : "modal-jazz",
"ancestors" : [
{ "_id" : ObjectId("4f5ec858eb03303a11000001"),
"slug" : "bop",
"name" : "Bop" },
{ "_id" : ObjectId("4f5ec858eb03303a11000000"),
"slug" : "ragtime",
"name" : "Ragtime" } ]
}
Operations
Read and Display a Category
Querying
Use the following option to read
and display a category hierarchy. This query will use the slug field to
return the category information and a “bread crumb” trail from the current
category to the top level category.
category = db.categories.find(
{'slug':slug},
{'_id':0, 'name':1, 'ancestors.slug':1, 'ancestors.name':1 })
Indexing
Create a unique index on the slug field
with the following operation on the Python/PyMongo console:
>>> db.categories.ensure_index('slug', unique=True)
11) Text
clustering:
Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document
organization, topic extraction and fast information
retrieval or filtering.
Document clustering involves the
use of descriptors and descriptor extraction. Descriptors are sets of words
that describe the contents within the cluster. Document clustering is generally
considered to be a centralized process. Examples of document clustering include
web document clustering for search users.
The application of document
clustering can be categorized to two types, online and offline. Online
applications are usually constrained by efficiency problems when compared
offline applications.
Clustering in search engines[edit]
A web
search engine often returns
thousands of pages in response to a broad query, making it difficult for users
to browse or to identify relevant information. Clustering methods can be used
to automatically group the retrieved documents into a list of meaningful
categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such asPolyMeta and Helioid,
or open source software such as Carrot2.
Examples:
·
Clustering divides the
results of a search for "cell" into groups like "biology,"
"battery," and "prison."
No comments:
Post a Comment