MANOHAR RATHOD: Q & A on CLuster Analysis

Q Define Clustering ?
Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics.

---------------------------------------------------------------------------------------------------------
Q Different Application of Cluster Analysis?
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis.

BiologyIn biology clustering has many applications

In imaging, data clustering may take different form based on the data dimensionality. For example, the SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computational libraries.

In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes

In computational biology and bioinformatics:

In transcriptomics, clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.

In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.

In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes.

In QSAR and molecular modeling studies as also chemoinformatics

Medicine, Psychology and NeuroscienceIn medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.

Market researchCluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.

Segmenting the market and determining target markets

Product positioning

New product development

Selecting test markets (see : experimental techniques)

[Educational researchIn educational research analysis, data for clustering can be students, parents, sex or test score. Clustering is an important method for understanding and utility of cluster[4] in educational research. Cluster analysis in educational research can be used for data exploration,[5] cluster confirmation [6] and hypothesis testing.[6] Data exploration is used when there is little information about which schools or students will be grouped together.[5] It aims at discovering any meaningful clusters of units based on measures on a set of response variables. Cluster confirmation is used for confirming the previously reported cluster results.[6] Hypothesis testing is used for arranging cluster structure.[6]

Example of cluster analysis in educational researchIn 2002, Hattie used cluster analysis in the project 'School Like Mine' [7] to compare students’ achievement in literacy and numeracy by the type of school they attended. 2707 majority and minority students in New Zealand are classified into different clusters according to school size, student ethnicity, region, size of civil jurisdiction and socioeconomic status for comparison. The cluster in this research is calculated across five dimensions, decile, region, size, minority and rurality. All schools are placed into one of twenty clusters that are used in the asTTle software[clarification needed] as a basis of student achievement comparison. The result shows that using the power of socioeconomic status to describe schools is analysed and found inadequate.

Euclidean distance was used as the clustering method in this research. Schools that are alike were clustered together. In addition, dendrogram. By clustering schools, Hattie suggested that school types had no significant relation with performance of schools.
Common cluster techniques in educational researchAll cluster techniques have two basic concerns: firstly, the measurement of similarity between individual profiles; and secondly, the use of that measure to form the groups or clusters. Brennan [8] described Iterative Relocation as the most important cluster technique in behavioral and educational research. It has been adopted at Lancaster to create typologies of pupils based on personality and behavioral items to identify types of students [9] and to isolate the skills considered to be important for certain grades of technologist in industry. Other similarity coefficients are available and the one chosen will depend upon the type of data gained.
The number of groups needs to be decided. This is often an arbitrary decision and the groupings are random. The analysis then proceeds by computing the group profile (or group centroid) of each group, which is the cumulative frequencies of all variables measured. Each individual should be compared with each of the group centroids. A number of formulae are available for measuring this similarity. Among others, Wishart [8] has found the error sum of squares, a measure of dissimilarity, to be one of the most successful coefficients for continuous data. When relocate or alters the composition of groups and recalculate the group centroids are completed, a new iteration cycle commences. This sequence of comparison and relocation continues until all individuals are in the group whose central profile is most similar to their own. The solution is then said to be stable. The analysis is then continued by reducing the number of groups by one (N-1). This is achieved by a fushion process whereby a measure of dissimilarity (error sum of squares) between all pairs of group centroids is calculated again. The two most similar groups are then combined to reduce one group. Recalculate the group centroids and repeat steps 2 and 3 until the solution is stable for the N-1 group level. This process can continue until the two level group is reached, at which point the analysis is complete.

Advantages of cluster analysisFrisvad of BioCentrum-DTU said that cluster analysis is a good way for quick review of data, especially if the objects are classified into many groups.[10] In the ‘Schools Like Mine’ example,[7] 23 clusters of schools with different properties were clearly clustered. It is easy for users to assign or nominate themselves into a cluster they would most like to compare with in a school cluster database[7] because each cluster is clearly named with understandable terms.

Cluster Analysis provides a simple profile of individuals.[7] Given a number of analysis units, for example school size, student ethnicity, region, size of civil jurisdiction and social economic status in this example, each of which is described by a set of characteristics and attributes. Cluster Analysis also suggests how groups of units are determined such that units within groups are similar in some respect and unlike those from other groups [11]

Disadvantages of cluster analysisAn object can be assigned in one cluster only.[7] For example in 'Schools Like Mine', schools are automatically assigned into the first twenty-two clusters. However, if schools want to compare themselves with integrated schools, they will have to manually assign themselves into cluster twenty-three. Data-driven clustering may not represent reality, because once a school is assigned to a cluster, it cannot be assigned to another one. Some schools may have more than one significant property or fall on the edge of two clusters.[7]

Clustering may have detrimental effects to teachers who work in low-decile schools, students who are educated in them, and parents who support them, by telling them the schools are classified as ineffective, when in fact many are doing well in some unique aspects that are not sufficiently illustrated by the clusters formed.[7]

In k-means clustering methods, it is often requires several analysis before the number of clusters can be determined.[12] It can be very sensitive to the choice of initial cluster centres.[12]

Solution to problems of cluster analysis in educational researchHattie stated although Cluster analysis provides an easy way to make comparison between schools, no particular variable should be taken as the “short cut” for judging school quality.[7] in order to overcome the unit reassignment issue, some researchers suggest a nonhierarchical cluster method which allows for reassignment of units from one cluster. This operates through an iterative partitioning k- means algorithm, where k denotes the number of clusters.[6] Nevertheless, to conduct a k-means analysis, the number of clusters needs to be specified at the start. This limits the exploratory power of cluster analysis.

Cluster Analysis has to be very carefully used in classifying schools into groups because results are heavily influenced by partial sampling, choice of clustering criteria and compositional variables, as well as cluster labeling. Like assigning schools into different bands, clustering may bring about unnecessary comparisons and inappropriate discriminations among schools, thereby adversely affecting students.[7]

Other applicationsSocial network analysis

In the study of social networks, clustering may be used to recognize communities within large groups of people.

Software evolution

Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of directly preventative maintenance.

Image segmentation

Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.

Data mining

Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres.

Search result grouping

In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.

Slippy map optimization

Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.

IMRT segmentation

Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.

Grouping of Shopping Items

Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)

Recommender systems

Recommender systems are designed to recommend new items based on a user's tastes. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.

Mathematical chemistry

To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.[13]

Climatology

To find weather regimes or preferred sea level pressure atmospheric patterns.[14]

Petroleum Geology

Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.

Physical Geography

The clustering of chemical properties in different sample locations.

Crime Analysis

Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.

Evolutionary algorithms

Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.
----------------------------------------------------------------------------------------------------------
Q