AliClu

Program description

AliClu is a Python implementation of a method that combines Temporal Needleman-Wunsch (TNW) [1] and agglomerative hierarchical clustering with validation [2]. The program receives a set of temporal sequences and outputs the final clusters. The program can be used in an automatic or semi-automatic manner. In the automatic mode, the program outputs the final clusters in separated CSV files and a pdf file with the cluster stability analysis. In the semi-automatic mode, the program first produces a pdf file with dendrograms, tables of averages of clustering indices and graphs with the standard deviation of Adjusted Rand. After analysing this pdf file, the user can input the final number of clusters and the program will output the final clusters and the cluster stability analysis. The python files required to run the program can be downloaded here.

Requirements

Install Python3 in your local machine
Install the required libraries by running the following code:
```
$ pip3 install -r requirements.txt
```

Note: In order to avoid library dependency problems it is recommend to use virtualenv, which is a tool to create isolated Python environments.

Run the following code for a linux machine:

$ pip3 install virtualenv		#to install virtualenv	
$ virtualenv [give_a_name]
$ source [give_a_name]/bin/activate	
$ pip3 install -r requirements.txt

Run the following code for a windows machine:

$ pip3 install virtualenv		#to install virtualenv	
$ virtualenv [give_a_name]
$ [give_a_name]\Scripts\activate	
$ pip3 install -r requirements.txt

Usage

Execute the python file AliClu.py:

$ python AliClu.py [-h] [-SS] [-g] [-tp] [-M] [-d] filename minClusters maxClusters automatic

The command-line options are:

positional arguments:
  filename		Input CSV file with temporal sequences for each patient.
  minClusters		Minimum number of clusters to be analysed.
  maxClusters		Maximum number of clusters to be analysed.
  automatic		1 to run AliClu automatically, 0 otherwise.

optional arguments:
  -h, --help			Show this help message and exit.
  -SS, --ScoringSystem		Pre-defined scoring system. To define a match and mismatch value for the scoring system,
				the argument should be "[match mismatch]".
  -g, --gap			Gap penalty. To analyse a range of gap penalties, the argument should be 
				"[min_gap max_gap step_gap]" where min_gap and max_gap are the minimum and
				maximum gap penalties to be analysed and step_gap is the space between 
				the gap values. To analyse only one gap value simply write "[gap_value]".
				Default the gap ranges from -1 to 1 in steps of 0.1.
  -tp, --temporalPenaltyConstant	
				Temporal Penalty Constant. Default tp is 0.25.
  -M, --bootStrapSamples	Number of bootstrap samples. Default M is 250.
  -d, --distanceMetric		Distance metric for agglomerative clustering. Only 5 distance metrics
				allowed: single, complete, average, centroid and ward. Default d is ward.

Input

The input file must be in comma-separated values (CSV) format, containing the temporal sequences for each subject.

The first line is the header with the names of the needed attributes. The header must be always "id_patient", "aux_encode".
The first column corresponds to the identifier of each subject whereas the second column refers to the temporal sequences in the prefix-encoded format.
All the other lines contains the temporal sequences for each subject.

Input file example:

id_patient, aux_encode
33496,"0.B,156.C,314.D,1696.Z"
33499,"0.B,148.Z"
33502,"0.A,120.B,2187.Z"
33505,"0.A,873.B,255.C,1145.Z"
33508,"0.A,358.B,420.C,963.D,239.Z"
33511,"0.A,2192.Z"		 
(...)

Another input file example is synthetic_temporal_sequences.csv. This file is used in Example 1 and Example 2 below.

Data preprocessing

Output

The program outputs the final clusters and the cluster stability analysis. The clusters are presented in separated CSV files in a new directory. For example, consider that the program yields 3 clusters. Then 3 CSV files will appear in a directory called '[distanceMetric]_gap_[chosen_gap]_Tp_[tp]_clusters_3':

Cluster 1 - N1 elements.csv
Cluster 2 - N2 elements.csv
Cluster 3 - N3 elements.csv

where [distanceMetric] is the distance metric used in agglomerative clustering, [chosen_gap] is the chosen gap penalty and [tp] the temporal penalty constant used in the program. N1, N2 and N3 are the number of elements in Cluster 1, 2 and 3, respectively.
The cluster stability analysis is presented in a pdf file named cluster_stability_analysis.pdf.

Example 1 - Automatic

In this example let us run AliClu in the automatic mode. The synthetic_temporal_sequences.csv is the input CSV file, which contains 3 clusters produced by 3 different continuous-time Markov Chains. Cluster 1 contain patients with id_patient from 0 to 9, Cluster 2 contain patients with id_patient from 10 to 19 and Cluster 3 has the remaining patients.

Let us choose 2 as the minimum number of clusters and 8 as the maximum number of clusters to be analysed. The command line to run AliClu and obtain the final clusters automatically is:

$ python AliClu.py synthetic_temporal_sequences.csv 2 8 1

Note that no optional argument was used, hence, all the other parameters used in AliClu are set to default. The program correctly founds the 3 clusters and outputs the following:

A directory 'ward_gap_0.8_Tp_0.25_clusters_3' is created with the final clusters in 3 separated CSV files (Cluster 1 - 10 elements.csv, Cluster 2 - 10 elements.csv, Cluster 3 - 10 elements.csv).
cluster_stability_analysis.pdf, with the cluster stability analysis.
full_analysis.pdf, with dendrograms, tables of averages of clustering indices and graphs with the standard deviation of Adjusted Rand. This file is used for consultation and it is the same as the semi_automatic_analysis.pdf presented in the next example.

Example 2 - Semi-automatic

In this example let us run AliClu in the semi-automatic mode. The input CSV file is again synthetic_temporal_sequences.csv and we choose 2 as the minimum number of clusters and 8 as the maximum number of clusters to be analysed.

The command line to run AliClu in the semi-automatic mode is:

$ python AliClu.py synthetic_temporal_sequences.csv 2 8 0

which will give the following in the terminal:

The program is asking the user to input the final number of clusters. In this stage, the program produced a pdf file named semi_automatic_analysis.pdf that is meant to auxiliate the user on making the final decision on the number of clusters. The pdf file (download for better visualization in your local pdf viewer) contains:

Note that pdf file does not contain the analysis for gap values from -1 to -0.5. The reason for this is that the pairwise alignments obtained with TNW for these gap values yield the same score and therefore it was decided to not perform agglomerative clustering.

After analysing the pdf file, the user chooses 3 as the final number of clusters:

Since more than one gap penalty was analysed, the user also has to input the chosen gap penalty. If only one gap penalty was analysed by the program it would not be necessary to input this information. Let us choose again 0.8:

The program ends and produces again the same outputs as in Example 1.

References

H. Syed and A. K. Das. Temporal Needleman-Wunsch. IEEE International Conference on Data Science and Advanced Analysis (DSAA), Dec. 2015.
H.-J. Mucha. On validation of hierarchical clustering. In R. Decker and H. J. Lenz, editors, Advances in Data Analysis, pages 115-122, Berlin, Heidelberg, 20017. Springer Berlin Heidelberg