A special type of clustering is time-series clustering. A sequence composed of a series of nominal symbols from a particular alphabet is usually called a temporal sequence, and a sequence of continuous, real-valued elements, is known as a time-series. A time-series is essentially classified as dynamic data because its feature values change as a function of time, which means that the value(s) of each point of a time-series is/are one or more observations that are made chronologically. Time-series data is a type of temporal data which is naturally high dimensional and large in data size. Time-series data are of interest due to their ubiquity in various areas ranging from science, engineering, business, finance, economics, healthcare, to government. While each time-series is consisting of a large number of data points it can also be seen as a single object. Clustering such complex objects is particularly advantageous because it leads to discovery of interesting patterns in time-series datasets. As these patterns can be either frequent or rare patterns, several research challenges have arisen such as: developing methods to recognize dynamic changes in time-series, anomaly and intrusion detection, process control, and character recognition. To highlight the importance and the need for clustering time-series datasets, potentially overlapping objectives for clustering of time-series data are given as follows:
- Time-series databases contain valuable information that can be obtained through pattern discovery. Clustering is a common solution performed to uncover these patterns on time-series datasets.
- Time-series databases are very large and cannot be handled well by human inspectors. Hence, many users prefer to deal with structured datasets rather than very large datasets. As a result, time-series data are represented as a set of groups of similar time-series by aggregation of data in nonoverlapping clusters or by a taxonomy as a hierarchy of abstract concepts.
- Time-series clustering is the most-used approach as an exploratory technique, and also as a subroutine in more complex data mining algorithms, such as rule discovery, indexing, classification, and anomaly detection.
- Representing time-series cluster structures as visual images (visualization of time-series data) can help users quickly understand the structure of data, clusters, anomalies, and other regularities in datasets
Time-series clustering is a challenging issue because first of all, time-series data are often far larger than memory size and consequently they are stored on disks. This leads to an exponential decrease in speed of the clustering process. Second challenge is that time-series data are often high dimensional, which makes handling these data difficult for many clustering algorithms and also slows down the process of clustering. Finally, the third challenge addresses the similarity measures that are used to make the clusters. To do so, similar time-series should be found which needs time-series similarity matching that is the process of calculating the similarity among the whole time-series using a similarity measure. This process is also known as “whole sequence matching” where whole lengths of time-series are considered during distance calculation. However, the process is complicated, because time-series data are naturally noisy and include outliers and shifts, at the other hand the length of time-series varies and the distance among them needs to be calculated. These common issues have made the similarity measure a major challenge for data miners.
Applications of time-series clustering
Clustering of time-series data is mostly utilized for discovery of interesting patterns in time-series datasets. This task itself, fall into two categories: The first group is the one which is used to find patterns that frequently appears in the dataset. The second group are methods to discover patterns which happened in datasets surprisingly. Briefly, finding the clusters of time-series can be advantageous in different domains to answer following real world problems: Anomaly, novelty or discord detection: Anomaly detection are methods to discover unusual and unexpected patterns which happen in datasets surprisingly. For example, in sensor databases, clustering of time-series which are produced by sensor readings of a mobile robot in order to discover the events.
1- Recognizing dynamic changes in time-series: detection of correlation between time-series. For example, in financial databases, it can be used to find the companies with similar stock price move.
2- Prediction and recommendation: a hybrid technique combining clustering and function approximation per cluster can help user to predict and recommend. For example, in scientific databases, it can address problems such as finding the patterns of solar magnetic wind to predict today’s pattern.
3- Pattern discovery: to discover the interesting patterns in databases. For example, in marketing database, different daily patterns of sales of a specific product in a store can be discovered.
Whole time-series clustering is considered as clustering of a set of individual time-series with respect to their similarity. Here, clustering means applying conventional usually) clustering on discrete objects, where objects are time-series. Subsequence clustering means clustering on a set of subsequences of a time-series that are extracted via a sliding window, that is, clustering of segments from a single long time-series. Time point clustering is another category of clustering which is seen in some papers. It is clustering of time points based on a combination of their temporal proximity of time points and the similarity of the corresponding values. This approach is similar to time-series segmentation. However, it is different from segmentation as all points do not need to be assigned to clusters, i.e., some of them are considered as noise.
Essentially, sub-sequence clustering is performed on a single time-series, and Keogh and Lin (2005) represented that this type of clustering is meaningless. Time-point clustering also is applied on a single time-series, and is similar to time-series segmentation as the objective of time-point clustering is finding the clusters of time-point instead of clusters of time-series data.
In the shape based approach, shapes of two time-series are matched as well as possible, by a non-linear stretching and contracting of the time axes. This approach has also been labelled as a raw-data-based approach because it typically works directly with the raw time-series data. Shape-based algorithms usually employ conventional clustering methods, which are compatible with static data while their distance/similarity measure has been modified with an appropriate one for time-series. In the feature-based approach, the raw time-series are converted into a feature vector of lower dimension. Later, a conventional clustering algorithm is applied to the extracted feature vectors. Usually in this approach, an equal length feature vector is calculated from each time-series followed by the Euclidean distance measurement. In model-based methods, a raw time-series is transformed into model parameters (a parametric mode for each time-series,) and then a suitable model distance and a clustering algorithm (usually conventional clustering algorithms) is chosen and applied to the extracted model parameters. However, it is shown that usually model based approaches has scalability problems, and its performance reduces when the clusters are close to each other. Reviewing existing works in the literature, it is implied that essentially time-series clustering has four components: Dimensionality reduction or representation method, distance measurement, clustering algorithm, prototype definition, and evaluation.
The general process in the time-series clustering uses some or all of these components depending on the problem. Usually, data is approximated using a representation method in such a way that can fit in memory. Afterwards, a clustering algorithm is applied on data by using a distance measure. In the clustering process, usually a prototype is required for summarization of the time-series. At last, the clusters are evaluated using criteria.