Efficient clustering algorithm for large datasets

Clustering, in data mining, is useful for identifying interesting distributions and discovering groups in the underlying data. Traditional clustering algorithms either favor clusters with similar sizes and spherical shapes, or are very sensitive to outliers. These shortcomings are alleviated in a no...

全面介紹

Saved in:
書目詳細資料
主要作者: Chen, Fangying.
其他作者: Chen Lihui
格式: Final Year Project
語言:English
出版: 2010
主題:
在線閱讀:http://hdl.handle.net/10356/40791
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
機構: Nanyang Technological University
語言: English
實物特徵
總結:Clustering, in data mining, is useful for identifying interesting distributions and discovering groups in the underlying data. Traditional clustering algorithms either favor clusters with similar sizes and spherical shapes, or are very sensitive to outliers. These shortcomings are alleviated in a novel algorithm called CURE which was proposed by some researchers. CURE achieves the improvement by representing each cluster with a constant number of well-scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. In an effort to keep up with the rapid growth in the size of databases, CURE incorporates two techniques, random sampling and partitioning, to cope with large datasets. The tenet of both techniques is to reduce the input size to clustering process in order to fit in the main memory. Nowadays, high dimensional data is commonly found in a wide range of real-life applications, like web documents, transaction data and gene expression data. There is an urge for efficient high dimensional data clustering. In this Final Year Project, CURE algorithm is first implemented for low dimensional data with Java programming language. The program is tested on sample datasets. A series of simulations with different parameter settings are carried out and a parameter sensitivity analysis is performed. After being verified on low dimensional data, the program is modified to deal with high dimensional data. Later, the modified program is tested on high dimensional sample datasets and a parameter analysis is performed as well. The objective of this project is to implement CURE using Java. The implementation details, the testing results and performance evaluation are reported.