Probabilistic approach for Text clustering in decentralized environment

Durga Prasanthi N, Krishnaiah N, Ananda Rao G

Abstract


Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, traditional text clustering algorithms fail to scale on highly distributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm

References


M. Eisenhardt, W. Muller, and A. Henrich, “Classifying documents by distributed P2P clustering.” in INFORMATIK, 2003.

S. Datta, C. R. Giannella, and H. Kargupta, “Approximate distributed K-Means clustering over a peer-to-peer network,” IEEE TKDE, vol. 21, no. 10, pp. 1372–1388, 2009.

H.C. Hsiao and C.T. King, “Similarity discovery in structured P2P overlays,” in ICPP, 2003.

G. Koloniari and E. Pitoura, “A recall-based cluster formation game in P2P systems,” PVLDB, vol. 2, no. 1, pp. 455–466, 2009.

Text mining classification, clustering, end application (Chapman and Hall).CRC datamining and knowledge discovery series by Ashok srivatsava and Mehran sahami.

Survey of text mining I: clustering, classification and retrieval (2003) by M.W.Berry.

Survey of text mining II: clustering ,classification and retrieval (2008) by M.W.Berry.


پاراگلایدر Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

ISSN : 2251-1563