Murtaza Munawar Fazal; Muhammad Rafi

Abstract

Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.

Keyword(s)

Document Clustering, Information Retrieval, Semi-supervised techniques, Data mining and Document graph

Web Source | Download PDF

A Semi-supervised approach to Document Clustering with Sequence Constraints

Murtaza Munawar Fazal, Muhammad Rafi

Volume 13, Issue No 1, 2015, JOURNAL OF INDEPENDENT STUDIES AND RESEARCH (JISR)

Abstract

Keyword(s)

Related Articles

Related Articles

Comparative Analysis of Collaborative Filtering on GraphLab, MLlib and Mahout

Extracting patterns from Global Terrorist Dataset (GTD) Using Co-Clustering approach

Probabilistic Vs. Soft Computing for Classifying Credit Card Transactions. A Case Study of Pakistani's Credit Card Data

Graph Visualization Tools: A Comparative Analysis

Analysis of SSD Utilization by Graph Processing Systems

HOME / ENGINEERING AND TECHNOLOGY / JOURNAL OF INDEPENDENT STUDIES AND RESEARCH (JISR) / Volume 13 / ISSUE NO 1

A Semi-supervised approach to Document Clustering with Sequence Constraints

Murtaza Munawar Fazal, Muhammad Rafi

Volume 13, Issue No 1, 2015, JOURNAL OF INDEPENDENT STUDIES AND RESEARCH (JISR)

Abstract

Keyword(s)

Related Articles