Algoritma Term Frequency – Inverse Document Frequency (TF-IDF) dan K-Means Clustering Untuk Menentukan Kategori Dokumen

Ida Widaningrum; Dyah Mustikasari; Rizal Arifin; Siti Lathifah Tsaqila; Dwiyunia Fatmawati

Authors

Ida Widaningrum Universitas Muhammadiyah Ponorogo
Dyah Mustikasari Universitas Muhammadiyah Ponorogo
Rizal Arifin Universitas Muhammadiyah Ponorogo
Siti Lathifah Tsaqila Universitas Ahmad Dahlan
Dwiyunia Fatmawati Universitas Muhammadiyah Ponorogo

Keywords:

document clustering, characteristics or categories, python, term frequency-inverse document frequency (tf-idf).

Abstract

The development of technology is speedy; one of the results is developing documents in research articles. Searching for documents in a repository will take a long time if they are not stored grouped by document category. One way to define document categories is clustering. The usefulness of document clustering, to make it easier to find documents by certain categories. The clustering process uses the Term Frequency - Inverse Document Frequency (TF-IDF) algorithm and K-Means. TF-IDF is used to find document weights, while K-Means is for the clustering process. The test documents or dataset were grouped as many as 93 documents, with various themes and document contents. The K-Means cluster quality assessment process results using the Silhouette score; the optimal number of clusters is 4 clusters. This is obtained by looking at the fluctuation in cluster size and thickness of the silhouette plot.

References

S. Andayani and A. Ryansyah, “Implementasi Algoritma TF-IDF Pada Pengukuran Kesamaan Dokumen,” JuSiTik J. Sist. dan Teknol. Inf. Komun., vol. 1, no. 1, p. 53, 2017, doi: 10.32524/jusitik.v1i1.218.

K. A. Vidhya and G. Aghila, “A Survey of Naive Bayes Machine Learning approach in Text Document Classification,” Int. J. Comput. Sci. Inf. Secur., vol. 7, no. 2, pp. 206–211, 2010.

S. L. Ting, W. H. Ip, and A. H. C. Tsang, “Is Naive Bayes a good classifier for document classification,” Int. J. Softw. Eng. Its Appl., vol. 5, no. 3, pp. 37–46, 2011.

E. Frank and R. R. Bouckaert, “Naive bayes for text classification with unbalanced classes,” in European Conference on Principles of Data Mining and Knowledge Discovery, 2006, pp. 503–510.

M. El Kourdi, A. Bensaid, and T. Rachidi, “Automatic Arabic document categorization based on the Naïve Bayes algorithm,” in proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004, pp. 51–58.

N. I. Widiastuti, E. Rainarli, and K. E. Dewi, “Peringkasan dan Support Vector Machine pada Klasifikasi Dokumen,” J. Infotel, vol. 9, no. 4, pp. 416–421, 2017.

J. Samodra, S. Sumpeno, and M. Hariadi, “Klasifikasi Dokumen Teks Berbahasa Indonesia dengan Menggunakan Naïve Bayes,” Semin. Nas. Electr. INFORMATICS, IT’S Educ., pp. 1–4, 2009.

N. M. A. Lestari, I. K. G. D. Putra, and A. A. K. A. Cahyawan, “Personality types classification for indonesian text in partners searching website using naïve bayes methods,” Int. J. Comput. Sci. Issues, vol. 10, no. 1, p. 1, 2013.

H. Februariyanti and E. Zuliarso, “Klasifikasi dokumen berita teks bahasa indonesia menggunakan ontologi,” Dinamik, vol. 17, no. 1, 2012.

A. Z. Arifin and A. N. Setiono, “Klasifikasi Dokumen Berita Kejadian Berbahasa Indonesia dengan Algoritma Single Pass Clustering,” in Prosiding Seminar on Intelligent Technology and its Applications (SITIA), Teknik Elektro, Institut Teknologi Sepuluh Nopember Surabaya, 2002, pp. 29–39, doi: 10.1109/ICODSE.2014.7062678.

A. A. Hakim, A. Erwin, K. I. Eng, M. Galinium, and W. Muliady, “Automated document classification for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach,” in 2014 6th international conference on information technology and electrical engineering (ICITEE), 2014, pp. 1–4.

A. D. Asy’arie and A. W. Pribadi, “Automatic news articles classification in indonesian language by using naive bayes classifier method,” in Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services, 2009, pp. 658–662.

R. Wongso, F. A. Luwinda, B. C. Trisnajaya, and O. Rusli, “News article text classification in Indonesian language,” Procedia Comput. Sci., vol. 116, pp. 137–143, 2017.

A. Z. Arifin, I. Mahendra, and H. T. Ciptaningtyas, “Enhanced confix stripping stemmer and ants algorithm for classifying news document in indonesian language,” in The International Conference on Information & Communication Technology and Systems, 2009, vol. 5, pp. 149–158.

D. J. Hand, H. Mannila, and P. Smyth, “Principles of data mining (adaptive computation and machine learning),” Publ. A Bradford Book, 2001.–584 ?, 2001.

A. Huang, “Similarity measures for text document clustering,” in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008, vol. 4, pp. 9–56.

P. V. Amoli and O. S. Sh, “Scientific documents clustering based on text summarization,” Int. J. Electr. Comput. Eng., vol. 5, no. 4, p. 782, 2015.

J. L. Neto, A. D. Santos, C. A. A. Kaestner, N. Alexandre, and D. Santos, “Document clustering and text summarization,” 2000.

L. Havrlant and V. Kreinovich, “A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation),” Int. J. Gen. Syst., vol. 46, no. 1, pp. 27–36, 2017.

T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.,” Carnegie-mellon univ pittsburgh pa dept of computer science, 1996.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, 1988.

C. D. Manning, P. Raghavan, and H. Schultze, Introduction to Information Retrieval. Cambridge University Press, 2009.

R. Handoyo, R. Mangkudjaja, and S. M. Nasution, “Perbandingan metode clustering menggunakan metode Single Linkage dan K-means pada Pengelompokan Dokumen,” J. Sifo Mikroskil, vol. 15, no. 2, pp. 73–82, 2014.

R. Muliono and Z. Sembiring, “Data Mining Clustering Menggunakan Algoritma K-Means Untuk Klasterisasi Tingkat Tridarma Pengajaran Dosen,” CESS (Journal Comput. Eng. Syst. Sci., vol. 4, no. 2, pp. 272–279, 2019.

R. Llet?, M. C. Ortiz, L. A. Sarabia, and M. S. Sánchez, “Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes,” Anal. Chim. Acta, vol. 515, no. 1, pp. 87–100, 2004.

P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, pp. 53–65, 1987.

R. Garreta and G. Moncecchi, Learning scikit-learn: machine learning in python. Packt Publishing Ltd, 2013.

F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.

D. Rosmala and G. D. L, “Pembangunan Website Content Monitoring System Menggunakan Difflib Python,” J. Inform., vol. 4, no. 1, pp. 57–68, 2012.

L. Buitinck et al., “API design for machine learning software: experiences from the scikit-learn project,” arXiv Prepr. arXiv1309.0238, 2013.

Algoritma Term Frequency – Inverse Document Frequency (TF-IDF) dan K-Means Clustering Untuk Menentukan Kategori Dokumen

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue