Open source Arabic research paper dataset for natural language processing

Almutairi, Tahani M.; Saifuddin, Shireen R.; Alotaibi, Reem M.; SARHAN, SHAHENDA

doi:10.1038/s41598-025-16647-5

Open source Arabic research paper dataset for natural language processing.

AUT.ALMUTAIRI TAHANI M., SAIFUDDIN SHIREEN R., ALOTAIBI REEM M., SARHAN SHAHENDA.

Opis bibliograficzny

Open source Arabic research paper dataset for natural language processing. [AUT.] ALMUTAIRI TAHANI M., SAIFUDDIN SHIREEN R., ALOTAIBI REEM M., SARHAN SHAHENDA. Scientific Reports. DOI: 10.1038/s41598-025-16647-5

Skopiowane!

Kliknij opis aby skopiować do schowka

Szczegóły publikacji

Źródło:

Scientific Reports

Rok:2025

Język:angielski

Charakter formalny:Artykuł w czasopismie

Typ MNiSW/MEiN:inne

Streszczenia

Recent advancements in applications such as natural language processing (NLP), applied linguistics, indexing, data mining, information retrieval, and machine translation have emphasized the need for robust datasets and corpora. While there exist many Arabic corpora, most are derived from social media platforms like X or news sources, leaving a significant gap in datasets tailored to academic research. To address this gap, the ARPD, Arabic Research Papers Dataset, is developed as a specialized resource for Arabic academic research papers. This paper explains the methodology used to construct the dataset, which consists of seven classes and is publicly available in several formats to benefit Arabic research. Experiments conducted on the ARPD dataset demonstrate its performance in classification and clustering tasks. The results show that most of the classical clustering algorithms achieve low performance compared to bio-inspiration algorithms such as Particle Swarm Optimization (PSO) and Gray Wolf Optimization (GWO) based on the Davies–Bouldin index measure. For classification, the Support Vector Machine (SVM) algorithm outperformed others, achieving the highest accuracy, with other classifiers ranging from 89% to 99%. These findings highlight the ARPD’s potential to enhance Arabic academic research and support advanced NLP applications.

Linki zewnętrzne

PBN

6957f614fdbe833fd967b663

DOI

10.1038/s41598-025-16647-5

Strona WWW

https://www.nature.com/articles/s41598-…

Identyfikatory

ISSN: 2045-2322

BPP ID: (6, 8321) wydawnictwo ciągłe #8321

Metryki

140,00

Punkty MNiSW/MEiN

0

Impact Factor

0

Index Copernicus

0

Punktacja wewnętrzna

Eksport cytowania

Wsparcie dla menedżerów bibliografii:
Ta strona wspiera automatyczny import do Zotero, Mendeley i EndNote. Użytkownicy z zainstalowanym rozszerzeniem przeglądarki mogą zapisać tę publikację jednym kliknięciem - ikona pojawi się automatycznie w pasku narzędzi przeglądarki.