GitHub - superAX/Classifcation-Analysis-on-Textual-Data

Classification Analysis on Textual Data

Language: Python 3.6
Library: scikit-learn

Classification Analysis on Textual Data: extract features from raw texts and try different classification approaches to classify them into topics.

In this project, we classify data from the "20 Newsgroups" dataset. It is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. It is is splitted in two subsets: one for training (or development) and the other one for testing (or for performance evaluation) and is a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

We first represent the textual data as TF-IDF matrices.
Then, we use dimensionality reduction methods (PCA and NMF) to their size.
Finally, we analyze and compare different classification methods (SVM, Logistic Regression, Na¨ıve Bayes, Multiclass).