GitHub - chekaru/MachineLearning: Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics

MachineLearning

This repository was used in the Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics and other Digital methods #LancsSS16 and #LancsSS17 at Lancaster University, UK – 12th to 15th July 2016 and 27th - 30th June 2017. http://ucrel.lancs.ac.uk/summerschool/nlp.php

Insttructor: Dr. Mahmoud El-Haj http://www.lancaster.ac.uk/staff/elhaj

Slides are avialable online here:

Course: https://lancaster.box.com/s/fi15evvbtcs4ab0tx5zo8nxmy2yylztx

Workspace Setup: https://lancaster.box.com/s/j78l0b4197il98oze2gfqlidlsvg7jlt

The code trains classifiers for chairman's statements, governance & remuneration sections from 1,000 annual financial reports. Using WEKA Java the code does the following:

  • Creates an ARFF File
  • Train a model using different Algorithms
  • Extract n-gram features using stringToWordsVector
  • Reduce features
  • Classify unseen documents using the created models.