GitHub - hpicgs/unCover: An AI tool presented at KDIR 2023, that can distinguish human-written from AI-generated news articles using natural language processing text analysis, with an emphasis on explainability.

Detailed information about unCover can be found in the following publication:

Liebe L., Baum J., Schütze T., Cech T., Scheibel W. and Döllner J. (2023).
UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization.
In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-671-2, SciTePress, pages 39-50. DOI: 10.5220/0012163300003598

Dataset: Pre-trained Models:

Our datasets and pre-trained models can be found on Zenodo through the batches above. Please note that for copyright reasons we removed the plain text of the scraped news articles and only left the metadata and the generated texts in the dataset files.

Prerequisites

If you want to run unCover locally without Docker, please make sure you have the following packages installed and working on your machine:

Anaconda or Miniconda
Java (Runtime Environment is sufficient)
Make and g++

Setup

To set up this project with docker compose, run the following command

Inside the training container, whenever you want to run any of the following scripts, make sure you first activate the environment with.

To take full advantage of all capabilities in this repository you should fill out all the information in .env.example and save it as .env. The credentials are only required for generation, however, fine-tuning the confidence thresholds to the used models will greatly benefit performance and is required to achieve good results.

When modifying the code, you can use the Dockerfiles in /deployment to re-build the containers.

Alternatively if you are only interested in running the web interface you can use the dedicated Docker container from Docker Hub: docker pull lucasliebe/uncover:latest.

Usage

There are multiple ways to use unCover, depending on your use case. First, you can run the server and use the web interface to explore the system. Second, you can generate your own news dataset following the proposed methods. Third, you can use the datasets to re-train and/ or test the unCover models.

Running unCover

To run and try out unCover in its base configuration follow the setup guide and run

This will start the process and ensure all sub-systems are installed and running, then you can access the web interface locally at http://localhost:8501/.

Scraping news articles

Scraping can be done either in query mode or dataset mode - query mode uses a search query provided by you to gather articles on Google News, while dataset mode looks up the personal pages of specific authors on their publication and collects hundreds of articles for a single person.

Query mode

Query mode is very simple and will fetch news from Google News depending on the query:

python3 generate_dataset.py --mode human --queries <query1,query2>

Dataset mode

Dataset mode is quicker and has higher output but takes a bit more work to set up. For one, every news publication has their own way of listing articles by author, therefore a helper functions is required for every single publication. At the moment, dataset mode only works for TheGuardian and N-tv.

Here is a practical example for dataset mode:

python3 generate_dataset.py --mode human \
    --publication <theguardian|n-tv> \
    --author <author name> \

The publication argument will set the website to collect articles from. The author argument is the most important one - you need to provide the URL appendix (only the last part of the url) of your specific author's personal page, where all their articles are listed.

Generating news articles

To generate news articles, we also scrape news from Google News, but this time we use the query as context for a generative language model. The model will then generate a new article based on this context. The following could is an example command:

python3 generate_dataset.py --mode ai --queries <query1,query2> --models <gpt4,gemini,...>

Training the models

New models can be trained using training.py for stylometry models and for the TEGM based prediction model. The commandline options for these scripts are documented in the scripts themselves. If your dataset is stored in a different location, please modify definitions.py accordingly.

Recreating the test performance

Including both generation methods,

python3 generate_dataset.py --mode test --queries <query1,query2> --models <gpt4,gemini,...>

has been used to create the test dataset including both human authors and AI models as text sources.

After creating this test dataset or using the provided dataset, test_performance.py can be used to create test results and a performance summary.

Citation

If you use unCover in your research, please cite our paper:

@inproceedings{kdir23uncover,
    author={Lucas Liebe and Jannis Baum and Tilman Schütze and Tim Cech and Willy Scheibel and Jürgen Döllner},
    title={UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization},
    booktitle={Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
    year={2023},
    pages={39-50},
    publisher={SciTePress},
    organization={INSTICC},
    doi={10.5220/0012163300003598},
    isbn={978-989-758-671-2},
}