GitHub - AluruLab/ParFastAAI: Parallel Algorithm for computing Average Amino Acid Identities (AAI) for bacterial and archaeal genomes.

A parallel version of Fast AAI

Installation

Pre-requisites

Linux (Tested with Ubuntu 22.04)
C++ 17 with Open MPI (Tested with gcc 11.4)
CMake (version 3.2, Optional)

Build par_fastaai

For a quick build, clone the repo and run build.sh. This build uses gcc and g++ to build the executable par_fastaai.x.

For building with CMake, execute the following steps:

Clone the repository

git clone --recurse-submodules https://github.com/AluruLab/ParFastAAI.git

Create a build directory

Configure with cmake

cd ParFastAAI/build
cmake ..

Build with make

The executable par_fastaai.x will be built in the build directory

Usage

Usage of Parallel FastAAI is as follows:

Usage: ./par_fastaai.x [OPTIONS] path_to_input_db path_to_output_file

Positionals:
  path_to_input_db TEXT:FILE REQUIRED
                              Path to the Input Database
  path_to_output_file TEXT REQUIRED
                              Path to output csv file.

Options:
  -h,--help                   Print this help message and exit
  -r,--query_db TEXT:FILE     Path to the Query Database [Optional (default: Same as the Input DB)]
  -s,--separator TEXT [,]     Field Separator in the output file [Optional (default: ,)].
  -q,--query_subset TEXT:FILE Path to Query List (Should be subset of genomoes in the input DB.)

Currently parallel FastAAI allows three types of usage to compute Average Jaccard Index (AJI) for a pairs of genomes. All of them require a SQLite database of tetramer and genome information of single copy proteins (SCP).

Given a database of SCPs and tetramer information, compute AJI for all the pairs of genomes in the database and outputs a csv file with AJI matrix. Output is a csv file with a square matrix, whose size is the number of genomes in the database.
Given a database of SCPs and tetramer information and a set of query genomes (a strict subset of the genomes in the database), compute AJI values for the query genomes against all the genomes in the database. Output is a csv file with AJI matrix of size : (Number of query genomes) X (Number of total genomes)
Given two databases of SCP and tetramer information - a main database and a query database, compute AJI (Average Jaccard Index) of the genomes in the query database against all the genomes in the main database. NOTE: NONE of the genomes in the query database should be from the input database. Outputs a csv file with AJI matrix of size: (Number of genomes in query db) X (Number of genomes in main db)

data/ directory contains the example databases, input and the output files.

Execution

Once the executable has been built, the following for more information on all the options that the executable accepts:

By default the program uses all the cores in the machine. To reduce the number of cores use the environment variable OMP_NUM_THREADS as described in the OpenMP documentation

 setenv OMP_NUM_THREADS 4,3,2

Parallel Algorithm

The parallel algorithm and the data structures used are described in the document Parallel Fast AAI

Licensing

Our code is licensed under the Apache License 2.0 (see LICENSE).