FASTA format

What is FASTA format?

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

An example sequence in FASTA format is:

    >gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
    MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
    AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
    QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
    LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
    

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:

  • lower-case letters are accepted and are mapped into upper-case;
  • a single hyphen or dash can be used to represent a gap of indeterminate length;
  • in amino acid sequences, U and * are acceptable letters (see below).
  • any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes are:

        A --> adenosine           M --> A C (amino)
        C --> cytidine            S --> G C (strong)
        G --> guanine             W --> A T (weak)
        T --> thymidine           B --> G T C
        U --> uridine             D --> G A T
        R --> G A (purine)        H --> A C T
        Y --> T C (pyrimidine)    V --> G C A
        K --> G T (keto)          N --> A G C T (any)
                                  -  gap of indeterminate length

The accepted amino acid codes are:

        A ALA alanine                         P PRO proline
        B ASX aspartate or asparagine         Q GLN glutamine
        C CYS cystine                         R ARG arginine
        D ASP aspartate                       S SER serine
        E GLU glutamate                       T THR threonine
        F PHE phenylalanine                   U     selenocysteine
        G GLY glycine                         V VAL valine
        H HIS histidine                       W TRP tryptophan
        I ILE isoleucine                      Y TYR tyrosine
        K LYS lysine                          Z GLX glutamate or glutamine
        L LEU leucine                         X     any
        M MET methionine                      *     translation stop
        N ASN asparagine                      -     gap of indeterminate length
    

zhanglabzhanggroup.org | +65-6601-1241 | Computing 1, 13 Computing Drive, Singapore 117417