4 files

COI reference sequences from BOLD DB

posted on 15.09.2022, 11:45 authored by John SundhJohn Sundh

Dataset description

This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences
collected from the BOLD database. The fasta file
bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public
Data Portal

and each fasta header contains the taxonomic ranks + the BIN ID assigned to the
record. The taxonomic information for each record is also given in the tab-separated
file bold_info_filtered.tsv.gz.

The dataset was last created on February 18, 2022.


The code used to generate this dataset consists of a snakemake workflow wrapped
into a python package that can be installed with conda
(`conda install -c bioconda coidb`).

Firstly sequence and taxonomic information for records in the BOLD database is
downloaded from the GBIF Hosted Datasets.
This data is then filtered to only keep records annotated as 'COI-5P' and assigned
to a BIN ID. The taxonomic information is parsed in order to assign species names
and resolve higher level ranks for each BIN ID. Sequences are processed to remove
gap characters and leading and trailing `N`s. After this, any sequences with
remaining non-standard characters are removed.
Sequences are then clustered at 100% identity using vsearch
(Rognes _et al._ 2016). This clustering is done separately for sequences assigned
to each BIN ID.   


Swedish Biodiversity Data Infrastructure

Swedish Research Council

Find out more...

National Bioinformatics Infrastructure Sweden (NBIS)

Swedish Research Council

Find out more...



Swedish Biodiversity Data Infrastructure (SBDI)