COInr and mkCOInr : Building and customizing a nonredundant barcoding reference database from BOLD and NCBI using a semi‐automated pipeline - Archive ouverte HAL Access content directly
Journal Articles Molecular Ecology Resources Year : 2023

COInr and mkCOInr : Building and customizing a nonredundant barcoding reference database from BOLD and NCBI using a semi‐automated pipeline

Abstract

Reference databases with wide taxonomic coverage are greatly needed in many fields of biology, most particularly for the taxonomic assignment of metabarcoding sequences. Therefore, it is fundamental to be able to access and pool data from different primary databases. The COInr database is a freely available, easy-to-access database of COI reference sequences extracted from the BOLD and NCBI nucleotide databases. It is a comprehensive database: not limited to a taxon, a gene region or a taxonomic rank; therefore, it is a good starting point for creating custom databases. Sequences are dereplicated between databases and within taxa. Each taxon has a unique taxonomic identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs, allowing their full or ranked lineages to be created. The mkcoinr tool is a series of Perl scripts designed to download sequences from BOLD and NCBI, to build the COInr database and to customize it according to the users’ needs. It is possible to select or eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for blast, vtam, qiime and rdp classifier. This is a semiautomated pipeline using command lines in a Linux environment. The COInr database can be downloaded from https://doi.org/10.5281/zenodo.6555985 and mkcoinr and its full documentation is available at https://github.com/meglecz/mkCOInr.
Fichier principal
Vignette du fichier
Meglécz_2023_MER.pdf (990.94 Ko) Télécharger le fichier
Origin : Publication funded by an institution
Licence : CC BY - Attribution

Dates and versions

hal-04010871 , version 1 (02-03-2023)

Licence

Attribution - CC BY 4.0

Identifiers

Cite

Emese Meglécz. COInr and mkCOInr : Building and customizing a nonredundant barcoding reference database from BOLD and NCBI using a semi‐automated pipeline. Molecular Ecology Resources, In press, ⟨10.1111/1755-0998.13756⟩. ⟨hal-04010871⟩
0 View
0 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More