### General information Author: Verena Kutschera Contact e-mail: generode@nbis.se DOI: 10.17044/scilifelab.19248172 License: GPL 3.0 + This readme file was last updated: 22-02-2022 Please cite as: Verena E. Kutschera, Marcin Kierczak, Tom van der Valk, Johanna von Seth, Nicolas Dussex, Edana Lord, Marianne Dehasque, David W. G. Stanton, Payam Emami, Björn Nystedt, Love Dalén, David Díez-del-Molino (2022). Test dataset from: GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. DOI 10.17044/scilifelab.19248172 ### Dataset description Files included into this repository: - reference.tar.gz: sumatran_rhino_22Jul2017_9M7eS_haploidified_headersFixed_Sc9M7eS_2_HRSCAF_41.fasta with scaffold ‘Sc9M7eS_2_HRSCAF_41’ from the Sumatran rhinoceros genome assembly (Dicerorhinus sumatrensis harrissoni; GenBank accession number GCA_014189135.1). GCF_000283155.1_CerSimSim1.0_genomic.Sc9M7eS_2_HRSCAF_41.fasta with three scaffolds from the White rhinoceros genome assembly (Ceratotherium simum simum; GenBank accession number GCF_000283155.1; ‘NW_004454182.1’, ‘NW_004454248.1’, and ‘NW_004454260.1’). GCF_000283155.1_CerSimSim1.0_genomic.Sc9M7eS_2_HRSCAF_41.gtf contains gene predictions from the three White rhinoceros scaffolds in GTF format. rhino_NC_012684.fasta with a Sumatran rhinoceros mitochondrial genome (GenBank accession number NC_012684.1). - historical.tar.gz: FASTQ files with paired-end reads from three historical Sumatran rhinoceros samples from the now-extinct Malay Peninsula population that had mapped to the Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome. For two of the historical Sumatran rhinoceros samples, three sequencing libraries are available per sample that had been sequenced on two lanes each. For the third historical sample, 24 sequencing libraries are available that had been sequenced on two lanes (12 libraries per lane). SRA identifiers: JvS008 (SR08) ERS4044060, JvS009 (SR09) ERS4044061, JvS022 (SR22) ERS4044063. - modern.tar.gz: FASTQ files with paired-end reads from three modern Sumatran rhinoceros samples from the now-extinct Malay Peninsula population that had mapped to the Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’, along with a small proportion of randomly selected reads that mapped to the Sumatran rhinoceros mitochondrial genome or elsewhere in the genome. For each of the three modern Sumatran rhinoceros samples, one sequencing library is available. SRA identifiers: JvS033 (KB6196) ERS4042484, JvS034 (KB6197) ERS4042485, JvS035 (KB6198) ERS4042486. - gerp_outgroups.tar.gz: gzipped FASTA files with scaffolds from the genome assemblies of mammalian outgroup species for GERP analyses that contained genes with reciprocal blast hits to genes from the Sumatran rhinoceros scaffold ‘Sc9M7eS_2_HRSCAF_41’. Included species and GenBank accession numbers of genome assemblies: Ailurus fulgens GCA_002007465.1, Antilocapra americana GCA_007570785.1, Balaenoptera acutorostrata GCF_000493695.1, Bubalus bubalis GCF_003121395.1, Camelus dromedarius GCF_000803125.2, Canis lupus GCF_014441545.1, Catagonus wagneri GCA_004024745.2, Cervus elaphus GCA_002197005.1, Diceros bicornis GCA_013634535.1, Enhydra lutris GCF_002288905.1, Equus asinus GCA_016077325.1, Giraffa camelopardalis GCA_017591445.1, Hippopotamus amphibius GCA_004027065.2, Hyaena hyaena GCF_003009895.1, Leptonychotes weddellii GCF_000349705.1, Lipotes vexillifer GCF_000442215.1, Manis javanica GCF_014570535.1, Mesoplodon bidens GCA_004027085.1, Ovis aries GCF_002742125.1, Panthera leo GCA_008795835.1, Paradoxurus hermaphroditus GCA_004024585.1, Physeter catodon GCF_002837175.2, Procyon lotor GCA_015708975.1, Spilogale gracilis GCA_004023965.1, Suricata suricatta GCF_006229205.1, Tapirus indicus GCA_004024905.1, Tragulus javanicus GCA_004024965.2, Tursiops truncatus GCF_011762595.1, Ursus maritimus GCF_017311325.1, Zalophus californianus GCF_009762305.2 - gerp_tree.nwk: Phylogeny of the White rhinoceros and the 30 mammalian outgroup species including divergence time estimates (in billions of years) from timetree.org in NEWICK format. - config.tar.gz: Configuration and metadata files used to analyze this test dataset with the GenErode pipeline, results presented in Kutschera et al. (2022): config_mitogenomes.yaml: configuration file used to run the mitochondrial mapping step. config_sum_rhino.yaml: configuration file used to run GenErode using the Sumatran rhinoceros scaffold as reference genome. config_whi_rhino.yaml: configuration file used to run GenErode using the White rhinoceros scaffolds as reference genome. rhino_3_historical_samples.txt: metadata file for historical samples. rhino_3_modern_samples.txt: metadata file for modern samples. To analyze this dataset with GenErode, please download this repository and uncompress all folders in a dedicated directory. The configuration and metadata files in `config/` can be used as examples for GenErode runs. Instructions on how to run the GenErode pipeline are provided in the GitHub repository (https://github.com/NBISweden/GenErode/wiki). References: Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, et al. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. bioRxiv. 2022. doi: 10.1101/2022.03.04.482637