Skip to main content

217 closed Salmonella reference genomes using PacBio sequencing

Abstract

Objectives

Whole Genome Sequencing (WGS) is widely used in food safety for the detection, investigation, and control of foodborne bacterial pathogens. However, the WGS data in most public databases, such as the National Center for Biotechnology Information (NCBI), primarily consist of Illumina short reads which lack some important information for repetitive regions, structural variations, and mobile genetic elements, and the genomic location of certain important genes like antimicrobial resistance genes (AMR) and virulence genes. To address this limitation, we have contributed 217 closed circular Salmonella enterica genomes that were generated using PacBio sequencing to the NCBI Pathogen Detection (PD) database and GenBank. This dataset provides a higher level of accuracy to genome representations in the database.

Data description

High-quality complete reference genomes generated from PacBio long reads can provide essential details that are not available in draft genomes from short reads. A complete reference genome allows for more accurate data analysis and researchers to establish connections between genome variations and known genes, regulatory elements, and other genomic features. The addition of 217 complete genomes from 78 different Salmonella serovars, each representing either a distinct SNP cluster within the NCBI PD database or a unique strain, significantly enriches the diversity of the reference genome database.

Peer Review reports

Objective

In 2012, the U.S. Food and Drug Administration’s Center for Food Safety and Applied Nutrition (FDA-CFSAN) launched the GenomeTrakr network [1], the first distributed network of laboratories that utilize WGS for foodborne pathogen identification. GenomeTrakr data is submitted to the NCBI PD database to assist in foodborne pathogen surveillance and outbreak detection. To date, WGS data of over 1.7 million isolates belonging to 81 different pathogen species, primarily obtained from Illumina short-read sequencers, has been collected, stored, and processed in NCBI. The NCBI PD web portal (https://www.ncbi.nlm.nih.gov/pathogens/) provides public access to this extensive collection of WGS data. This portal offers high resolution strain typing, outbreak investigation, and surveillance capabilities. However, the NCBI PD database currently contains only around 500 closed genomes indicating a shortage of high-quality complete genomes.

Salmonella, one of the leading foodborne pathogens, causes widespread illnesses and poses a significant threat to public health worldwide. To address the need for more comprehensive genomic information, between 2018 and 2021 we focused on sequencing and completely closing a diverse set of Salmonella genomes using PacBio technology. This effort provides higher resolution, accuracy, and precision to various analyses, including the detection of structural variations, gene annotations, phylogenetic analysis, and comparative genomics. These analyses which are often used during foodborne events require critical information that is often lacking in the short-read assemblies. The 217 Salmonella isolates used in this collection were collected from various food sources, clinical, and environment. Both the closed genomes and raw long reads have been deposited in NCBI (Table 1), contributing to the expansion of the available genomic resources for Salmonella surveillance.

Data description

NCBI PD processes the WGS data from bacterial and fungal pathogen genomes and places them into clusters based on the relatedness between genomes. Specifically for Salmonella, approximately 711,000 genomes were assigned to more than 30,000 clusters as of January 20, 2025. Within each cluster, a reference genome was used to construct a reference-based SNP matrix, which in turn was used to infer a phylogenetic tree. Having a high-quality closed reference genome would greatly help to identify lineage or serovar-specific Salmonella fragments, enhance the accuracy of the phylogenetic analysis, SNP annotation, and other related analyses [2, 3]. This enhanced resolution helps trace the source of the outbreaks more accurately, determine the pathways of contamination, and understand the genetic factors involved in virulence and resistance [4,5,6]. Consequently, it will lead to more effective and targeted interventions to control and prevent future outbreaks, ensuring better food safety and public health protection [7]. For our study we carefully selected 217 Salmonella isolates for PacBio sequencing. Each isolate represents either a distinct NCBI SNP cluster or is a unique isolate that did not cluster with any other genomes.

Each isolate was cultured in Luria-Bertani (LB) medium at 35 ºC overnight. DNA was extracted using the Maxwell RSC Cultured Cell DNA kit. The sequencing libraries were generated using the SMRTbell Template Prep Kit 1.0 following the manufacturer ‘s recommended microbial multiplexing protocol. Each set of 4 samples was sequenced using the Pacific Biosciences (PacBio) Sequel platform (v2.1 Chemistry, Sequel SMRT cell 1 M v2, 10-hour movie). The PacBio raw reads were de novo assembled using the PacBio Hierarchical Genome Assembly Process (HGAP) 4.0. The assembled genomes were circularized using Circlator [8]. If the corresponding Illumina short reads were available in NCBI, the closed genomes were further polished with the Illumina short reads using Pilon [9]. In cases where short reads were unavailable, the closed genomes were polished with the PacBio raw reads using the Resequencing module provided by PacBio SMRTLink (from version 5.1 to 8.0) (Pacific Biosciences, Menlo Park, CA).

Among the 217 closed genomes, CFSAN017963 and CFSAN000871 were sequenced on PacbBio RS II, assembled using HGAP 3.0, and published previously [10, 11]. They were re-sequenced on PacBio Sequel which, compared to RS II, provided higher throughput and higher consensus accuracy, assembled using HGAP 4.0, and included in the current dataset for comparison purpose.

Table 1 Complete DNA sequences (.Fasta) for the 217 Salmonella genomes

Data availability

Data is provided within the manuscript (Table 1), which have been deposited in NCBI with the primary accession number PRJNA186035.

Abbreviations

FDA-CFSAN:

U.S. Food and Drug Administration’s Center for Food Safety and Applied Nutrition

NCBI:

National Center for Biotechnology Information

SNP:

Single Nucleotide Polymorphism

SRA:

Short-read Archive

WGS:

Whole Genome Sequencing

References

  1. Allard MW, Strain E, Melka D, Bunning K, Musser SM, Brown EW, Timme R. Practical value of food pathogen traceability through building a whole-genome sequencing network and database. J Clin Microbiol. 2016;54(8):1975–83. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/JCM.00081-16

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL. The value of complete microbial genome sequencing (you get what you pay for). J Bacteriol. 2002;184(23):6403–5. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/JB.184.23.6403-6405.2002

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Korlach J. Returning to more finished genomes. Genom Data. 2014;2:46–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.gdata.2014.02.003

    Article  PubMed  PubMed Central  Google Scholar 

  4. Gilchrist CA, Turner SD, Riley MF, Petri WA Jr, Hewlett EL. Whole-genome sequencing in outbreak analysis. Clin Microbiol Rev. 2015;28(3):541–63. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/CMR.00075-13

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhou Y, Ren M, Zhang P, Jiang D, Yao X, Luo Y, Yang Z, Wang Y. Application of nanopore sequencing in the detection of foodborne microorganisms. Nanomaterials. 2022;12(9):1534. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/nano12091534

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Sikorski MJ, Hazen TH, Desai SN, Nimarota-Brown S, Tupua S, Sialeipata M, Rambocus S, Ingle DJ, Duchene S, Ballard SA, Valcanis M, Zufan S, Ma J, Sahl JW, Maes M, Dougan G, Thomsen RE, Robins-Browne RM, Levine MM. Rasko DA,2022. Persistence of rare Salmonella Typhi genotypes susceptible to first-line antibiotics in the remote islands of Samoa. mBio. 2022;13(5):e0192022. https://doiorg.publicaciones.saludcastillayleon.es/10.1128/mbio.01920-22

    Article  CAS  PubMed  Google Scholar 

  7. Stevens EL, Carleton HA, Beal J, Tillman GE, Lindsey RL, Lauer AC, Pightling A, Jarvis KG, Ottesen A, Ramachandran P, Hintz L, Lee S, Katz, Jason P, Folster JM, Whichard E, Trees RE, Timme P, Mcdermott B, Wolpert M, Bazaco S, Zhao S, Lindley BB, Bruce PM, Griffin M, Hoffmann M, Wise R, Tauxe P, Gerner-Smidt M. Musser, Chris Braden. Use of whole genome sequencing by the federal interagency collaboration for genomics for food and feed safety in the United States. J Food Prot. 2022;85(5):755–722. https://doiorg.publicaciones.saludcastillayleon.es/10.4315/JFP-21-437

  8. Hunt M, De Silva N, Otto TD, Parkhill J, Keane J, Harris SR. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol. 2015;16:294. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-015-0849-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9(11):e112963. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0112963

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Wang W, Hoffmann M, Laasri A, Jacobson AP, Melka D, Curry PE, Hammack TS, Zheng J. Complete genome sequence of Salmonella enterica subsp. Enterica Serovar Minnesota strain. Genome Biol Evol. 2017;9(10):2727–31. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/gbe/evx209

    Article  CAS  PubMed Central  Google Scholar 

  11. Zheng J, Luo Y, Reed E, Bell R, Brown EW, Hoffmann M. Whole-genome comparative analysis of Salmonella enterica Serovar Newport strains reveals lineage-specific divergence. Genome Biol Evol. 2017;9(4):1047–50. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/gbe/evx065

  12. The 217 genomes were under the same BioProject number. PRJNA186035 - NCBI. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA186035

Download references

Funding

This study was supported by funding from the FDA Human Foods Program Intramural Funds.

Author information

Authors and Affiliations

Authors

Contributions

M.H. designed the study. J.J. and M.H. performed all sequencing and wet lab work. Y.L. assembled the sequencing data and closed the genomes. M.H. and M.B. submitted the closed genomes and raw reads to NCBI, respectively. Y.L. wrote the manuscript. All authors read, provided feedback, and approved the final manuscript.

Corresponding author

Correspondence to Yan Luo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, Y., Jang, J.H., Balkey, M. et al. 217 closed Salmonella reference genomes using PacBio sequencing. BMC Genom Data 26, 15 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12863-025-01304-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12863-025-01304-7

Keywords