Assemble a bacterial genome from scratch
A hands-on tour of the pipeline I use for every bacterial isolate I sequence — from raw reads to an annotated, comparable genome. You don't need anything installed; the command blocks are illustrative, and the charts simulate realistic output.
step 01 Set the stage
You have a bacterial isolate — let's call it S. maltophilia SVIA2. You've purified its DNA, handed it to a sequencing core, and received back a pair of FASTQ files: forward reads (R1) and reverse reads (R2). Our goal is to turn those millions of short sequences into one coherent, annotated genome.
Paired-end sequencing reads both ends of the same DNA fragment, so we know roughly how far apart the two reads are. That distance information is gold later — it helps the assembler resolve repeats.
-rw-r--r-- 1 tem staff 1.2G apr 19 09:14 SVIA2_R1.fastq.gz
-rw-r--r-- 1 tem staff 1.2G apr 19 09:14 SVIA2_R2.fastq.gz
$ zcat raw/SVIA2_R1.fastq.gz | head -4
@M01234:15:000-ABCDE:1:1101:15789:1320 1:N:0:1
GATCGGAAGAGCACACGTCTGAACTCCAGTCA…
+
AAAAAFFFFFFFFGGGGGGGGGGHHHGGHHH…
Each 4-line block is one read: a header, the nucleotide sequence, a separator, and a quality string where each character is a confidence score. That's what we're working with — millions of these.
step 02 Meet your reads
Before doing anything clever, always count. How many reads? How long are they? What's the expected coverage?
Coverage is the number of times, on average, each base in the genome is sequenced. For bacterial isolates we aim for ~80–100× with short reads — enough that errors in individual reads get voted out by consensus, but not so much that the assembler chokes on redundancy.
- Reads × length ÷ genome size ≈ coverage
- 8 M reads × 151 bp ÷ 4.8 Mb ≈ 251× — plenty for S. maltophilia.
step 03 Quality control (FastQC & trimming)
Sequencers are wonderfully reliable at the start of a read and degrade toward the end. Plot per-base quality and you'll see it — a high plateau that slumps at the 3' end. Here's the typical shape, simulated:
Nothing alarming — bases 1–120 are solid, 130-plus dips. Time to trim.
raw/SVIA2_R1.fastq.gz raw/SVIA2_R2.fastq.gz \
qc/R1_paired.fq.gz qc/R1_unpaired.fq.gz \
qc/R2_paired.fq.gz qc/R2_unpaired.fq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50
Input Read Pairs: 8,214,332
Both Surviving: 8,011,410 (97.52%)
Forward Only: 142,881 · Reverse Only: 38,204
Dropped: 21,837 (0.27%)
Keep >95% of pairs after trimming. If you're losing 20% of your data, something is wrong with the library or adapters — go back to the sequencing core before assembling.
step 04 Assemble with SPAdes
SPAdes builds a de Bruijn graph from overlapping k-mers in your reads, then walks the graph to reconstruct long, contiguous sequences — contigs. For an isolate you won't usually get a single perfect chromosome; you'll get a set of contigs that together cover the genome.
-1 qc/R1_paired.fq.gz -2 qc/R2_paired.fq.gz \
-o asm/ -t 16 -k 21,33,55,77,99,127
==> Reading input
==> K-mer counting (k=21,33,55,77,99,127)
==> Assembly graph construction
==> Contig extraction & scaffolding
==> Finished. See asm/scaffolds.fasta
The resulting contigs — sorted longest to shortest — might look like this:
- N50 ≈ 412,803 bp — half the genome sits in contigs longer than this.
- Total assembly ≈ 4.82 Mb, close to the expected S. maltophilia genome size.
- GC content ≈ 66.3%, which matches the genus. A wildly different GC would hint at contamination.
step 05 Annotate with Prokka
A sequence is useful; genes are what we care about. Prokka hands your contigs to a pipeline of tools (Prodigal for CDS prediction, HMMER against curated databases, Aragorn for tRNAs, Barrnap for rRNAs) and produces a GFF you can drop into IGV.
--genus Stenotrophomonas --species maltophilia \
--cpus 16 asm/scaffolds.fasta
[predict] 4,401 CDS
[predict] 74 tRNAs · 6 rRNAs · 1 tmRNA
[done] wrote annot/SVIA2.gff, .gbk, .faa, .tbl
A tiny slice of what came back — some of which becomes the story you tell in the paper:
Two overlapping stories appear in a single annotated genome: a metabolic one (what can this bug eat?) and a resistance one (what can it tolerate?). Both were central to the Stenotrophomonas sp. Pemsol story — PAH degradation and MDR on one chromosome.
step 06 Compare & publish
One genome alone is a snapshot. Placed next to relatives, it becomes a datapoint. Before submitting to GenBank, I run a quick comparison against published strains:
Nothing surprising — SVIA2 sits comfortably in its genus. That's the null result you want before digging into what makes your isolate distinct (in SVIA2's case: an unusual configuration of PAH-catabolic clusters).
Then: deposit to GenBank / ENA, write a short Microbiology Resource Announcement, and link back here so students can see the whole pipeline end-to-end.
Next up (drafts): TUT.02 — how NGS sequencers actually work · TUT.03 — AMR from a genome point of view. Both coming this quarter.