▸ TUT.01 · interactive walkthrough · ≈ 25 min

Assemble a bacterial genome from scratch

A hands-on tour of the pipeline I use for every bacterial isolate I sequence — from raw reads to an annotated, comparable genome. You don't need anything installed; the command blocks are illustrative, and the charts simulate realistic output.

step 01 Set the stage

You have a bacterial isolate — let's call it S. maltophilia SVIA2. You've purified its DNA, handed it to a sequencing core, and received back a pair of FASTQ files: forward reads (R1) and reverse reads (R2). Our goal is to turn those millions of short sequences into one coherent, annotated genome.

why pairs?

Paired-end sequencing reads both ends of the same DNA fragment, so we know roughly how far apart the two reads are. That distance information is gold later — it helps the assembler resolve repeats.

~/projects/svia2/

$ ls -lh raw/
-rw-r--r-- 1 tem staff 1.2G apr 19 09:14 SVIA2_R1.fastq.gz
-rw-r--r-- 1 tem staff 1.2G apr 19 09:14 SVIA2_R2.fastq.gz
$ zcat raw/SVIA2_R1.fastq.gz | head -4
@M01234:15:000-ABCDE:1:1101:15789:1320 1:N:0:1
GATCGGAAGAGCACACGTCTGAACTCCAGTCA…
+
AAAAAFFFFFFFFGGGGGGGGGGHHHGGHHH…

Each 4-line block is one read: a header, the nucleotide sequence, a separator, and a quality string where each character is a confidence score. That's what we're working with — millions of these.

step 02 Meet your reads

Before doing anything clever, always count. How many reads? How long are they? What's the expected coverage?

> simulate: inspect reads

waiting…

Coverage is the number of times, on average, each base in the genome is sequenced. For bacterial isolates we aim for ~80–100× with short reads — enough that errors in individual reads get voted out by consensus, but not so much that the assembler chokes on redundancy.

Reads × length ÷ genome size ≈ coverage
8 M reads × 151 bp ÷ 4.8 Mb ≈ 251× — plenty for S. maltophilia.

step 03 Quality control (FastQC & trimming)

Sequencers are wonderfully reliable at the start of a read and degrade toward the end. Plot per-base quality and you'll see it — a high plateau that slumps at the 3' end. Here's the typical shape, simulated:

base 1base 75base 150

Nothing alarming — bases 1–120 are solid, 130-plus dips. Time to trim.

$ trimmomatic PE

$ trimmomatic PE -phred33 \
    raw/SVIA2_R1.fastq.gz raw/SVIA2_R2.fastq.gz \
    qc/R1_paired.fq.gz qc/R1_unpaired.fq.gz \
    qc/R2_paired.fq.gz qc/R2_unpaired.fq.gz \
    ILLUMINACLIP:adapters.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50
Input Read Pairs: 8,214,332
Both Surviving: 8,011,410 (97.52%)
Forward Only: 142,881 · Reverse Only: 38,204
Dropped: 21,837 (0.27%)

rule of thumb

Keep >95% of pairs after trimming. If you're losing 20% of your data, something is wrong with the library or adapters — go back to the sequencing core before assembling.

step 04 Assemble with SPAdes

SPAdes builds a de Bruijn graph from overlapping k-mers in your reads, then walks the graph to reconstruct long, contiguous sequences — contigs. For an isolate you won't usually get a single perfect chromosome; you'll get a set of contigs that together cover the genome.

$ spades.py

$ spades.py --isolate \
-1 qc/R1_paired.fq.gz -2 qc/R2_paired.fq.gz \
-o asm/ -t 16 -k 21,33,55,77,99,127
==> Reading input
==> K-mer counting (k=21,33,55,77,99,127)
==> Assembly graph construction
==> Contig extraction & scaffolding
==> Finished. See asm/scaffolds.fasta

The resulting contigs — sorted longest to shortest — might look like this:

N50 ≈ 412,803 bp — half the genome sits in contigs longer than this.
Total assembly ≈ 4.82 Mb, close to the expected S. maltophilia genome size.
GC content ≈ 66.3%, which matches the genus. A wildly different GC would hint at contamination.

step 05 Annotate with Prokka

A sequence is useful; genes are what we care about. Prokka hands your contigs to a pipeline of tools (Prodigal for CDS prediction, HMMER against curated databases, Aragorn for tRNAs, Barrnap for rRNAs) and produces a GFF you can drop into IGV.

$ prokka

$ prokka --outdir annot/ --prefix SVIA2 \
--genus Stenotrophomonas --species maltophilia \
--cpus 16 asm/scaffolds.fasta
[predict] 4,401 CDS
[predict] 74 tRNAs · 6 rRNAs · 1 tmRNA
[done] wrote annot/SVIA2.gff, .gbk, .faa, .tbl

A tiny slice of what came back — some of which becomes the story you tell in the paper:

nahG · salicylate 1-monooxygenase

phnA · phenanthrene dioxygenase α

smeR · MDR efflux regulator

blaL1 · metallo-β-lactamase

qnrB · quinolone resistance

16S rRNA · 1542 bp

catA · catechol dioxygenase

recA · DNA recombination

smeE · efflux pump

dmpB · 2,3-dihydroxybiphenyl 1,2-dioxygenase

gyrB · DNA gyrase subunit B

alkB · alkane monooxygenase

interpret

Two overlapping stories appear in a single annotated genome: a metabolic one (what can this bug eat?) and a resistance one (what can it tolerate?). Both were central to the Stenotrophomonas sp. Pemsol story — PAH degradation and MDR on one chromosome.

step 06 Compare & publish

One genome alone is a snapshot. Placed next to relatives, it becomes a datapoint. Before submitting to GenBank, I run a quick comparison against published strains:

strain

size (Mb)

GC %

CDS

SVIA2

4.82

66.3

4,401

K279a

4.85

66.3

4,386

R551-3

4.57

66.9

4,112

Pemsol

4.91

66.1

4,528

Nothing surprising — SVIA2 sits comfortably in its genus. That's the null result you want before digging into what makes your isolate distinct (in SVIA2's case: an unusual configuration of PAH-catabolic clusters).

Then: deposit to GenBank / ENA, write a short Microbiology Resource Announcement, and link back here so students can see the whole pipeline end-to-end.

▸ tutorial complete · you've just assembled a genome

Next up (drafts): TUT.02 — how NGS sequencers actually work · TUT.03 — AMR from a genome point of view. Both coming this quarter.

step 1 of 6