Using CLI Tools for Federated FASTQ QC
Assumed Knowledge
This guide assumes you're already familiar with the concepts shown in the VCF QC tutorial (federated execution model, analyzer vs. aggregator roles, project / datastore setup, approvals). If not, read that first: see VCF QC Guide plus the background docs on Coding an Analysis and the Core SDK.
Summary
This tutorial shows how to run a simple, single‑round federated FASTQ quality control (QC) analysis in FLAME using the external command‑line tool FastQC and the provided reference script fastq_qc.py. It is an MVP / demonstration workflow and is not a substitute for comprehensive sequencing data QC or clinical validation pipelines.
Download
Download the full reference script: fastq_qc.py
Goal
Show how to use a command‑line tool (FastQC in our case) inside a single‑round federated analysis: handling temporary output, enforcing runtime constraints (quiet mode + timeout), parsing only the minimal required artifacts, and aggregating per‑file module statuses and basic stats across nodes without moving raw read data.
You will learn the exact CLI invocation pattern, where transient files live, and how to reproduce a node’s result locally using plain shell commands before (or after) federation.
Why python?
As there is no FLAME StarModel for bash we must rely on python as wrapper for our CLI tools.
What the Script Does
Condensed overview (details parallel the VCF QC example):
- Analyzer writes each candidate FASTQ object to a temp file and runs:
fastqc --quiet --outdir <tmp> <file>
. - Parses
fastqc_data.txt
(basic stats) +summary.txt
(module statuses). - Fails a file on structural / runtime issues, zero sequences, missing required stats, or any module with status
FAIL
. - Marks warnings if one or more modules report
WARN
. - Aggregator concatenates per‑node results into one JSON; single round (
simple_analysis=True
).
Anything not listed here works identically to the VCF QC pattern (data fan‑out, approvals, convergence logic).
Prerequisites
- A project (proposal) with at least one analyzer node and one aggregator node approved (see Project Guide).
- The genomics master image available (contains
pysam
and other basic genomics tools). - MinIO (S3) datastores configured on each participating node. See admin docs for bucket setup: Bucket Setup & Data Store Management.
Filenames & Privacy
FASTQ filenames (object keys) are included verbatim in aggregated outputs. Ensure they do not contain sensitive identifiers. If needed, anonymize beforehand.
CLI Invocation Used
Each file is processed via (conceptually):
with tempfile.TemporaryDirectory() as temp_dir: # Clean workspace
cmd = ["fastqc", "--quiet", "--outdir", temp_dir, path]
result = subprocess.run( # Controlled execution wrapper
cmd,
capture_output=True, # Prevent noisy/stdout leakage
text=True, # Easier error string handling
timeout=300, # Hard upper bound per file
)
if result.returncode != 0:
# Mark file failed with sanitized reason (no raw stderr leakage)
...
# Extract minimal metrics; discard directory when context exits
The design above deliberately places each FASTQ file in its own clean workspace (TemporaryDirectory
) so concurrent processing cannot collide on filenames and so every transient artifact (unzipped FastQC output, extracted text files) is guaranteed to be purged automatically on scope exit.
A per‑file temporary copy is necessary because FastQC (like many classic bioinformatics CLI tools such as samtools
) operates on real filesystem paths; some tools also rely on filename extensions to auto‑detect compression (e.g. distinguishing gzipped input).
The explicit timeout=300
acts as a safety valve: a single pathological, corrupt or unexpectedly huge file cannot monopolize runtime resources or stall the federated round. Increase this value only if you routinely process very large read sets and have validated performance.
Failure reasons are intentionally sanitized: the script emits generic, high‑signal messages instead of raw FastQC stderr or stack traces, reducing the risk of leaking sensitive sequence fragments or environment internals across nodes.
If a timeout or execution failure occurs, the script simply marks that specific file as failed (with a concise reason) and continues processing the remaining files so one bad input never invalidates the entire node contribution.
Output Structure
Example real output:
{
"overall_pass": false,
"warnings_present": true,
"overall_total": 2,
"failed_nodes": ["e58721f5-b971-4028-bbf7-362a65a0e660"],
"nodes": [
{
"node_pass": true,
"warnings_present": true,
"valid_file_count": 2,
"invalid_file_count": 0,
"files": [
{
"file": "SRR062634.filt.fastq",
"size_bytes": 80755971,
"pass": true,
"warnings": true,
"reason": "WARN: per_tile_sequence_quality",
"total_sequences": 308846,
"sequence_length": 100,
"gc_content": 40.0
}
// ... second file ...
],
"node_id": "2a1828d3-e52a-4805-8dae-62eab8083031"
},
// ... second node ...
]
}
Troubleshooting
Issue | Hint |
---|---|
FastQC executable not found | Wrong image selected. |
FastQC timeout | Increase timeout or subset reads. |
All files warning same modules | Investigate data quality locally with full HTML report. |
Node fails with no files | Check extensions & datastore mapping; maybe restrict keys incorrectly. |
See Also
Author: Jules Kreuer