Newer
Older
Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale.
The workflow use a set of apptainer images :
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- PanGraTools (PGGB): https://forgemia.inra.fr/alexis.mergez/pangratools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps
> An example of input files and a config file is available in `example/`.
# Prepare your data
This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.
**Fasta files need to be compressed** using **bgzip2** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)).
Sequences names of the reference **must follow this pattern** : `<sample>#<haplotype>#<contig or chromosome name>`.
For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names.
Fasta files **must also follow a pattern** : `<sample>.hap<haplotype>.fa.gz`. Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`.
See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming.
> Note : Input files should be read-only to prevent snakemake to mess with them (which seems to happen in some rare cases).
# Download apptainer images
Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
```
./getApps.sh -a <apps directory>
```
# Running the workflow
Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes.
Update the reference name and the apptainer image directory in `config.yaml`.
Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.).
Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`!
# Outputs
The workflow generates several key files :
- Aggregated graph including every chromosome scale graphs (`output/pan1c.pggb.<panname>.gfa`)
- Chromosome scale graphs (`data/chrGraphs/chr<id>.gfa`)
- Panacus html reports for each chromosome graph (`output/panacus.reports/chr<id>.histgrowth.html`)
- Statistics on input sequences, graphs and resources used by the workflow
- PAV matrices (optional) for each chromosome graph (`output/pav.matrices/chr<id>.pav.matrix.tsv`)
├── data
│ └── haplotypes
│ ├── ref.hap<x>.fa.gz
│ ├── samp1.hap<x>.fa.gz
│ └── ...
The following tree is non-exhaustive for clarity. Temporary files are not listed, but key files are included.
The name of the pangenome is `06AT-v3`.
```
Pan1c-06AT-v3
├── chrInputs
│
├── config.yaml
├── data
│ ├── chrGraphs
│ │ ├── chr<id>
│ │ ├── chr<id>.gfa
│ │ └── graphsList.txt
│ ├── chrInputs
│ │ └── chr<id>.fa.gz
│ ├── haplotypes
│ └── hap.ragtagged
│ ├── <sample>.hap<hid>
│ └── <sample>.hap<hid>.ragtagged.fa.gz
├── logs
│ ├── pan1c.pggb.06AT-v3.logs.tar.gz
│ └── pggb
│ ├── chr<id>.pggb.cmd.log
│ └── chr<id>.pggb.time.log
├── output
│ ├── figures
│ │ ├── chr<id>.1Dviz.png
│ │ └── chr<id>.pcov.png
│ ├── stats
│ │ ├── pan1c.pggb.06AT-v3.core.stats.tsv
│ │ ├── pan1c.pggb.06AT-v3.chrGraph.general.stats.tsv
│ │ └── pan1c.pggb.06AT-v3.chrGraph.path.stats.tsv
│ ├── pan1c.pggb.06AT-v3.gfa
│ ├── panacus.reports
│ │ └── chr<id>.histgrowth.html
│ └── chr<id>.stats.tsv
├── Pan1c-06AT-v3.log
├── README.md
├── runSnakemake.sh
├── scripts
│ └── ...
├── Snakefile
└── workflow.svg
```
This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.