Newer
Older
Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale.
The workflow use a set of apptainer images :
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps
> An example of input files and a config file is available in `example/`.
# Minimum image version
- PanGeTools >= v1.10.10
- Pan1c-env >= v1.1.1
- Pan1c-box >= v1.1.2
- minigraph-cactus >= v2.1.4b
This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.
**Fasta files need to be compressed** using **bgzip** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)).
Sequences names of the reference **must follow this pattern** :
`<sample>#<haplotype>#<contig or chromosome name>`
For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names.
Fasta files **must also follow a pattern** :
`<sample>.hap<haplotype>.fa.gz`
Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`.
See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming.
You should only provide chromosome-level assemblies, but, as the haplotypes are renamed using RagTag, it is possible to give scaffold or contig-level assemblies. Since RagTag scaffolds each assemblies using the "reference" haplotype, it can scaffold chromosome-level assemblies that also contains non-placed scaffold/contigs. If you don't want this behavior, prune your FASTAs from any non-chromosome-level sequences **before** providing them to Pan1c.
Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
```
./getApps.sh -a <apps directory>
```
> Make sure to use the latest version or the workflow might return you errors.
Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes.
Update the reference name and the apptainer image directory in `config.yaml`.
Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.).
Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`!
The default script uses a single node and runs everything on it. This method only require apptainer to run but isn't the most efficient for job distribution.
## Cluster execution
To execute each steps as a job with SLURM, install a custom conda environement with this command :
```
conda create -n Pan1c -c conda-forge -c bioconda snakemake=8.4.7 snakemake-executor-plugin-slurm
```
This works by having a job that runs snakemake which will submit other jobs. To do so, configure `runSnakemakeSLURM.sh` and submit it using `sbatch`.
> If you get OOM errors, use the mem_multiplier in `config.yaml` to allocate more memory for jobs.
# Pan1c_View
[Pan1c_View](https://forgemia.inra.fr/philippe.bardou/pan1c_view) is an interface developped by Philippe Bardou, used to visualize different statistics generated from Pan1c pangenome graphs.
To use it, extract the Pan1c_View tarball generated by the workflow in the `project` folder of Pan1c_View, then follow [these](https://forgemia.inra.fr/philippe.bardou/pan1c_view#installation) instructions.
# Main outputs
- Aggregated graph including every chromosome scale graphs (`output/Pan1c.<gtool>.<pangenome_name>.gfa.gz`)
- Chromosome scale graphs (`data/chrGraphs/Pan1c.<gtool>.<pangenome_name>.<chromosome>.gfa.gz`)
- Panacus html reports for each chromosome level graph (`output/panacus.reports/Pan1c.<gtool>.<pangenome_name>.<chromosome>.histgrowth.html`)
- Statistics on input sequences, graphs and resources used by the workflow (`output/stats`)
- Odgi 1D visualization of chromosome level graphs (`output/chrGraphs.figs`)
- (Optional) Pan1c-View tarball (`output/<pangenome_name>.Pan1c-View.data.tar.gz`)
- (Optional) SyRI structural variant figures (`output/asm.syri.figs`, `chrInput.syri.figs`)
- (Optional) Quast results on your input haplotypes (`output/Pan1c.<pangenome_name>.quast.report.html`)
- (Optional) Contig composition of chromosomes of your input haplotypes (`output/chr.contig`)
├── data
│ └── haplotypes
│ ├── ref.hap<x>.fa.gz
│ ├── samp1.hap<x>.fa.gz
│ └── ...
This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.
# Authors and acknowledgment
- Alexis Mergez
- Martin Racoupeau
- Christophe Klopp
- Christine Gaspin
- Fabrice Legeai