README.md

# Pan1c (Pangenome at chromosome level) workflow

Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale.
The workflow use a set of apptainer images : 
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps

> An example of input files and a config file is available in `example/`.  

# Minimum image version

- PanGeTools >= v1.10.10
- Pan1c-env >= v1.1.1
- Pan1c-box >= v1.1.2
- minigraph-cactus >= v2.1.4b

# Prepare your data

This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.  
**Fasta files need to be compressed** using **bgzip** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)).
Sequences names of the reference **must follow this pattern** :  
`<sample>#<haplotype>#<contig or chromosome name>`  
For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names. 
Fasta files **must also follow a pattern** :  
`<sample>.hap<haplotype>.fa.gz`  
Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`.  

See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming.  

You should only provide chromosome-level assemblies, but, as the haplotypes are renamed using RagTag, it is possible to give scaffold or contig-level assemblies. Since RagTag scaffolds each assemblies using the "reference" haplotype, it can scaffold chromosome-level assemblies that also contains non-placed scaffold/contigs. If you don't want this behavior, prune your FASTAs from any non-chromosome-level sequences **before** providing them to Pan1c.

# Download apptainer images

Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
```
./getApps.sh -a <apps directory>
``` 

> Make sure to use the latest version or the workflow might return you errors.

# Running the workflow

Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes.  
Update the reference name and the apptainer image directory in `config.yaml`.  
Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.).  

## Single machine mode

Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`!
The default script uses a single node and runs everything on it. This method only require apptainer to run but isn't the most efficient for job distribution.

## Cluster execution

To execute each steps as a job with SLURM, install a custom conda environement with this command : 
```
conda create -n Pan1c -c conda-forge -c bioconda snakemake=8.4.7 snakemake-executor-plugin-slurm
```
This works by having a job that runs snakemake which will submit other jobs. To do so, configure `runSnakemakeSLURM.sh` and submit it using `sbatch`.
> If you get OOM errors, use the mem_multiplier in `config.yaml` to allocate more memory for jobs.

# Pan1c_View

[Pan1c_View](https://forgemia.inra.fr/philippe.bardou/pan1c_view) is an interface developped by Philippe Bardou, used to visualize different statistics generated from Pan1c pangenome graphs.  
To use it, extract the Pan1c_View tarball generated by the workflow in the `project` folder of Pan1c_View, then follow [these](https://forgemia.inra.fr/philippe.bardou/pan1c_view#installation) instructions. 

# Main outputs

The workflow generates several key files :
- Aggregated graph including every chromosome scale graphs (`output/Pan1c.<gtool>.<pangenome_name>.gfa.gz`)  
- Chromosome scale graphs (`data/chrGraphs/Pan1c.<gtool>.<pangenome_name>.<chromosome>.gfa.gz`)  
- Panacus html reports for each chromosome level graph (`output/panacus.reports/Pan1c.<gtool>.<pangenome_name>.<chromosome>.histgrowth.html`)  
- Statistics on input sequences, graphs and resources used by the workflow (`output/stats`)
- Odgi 1D visualization of chromosome level graphs (`output/chrGraphs.figs`)
- (Optional) Pan1c-View tarball (`output/<pangenome_name>.Pan1c-View.data.tar.gz`)
- (Optional) SyRI structural variant figures (`output/asm.syri.figs`, `chrInput.syri.figs`) 
- (Optional) Quast results on your input haplotypes (`output/Pan1c.<pangenome_name>.quast.report.html`)
- (Optional) Contig composition of chromosomes of your input haplotypes (`output/chr.contig`) 

# File architecture

```
Pan1c/
├── config.yaml
├── data
│   └── haplotypes
│       ├── ref.hap<x>.fa.gz
│       ├── samp1.hap<x>.fa.gz
│       └── ...
├── example
│   └── ...
├── getApps.sh
├── README.md
├── runSnakemake.sh
├── runSnakemakeSLURM.sh
├── scripts
│   └── ...
└── Snakefile
```

# Example DAG (Saccharomyces cerevisiae example)

This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.
![Workflow DAG](example/workflow.svg)

# Authors and acknowledgment

- Alexis Mergez
- Martin Racoupeau
- Christophe Klopp
- Christine Gaspin
- Fabrice Legeai

# Contact

[pan1c@inrae.fr](mailto:pan1c@inrae.fr)