README.md

# Pan1c (Pangenome at chromosome level) workflow

Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale.
The workflow use a set of apptainer images : 
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- PanGraTools (PGGB): https://forgemia.inra.fr/alexis.mergez/pangratools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps

> An example of input files and a config file is available in `example/`.  

# Prepare your data
This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.  
**Fasta files need to be compressed** using **bgzip2** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)).
Sequences names of the reference **must follow this pattern** : `<sample>#<haplotype>#<contig or chromosome name>`.  
For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names. 
Fasta files **must also follow a pattern** : `<sample>.hap<haplotype>.fa.gz`. Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`.  

See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming.  

> Note : Input files should be read-only to prevent snakemake to mess with them (which seems to happen in some rare cases).  

# Download apptainer images
Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
```
./getApps.sh -a <apps directory>
``` 

# Running the workflow
Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes.  
Update the reference name and the apptainer image directory in `config.yaml`.  
Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.).  
Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`!

# Outputs
The workflow generates several key files :
- Aggregated graph including every chromosome scale graphs (`output/pan1c.pggb.<panname>.gfa`)  
- Chromosome scale graphs (`data/chrGraphs/chr<id>.gfa`)  
- Panacus html reports for each chromosome graph (`output/panacus.reports/chr<id>.histgrowth.html`)  
- Statistics on input sequences, graphs and resources used by the workflow 
- PAV matrices (optional) for each chromosome graph (`output/pav.matrices/chr<id>.pav.matrix.tsv`)

# File architecture
## Before running the workflow
```
Pan1c/
├── config.yaml
├── data
│   └── haplotypes
│       ├── ref.hap<x>.fa.gz
│       ├── samp1.hap<x>.fa.gz
│       └── ...
├── example
│   └── ...
├── getApps.sh
├── README.md
├── runSnakemake.sh
├── scripts
│   └── ...
└── Snakefile
```
## After the workflow (Arabidopsis Thaliana example)
The following tree is non-exhaustive for clarity. Temporary files are not listed, but key files are included.
The name of the pangenome is `06AT-v3`.
```
Pan1c-06AT-v3
├── chrInputs
│   
├── config.yaml
├── data
│   ├── chrGraphs
│   │   ├── chr<id>
│   │   ├── chr<id>.gfa
│   │   └── graphsList.txt
│   ├── chrInputs
│   │   └── chr<id>.fa.gz
│   ├── haplotypes
│   └── hap.ragtagged
│       ├── <sample>.hap<hid>
│       └── <sample>.hap<hid>.ragtagged.fa.gz
├── logs
│   ├── pan1c.pggb.06AT-v3.logs.tar.gz
│   └── pggb
│       ├── chr<id>.pggb.cmd.log
│       └── chr<id>.pggb.time.log
├── output
│   ├── figures
│   │   ├── chr<id>.1Dviz.png
│   │   └── chr<id>.pcov.png
│   ├── stats
│   │   ├── pan1c.pggb.06AT-v3.core.stats.tsv
│   │   ├── pan1c.pggb.06AT-v3.chrGraph.general.stats.tsv
│   │   └── pan1c.pggb.06AT-v3.chrGraph.path.stats.tsv
│   ├── pan1c.pggb.06AT-v3.gfa
│   ├── panacus.reports
│   │   └── chr<id>.histgrowth.html
│   └── chrGraphs.stats
│       └── chr<id>.stats.tsv
├── Pan1c-06AT-v3.log
├── README.md
├── runSnakemake.sh
├── scripts
│   └── ...
├── Snakefile
└── workflow.svg
```

# Example DAG (Saccharomyces cerevisiae example)
This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.
![Workflow DAG](example/workflow.svg)