Skip to content
Snippets Groups Projects
README.md 5.25 KiB
Newer Older
Alexis Mergez's avatar
Alexis Mergez committed
# Pan1c (Pangenome at chromosome level) workflow
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale.
Alexis Mergez's avatar
Alexis Mergez committed
The workflow use a set of apptainer images : 
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps
Alexis Mergez's avatar
Alexis Mergez committed
> An example of input files and a config file is available in `example/`.  

Alexis Mergez's avatar
Alexis Mergez committed
# Minimum image version

- PanGeTools >= v1.10.10
- Pan1c-env >= v1.1.1
- Pan1c-box >= v1.1.2
- minigraph-cactus >= v2.1.4b

Alexis Mergez's avatar
Alexis Mergez committed
# Prepare your data
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.  
Alexis Mergez's avatar
Alexis Mergez committed
**Fasta files need to be compressed** using **bgzip** (included in [PanGeTools](https://forgemia.inra.fr/alexis.mergez/pangetools)).
Sequences names of the reference **must follow this pattern** :  
`<sample>#<haplotype>#<contig or chromosome name>`  
Alexis Mergez's avatar
Alexis Mergez committed
For example, CHM13 chromosomes (haploïd) must be named `CHM13#1#chr..`. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names. 
Alexis Mergez's avatar
Alexis Mergez committed
Fasta files **must also follow a pattern** :  
`<sample>.hap<haplotype>.fa.gz`  
Once again with CHM13, the fasta file should be named : `CHM13.hap1.fa.gz`.  
Alexis Mergez's avatar
Alexis Mergez committed

See [PanSN](https://github.com/pangenome/PanSN-spec) for more info on sequence naming.  

Alexis Mergez's avatar
Alexis Mergez committed
You should only provide chromosome-level assemblies, but, as the haplotypes are renamed using RagTag, it is possible to give scaffold or contig-level assemblies. Since RagTag scaffolds each assemblies using the "reference" haplotype, it can scaffold chromosome-level assemblies that also contains non-placed scaffold/contigs. If you don't want this behavior, prune your FASTAs from any non-chromosome-level sequences **before** providing them to Pan1c.
Alexis Mergez's avatar
Alexis Mergez committed

# Download apptainer images
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
```
./getApps.sh -a <apps directory>
``` 

Alexis Mergez's avatar
Alexis Mergez committed
> Make sure to use the latest version or the workflow might return you errors.
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
# Running the workflow
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
Clone this repository and create a `data/haplotypes` directory where you will place all your haplotypes.  
Update the reference name and the apptainer image directory in `config.yaml`.  
Then, modify the variables in `runSnakemake.sh` to match your requirements (number of threads, memory, job name, email, etc.).  
Alexis Mergez's avatar
Alexis Mergez committed

## Single machine mode

Alexis Mergez's avatar
Alexis Mergez committed
Navigate to the root directory of the repository and execute `sbatch runSnakemake.sh`!
Alexis Mergez's avatar
Alexis Mergez committed
The default script uses a single node and runs everything on it. This method only require apptainer to run but isn't the most efficient for job distribution.

## Cluster execution

To execute each steps as a job with SLURM, install a custom conda environement with this command : 
```
conda create -n Pan1c -c conda-forge -c bioconda snakemake=8.4.7 snakemake-executor-plugin-slurm
```
This works by having a job that runs snakemake which will submit other jobs. To do so, configure `runSnakemakeSLURM.sh` and submit it using `sbatch`.
> If you get OOM errors, use the mem_multiplier in `config.yaml` to allocate more memory for jobs.

# Pan1c_View

[Pan1c_View](https://forgemia.inra.fr/philippe.bardou/pan1c_view) is an interface developped by Philippe Bardou, used to visualize different statistics generated from Pan1c pangenome graphs.  
To use it, extract the Pan1c_View tarball generated by the workflow in the `project` folder of Pan1c_View, then follow [these](https://forgemia.inra.fr/philippe.bardou/pan1c_view#installation) instructions. 

# Main outputs
Alexis Mergez's avatar
Alexis Mergez committed

The workflow generates several key files :
Alexis Mergez's avatar
Alexis Mergez committed
- Aggregated graph including every chromosome scale graphs (`output/Pan1c.<gtool>.<pangenome_name>.gfa.gz`)  
- Chromosome scale graphs (`data/chrGraphs/Pan1c.<gtool>.<pangenome_name>.<chromosome>.gfa.gz`)  
- Panacus html reports for each chromosome level graph (`output/panacus.reports/Pan1c.<gtool>.<pangenome_name>.<chromosome>.histgrowth.html`)  
Alexis Mergez's avatar
Alexis Mergez committed
- Statistics on input sequences, graphs and resources used by the workflow (`output/stats`)
- Odgi 1D visualization of chromosome level graphs (`output/chrGraphs.figs`)
Alexis Mergez's avatar
Alexis Mergez committed
- (Optional) Pan1c-View tarball (`output/<pangenome_name>.Pan1c-View.data.tar.gz`)
- (Optional) SyRI structural variant figures (`output/asm.syri.figs`, `chrInput.syri.figs`) 
- (Optional) Quast results on your input haplotypes (`output/Pan1c.<pangenome_name>.quast.report.html`)
- (Optional) Contig composition of chromosomes of your input haplotypes (`output/chr.contig`) 
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
# File architecture
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
```
Pan1c/
Alexis Mergez's avatar
Alexis Mergez committed
├── config.yaml
Alexis Mergez's avatar
Alexis Mergez committed
├── data
│   └── haplotypes
│       ├── ref.hap<x>.fa.gz
│       ├── samp1.hap<x>.fa.gz
│       └── ...
Alexis Mergez's avatar
Alexis Mergez committed
├── example
Alexis Mergez's avatar
Alexis Mergez committed
│   └── ...
Alexis Mergez's avatar
Alexis Mergez committed
├── getApps.sh
├── README.md
Alexis Mergez's avatar
Alexis Mergez committed
├── runSnakemake.sh
Alexis Mergez's avatar
Alexis Mergez committed
├── runSnakemakeSLURM.sh
Alexis Mergez's avatar
Alexis Mergez committed
├── scripts
Alexis Mergez's avatar
Alexis Mergez committed
│   └── ...
└── Snakefile
Alexis Mergez's avatar
Alexis Mergez committed
```
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
# Example DAG (Saccharomyces cerevisiae example)
Alexis Mergez's avatar
Alexis Mergez committed

Alexis Mergez's avatar
Alexis Mergez committed
This DAG shows the worflow for a pangenome of `Saccharomyces cerevisiae` using the `R64` reference.
Alexis Mergez's avatar
Alexis Mergez committed
![Workflow DAG](example/workflow.svg)

Alexis Mergez's avatar
Alexis Mergez committed
# Authors and acknowledgment

- Alexis Mergez
- Martin Racoupeau
- Christophe Klopp
- Christine Gaspin
- Fabrice Legeai

Alexis Mergez's avatar
Alexis Mergez committed

[pan1c@inrae.fr](mailto:pan1c@inrae.fr) 
Alexis Mergez's avatar
Alexis Mergez committed