AlphaFold

AlphaFold can predict protein structures with atomic accuracy even where no similar structure is known AlphaFold Homepage

Available Modules¶

module load AlphaFold/3.0.2

Prerequisite

An extended version of AlphaFold2 on Mahuika cluster which contains additional information such as visualisation of AlphaFold outputs, etc can be found here

Description¶

AlphaFold is a deep-learning system, developed by Google DeepMind, that predicts a protein's three-dimensional structure from its amino acid sequence. It combines evolutionary information from a multiple sequence alignment (MSA) of related proteins with structural templates, and returns per-residue confidence estimates (pLDDT) alongside the predicted coordinates. AlphaFold 2 achieved breakthrough accuracy in the CASP14 assessment (2020), in many cases approaching experimental quality; the later AlphaFold 3 extends prediction beyond single proteins to complexes that also contain nucleic acids, ligands and ions.

This package provides an implementation of the inference pipeline of AlphaFold.

Referencing¶

Any publication that discloses findings arising from using this source code or the model parameters should cite the AlphaFold paper. Please also refer to the Supplementary Information for a detailed description of the method.

Home page is at https://github.com/deepmind/alphafold

AlphaFold Databases¶

AlphaFold databases are stored in /opt/nesi/db/alphafold_db/ parent directory. In order to make the database calling more convenient, we have prepared modules for each version of the database. Running module spider AlphaFold2DB will list the available versions based on when they were downloaded (Year-Month)

$ module spider AlphaFold2DB

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AlphaFold2DB: AlphaFold2DB/2022-06
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Description:
AlphaFold2 databases

 Versions:
         AlphaFold2DB/2022-06
         AlphaFold2DB/2023-04

Loading a module will set the $AF2DB variable which is pointing to the selected version of the database. For an example.

$ module load AlphaFold2DB/2023-04

$ echo $AF2DB 
/opt/nesi/db/alphafold_db/2023-04

AlphaFold 3¶

AlphaFold 3 can predict the joint structure of complexes that can include proteins, nucleic acids (DNA/RNA), ligands, ions and modified residues. It takes its input as a JSON file, and the workflow is split into a CPU-bound data pipeline (genetic and template search) and a GPU-bound inference stage.

Note that AlphaFold 3 predicts the three-dimensional structure of these complexes (with per-residue and per-pair confidence estimates). It does not compute binding affinities, docking scores or other measures of binding strength.

Home page is at https://github.com/google-deepmind/alphafold3.

You must request the model parameters from Google DeepMind

The AlphaFold 3 model parameters are not redistributed by Mahuika. To obtain them you must agree to the AlphaFold 3 Model Parameters Terms of Use and request access from Google DeepMind yourself. The terms permit non-commercial use only and prohibit sharing the weights. Once you receive them, store them in a directory you control (for example under your project space) and point --model_dir at it.

AlphaFold 3 module and databases¶

The application and its databases are available as modules:

module spider AlphaFold

 Versions:
        AlphaFold/3.0.0
        AlphaFold/3.0.1
        AlphaFold/3.0.2

module spider AlphaFold3DB

 Versions:
        AlphaFold3DB/2024-12

Loading the AlphaFold3DB module sets an environment variable ($AF3DB) that points at the selected database version, so you can pass it (and the individual files within it) to the --db_dir and --*_database_path options:

module load AlphaFold3DB/2024-12
echo $AF3DB

On Mahuika you also need to load HMMER, which AlphaFold 3 uses for its genetic search. Loading it sets $HMMER_DIR, which you pass to the --*_binary_path options:

module load HMMER/3.4-GCC-12.3.0
echo $HMMER_DIR

Input JSON¶

AlphaFold 3 reads a JSON description of the structure to fold. A single model handles both monomers and multimers — there is no model-preset option to set — and whether you fold a monomer or a complex is decided entirely by what you list under sequences in the JSON. The run_alphafold.py command is identical in either case.

Monomer — a single protein block with one chain id (fold_input.json):

{
  "name": "my_protein",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLDVLVGGGNTAEAFIHGVQTILTKPSLHALILEYSSQEMQE"
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Multimer — add more entries to the sequences list, each with its own chain id. This example is a heterodimer of two different proteins (chains A and B):

{
  "name": "my_complex",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLDVLVGGGNTAEAFIHGVQTILTKPSLHALILEYSSQEMQE"
      }
    },
    {
      "protein": {
        "id": "B",
        "sequence": "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH"
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

For a homo-multimer (the same sequence repeated), give a single protein block a list of ids instead, for example "id": ["A", "B"]. A complex can also mix molecule types — alongside protein blocks the sequences list accepts dna, rna, ligand and ion entries. See the input documentation for the full schema.

Example Slurm script¶

On Mahuika, AlphaFold 3 does not infer the database files or search-tool binaries automatically: you pass the genetic databases via the --*_database_path options (all found under $AF3DB), the HMMER binaries via the --*_binary_path options (under $HMMER_DIR), and your own copy of the model parameters via --model_dir. The following runs both the data pipeline and inference in a single GPU job for one input JSON:

#!/bin/bash -e

#SBATCH --account       nesi12345
#SBATCH --job-name      af3-example
#SBATCH --mem           24G
#SBATCH --cpus-per-task 8
#SBATCH --gpus-per-node A100:1
#SBATCH --time          02:00:00
#SBATCH --output        %j.out

module purge
module load AlphaFold/3.0.2
module load AlphaFold3DB/2024-12
module load HMMER/3.4-GCC-12.3.0

INPUT=/nesi/project/nesi12345/alphafold3/fold_input.json
OUTPUT=/nesi/project/nesi12345/alphafold3/results
MODEL_DIR=/nesi/project/nesi12345/alphafold3/models

run_alphafold.py \
--json_path=${INPUT} \
--model_dir=${MODEL_DIR} \
--output_dir=${OUTPUT} \
--db_dir=${AF3DB} \
--uniref90_database_path=${AF3DB}/uniref90_2022_05.fa \
--mgnify_database_path=${AF3DB}/mgy_clusters_2022_05.fa \
--uniprot_cluster_annot_database_path=${AF3DB}/uniprot_all_2021_04.fa \
--small_bfd_database_path=${AF3DB}/bfd-first_non_consensus_sequences.fasta \
--pdb_database_path=${AF3DB}/mmcif_files \
--seqres_database_path=${AF3DB}/pdb_seqres_2022_09_28.fasta \
--hmmalign_binary_path=${HMMER_DIR}/hmmalign \
--hmmbuild_binary_path=${HMMER_DIR}/hmmbuild \
--hmmsearch_binary_path=${HMMER_DIR}/hmmsearch \
--jackhmmer_binary_path=${HMMER_DIR}/jackhmmer \
--nhmmer_binary_path=${HMMER_DIR}/nhmmer

To fold several inputs, place all of the JSON files in a directory and replace --json_path=${INPUT} with --input_dir=/path/to/json_dir, or call run_alphafold.py once per file (for example from a job array).

Splitting the data pipeline and inference

The data pipeline (genetic/template search) is CPU-bound and does not need a GPU, while inference is GPU-bound. For large batches you can run the two stages as separate jobs — add --norun_inference to a CPU-only job to produce an enriched JSON (containing the MSAs), then feed that JSON into a GPU job with --norun_data_pipeline — so a GPU is not held idle during the search.

AlphaFold 3 troubleshooting¶

With the default configuration AlphaFold 3 fits inputs of up to roughly 5,120 tokens on an 80 GB GPU. For larger complexes, or if you hit out-of-memory errors during inference, enable unified memory so the GPU can spill into host memory:
```
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.95
export XLA_CLIENT_MEM_FRACTION=0.95
```
The data pipeline can be memory hungry for large sequences; increase --mem if the pipeline stage is killed.

AlphaFold 3 license¶

The source code and the model parameters are governed by separate terms — only the latter carries the non-commercial restriction:

Source code is released under the Apache License, Version 2.0.
Model parameters and any output generated using them are subject to the AlphaFold 3 Model Parameters Terms of Use, which permit non-commercial use only and prohibit redistributing the parameters. You must obtain the parameters directly from Google DeepMind (see the warning above).

AlphaFold module ( >= 2.3.2)¶

As of version 2.3.2 of AlphaFold, we recommend deploying AlphaFold via the module.

Example Slurm script for monomer¶

Input fasta used in following example is 3RGK (https://www.rcsb.org/structure/3rgk).

#!/bin/bash -e

#SBATCH --account       nesi12345
#SBATCH --job-name      af-2.3.2-monomer
#SBATCH --mem           24G
#SBATCH --cpus-per-task 8
#SBATCH --gpus-per-node A100:1
#SBATCH --time          02:00:00
#SBATCH --output        %j.out

module purge
module load AlphaFold2DB/2023-04
module load AlphaFold/2.3.2

INPUT=/nesi/project/nesi12345/alphafold/input_data
OUTPUT=/nesi/project/nesi12345/alphafold/results

run_alphafold.py --use_gpu_relax \
--data_dir=$AF2DB \
--uniref90_database_path=$AF2DB/uniref90/uniref90.fasta \
--mgnify_database_path=$AF2DB/mgnify/mgy_clusters_2022_05.fa \
--bfd_database_path=$AF2DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$AF2DB/uniref30/UniRef30_2021_03 \
--pdb70_database_path=$AF2DB/pdb70/pdb70 \
--template_mmcif_dir=$AF2DB/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$AF2DB/pdb_mmcif/obsolete.dat \
--model_preset=monomer \
--max_template_date=2022-6-1 \
--db_preset=full_dbs \
--output_dir=$OUTPUT \
--fasta_paths=${INPUT}/rcsb_pdb_3RGK.fasta

Example Slurm script for multimer¶

Input fasta used in following example

    T1083
GAMGSEIEHIEEAIANAKTKADHERLVAHYEEEAKRLEKKSEEYQELAKVYKKITDVYPNIRSYMVLHYQNLTRRYKEAAEENRALAKLHHELAIVED
    T1084
MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

#!/bin/bash -e

#SBATCH --account       nesi12345
#SBATCH --job-name      af-2.3.2-multimer
#SBATCH --mem           30G
#SBATCH --cpus-per-task 4
#SBATCH --gpus-per-node A100:1
#SBATCH --time          01:45:00
#SBATCH --output        slurmout.%j.out

module purge
module load AlphaFold2DB/2023-04
module load AlphaFold/2.3.2

INPUT=/nesi/project/nesi12345/input_data
OUTPUT=/nesi/project/nesi12345/alphafold/2.3_multimer

run_alphafold.py \
--use_gpu_relax \
--data_dir=$AF2DB \
--model_preset=multimer \
--uniprot_database_path=$AF2DB/uniprot/uniprot.fasta \
--uniref90_database_path=$AF2DB/uniref90/uniref90.fasta \
--mgnify_database_path=$AF2DB/mgnify/mgy_clusters_2022_05.fa \
--bfd_database_path=$AF2DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref30_database_path=$AF2DB/uniref30/UniRef30_2021_03 \
--pdb_seqres_database_path=$AF2DB/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$AF2DB/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$AF2DB/pdb_mmcif/obsolete.dat \
--max_template_date=2022-6-1 \
--db_preset=full_dbs \
--output_dir=${OUTPUT} \
--fasta_paths=${INPUT}/test_multimer.fasta

AlphaFold Singularity container (prior to v2.3.2)¶

If you would like to use a version prior to 2.3.2, It can be done via the Singularity containers.

We prepared a Singularity container image based on the official Dockerfile with some modifications. Image (.simg) and the corresponding definition file (.def) are stored in /opt/nesi/containers/AlphaFold/

Example Slurm scripts for Singularity container based AF2 deployment¶

Monomer¶

#!/bin/bash -e

#SBATCH --account       nesi12345
#SBATCH --job-name      alphafold2_monomer_example
#SBATCH --mem           30G
#SBATCH --cpus-per-task 6
#SBATCH --gpus-per-node A100:1 
#SBATCH --time          02:00:00
#SBATCH --output        slurmout.%j.out

module purge
module load AlphaFold2DB/2022-06
module load cuDNN/8.1.1.33-CUDA-11.2.0 Singularity/3.9.8

INPUT=/path/to/input_data
OUTPUT=/path/to/results

export SINGULARITY_BIND="$INPUT,$OUTPUT,$AF2DB"

singularity exec --nv /opt/nesi/containers/AlphaFold/alphafold_2.2.0.simg python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$AF2DB \
--uniref90_database_path=$AF2DB/uniref90/uniref90.fasta \
--mgnify_database_path=$AF2DB/mgnify/mgy_clusters_2018_12.fa \
--bfd_database_path=$AF2DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$AF2DB/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--pdb70_database_path=$AF2DB/pdb70/pdb70 \
--template_mmcif_dir=$AF2DB/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$AF2DB/pdb_mmcif/obsolete.dat \
--model_preset=monomer \
--max_template_date=2022-1-1 \
--db_preset=full_dbs \
--output_dir=$OUTPUT \
--fasta_paths=$INPUT/rcsb_pdb_3RGK.fasta

Multimer¶

#!/bin/bash -e

#SBATCH --account       nesi12345
#SBATCH --job-name      alphafold2_monomer_example
#SBATCH --mem           30G
#SBATCH --cpus-per-task 6
#SBATCH --gpus-per-node A100:1 
#SBATCH --time          02:00:00
#SBATCH --output        slurmout.%j.out

module purge
module load AlphaFold2DB/2022-06
module load cuDNN/8.1.1.33-CUDA-11.2.0 Singularity/3.9.8

INPUT=/path/to/input_data
OUTPUT=/path/to/results


export SINGULARITY_BIND="$INPUT,$OUTPUT,$AF2DB"

singularity exec --nv /opt/nesi/containers/AlphaFold/alphafold_2.2.0.simg python /app/alphafold/run_alphafold.py \
--use_gpu_relax \
--data_dir=$AF2DB \
--uniref90_database_path=$AF2DB/uniref90/uniref90.fasta \
--mgnify_database_path=$AF2DB/mgnify/mgy_clusters_2018_12.fa \
--bfd_database_path=$AF2DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=$AF2DB/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--pdb_seqres_database_path=$AF2DB/pdb_seqres/pdb_seqres.txt \
--template_mmcif_dir=$AF2DB/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$AF2DB/pdb_mmcif/obsolete.dat \
--uniprot_database_path=$AF2DB/uniprot/uniprot.fasta \
--model_preset=multimer \
--max_template_date=2022-1-1 \
--db_preset=full_dbs \
--output_dir=$OUTPUT \
--fasta_paths=$INPUT/rcsb_pdb_3RGK.fasta

Explanation of Slurm variables and Singularity flags¶

Values for --mem , --cpus-per-task and --time Slurm variables are for 3RGK.fasta. Adjust them accordingly
The --nv flag enables GPU support.
--pwd /app/alphafold is to workaround this existing issue

AlphaFold2 : Initial Release ( this version does not support `multimer`)¶

Input fasta used in following example and subsequent benchmarking is 3RGK (https://www.rcsb.org/structure/3rgk).

Troubleshooting¶

If you are to encounter the message "RuntimeError: Resource exhausted: Out of memory" , add the following variables to the slurm script

For module based runs

export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

For Singularity based runs

export SINGULARITYENV_TF_FORCE_UNIFIED_MEMORY=1 
export SINGULARITYENV_XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

License and Disclaimer¶

This is not an officially supported Google product.

AlphaFold Code License¶

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Model Parameters License¶

The AlphaFold parameters are made available for non-commercial use only, under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode

AlphaFold

Available Modules¶

Description¶

Referencing¶

AlphaFold Databases¶

AlphaFold 3¶

AlphaFold 3 module and databases¶

Input JSON¶

Example Slurm script¶

AlphaFold 3 troubleshooting¶

AlphaFold 3 license¶

AlphaFold module ( >= 2.3.2)¶

Example Slurm script for monomer¶

Example Slurm script for multimer¶

AlphaFold Singularity container (prior to v2.3.2)¶

Example Slurm scripts for Singularity container based AF2 deployment¶

Monomer¶

Multimer¶

Explanation of Slurm variables and Singularity flags¶

AlphaFold2 : Initial Release ( this version does not support multimer)¶

Troubleshooting¶

License and Disclaimer¶

AlphaFold Code License¶

Model Parameters License¶

AlphaFold2 : Initial Release ( this version does not support `multimer`)¶