Getting Started¶
MiLoPYP is an open source, dataset-specific contrastive learning-based framework that enables two-step fast molecular pattern visualization followed by accurate protein localization without the need for manual annotation. During the exploration step, it learns an embedding space for 3D macromolecules such that similar structures are grouped together while dissimilar ones are separated. The embedding space is then projected into 2D and 3D which allows easy identification of the distribution of macromolecular structures across an entire dataset. During the refinement step, examples of proteins identified during the exploration step are selected and MiLoPYP learns to localize these proteins with high accuracy.
Each step can be used separately. To use the refinement step only (tomogram particle detection), ground truth particle coordinates need to be provided for training. Typically, around 200 particles from several tomograms are needed to ensure good performance. Training coordinates can be obtained either manually or from the exploration module.
Installation¶
The code was tested on CentOS Stream (version 8.0), using Anaconda Python 3.8, PyTorch version 1.11.0, and CUDA 10.2. NVIDIA GPUs with 32GB RAMs were used for training. Inference can be performed on either GPUs or CPUs.
After installing Anaconda:
-
Create a new conda environment [optional, but recommended]:
And activate the environment.
-
Clone the
cet_pick
repo: -
Install the requirements:
-
Install PyTorch:
-
Install
cet_pick
package and dependencies:
Folder structure¶
MiLoPYP uses the following directory structure:
├── data # training data
│ ├── sample_train_explore_img.txt
│ ├── sample_train_refine_img.txt
│ ├── training_coordinates.txt
├── datasets # dataloading, sampling related code
│ ├── dataset_factory.py # dataset factory and sampling factory
│ ├── tomo_*.py # data factory for different modes
│ ├── particle_*.py # sampling factory for different modes
├── trains # model training modules
├── models # model architectures for different modes
├── utils # util functions
├── colormap # colormaps for 2D visualization
└── DCNv2 # deformable convolution related operations
├── opts.py # arguments for training
├── main.py # training for refinement module
├── simsiam_main.py # training for cellular content exploration module
├── simsiam_test_hm_3d.py # inference for cellular content exploration module
├── test.py # inference for refinement/particle detection module
├── interactive_to_training_coords.py # convert output from interactive session
├── plot_2d.py # 2D visualization plots
├── phoenix_visualization.py # 3D interactive session
Sample datasets¶
Globular-shaped particles (EMPIAR-10304)¶
This dataset contains a subset of tilt-series from EMPIAR-10304 and all the necessary metadata to run MiLoPYP. To download and decompress, run:
wget https://nextpyp.app/files/data/milopyp_globular_tutorial.tbz
tar xvfz milopyp_globular_tutorial.tbz
Tubular-shaped particles (EMPIAR-10987)¶
This dataset contains a subset of tilt-series from EMPIAR-10987 and all the necessary metadata to run MiLoPYP. To download and decompress, run:
wget https://nextpyp.app/files/data/milopyp_tubular_tutorial.tbz
tar xvfz milopyp_tubular_tutorial.tbz
MiLoPYP consists of two modules:
The quick tutorials contain a step-by-step guide on how to run MiLoPYP on two sample datasets.