Execute the STAVER workflow¶

The STAVER algorithm is implemented in the staver_pipeline module. This module provides a comprehensive proteomics data analysis tool designed to streamline the workflow from raw data preprocessing to the final result output. We provide a tutorial for running the STAVER workflow with the Command-Line Interface (CLI). For more details about the STAVER algorithm, please refer to the STAVER Document.

[2]:

import pandas as pd
import numpy as np
import staver as st

import warnings
warnings.filterwarnings('ignore')

List all the optional command arguments¶

[3]:

%run ~/STAVER/staver/staver_pipeline.py  -h

usage: staver_pipeline.py [-h] -n NUMBER_THRESHODS -i DIA_PATH
                          [-ref REFERENCE_STANDARD_DATASET] -o
                          DIA_PEP_DATA_OUTPATH -op DIA_PROTEIN_DATA_OUTPATH
                          [-fdr FDR_THRESHOLD] [-c COUNT_CUTOFF_SAME_LIBS]
                          [-d COUNT_CUTOFF_DIFF_LIBS]
                          [-pep_cv PEPTIDES_CV_THRESH]
                          [-pro_cv PROTEINS_CV_THRESH]
                          [-na_thresh NA_THRESHOLD] [-top TOP_PRECURSOR_IONS]
                          [-norm NORMALIZATION_METHOD] [-suffix FILE_SUFFIX]
                          [-sample SAMPLE_TYPE] [-ver VERBOSE] [-v]

STAVER: A Standardized Dataset-Based Algorithm for Efficient Variation
Reduction in Large-Scale DIA MS Data

optional arguments:
  -h, --help            show this help message and exit
  -n NUMBER_THRESHODS, --thread_numbers NUMBER_THRESHODS
                        The number of thresholds for computer operations
  -i DIA_PATH, --input DIA_PATH
                        The DIA input data path
  -ref REFERENCE_STANDARD_DATASET, --reference_dataset_path REFERENCE_STANDARD_DATASET
                        The DIA standarde reference directory
  -o DIA_PEP_DATA_OUTPATH, --output_peptide DIA_PEP_DATA_OUTPATH
                        The processed DIA proteomics of peptide data output
                        path
  -op DIA_PROTEIN_DATA_OUTPATH, --output_protein DIA_PROTEIN_DATA_OUTPATH
                        The processed DIA proteomics protein data output path
  -fdr FDR_THRESHOLD, --fdr_threshold FDR_THRESHOLD
                        Setting the FDR threshold (default: 0.01)
  -c COUNT_CUTOFF_SAME_LIBS, --count_cutoff_same_libs COUNT_CUTOFF_SAME_LIBS
                        Setting the count cutoff of same files (default: 1)
  -d COUNT_CUTOFF_DIFF_LIBS, --count_cutoff_diff_libs COUNT_CUTOFF_DIFF_LIBS
                        Setting the count cutoff of different files (default:
                        2)
  -pep_cv PEPTIDES_CV_THRESH, --peptides_cv_thresh PEPTIDES_CV_THRESH
                        Setting coefficient of variation threshold for the
                        peptides (default: 0.3)
  -pro_cv PROTEINS_CV_THRESH, --proteins_cv_thresh PROTEINS_CV_THRESH
                        Setting coefficient of variation threshold for the
                        proteins (default: 0.3)
  -na_thresh NA_THRESHOLD, --na_threshold NA_THRESHOLD
                        Setting the minimum threshold for NUll peptides
                        (default: 0.3)
  -top TOP_PRECURSOR_IONS, --top_precursor_ions TOP_PRECURSOR_IONS
                        Setting the top high confidence interval precursor
                        ions (default: 6)
  -norm NORMALIZATION_METHOD, --normalization_method NORMALIZATION_METHOD
                        Specify data normalization method
  -suffix FILE_SUFFIX, --file_suffix FILE_SUFFIX
                        Set the suffix for folder specific identification
  -sample SAMPLE_TYPE, --sample_type SAMPLE_TYPE
                        Description of the sample type
  -ver VERBOSE, --verbose VERBOSE
                        Set the verbose mode for the output information
  -v, --version         show program's version number and exit

Run the staver_pipeline¶

(Estimated time: ~5 min of 20 samples)

To begin with, the Environment and the DIA dataset should be prepared:

Preparing the Environment:
- Ensure that Python is installed on your system.
- Download or clone the STAVER repository to your local machine or HPC.
- Install the required packages by running pip install -r requirements.txt in the STAVER directory.
Setting Up the Parameters:
- Use the -n flag to set the number of threads for computation.
- The -i flag should point to your input DIA data path.
- If you have a reference dataset, use the -ref flag to provide its path; otherwise, the default dataset will be used.
- Define the output paths for peptide data with -o and protein data with -op.

Users have the option to configure various parameters, including false discovery rate (FDR) and coefficient of variation thresholds for peptides, as well as intensity and frequency thresholds and selection criteria for top precursor ions, among others. These customizable settings enable the tailoring of the STAVER processing workflow to meet specific experimental requirements.

For comprehensive information, please refer to the Tutorials section in the STAVER documentation, specifically under the subsection “The Detailed Description of the Parameters”. The detailed information about “data preparation and format requirements” can be found in the Data Ingestion subsection of the Tutorials.

[5]:

## run staver_pipeline
%run ~/STAVER/staver/staver_pipeline.py \
        --thread_numbers 16 \
        --input /Volumes/T7_Shield/staver/data/likai-diann-raw-20/ \
        --reference_dataset_path /Volumes/T7_Shield/staver/data/likai-diann-raw \
        --output_peptide /Volumes/T7_Shield/staver/results/DIA_repeat20_2023010/peptides/ \
        --output_protein /Volumes/T7_Shield/staver/results/DIA_repeat20_2023010/proteins/ \
        --count_cutoff_same_libs 1 \
        --count_cutoff_diff_libs 2 \
        --fdr_threshold 0.01 \
        --peptides_cv_thresh 0.3 \
        --proteins_cv_thresh 0.3 \
        --na_threshold 0.3 \
        --top_precursor_ions 6 \
        --file_suffix _F1_R1

All parsed arguments:
number_threshods: 16
dia_path: /Volumes/T7_Shield/staver/data/likai-diann-raw-20/
reference_standard_dataset: /Volumes/T7_Shield/staver/data/likai-diann-raw
dia_pep_data_outpath: /Volumes/T7_Shield/staver/results/DIA_repeat20_2023010/peptides/
dia_protein_data_outpath: /Volumes/T7_Shield/staver/results/DIA_repeat20_2023010/proteins/
fdr_threshold: 0.01
count_cutoff_same_libs: 1
count_cutoff_diff_libs: 2
peptides_cv_thresh: 0.3
proteins_cv_thresh: 0.3
na_threshold: 0.3
top_precursor_ions: 6
normalization_method: median
file_suffix: _F1_R1
sample_type: None
verbose: False

===================== 'run_staver' function begins running... ======================


====================== 'load_data' function begins running... ======================

/Volumes/T7_Shield/staver/results/DIA_repeat20_2023010/peptides/

[ ]: