STAVER shows robust generalization across various large-scale DIA datasets¶

To further validate the reliability of STAVER’s results and underscore the robustness and inherent advantages of the STAVER algorithm, we applied STAVER to a much more diverse and larger-scale DIA dataset from the “ProCan-DepMap-Sanger project” (Gonçalves, et al. 2022, Cancer Cell). A total of 1,326 samples were included for analysis, including 84 samples of HEK293T cell lines used for quality control and 1,242 cancer cell samples derived from 9 typical cancer types (colorectal, SCLC, kidney, gastric, pancreatic, bladder, NSCLC, glioma, and hepatocellular cancer) (Figure RL2A-RL2B and Figure RL5A-RL5B).

We performed a comprehensive comparative analysis of the original data without the STAVER processing and the STAVER-processed data, with the main results focusing on - (I) the reproducibility and reliability of the STAVER-processed data, - (II) the robustness of the STAVER algorithm to uncover inherent biological differences, - (III) the reproducibility of the STAVER algorithm in identifying the previously reported tumor biomarkers, and - (IV) the robustness and broad applicability of the STAVER algorithm for disease diagnosis and classification.

[1]:

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import colorsys
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import scanpy as sc
import anndata
import os
import matplotlib.pyplot as plt
# python matplotlib export editable PDF
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
# mpl.rcParams['figure.dpi']= 150

import warnings
warnings.filterwarnings('ignore')

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

Metadata¶

[2]:

metadata = pd.read_excel("~/STAVER-revised/Pan-cancer-cell-lines/subset_949_samples.xlsx",index_col=0)
metadata

[2]:

	Automatic_MS_filename	Batch	Date	Instrument	Cell_line	SIDM	Tissue_type	Cancer_type	Cancer_subtype	Project_Identifier
519	191012_b36-t2-8_00di6_00jm7_m06_s_1	P04	2019-10-12	M06	BFTC-905	SIDM00989	Bladder	Bladder Carcinoma	Urothelial carcinoma	SIDM00989;BFTC-905
521	191023_b61-t1-1_00dsu_00kp3_m03_s_1	P04	2019-10-23	M03	BFTC-905	SIDM00989	Bladder	Bladder Carcinoma	Urothelial carcinoma	SIDM00989;BFTC-905
522	191026_b36-t3-8_00di6_00kt6_m04_s_1	P04	2019-10-26	M04	BFTC-905	SIDM00989	Bladder	Bladder Carcinoma	Urothelial carcinoma	SIDM00989;BFTC-905
523	191026_b61-t2-1_00dsu_00ktw_m06_s_1	P04	2019-10-26	M06	BFTC-905	SIDM00989	Bladder	Bladder Carcinoma	Urothelial carcinoma	SIDM00989;BFTC-905
524	191125_b32-t4-8_00dge_00n1d_m03_s_1	P04	2019-11-25	M03	BFTC-905	SIDM00989	Bladder	Bladder Carcinoma	Urothelial carcinoma	SIDM00989;BFTC-905
...	...	...	...	...	...	...	...	...	...	...
6590	200131_b4-7-t3-1_00q3n_00rtc_m05_s_1	P06	2020-01-31	M05	COR-L95	SIDM00521	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00521;COR-L95
6591	200131_b4-9-t3-1_00q3p_00rte_m05_s_1	P06	2020-01-31	M05	NCI-H510A	SIDM00927	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00927;NCI-H510A
6592	200131_b4-10-t3-1_00q3q_00rtf_m05_s_1	P06	2020-01-31	M05	NCI-H2171	SIDM00733	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00733;NCI-H2171
6593	200131_b4-13-t3-1_00q3t_00rti_m05_s_1	P06	2020-01-31	M05	NCI-H1836	SIDM00770	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00770;NCI-H1836
6594	200201_b3-13-t3-2_00q3d_00rtm_m05_s_1	P06	2020-02-01	M05	IST-SL1	SIDM00223	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00223;IST-SL1

1242 rows × 10 columns

Tissue_type¶

[3]:

Tissue_type_counts = metadata['Tissue_type'].value_counts().reset_index()
Tissue_type_counts.columns = ['Tissue_type', 'Count']
Tissue_type_counts

[3]:

	Tissue_type	Count
0	Lung	273
1	Large Intestine	271
2	Kidney	181
3	Stomach	153
4	Pancreas	115
5	Bladder	96
6	Central Nervous System	78
7	Liver	75

[4]:

fig = px.pie(Tissue_type_counts, values='Count', names='Tissue_type', title="Diverse Tissue Type")
fig.update_traces(textinfo='label+percent', insidetextorientation='radial')

# Save figure to PDF
pio.write_image(fig, 'figs/Tissue_type.pdf')

Cancer_type¶

[5]:

Cancer_type_counts = metadata['Cancer_type'].value_counts().reset_index()
Cancer_type_counts.columns = ['Cancer_type', 'Count']
Cancer_type_counts

[5]:

	Cancer_type	Count
0	Colorectal Carcinoma	271
1	Small Cell Lung Carcinoma	187
2	Kidney Carcinoma	181
3	Gastric Carcinoma	153
4	Pancreatic Carcinoma	110
5	Bladder Carcinoma	96
6	Non-Small Cell Lung Carcinoma	86
7	Glioma	78
8	Hepatocellular Carcinoma	75
9	Other Solid Carcinomas	5

[6]:

fig = px.pie(Cancer_type_counts, values='Count', names='Cancer_type', title="Diverse Cancer Type")
# Update the labels to show both count and percentage
fig.update_traces(textinfo='label+percent', insidetextorientation='radial')

# Save figure to PDF
pio.write_image(fig, 'figs/Cancer_types.pdf')

Cancer_subtype¶

[7]:

Cancer_subtype = metadata['Cancer_subtype'].value_counts().reset_index()
Cancer_subtype.columns = ['Cancer_subtype', 'Count']

Cancer_subtype["Cancer_subtype_modify"] = [
    x if count > 14 else "Others"
    for x, count in zip(Cancer_subtype["Cancer_subtype"], Cancer_subtype["Count"])
]

Cancer_subtype

[7]:

	Cancer_subtype	Count	Cancer_subtype_modify
0	Small cell lung carcinoma	187	Small cell lung carcinoma
1	Colon adenocarcinoma	147	Colon adenocarcinoma
2	Kidney carcinoma	107	Kidney carcinoma
3	Bladder carcinoma	89	Bladder carcinoma
4	Squamous cell lung carcinoma	86	Squamous cell lung carcinoma
5	Gastric adenocarcinoma	75	Gastric adenocarcinoma
6	Hepatocellular carcinoma	72	Hepatocellular carcinoma
7	Low grade glioma	72	Low grade glioma
8	Clear cell renal cell carcinoma	64	Clear cell renal cell carcinoma
9	Pancreatic ductal adenocarcinoma	52	Pancreatic ductal adenocarcinoma
10	Cecum adenocarcinoma	46	Cecum adenocarcinoma
11	Pancreatic adenocarcinoma	44	Pancreatic adenocarcinoma
12	Colorectal carcinoma	43	Colorectal carcinoma
13	Rectal adenocarcinoma	35	Rectal adenocarcinoma
14	Gastric signet ring cell adenocarcinoma	29	Gastric signet ring cell adenocarcinoma
15	Gastric tubular adenocarcinoma	15	Gastric tubular adenocarcinoma
16	Pancreatic carcinoma	13	Others
17	Gastric carcinoma	12	Others
18	Urothelial carcinoma	7	Others
19	Gastric fundus carcinoma	7	Others
20	Papillary renal cell carcinoma	6	Others
21	Oligodendroglioma	6	Others
22	Gastic small cell neuroendocrine carcinoma	6	Others
23	Gastric small cell carcinoma	6	Others
24	Pancreatic somatostatinoma	5	Others
25	Renal pelvis and ureter urothelial carcinoma	4	Others
26	Hepatoblastoma	3	Others
27	Gastric choriocarcinoma	3	Others
28	Pancreatic adenosquamous carcinoma	1	Others

[8]:

fig = px.pie(Cancer_subtype, values='Count', names='Cancer_subtype_modify', title="Diverse Cancer subtype")
# Update the labels to show both count and percentage
fig.update_traces(textinfo='label+percent', insidetextorientation='radial')

# Save figure to PDF
pio.write_image(fig, 'figs/Diverse Cancer subtype modify.pdf')

The Sankey diagram delineates the relationships¶

[9]:

def aggregate_by_sum(df, column_name, threshold=14):

    # Group by the specified column and calculate the sum for each group
    group_counts = df.groupby(column_name).size()

    # Select groups with a sum greater than the threshold
    processed_data = group_counts[group_counts > threshold].index.tolist()

    # Filter the original DataFrame for rows belonging to the filtered groups
    processed_data = df[df[column_name].isin(processed_data)]

    return processed_data

# 使用函数
result_df = aggregate_by_sum(metadata, 'Cancer_subtype', 10)
result_df

[9]:

	Automatic_MS_filename	Batch	Date	Instrument	Cell_line	SIDM	Tissue_type	Cancer_type	Cancer_subtype	Project_Identifier
527	181010_e0022_p02_2178_1_s_m04_1	P02	2018-10-10	M04	SW1710	SIDM00420	Bladder	Bladder Carcinoma	Bladder carcinoma	SIDM00420;SW1710
528	181010_e0022_p02_2178_3_s_m04_1	P02	2018-10-10	M04	SW1710	SIDM00420	Bladder	Bladder Carcinoma	Bladder carcinoma	SIDM00420;SW1710
529	181010_e0022_p02_2178_2_s_m04_1	P02	2018-10-10	M04	SW1710	SIDM00420	Bladder	Bladder Carcinoma	Bladder carcinoma	SIDM00420;SW1710
530	181127_e0022_p02_2051_3_s_m04_1	P02	2018-11-27	M04	RT-112	SIDM00402	Bladder	Bladder Carcinoma	Bladder carcinoma	SIDM00402;RT-112
531	181127_e0022_p02_2051_1_s_m04_1	P02	2018-11-27	M04	RT-112	SIDM00402	Bladder	Bladder Carcinoma	Bladder carcinoma	SIDM00402;RT-112
...	...	...	...	...	...	...	...	...	...	...
6590	200131_b4-7-t3-1_00q3n_00rtc_m05_s_1	P06	2020-01-31	M05	COR-L95	SIDM00521	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00521;COR-L95
6591	200131_b4-9-t3-1_00q3p_00rte_m05_s_1	P06	2020-01-31	M05	NCI-H510A	SIDM00927	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00927;NCI-H510A
6592	200131_b4-10-t3-1_00q3q_00rtf_m05_s_1	P06	2020-01-31	M05	NCI-H2171	SIDM00733	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00733;NCI-H2171
6593	200131_b4-13-t3-1_00q3t_00rti_m05_s_1	P06	2020-01-31	M05	NCI-H1836	SIDM00770	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00770;NCI-H1836
6594	200201_b3-13-t3-2_00q3d_00rtm_m05_s_1	P06	2020-02-01	M05	IST-SL1	SIDM00223	Lung	Small Cell Lung Carcinoma	Small cell lung carcinoma	SIDM00223;IST-SL1

1188 rows × 10 columns

[10]:

def generate_color_map(labels, alpha_nodes=1.0, alpha_links=0.55):
    hues = [i/len(labels) for i in range(len(labels))]
    colors_hsv = [(h, 0.5, 0.8) for h in hues]

    colors_rgba_nodes = [colorsys.hsv_to_rgb(*hsv) for hsv in colors_hsv]
    colors_rgba_links = [(r, g, b, alpha_links) for r, g, b in colors_rgba_nodes]
    colors_rgba_nodes = [(r, g, b, alpha_nodes) for r, g, b in colors_rgba_nodes]

    colors_str_nodes = ["rgba({},{},{},{})".format(int(r*255), int(g*255), int(b*255), a) for r, g, b, a in colors_rgba_nodes]
    colors_str_links = ["rgba({},{},{},{})".format(int(r*255), int(g*255), int(b*255), a) for r, g, b, a in colors_rgba_links]

    return dict(zip(labels, colors_str_nodes)), dict(zip(labels, colors_str_links))


def draw_sankey(data, filename=None):
    all_labels = sorted(list(set(data['Tissue_type'].unique().tolist() + data['Cancer_type'].unique().tolist() + data['Cancer_subtype'].unique().tolist())))

    node_color_map, link_color_map = generate_color_map(all_labels)

    # Aggregate data
    links = data.groupby(['Tissue_type', 'Cancer_type', 'Cancer_subtype']).size().reset_index(name='count')

    final_source = []
    final_target = []
    final_value = []

    for index, row in links.iterrows():
        final_source.append(all_labels.index(row['Tissue_type']))
        final_target.append(all_labels.index(row['Cancer_type']))
        final_value.append(row['count'])

        final_source.append(all_labels.index(row['Cancer_type']))
        final_target.append(all_labels.index(row['Cancer_subtype']))
        final_value.append(row['count'])

    link_colors = [link_color_map[all_labels[s]] for s in final_source]

    fig = go.Figure(go.Sankey(
        node=dict(
                pad=15,
                thickness=20,
                line=dict(color="black", width=0.5),
                label=all_labels,
                color=[node_color_map[label] for label in all_labels]
        ),
        link=dict(
                source=final_source,
                target=final_target,
                value=final_value,
                color=link_colors
            )
    ))

    fig.update_layout(title_text="Sankey Diagram of Cancer Types and Subtypes", font_size=10)

    if filename:
        fig.write_image(filename)

    fig.show()

# 使用示例
# df = pd.DataFrame(data)
draw_sankey(result_df, filename="sankey_diagram-3.pdf")

The reproducibility and reliability of the STAVER-processed data¶

HEK293T Spearman correlation analysis¶

[6]:

# Load HEK293T data
HEK293T_rawdata = pd.read_csv("~/STAVER-revised/HEK_293T/HEK-QCS_matrix_rawdata.csv", index_col=0)
HEK293T_rawdata.dropna(thresh=5, inplace=True)
HEK293T_STAVER = pd.read_csv("~/STAVER-revised/HEK_293T/HEK-QCS_matrix_STAVER.csv", index_col=0)
HEK293T_STAVER.dropna(thresh=5, inplace=True)

[26]:

def correlation_heatmap(data, method = 'pearson', log_transformed = False, outpath = None, filename = None):
    """ Correaltion heatmap and correlation matrix
    Args:
    --------------
    data -> Dataframe: dataframe of raw data
    method -> str: pearson or spearman
    log_transformed -> bool: whether to use log10 transformed data

    Return:
    -----------
    Dataframe: Correaltion of experiments data

    """
    if log_transformed:
        data = np.log10(data)
    corr = data.corr(method = method)
    ## Platelet selection refernce: https://learnku.com/articles/39890
    plt.figure(figsize=(10, 10))
    # optional cmap: RdYlGn; RdYlGn_r; YlGn; rocket_r; YlGnBu; YlGnBu_r; YlOrBr; YlOrBr_r; YlOrRd; YlOrRd_r; RdBu_r
    sns.clustermap(corr, cmap="Reds", annot= False, robust=False, col_cluster=False, row_cluster=False, linewidths=.005,fmt=".2f")
    # sn.heatmap(corr, annot=True, cmap='vlag')  ## optional cmap: RdYlGn; RdYlGn_r; YlGn; rocket_r; YlGnBu; YlGnBu_r; YlOrBr; YlOrBr_r; YlOrRd; YlOrRd_r;
    if outpath:
        plt.savefig(f"{outpath}/{filename}_corr_heatmap.pdf")
        corr.to_csv(f"{outpath}/{filename}_corr_matrix.csv")
    plt.show()
    # return corr

The correlation heatmap of raw data¶

[27]:

outpath = r'~/STAVER-revised/figs/'
correlation_heatmap(HEK293T_rawdata, method = 'spearman', log_transformed=False, outpath = outpath, filename = 'HEK-QCS_matrix_rawdata_corr_heatmap')

<Figure size 1000x1000 with 0 Axes>

_images/PanCancer_949_celline_analysis_21_1.png

The correlation heatmap of STAVER-processed data¶

[28]:

outpath = r'~/STAVER-revised/figs/'
correlation_heatmap(HEK293T_STAVER, method = 'spearman', log_transformed=False, outpath = outpath, filename = 'HEK-QCS_matrix_STAVER_corr_heatmap')

<Figure size 1000x1000 with 0 Axes>

_images/PanCancer_949_celline_analysis_23_1.png

[93]:

def extract_lower_triangular(df_corr):
    """
    提取相关性矩阵对角线以下的一半相关性值。

    参数:
    ----
    df_corr : pandas.DataFrame
        相关性矩阵的 DataFrame。

    返回:
    -----
    pandas.DataFrame
        包含对角线以下一半相关性值的 DataFrame。
    """
    if not isinstance(df_corr, pd.DataFrame):
        raise ValueError("df_corr 必须是 pandas.DataFrame 类型。")

    # 获取对角线以下的索引
    tril_indices = np.tril_indices(df_corr.shape[0], k=-1)

    # 提取对角线以下的一半相关性值
    lower_triangular_values = df_corr.to_numpy()[tril_indices]

    return pd.DataFrame(
        {
            'Row': df_corr.index[tril_indices[0]],
            'Column': df_corr.columns[tril_indices[1]],
            'Correlation': lower_triangular_values,
        }
    )


def get_IQR(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    print(f"The IQR1 is: {Q1}")
    print(f"The IQR3 is: {Q3}")
    mean = data.mean()
    print(f"The mean is: {mean}")
    median = data.median()
    print(f"The median is: {median}")

The IQR of HEK293T spearman correlation matrix in rawdata¶

[105]:

res = extract_lower_triangular(HEK293T_rawdata.corr())
get_IQR(res['Correlation'])

The IQR1 is: 0.8846399725
The IQR3 is: 0.907049235
The mean is: 0.893689345854834
The median is: 0.897238564

The IQR of HEK293T spearman correlation matrix in SATVER-processed data¶

[106]:

res = extract_lower_triangular(HEK293T_STAVER.corr())
get_IQR(res['Correlation'])

The IQR1 is: 0.9321159908757065
The IQR3 is: 0.9821799248475355
The mean is: 0.9523831119684066
The median is: 0.9657469442578246

The identification protein numbers of the raw data and STAVER data¶

[2]:

protein_num= pd.read_clipboard()
protein_num

[2]:

	Experiment	ldentification of Protein numbers	Type
0	190124_HEK-QCS_000F2_008AR_M04_S_1	5683	Raw data
1	190124_HEK-QCS_000F2_008IQ_M06_S_1	5923	Raw data
2	190125_HEK-QCS_000F2_008JO_M04_S_1	5617	Raw data
3	190128_HEK-QCS_000F2_008MR_M06_S_1	5873	Raw data
4	190129_HEK-QCS_000F2_008NM_M04_S_1	5465	Raw data
...	...	...	...
162	190102_HEK-QCS_000F2_007K6_M02_S_1	5150	STAVER
163	190401_HEK-QCS_000F2_00A13_M06_S_1	5817	STAVER
164	190507_HEK-QCS_000F2_00AQK_M03_S_1	5961	STAVER
165	190522_HEK-QCS_000F2_00BEO_M04_S_1	5283	STAVER
166	181213_HEK-QCS_000F2_006UR_M03_S_1	5519	STAVER

167 rows × 3 columns

[28]:

def plot_protein_numbers(data, x_col, y_col, title, y_lim=None, height=6, aspect=1.3):
    """
    Creates a combined box and strip plot for protein number visualization.

    Args:
        data (pd.DataFrame): DataFrame containing the data to be plotted.
        x_col (str): Column name in `data` to be plotted on the x-axis.
        y_col (str): Column name in `data` to be plotted on the y-axis.
        title (str): Title of the plot.
        y_lim (tuple, optional): Tuple specifying the limits for the y-axis (e.g., (0, 7000)).
        height (float, optional): Height (in inches) of each facet. Defaults to 6.
        aspect (float, optional): Aspect ratio of each facet, so that aspect * height gives the width of each facet in inches. Defaults to 1.3.

    Example:
        >>> plot_protein_numbers(protein_num, 'Type', 'Identification of Protein numbers',
                                'Protein numbers of HEK293T QC samples', y_lim=(0, 7000), height=5, aspect=1)
    """
    custom_params = {"axes.spines.right": False, "axes.spines.top": False}
    sns.set_theme(style="ticks", rc=custom_params)

    # Create a box plot
    box_plot = sns.catplot(x=x_col, y=y_col, hue=x_col, kind="box", legend=False, height=height, aspect=aspect, data=data)

    # Overlay with a strip plot
    sns.stripplot(x=x_col, y=y_col, hue=x_col, jitter=True, dodge=True, marker='o', palette="Set2", alpha=0.5, data=data)

    # Set additional plot attributes
    box_plot.fig.suptitle(title)  # Set the title for the figure
    plt.legend(loc='lower right')
    if y_lim:
        plt.ylim(y_lim)

    plt.show()

# The protein numbers of the raw data and STAVER data
plot_protein_numbers(protein_num, 'Type', 'ldentification of Protein numbers',
                     'Identification of the raw and STAVER data', y_lim=(0, 7000), height=5, aspect=1)

_images/PanCancer_949_celline_analysis_31_0.png

The Coefficient of Variation (CVs) of the raw data and STAVER data¶

[130]:

# custom_params = {"axes.spines.right": False, "axes.spines.top": False}
# sns.set_theme(style="ticks", rc=custom_params)

def plot_molecular_variance(df, column_name):
    """
    Plots the density curves of the original and STAVER processed data for comparison.

    Args:
        df: A pandas dataframe containing the data.
        column_name: A string representing the column name of the data to be plotted.

    Returns:
        None
    """
    plt.figure(figsize=(5, 4))

    # Density plot of the original data
    sns.kdeplot(df[df[column_name]=='Raw data']['Coefficient of Variation [%]'], label='Original Density', color='blue', linestyle="--")

    # Density plot of the STAVER-processed data
    sns.kdeplot(df[df[column_name]=='STAVER']['Coefficient of Variation [%]'], label='STAVER Density', color='red')

    plt.legend()
    plt.title('Coefficient of Variation density curve')
    plt.xlabel("Coefficient of Variation [%]")
    plt.ylabel('Density')
    plt.show()

[131]:

protein_cv = pd.read_csv("~/STAVER-revised/Pan-cancer-cell-lines/HEK293T_Protein_CV_Compare.csv")
plot_molecular_variance(protein_cv, 'Type')

_images/PanCancer_949_celline_analysis_34_0.png

The advantages of the STAVER algorithm to uncover inherent biological differences¶

Distinct tumors often exhibit diverse molecular characteristics (Cell, 2014, PMID: 25109877; Cell, 2023, PMID: 37582357), with even different histological subtypes of the same tumor demonstrating significant molecular heterogeneity (Science, 2014, PMID: 25301631; Cancer Cell, 2023, PMID: 36563681), contributing to challenges in cancer treatment. Thus, accurately deciphering the inherent molecular heterogeneity of various cancer types, especially in high-dimensional data such as proteomics, is vital for precision treatments and ultimately improved patient outcomes.

To demonstrate the superior advantages of the STAVER algorithm to uncover inherent biological differences, we comprehensively compared the original data and STAVER-processed data of the 1,242 cancer cell line samples from diverse cancer types.

The UMAP analysis of diverse 1242 cancer cell line samples¶

[98]:

def visualize_cancer_subtypes(proteomics_data_path,
                              subtypes_file_path,
                              color_params=['Batch', 'Tissue_type', 'Cancer_type'],
                              figsize_per_plot=(4.5, 4),
                              legend_loc="on data",
                              outpath='./',
                              filename=None,
                              add_outline=False,
                              show_plot=False):
    """
    Visualize the given proteomics data and cancer subtypes using t-SNE and UMAP.

    Args:
        proteomics_data_path (str): Path to the proteomics data (samples as rows, proteins as columns).
        subtypes_file_path (str): Path to the file with samples and their corresponding cancer subtypes.
        color_params (list): List of parameters to color by in the plots.
        figsize_per_plot (tuple): Size of each individual plot.
        outpath (str): Directory to save the plots.
        filename (str): Prefix of the saved files.

    Returns:
        None. Generates visualization images for t-SNE and UMAP.
    """

    # Load data
    data = pd.read_csv(proteomics_data_path, index_col=0).T.replace(np.nan, 0)

    # Load subtype information
    subtypes = pd.read_csv(subtypes_file_path, index_col=0)

    # Check for missing subtype information
    if not all(sample in subtypes.index for sample in data.index):
        raise ValueError("Some samples lack subtype information!")

    # Convert data to AnnData format
    adata = anndata.AnnData(X=data)

    # Add subtype information to AnnData
    adata.obs = subtypes.loc[adata.obs_names]

    # Data normalization (if required)
    sc.pp.scale(adata)

    # Compute t-SNE and UMAP
    sc.tl.tsne(adata)
    sc.pp.neighbors(adata)
    sc.tl.umap(adata)

    # Set filenames if not provided
    if not filename:
        filename = os.path.basename(proteomics_data_path).split('.')[0]

    # # Visualize t-SNE
    # fig, axes = plt.subplots(figsize=(len(color_params)*figsize_per_plot[0], figsize_per_plot[1]), nrows=1, ncols=len(color_params))
    # for i, param in enumerate(color_params):
    #     sc.pl.tsne(adata, color=param, legend_loc=legend_loc, title=f"t-SNE colored by {param}", ax=axes[i], show=show_plot)
    # fig.tight_layout()
    # fig.savefig(os.path.join(outpath, f"{filename}_tsne.pdf"))
    # plt.close(fig)

    # # Visualize UMAP
    # fig, axes = plt.subplots(figsize=(len(color_params)*figsize_per_plot[0], figsize_per_plot[1]), nrows=1, ncols=len(color_params))
    # for i, param in enumerate(color_params):
    #     sc.pl.umap(adata, color=param, legend_loc=legend_loc, title=f"UMAP colored by {param}", ax=axes[i], show=show_plot)
    # fig.tight_layout()
    # fig.savefig(os.path.join(outpath, f"{filename}_umap.pdf"))
    # plt.close(fig)

    # Determine the number of plots (number of color_params x 2 for both t-SNE and UMAP)
    nrows = 2
    ncols = len(color_params)

    # Create a figure with the necessary number of subplots
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols * figsize_per_plot[0], nrows * figsize_per_plot[1]))

    # Visualize t-SNE on the first row
    for i, param in enumerate(color_params):
        sc.pl.tsne(adata, color=param, legend_loc=legend_loc, title=f"t-SNE colored by {param}", add_outline=add_outline, ax=axes[0, i], show=show_plot)

    # Visualize UMAP on the second row
    for i, param in enumerate(color_params):
        sc.pl.umap(adata, color=param, legend_loc=legend_loc, title=f"UMAP colored by {param}", add_outline=add_outline, ax=axes[1, i], show=show_plot)

    # Save the combined visualization
    fig.tight_layout()
    fig.savefig(os.path.join(outpath, f"{filename}_combined.pdf"))
    plt.show()
    plt.close(fig)

    return adata

The UMAP analysis of the Rawdata¶

By employing UMAP analysis, the results indicated that the STAVER algorithm did not introduce batch effects, corroborating the original data findings (Figure RL2H-2I and Figure RL5H-5I). In the original data, the UMAP analysis showed that the 1,242 cancer cell line samples were diffusely distributed across various tissue sources and cancer types and did not show a clear separation. This lack of distinction was particularly evidenced among bladder, colorectal, pancreatic, gastric, and hepatocellular cancer cell lines, which intermixed with each other (Figure RL2H and Figure RL5H).

[100]:

outpath = '~/DIA-STAVER/STAVER-nc-revised/revised-sript/figs-1/'
if not os.path.exists(outpath):
    os.makedirs(outpath)
adata_rawdata = visualize_cancer_subtypes("~/STAVER-revised/PCA/rawdata_data.csv", "~/STAVER-revised/PCA/metadata_anatation_scanpy.csv", outpath=outpath, filename = "Raw_data_2")

WARNING: You’re trying to run this on 11880 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

_images/PanCancer_949_celline_analysis_39_2.png

The UMAP analysis of STAVER-processed data¶

When analyzing STAVER-processed data, UMAP visualization clearly separated the 1,242 cancer cell line samples by tissue sources and cancer types (Figure RL2I and Figure RL5I). Notably, the STAVER-processed data revealed that cancer cell lines from the digestive tract, such as intestinal, gastric, and pancreatic, exhibited more molecular similarities than their nondigestive tract counterparts (lung, glioma, and kidney cancer cell lines). These findings highlighted the potential superiority of the STAVER algorithm in elucidating the intrinsic biological differences among diverse tumor cell lines.

[102]:

outpath = '~/DIA-STAVER/STAVER-nc-revised/revised-sript/figs-1/'
if not os.path.exists(outpath):
    os.makedirs(outpath)

adata_STAVER = visualize_cancer_subtypes("~/STAVER-revised/PCA/STAVER_data.csv", "~/STAVER-revised/PCA/metadata_anatation_scanpy.csv", outpath=outpath, filename = "STAVER_data_2")

WARNING: You’re trying to run this on 8202 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.

_images/PanCancer_949_celline_analysis_41_1.png

Cancer specific proteins¶

To rigorously establish the advantages of the STAVER algorithm in identifying tumor cell-specific proteins, we compared the distribution patterns of cell-specific protein expression profiles between the original and STAVER-processed data. We incorporated the Human Protein Atlas (HPA) molecular annotations of tumor cell specificity and their expression profiles across various cancer cell lines for integrated analysis.

The results demonstrated that the STAVER-processed data more accurately reflected cancer type-specific expression patterns than the original proteomic data.

[4]:

HPA_dataset = pd.read_csv("~/STAVER-nc-revised/resources/proteinatlas_56a92855.tsv", sep="\t")
HPA_dataset

[4]:

	Gene	Gene synonym	Ensembl	Gene description	Uniprot	Chromosome	Position	Protein class	Biological process	Molecular function	...	Pathology prognostics - Lung cancer	Pathology prognostics - Melanoma	Pathology prognostics - Ovarian cancer	Pathology prognostics - Pancreatic cancer	Pathology prognostics - Prostate cancer	Pathology prognostics - Renal cancer	Pathology prognostics - Stomach cancer	Pathology prognostics - Testis cancer	Pathology prognostics - Thyroid cancer	Pathology prognostics - Urothelial cancer
0	A1BG	NaN	ENSG00000121410	Alpha-1-B glycoprotein	P04217	19	58345178-58353492	Plasma proteins, Predicted intracellular prote...	NaN	NaN	...	unprognostic (1.09e-1)	unprognostic (2.59e-1)	unprognostic (2.10e-1)	unprognostic (1.47e-2)	unprognostic (1.37e-2)	unprognostic (4.19e-5)	unprognostic (2.37e-2)	unprognostic (1.94e-1)	unprognostic (1.72e-1)	unprognostic (6.72e-2)
1	A1CF	ACF, ACF64, ACF65, APOBEC1CF, ASP	ENSG00000148584	APOBEC1 complementation factor	Q9NQ94	10	50799409-50885675	Predicted intracellular proteins	mRNA processing	RNA-binding	...	unprognostic (7.38e-3)	NaN	unprognostic (1.30e-2)	unprognostic (2.46e-2)	unprognostic (1.20e-1)	unprognostic (1.90e-3)	unprognostic (1.97e-2)	unprognostic (2.77e-1)	unprognostic (2.19e-2)	unprognostic (8.50e-4)
2	A2M	CPAMD5, FWP007, S863-7	ENSG00000175899	Alpha-2-macroglobulin	P01023	12	9067664-9116229	Cancer-related genes, Candidate cardiovascular...	NaN	Protease inhibitor, Serine protease inhibitor	...	unprognostic (3.65e-2)	unprognostic (2.38e-1)	unprognostic (7.19e-2)	unprognostic (4.71e-2)	unprognostic (2.06e-2)	unprognostic (1.28e-2)	unprognostic (8.04e-3)	unprognostic (2.32e-2)	unprognostic (8.58e-2)	unprognostic (9.03e-3)
3	A2ML1	CPAMD9, FLJ25179, p170	ENSG00000166535	Alpha-2-macroglobulin like 1	A8K2U0	12	8822621-8887001	Disease related genes, Predicted intracellular...	NaN	Protease inhibitor, Serine protease inhibitor	...	unprognostic (7.58e-3)	unprognostic (2.63e-1)	unprognostic (1.57e-1)	unprognostic (1.15e-3)	unprognostic (2.03e-1)	unprognostic (1.06e-9)	unprognostic (2.28e-1)	unprognostic (3.07e-1)	unprognostic (5.88e-2)	unprognostic (2.42e-2)
4	A3GALT2	A3GALT2P, IGB3S, IGBS3S	ENSG00000184389	Alpha 1,3-galactosyltransferase 2	U3KPV4	1	33306766-33321098	Enzymes, Predicted membrane proteins	Lipid metabolism	Glycosyltransferase, Transferase	...	unprognostic (4.96e-2)	unprognostic (6.83e-2)	unprognostic (5.81e-2)	unprognostic (1.23e-1)	unprognostic (1.89e-1)	unprognostic (4.90e-8)	unprognostic (1.17e-1)	NaN	unprognostic (1.12e-2)	unprognostic (7.87e-2)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20157	ZYG11A	ZYG11	ENSG00000203995	Zyg-11 family member A, cell cycle regulator	Q6WRX3	1	52842511-52894998	Predicted intracellular proteins	Ubl conjugation pathway	NaN	...	unprognostic (2.34e-1)	unprognostic (4.56e-2)	unprognostic (2.06e-2)	unprognostic (4.01e-2)	unprognostic (1.01e-1)	unprognostic (6.15e-3)	unprognostic (2.95e-1)	unprognostic (1.21e-1)	unprognostic (3.07e-1)	unprognostic (1.02e-1)
20158	ZYG11B	FLJ13456, ZYG11	ENSG00000162378	Zyg-11 family member B, cell cycle regulator	Q9C0D3	1	52726453-52827336	Predicted intracellular proteins	Ubl conjugation pathway	NaN	...	unprognostic (1.85e-1)	unprognostic (4.84e-3)	unprognostic (5.06e-2)	unprognostic (2.76e-1)	unprognostic (6.08e-2)	prognostic favorable (9.80e-7)	unprognostic (2.22e-1)	unprognostic (3.37e-1)	unprognostic (1.13e-1)	unprognostic (9.57e-2)
20159	ZYX	NaN	ENSG00000159840	Zyxin	Q15942	7	143381295-143391111	Plasma proteins, Predicted intracellular proteins	Cell adhesion, Host-virus interaction	NaN	...	unprognostic (1.66e-3)	unprognostic (2.60e-1)	unprognostic (4.22e-1)	unprognostic (1.98e-1)	unprognostic (2.43e-1)	prognostic unfavorable (7.92e-5)	unprognostic (1.39e-1)	unprognostic (8.12e-2)	unprognostic (1.95e-1)	unprognostic (6.72e-2)
20160	ZZEF1	FLJ10821, KIAA0399, ZZZ4	ENSG00000074755	Zinc finger ZZ-type and EF-hand domain contain...	O43149	17	4004445-4143030	Predicted membrane proteins	Transcription, Transcription regulation	Activator	...	unprognostic (1.44e-2)	unprognostic (1.23e-1)	unprognostic (2.21e-2)	prognostic favorable (6.08e-4)	unprognostic (1.54e-1)	unprognostic (1.38e-3)	unprognostic (6.09e-3)	unprognostic (1.80e-1)	unprognostic (9.43e-2)	unprognostic (6.46e-2)
20161	ZZZ3	ATAC1, DKFZP564I052	ENSG00000036549	Zinc finger ZZ-type containing 3	Q8IYH5	1	77562416-77683419	Predicted intracellular proteins	Transcription, Transcription regulation	DNA-binding	...	unprognostic (2.91e-1)	unprognostic (9.67e-2)	unprognostic (1.35e-1)	unprognostic (2.83e-1)	unprognostic (1.91e-1)	unprognostic (1.71e-1)	unprognostic (3.96e-1)	unprognostic (1.95e-1)	unprognostic (3.05e-2)	unprognostic (1.43e-1)

20162 rows × 89 columns

[6]:

def get_hpa_expression_profile(protein):
    HPA = HPA_dataset[['Gene', 'RNA cell line specific nTPM']]
    HPA.set_index('Gene', inplace=True)

    return HPA.loc[protein, "RNA cell line specific nTPM"]

CTSE overexpression in Gastric, pancreatic, and colorectal cancer of HPA dataset¶

[9]:

get_hpa_expression_profile("CTSE")

[9]:

'colorectal cancer: 28.4;Esophageal cancer: 48.8;Gastric cancer: 110.7;pancreatic cancer: 35.3'

[10]:

HPA_protein_CTSE = {
    'Bladder Carcinoma': 0,
    'Colorectal Carcinoma': 28.4,
    'Gastric Carcinoma': 110.7,
    'Glioma': 0,
    'Hepatocellular Carcinoma': 0,
    'Kidney Carcinoma': 0,
    'NSCLC': 0,
    'Pancreatic Carcinoma': 35.3,
    'SCLC': 0
}


HPA_protein_CTSE = pd.DataFrame.from_dict(HPA_protein_CTSE, orient='index', columns=['CTSE'])
HPA_protein_CTSE = HPA_protein_CTSE.sort_values(by=['CTSE'], ascending=False)
HPA_protein_CTSE

[10]:

	CTSE
Gastric Carcinoma	110.7
Pancreatic Carcinoma	35.3
Colorectal Carcinoma	28.4
Bladder Carcinoma	0.0
Glioma	0.0
Hepatocellular Carcinoma	0.0
Kidney Carcinoma	0.0
NSCLC	0.0
SCLC	0.0

[11]:

STAVER = pd.read_csv("~/STAVER-revised/model-evlauate/STAVER_data.csv", index_col=0)
raw_data = pd.read_csv("~/STAVER-revised/model-evlauate/raw_data.csv", index_col=0)

[12]:

def process_data(data, target_gene):
    df = data[[target_gene, 'Cancer_type']]
    df.replace(0, np.nan, inplace=True)
    df.drop(df[df['Cancer_type'] == 'Other Solid Carcinomas'].index, inplace=True)
    df = df.groupby('Cancer_type').median()
    df = df.sort_values(by=[target_gene], ascending=False)
    return df

[13]:

rawdata_CTSE = process_data(raw_data, "CTSE")
rawdata_CTSE.sort_values(by=['CTSE'], ascending=False, inplace=True)
rawdata_CTSE

[13]:

	CTSE
Cancer_type
Gastric Carcinoma	13.064867
Pancreatic Carcinoma	13.046290
Hepatocellular Carcinoma	12.683957
Colorectal Carcinoma	11.875120
Non-Small Cell Lung Carcinoma	11.738225
Bladder Carcinoma	11.182161
Small Cell Lung Carcinoma	10.270660
Kidney Carcinoma	10.095740
Glioma	NaN

[17]:

STAVER_CTSE = process_data(STAVER, "CTSE")
STAVER_CTSE.sort_values(by=['CTSE'], ascending=False, inplace=True)
STAVER_CTSE

[17]:

	CTSE
Cancer_type
Gastric Carcinoma	9.501038
Pancreatic Carcinoma	8.500000
Colorectal Carcinoma	8.050000
Bladder Carcinoma	NaN
Glioma	NaN
Hepatocellular Carcinoma	NaN
Kidney Carcinoma	NaN
Non-Small Cell Lung Carcinoma	NaN
Small Cell Lung Carcinoma	NaN

[ ]:

mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['figure.dpi']= 150 # 一般将dpi设置在150到300之间

custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)

def plot_barplot(df, target_col, log_transform=False, x_label=None, y_label=None, title=None, save_path=None):
    """
    Plots a barplot.

    Args:
        df (pd.DataFrame): The dataframe.
        target_col (str): The column name to be plotted as the target column.
        x_label (str): The x-axis label. Default is None.
        y_label (str): The y-axis label. Default is None.
        title (str): The title of the plot. Default is None.
        save_path (str): The path to save the image. Default is None, indicating no saving.

    Returns:
        None
    """

    plt.figure(figsize=(4, 2.7))
    if log_transform:
        sns.barplot(x=df.index, y=np.log2(df[target_col]+1))
    # Use seaborn's barplot function to plot the graph
    else:
        sns.barplot(x=df.index, y=df[target_col])

    # Set the title and axis labels
    if title:
        plt.title(title)
    if x_label:
        plt.xlabel(x_label)
    else:
        plt.xlabel("Index")
    if y_label:
        plt.ylabel(y_label)
    else:
        plt.ylabel(target_col)

    plt.xticks(rotation=30, ha='right')  # Rotate the x-axis labels for better display

    # Save the image (if save_path is specified)
    if save_path:
        plt.savefig(save_path, bbox_inches='tight')

    # Show the image
    plt.show()

[33]:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 全局的绘图参数设置
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['figure.dpi'] = 150

custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params)

def plot_combined_barplots(datasets, target_col, titles, log_transforms, x_labels, y_labels, save_path=None):
    """
    Plots multiple barplots on a single figure, one for each dataset provided.
    """

    if not all(len(lst) == len(datasets) for lst in [titles, log_transforms, x_labels, y_labels]):
        raise ValueError("All list arguments must be of the same length as the 'datasets' list.")

    n = len(datasets)
    fig, axes = plt.subplots(1, n, figsize=(n * 4, 4.2))

    if n == 1:  # If there is only one dataset, axes will not be an array
        axes = [axes]

    for i, (df, title, log_transform, x_label, y_label) in enumerate(zip(datasets, titles, log_transforms, x_labels, y_labels)):
        ax = axes[i]
        plot_data = np.log2(df[target_col]+1) if log_transform else df[target_col]
        sns.barplot(ax=ax, x=df.index, y=plot_data)
        ax.set_title(title)
        ax.set_xlabel(x_label if x_label else "Index")
        ax.set_ylabel(y_label if y_label else target_col)

        # Set the tick parameters and label rotation
        ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')

    # Adjust the left margin if needed
    plt.subplots_adjust(left=0.1)  # You might need to tweak this value

    if save_path:
        plt.savefig(save_path, bbox_inches='tight')

    plt.tight_layout()
    plt.show()


# The barplot of CTSE expression profiles
plot_combined_barplots(
    datasets=[HPA_protein_CTSE, rawdata_CTSE, STAVER_CTSE],
    target_col="CTSE",
    titles=["Expression profiles from HPA dataset (CTSE)",
            "Expression profiles from original data (CTSE)",
            "Expression profiles from STAVER data (CTSE)"],
    log_transforms=[True, False, False],
    x_labels=[None, None, None],
    y_labels=["Log2 (Relative expression)",
            "Log2 (Relative expression)",
            "Log2 (Relative expression)"],
    save_path="combined_barplots_CTSE.pdf"
)

_images/PanCancer_949_celline_analysis_53_0.png

[ ]:

# The barplot of CTSE expression profiles
plot_combined_barplots(
        datasets=[HPA_protein_CTSE, rawdata_CTSE, STAVER_CTSE],
        target_col="CTSE",
        titles=["Expression profiles from HPA dataset (CTSE)",
                "Expression profiles from original data (CTSE)",
                "Expression profiles from STAVER data (CTSE)"],
        log_transforms=[True, False, False],
        x_labels=[None, None, None],
        y_labels=["Log2 (Relative expression)",
                "Log2 (Relative expression)",
                "Log2 (Relative expression)"],
        save_path="combined_barplots_CTSE.pdf"
)

GPA33 overexpression in colorectal cancer and Gastric cancer of HPA dataset¶

[34]:

get_hpa_expression_profile("GPA33")

[34]:

'Bile duct cancer: 34.1;colorectal cancer: 100.5;Gastric cancer: 19.1'

[35]:

HPA_protein_CTSE = {
    'Bladder Carcinoma': 0,
    'Colorectal Carcinoma': 100.5,
    'Gastric Carcinoma': 19.1,
    'Glioma': 0,
    'Hepatocellular Carcinoma': 0,
    'Kidney Carcinoma': 0,
    'NSCLC': 0,
    'Pancreatic Carcinoma': 0,
    'SCLC': 0
}


HPA_protein_GPA33 = pd.DataFrame.from_dict(HPA_protein_CTSE, orient='index', columns=['GPA33'])
HPA_protein_GPA33 = HPA_protein_GPA33.sort_values(by=['GPA33'], ascending=False)
HPA_protein_GPA33

[35]:

	GPA33
Colorectal Carcinoma	100.5
Gastric Carcinoma	19.1
Bladder Carcinoma	0.0
Glioma	0.0
Hepatocellular Carcinoma	0.0
Kidney Carcinoma	0.0
NSCLC	0.0
Pancreatic Carcinoma	0.0
SCLC	0.0

[39]:

rawdata_GPA33 = process_data(raw_data, "GPA33")
rawdata_GPA33.sort_values(by=['GPA33'], ascending=False, inplace=True)
rawdata_GPA33

[39]:

	GPA33
Cancer_type
Colorectal Carcinoma	16.132530
Gastric Carcinoma	14.659133
Non-Small Cell Lung Carcinoma	13.211101
Pancreatic Carcinoma	10.548243
Hepatocellular Carcinoma	9.851693
Kidney Carcinoma	9.565263
Bladder Carcinoma	9.219547
Small Cell Lung Carcinoma	9.125549
Glioma	9.036120

[41]:

STAVER_GPA33 = process_data(STAVER, "GPA33")
STAVER_GPA33.sort_values(by=['GPA33'], ascending=False, inplace=True)
STAVER_GPA33

[41]:

	GPA33
Cancer_type
Colorectal Carcinoma	9.632096
Gastric Carcinoma	7.820000
Bladder Carcinoma	NaN
Glioma	NaN
Hepatocellular Carcinoma	NaN
Kidney Carcinoma	NaN
Non-Small Cell Lung Carcinoma	NaN
Pancreatic Carcinoma	NaN
Small Cell Lung Carcinoma	NaN

[42]:

# The barplot of GPA33 expression profiles
plot_combined_barplots(
        datasets=[HPA_protein_GPA33, rawdata_GPA33, STAVER_GPA33],
        target_col="GPA33",
        titles=["Expression profiles from HPA dataset (GPA33)",
                "Expression profiles from original data (GPA33)",
                "Expression profiles from STAVER data (GPA33)"],
        log_transforms=[True, False, False],
        x_labels=[None, None, None],
        y_labels=["Log2 (Relative expression)",
                "Log2 (Relative expression)",
                "Log2 (Relative expression)"],
        save_path="combined_barplots_GPA33.pdf"
)

_images/PanCancer_949_celline_analysis_60_0.png

ADGRF1 overexpression in pancreatic cancer of HPA dataset¶

[43]:

get_hpa_expression_profile("ADGRF1")

[43]:

'pancreatic cancer: 62.2'

[44]:

HPA_protein_CTSE = {
    'Bladder Carcinoma': 0,
    'Colorectal Carcinoma': 0,
    'Gastric Carcinoma': 0,
    'Glioma': 0,
    'Hepatocellular Carcinoma': 0,
    'Kidney Carcinoma': 0,
    'NSCLC': 0,
    'Pancreatic Carcinoma': 62.2,
    'SCLC': 0
}


HPA_protein_ADGRF1 = pd.DataFrame.from_dict(HPA_protein_CTSE, orient='index', columns=['ADGRF1'])
HPA_protein_ADGRF1 = HPA_protein_ADGRF1.sort_values(by=['ADGRF1'], ascending=False)
HPA_protein_ADGRF1

[44]:

	ADGRF1
Pancreatic Carcinoma	62.2
Bladder Carcinoma	0.0
Colorectal Carcinoma	0.0
Gastric Carcinoma	0.0
Glioma	0.0
Hepatocellular Carcinoma	0.0
Kidney Carcinoma	0.0
NSCLC	0.0
SCLC	0.0

[45]:

rawdata_ADGRF1 = process_data(raw_data, "ADGRF1")
rawdata_ADGRF1.sort_values(by=['ADGRF1'], ascending=False, inplace=True)
rawdata_ADGRF1

[45]:

	ADGRF1
Cancer_type
Pancreatic Carcinoma	10.272240
Gastric Carcinoma	10.256628
Colorectal Carcinoma	10.093484
Bladder Carcinoma	9.910033
Glioma	NaN
Hepatocellular Carcinoma	NaN
Kidney Carcinoma	NaN
Non-Small Cell Lung Carcinoma	NaN
Small Cell Lung Carcinoma	NaN

[46]:

STAVER_ADGRF1 = process_data(STAVER, "ADGRF1")
STAVER_ADGRF1.sort_values(by=['ADGRF1'], ascending=False, inplace=True)
STAVER_ADGRF1

[46]:

	ADGRF1
Cancer_type
Pancreatic Carcinoma	7.594326
Bladder Carcinoma	NaN
Colorectal Carcinoma	NaN
Gastric Carcinoma	NaN
Glioma	NaN
Hepatocellular Carcinoma	NaN
Kidney Carcinoma	NaN
Non-Small Cell Lung Carcinoma	NaN
Small Cell Lung Carcinoma	NaN

[47]:

# The barplot of ADGRF1 expression profiles
plot_combined_barplots(
        datasets=[HPA_protein_ADGRF1, rawdata_ADGRF1, STAVER_ADGRF1],
        target_col="ADGRF1",
        titles=["Expression profiles from HPA dataset (ADGRF1)",
                "Expression profiles from original data (ADGRF1)",
                "Expression profiles from STAVER data (ADGRF1)"],
        log_transforms=[True, False, False],
        x_labels=[None, None, None],
        y_labels=["Log2 (Relative expression)",
                "Log2 (Relative expression)",
                "Log2 (Relative expression)"],
        save_path="combined_barplots_ADGRF1.pdf"
)

_images/PanCancer_949_celline_analysis_66_0.png

The reproducibility of the STAVER algorithm in identifying the previously reported tumor biomarkers¶

To further validate the superior reproducibility of the STAVER algorithm in identifying tumor biomarkers, a systematic literature review was conducted. According to the systematic literature review, a series of cancer biomarkers were selected to further evaluate the STAVER algorithm (Table RL1 and Table RL4). We utilized the original and STAVER-processed proteomic data to examine the expression differences of these previously reported cancer biomarkers across diverse cancer cell lines.

The results demonstrated that the STAVER algorithm more accurately identify the previously reported tumor biomarkers with high reproducibility.

[49]:

reported_biomrkers = pd.read_clipboard()
reported_biomrkers

[49]:

	Tissue Type	Cancer Type	Reported Biomarker	Reference	PMID
0	Liver	Hepatocellular carcinoma	DCP	Journal of Hepatology, 2023	PMID: 37683735
1	Liver	Hepatocellular carcinoma	HSP70	Journal of hepatology, 2009	PMID: 19231003
2	Liver	Hepatocellular carcinoma	GGT1	Gastroenterology, 2017; BMC Cancer, 2019	PMID: 28711626; PMID: 31455253
3	Liver	Hepatocellular carcinoma	A1CF	Cell reports, 2019; The JCI, 2021	PMID: 31597092; PMID: 33445170
4	Kidney	Kidney cancer	CA9	European urology, 2014; Cancer research, 1997	PMID: 24821582; PMID: 9230182
5	Kidney	Kidney cancer	CD70	Cancer research, 2006	PMID: 16489038
6	Gastric	Gastric cancer	ERRB2	Annals of oncology, 2008	PMID: 18441328
7	Gastric	Gastric cancer	AGR3	In vivo, 2023	PMID: 36593009
8	Colorectal	Colorectal cancer	MUC2	Gastroenterology, 2005;	PMID: 16285957
9	Colorectal	Colorectal cancer	EPCAM	British journal of cancer, 2014	PMID: 24786601
10	Colorectal	Colorectal cancer	CDX2	Annals of oncology, 2017	PMID: 28328000
11	Colorectal	Colorectal cancer	MUC13	Oncogene, 2019	PMID: 31427737
12	Brain	Glioma	GFAP	Brain, 2007; Biosens Bioelectron, 2020	PMID: 17998256; PMID: 33160234
13	Brain	Glioma	NES	Cancer cell, 2020; Mol Neurobiol, 2017	PMID: 32396858; PMID: 26768429
14	Pancreas	Pancreatic cancer	MUC1	Journal of gastroenterology,2003	PMID: 14714254
15	Pancreas	Pancreatic cancer	ANO1	PNAS, 2019	PMID: 31182586
16	Lung	Non-small cell lung cancer	KRT5	Cancer cell, 2022	PMID: 36368318
17	Lung	Non-small cell lung cancer	DSC3	Journal of thoracic oncology, 2011	PMID: 21623236
18	Lung	Small cell lung cancer	CHGA	Cell reports, 2020	PMID: 33086069
19	Lung	Small cell lung cancer	NCAM1	Cancer cell, 2022	PMID: 36368318

Reported_cancer_biomarkers

All the results are shown in the above figure and can be reproduced with the R script boxplot_949_PanCancer.R.

The robustness and broad applicability of the STAVER algorithm for disease diagnosis and classification.¶

To validate the outstanding advantages of the STAVER algorithm for potential clinical decision-making, we comprehensively evaluated the generalization performance and prediction accuracy of classification models constructed based on the above tumor biomarkers.

To illustrate the robustness and broad applicability of the STAVER algorithm, we constructed three separate benchmark classification models based on different algorithms, including decision tree-based random forest, gradient boosting-based XGBoost, and linear regression-based logistic regression models (Materials and methods section).

As a result, compared to the original proteomic data, the classification models constructed based on STAVER-processed data exhibited increased generalization performance in distinguishing specific cancer-type cell lines.

See the PanCancer_949_cellline_models.ipynb for more details.

[ ]: