Enhancing Quantitative Data-Independent-Acquisition Proteomics of Partially Heterogeneous Samples with Missing Value Imputation

You can download our poster presented at ASMS 2021 or connect with us on Twitter if you have questions.

Introduction

Relative quantification of peptides is a common application of data-independent-acquisition mass spectrometry (DIA). One advantage of DIA over data-dependent acquisition (DDA) is the lower incidence of missing values due to inconsistent measurement or data processing steps, especially when employing strict control of false discovery rates (FDR). However missing values can still be encountered in DIA results due to noise filtering/suppression (so-called “left censorship” of observed intensities), or when analyzing partially-heterogenous mixtures where some analytes can only be measured in a subset of samples. The consistency of DIA acquisition means missing values are usually caused by low intensity or absence of the analyte, while those encountered in DDA data are more likely to be missing-at-random.

Methods

Previous work on DIA quantification has often assumed samples are homogenous, or that measurements of low-intensity or absent analytes will reflect instrument noise and be suitable for downstream analysis. However such background integration can also include interference which heavily impacts results.

We have developed a relative protein quantification workflow for DIA including a configurable threshold for determining the absence of analytes in individual samples, missing-value imputation, and imputation-aware estimation of the statistical significance of quantitative changes. After global identification and FDR control, the threshold excludes peptides from quantification in runs where the identification is not accepted at the configured FDR level (q-value). Because all analytes are (putatively) present in at least one sample this filtering does not affect the overall FDR of the experiment (controlled at 1% in the experiments below).

Missing values are imputed by quantile regression for left-censored data (QRILC), which draws values from a truncated log-normal distribution fit to the non-missing measurements for an analyte (in all categories). Significance estimates use degrees of freedom reduced by the number of imputed values.

Results and Discussion

We initially analyzed the relationship between protein-level intensity and coefficient of variation (CV). We acquired two sets of three replicates, one with DDA techniques, and the other in parallel using DIA. Analyzing each set of samples separately, we controlled protein FDR at 1%, and computed the relationship between intensity and CV for proteins with measured intensities in all three replicates by determining the median CV and a 50% confidence interval in a sliding window over intensities. For analytes with a single missing value, we computed the mean and CV using QRILC. This allowed us to compare the agreement between the intensity-CV relationship deduced from all-present data and that deduced from analytes with missing values, and we found good agreement.

For DIA, 10 of 22 QRLIC-produced points fall inside the 50% confidence interval estimated from proteins with a quantitative value in each sample. We saw a larger number of analytes with missing values in the DDA experiment, and the distribution of QRILC-computed CVs seemed broader than that determined from proteins without missing values. We expect more missing values, but the increased spread of CVs may indicate more DDA missing values are missing-at-random.

In an experiment where some samples were infected with tuberculosis (TB) nearly all pathogen proteins are not quantified in uninfected samples when using reasonable settings (0.01 – 0.2, compared to global FDR threshold 0.01). The threshold is stable and almost exclusively affects these false positive proteins, leaving host cell proteins quantified.

Thresholding is again stable in a similar analysis of plasmodium-infected cells. Consistency between runs indicates imputation is unlikely to recover false positives (though more replicates are required for imputation). Note that one replicate of this experiment exhibited significant carry-over from previous samples and was not used. Analysis was further complicated by peptides shared between proteomes which were excluded from quantification to avoid spurious quantification of plasmodium proteins; this had a small but noticeable effect.

Treating analytes as truly absent simplifies analysis of quantitative results, but risks falsely confusing low intensity and absence. We analyzed six samples with synthetic peptides spiked into a complex proteome at extreme dilutions (2-3 orders of magnitude) below the limit of quantification.

Unlike false positives, these low-intensity analytes are not excluded at reasonable threshold settings. At more stringent settings imputation (across only these runs) recovers quantities for a large proportion.

We further analyzed the statistical significance of spiked-in analytes (by ANOVA, p < 0.05, Benjamini-Hochberg corrected) across nine dilutions with three technical replicates at each concentration.

Reduced significance at strict settings is primarily caused by imputation which is applied across all categories and reduces the degrees of freedom by the number of missing values. At setting 0.003 only one spiked-in protein is quantified in the three lowest concentrations, possibly due to shared fragments. The threshold has minimal impact on background proteins, with only one found significant, albeit at an extreme setting. This may be a random result, but indicates that quantitative results are affected by very stringent thresholds, even if significance is not.

Connect with us on Twitter if you have questions,
or download our poster presented at ASMS 2021.