|
1 | 1 | Methods |
2 | 2 | ======= |
3 | 3 |
|
4 | | -All samples are first normalized to correct for uneven sequencing depth |
5 | | -using GMPR_ (default). After normalization, Sourcepredict |
6 | | -performs a two steps prediction: first a prediction of the proportion of |
7 | | -unknown sources, i.e. not represented in the reference dataset. Then a |
8 | | -prediction of the proportion of each known source of the reference |
9 | | -dataset in the test samples. |
10 | | - |
11 | | -Organism are represented by their taxonomic identifiers (TAXID). |
| 4 | +Starting with a numerical organism count matrix (samples as columns, |
| 5 | +organisms as rows, obtained by a taxonomic classifier) of merged |
| 6 | +references and sinks datasets, samples are first normalized relative to |
| 7 | +each other, to correct for uneven sequencing depth using the GMPR_ method |
| 8 | +(default). After normalization, Sourcepredict performs a |
| 9 | +two-step prediction algorithm. First, it predicts the proportion of |
| 10 | +unknown sources, *i.e.* which are not represented in the reference |
| 11 | +dataset. Second it predicts the proportion of each known source of the |
| 12 | +reference dataset in the sink samples. |
| 13 | + |
| 14 | +Organisms are represented by their taxonomic identifiers (TAXID). |
12 | 15 |
|
13 | 16 | Prediction of unknown sources proportion |
14 | 17 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
15 | 18 |
|
16 | | -| Let :math:`S` be a sample of size :math:`O` organims from the test |
17 | | - dataset :math:`D_{sink}` |
18 | | -| Let :math:`n` be the average number of samples per class in the |
19 | | - reference dataset. |
20 | | -| I define :math:`U_n` samples to add to the training dataset to account |
21 | | - for the unknown source proportion in a test sample. |
22 | | -
|
23 | | -To compute :math:`U_n`, a :math:`\alpha` proportion (default = |
24 | | -:math:`0.1`) of each :math:`o_i` organism (with :math:`i\in[1,O]`) is |
25 | | -added to the training dataset for each :math:`U_j` samples (with |
26 | | -:math:`j\in[1,n]`), such as :math:`U_j(o_i) = \alpha\times S_(o_i)` |
27 | | - |
28 | | -The :math:`U_n` samples are then merged as columns to the reference |
29 | | -dataset (:math:`D_{ref}`) to create a new reference dataset denoted |
30 | | -:math:`D_{ref\ unknown}` |
31 | | - |
32 | | -| To predict this unknown proportion, the dimension of the reference |
33 | | - dataset :math:`D_{ref\ unknown}` (samples in columns, organisms as |
34 | | - rows) is first reduced to 20 with the scikit-learn_ |
35 | | - implementation of PCA. |
36 | | -| This reference dataset is then divided into three subsets: |
37 | | - :math:`D_{train\ unknown}` (64%), :math:`D_{test\ unknown}` (20%), and |
38 | | - :math:`D_{validation unknown}`\ (16%). |
39 | | -
|
40 | | -| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm |
41 | | - is then trained on :math:`D_{train\ unknown}`, and the test accuracy |
42 | | - is computed with :math:`D_{test\ unknown}` . |
43 | | -| The trained KNN model is then corrected for probability estimation of |
44 | | - unknown proportion using the scikit-learn implementation of the |
45 | | - Platt’s scaling method_ with :math:`D_{validation\ unknown}`. |
46 | | - This procedure is repeated for each sample of the test dataset. |
47 | | -
|
48 | | -The proportion of unknown :math:`p_{unknown}` sources in each sample is |
49 | | -then computed using the trained and corrected KNN model. |
| 19 | +| Let :math:`S_i \in \{S_1, .., S_n\}` be a sample from the normalized |
| 20 | + sinks dataset :math:`D_{sink}`, |
| 21 | + :math:`o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}` be an |
| 22 | + organism in :math:`S_i`, and :math:`n_o^{\ i}` be the total number of |
| 23 | + organisms in :math:`S_i`, with :math:`o_{j}^{\ i} \in \mathbb{Z}+`. |
| 24 | +| Let :math:`m` be the mean number of samples per class in the reference |
| 25 | + dataset, such that :math:`m = \frac{1}{O}\sum_{i=1}^{O}S_i`. |
| 26 | +| For each :math:`S_i` sample, I define :math:`||m||` estimated samples |
| 27 | + :math:`U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}` to add to the |
| 28 | + reference dataset to account for the unknown source proportion in a |
| 29 | + test sample. |
| 30 | +
|
| 31 | +Separately for each :math:`S_i`, a proportion denoted |
| 32 | +:math:`\alpha \in [0,1]` (default = :math:`0.1`) of each of the |
| 33 | +:math:`o_{j}^{\ i}` organism of :math:`S_i` is added to each |
| 34 | +:math:`U_k^{S_i}` samples such that |
| 35 | +:math:`U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}` , where |
| 36 | +:math:`x_{i \ j}` is sampled from a Gaussian distribution |
| 37 | +:math:`\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)`. |
| 38 | + |
| 39 | +The :math:`||m||` :math:`U_k^{S_i}` samples are then added to the |
| 40 | +reference dataset :math:`D_{ref}`, and labeled as *unknown*, to create a |
| 41 | +new reference dataset denoted :math:`{}^{unk}D_{ref}`. |
| 42 | + |
| 43 | +| To predict the proportion of unknown sources, a Bray-Curtis_ pairwise dissimilarity matrix of all :math:`S_i` and |
| 44 | + :math:`U_k^{S_i}` samples is computed using scikit-bio. This distance |
| 45 | + matrix is then embedded in two dimensions (default) with the |
| 46 | + scikit-bio implementation of PCoA. |
| 47 | +| This sample embedding is divided into three subsets: |
| 48 | + :math:`{}^{unk}D_{train}` (:math:`64\%`), :math:`{}^{unk}D_{test}` |
| 49 | + (:math:`20\%`), and :math:`{}^{unk}D_{validation}`\ (:math:`16\%`). |
| 50 | +
|
| 51 | +| The scikit-learn implementation of KNN algorithm is then trained on |
| 52 | + :math:`{}^{unk}D_{train}`, and the training accuracy is computed with |
| 53 | + :math:`{}^{unk}D_{test}`. |
| 54 | +| This trained KNN model is then corrected for probability estimation of |
| 55 | + the unknown proportion using the scikit-learn implementation of |
| 56 | + Platt_’s scaling method with :math:`{}^{unk}D_{validation}`. |
| 57 | +
|
| 58 | +The proportion of unknown sources in :math:`S_i`, :math:`p_u \in [0,1]` |
| 59 | +is then estimated using this trained and corrected KNN model. |
| 60 | + |
| 61 | +Ultimately, this process is repeated independantly for each sink sample |
| 62 | +:math:`S_i` of :math:`D_{sink}`. |
50 | 63 |
|
51 | 64 | Prediction of known source proportion |
52 | 65 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
53 | 66 |
|
54 | | -First, only organism TAXID corresponding to the *species* taxonomic |
55 | | -level are kept using ETE toolkit_. A distance matrix is then |
56 | | -computed on the merged training dataset :math:`D_{ref}` and test dataset |
57 | | -:math:`D_{sink}` using the scikit-bio implementation of weighted Unifrac |
58 | | -distance_ (default). |
| 67 | +First, only organism TAXIDs corresponding to the species taxonomic level |
| 68 | +are retained using the ETE toolkit_. A weighted Unifrac (default) |
| 69 | +pairwise distance_ matrix is then computed on the merged and |
| 70 | +normalized training dataset :math:`D_{ref}` and test dataset |
| 71 | +:math:`D_{sink}` with scikit-bio. |
59 | 72 |
|
60 | | -The distance matrix is then embedded in two dimensions using the |
61 | | -scikit-learn implementation of t-SNE_. |
| 73 | +This distance matrix is then embedded in two dimensions (default) using |
| 74 | +the scikit-learn implementation of t-SNE_. |
62 | 75 |
|
63 | 76 | The 2-dimensional embedding is then split back to training |
64 | | -:math:`D_{ref\ tsne}` and testing dataset :math:`D_{sink\ tsne}`. |
65 | | - |
66 | | -| The training dataset :math:`D_{ref\ tsne}` is further divided into |
67 | | - three subsets: :math:`D_{train\ tsne}` (64%), :math:`D_{test\ tsne}` |
68 | | - (20%), and :math:`D_{validation\ tsne}` (16%). |
69 | | -| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm |
70 | | - is then trained on the train subset, and the test accuracy is computed |
71 | | - with :math:`D_{test\ tsne}`. |
72 | | -| The trained KNN model is then corrected for source proportion |
73 | | - estimation using the scikit-learn implementation of the Platt’s method |
74 | | - with :math:`D_{validation\ tsne}`. |
75 | | -
|
76 | | -The proportion of each source :math:`p_{c}` sources in each sample is |
77 | | -then computed using the trained and corrected KNN model. |
| 77 | +:math:`{}^{tsne}D_{ref}` and testing dataset :math:`{}^{tsne}D_{sink}`. |
| 78 | + |
| 79 | +| The training dataset :math:`{}^{tsne}D_{ref}` is further divided into |
| 80 | + three subsets: :math:`{}^{tsne}D_{train}` (:math:`64\%`), |
| 81 | + :math:`{}^{tsne}D_{test}` (:math:`20\%`), and |
| 82 | + :math:`{}^{tsne}D_{validation}` (:math:`16\%`). |
| 83 | +| The KNN algorithm is then trained on the train subset, with a five |
| 84 | + (default) cross validation to look for the optimum number of |
| 85 | + K-neighbors. The training accuracy is then computed with |
| 86 | + :math:`{}^{tsne}D_{test}`. Finally, this second trained KNN model is |
| 87 | + also corrected for source proportion estimation using the scikit-learn |
| 88 | + implementation of the Platt’s method with |
| 89 | + :math:`{}^{tsne}D_{validation}`. |
| 90 | +
|
| 91 | +The proportion :math:`p_{c_s} \in [0,1]` of each of the :math:`n_s` |
| 92 | +sources :math:`c_s \in \{c_{1},\ ..,\ c_{n_s}\}` in each sample |
| 93 | +:math:`S_i` is then estimated using this second trained and corrected |
| 94 | +KNN model. |
| 95 | + |
| 96 | +Combining unknown and source proportion |
| 97 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 98 | + |
| 99 | +Then for each sample :math:`S_i` of the test dataset :math:`D_{sink}`, |
| 100 | +the predicted unknown proportion :math:`p_{u}` is then combined with the |
| 101 | +predicted proportion :math:`p_{c_s}` for each of the :math:`n_s` sources |
| 102 | +:math:`c_s` of the training dataset such that |
| 103 | +:math:`\sum_{c_s=1}^{n_s} s_c + p_u = 1` where |
| 104 | +:math:`s_c = p_{c_s} \cdot p_u`. |
| 105 | + |
| 106 | +Finally, a summary table gathering the estimated sources proportions is |
| 107 | +returned as a ``csv`` file, as well as the t-SNE embedding sample |
| 108 | +coordinates. |
78 | 109 |
|
79 | 110 | .. _GMPR: https://peerj.com/articles/4600/ |
| 111 | +.. _Bray-Curtis: https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1942268 |
80 | 112 | .. _scikit-learn: https://scikit-learn.org/stable/ |
81 | 113 | .. _method: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639 |
82 | 114 | .. _toolkit: http://etetoolkit.org/ |
83 | 115 | .. _distance: https://www.ncbi.nlm.nih.gov/pubmed/17220268 |
84 | | -.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html |
| 116 | +.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html |
| 117 | +.. _Platt: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639 |
0 commit comments