Skip to content

Commit f78d01d

Browse files
committed
update doc and bump version
1 parent 07420c3 commit f78d01d

File tree

6 files changed

+123
-87
lines changed

6 files changed

+123
-87
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ The documentation of SourcePredict is available here: [sourcepredict.readthedocs
7474
### Environments included in the default source file
7575

7676
- *Homo sapiens* gut microbiome
77-
- *Canis familiaris* gut microbiom
77+
- *Canis familiaris* gut microbiome
7878
- Soil microbiome
7979

8080
### Updating the source file

conda/meta.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
package:
22
name: sourcepredict
3-
version: "0.32"
3+
version: "0.33"
44

55
source:
66
path: ../

docs/intro.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ Prediction/source tracking of metagenomic samples source using machine learning
77

88
----
99

10-
[SourcePredict](https://github.com/maxibor/sourcepredict) is a Python package to classify and predict the source of metagenomics sample given a training set.
10+
SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python Conda package to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
1111

12-
The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics.
13-
One of the goals of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
14-
These taxonomic classifiers, such as Kraken for example, will compute the taxonomic composition in Operational Taxonomic Unit (OTU), from the DNA sequencing data.
12+
DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics.
13+
One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers.
14+
These taxonomic classifiers, such as [Kraken](https://ccb.jhu.edu/software/kraken/), will compute the organism taxonomic composition from the DNA sequencing data.
1515

16-
When in most cases the origin of a metagenomic sample is known, it is sometimes part of the research question to infer and/or confirm its source.
17-
Using samples of known sources, a training set can be established with the OTU sample composition as features, and the source of the sample as class labels.
18-
With this training set, a machine learning algorithm can be trained to predict the source of unlabeled samples from their OTU taxonomic composition.
19-
20-
SourcePredict performs the classification/prediction of unlabeled samples sources from their OTU taxonomic compositions.
16+
In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the research question to predict and/or confirm the source.
17+
Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, *i.e.* the organisms identified in the samples as features, and the sources of the samples as class labels.
18+
With this reference dataset, a machine learning algorithm can be trained to predict the source of unknown samples (sinks) from their taxonomic composition.
19+
Other tools used to perform the prediction of a sample source already exist, such as SourceTracker [sourcetracker](https://www.nature.com/articles/nmeth.1650), which employs Gibbs sampling.
20+
However, with Sourcepredict using a dimension reduction algorithm, followed by K-Nearest-Neighbors (KNN) classification, the interpretation of the results is made more straightforward thanks to the embedding of the samples in a human observable low dimensional space.

docs/methods.rst

Lines changed: 97 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,117 @@
11
Methods
22
=======
33

4-
All samples are first normalized to correct for uneven sequencing depth
5-
using GMPR_ (default). After normalization, Sourcepredict
6-
performs a two steps prediction: first a prediction of the proportion of
7-
unknown sources, i.e. not represented in the reference dataset. Then a
8-
prediction of the proportion of each known source of the reference
9-
dataset in the test samples.
10-
11-
Organism are represented by their taxonomic identifiers (TAXID).
4+
Starting with a numerical organism count matrix (samples as columns,
5+
organisms as rows, obtained by a taxonomic classifier) of merged
6+
references and sinks datasets, samples are first normalized relative to
7+
each other, to correct for uneven sequencing depth using the GMPR_ method
8+
(default). After normalization, Sourcepredict performs a
9+
two-step prediction algorithm. First, it predicts the proportion of
10+
unknown sources, *i.e.* which are not represented in the reference
11+
dataset. Second it predicts the proportion of each known source of the
12+
reference dataset in the sink samples.
13+
14+
Organisms are represented by their taxonomic identifiers (TAXID).
1215

1316
Prediction of unknown sources proportion
1417
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1518

16-
| Let :math:`S` be a sample of size :math:`O` organims from the test
17-
dataset :math:`D_{sink}`
18-
| Let :math:`n` be the average number of samples per class in the
19-
reference dataset.
20-
| I define :math:`U_n` samples to add to the training dataset to account
21-
for the unknown source proportion in a test sample.
22-
23-
To compute :math:`U_n`, a :math:`\alpha` proportion (default =
24-
:math:`0.1`) of each :math:`o_i` organism (with :math:`i\in[1,O]`) is
25-
added to the training dataset for each :math:`U_j` samples (with
26-
:math:`j\in[1,n]`), such as :math:`U_j(o_i) = \alpha\times S_(o_i)`
27-
28-
The :math:`U_n` samples are then merged as columns to the reference
29-
dataset (:math:`D_{ref}`) to create a new reference dataset denoted
30-
:math:`D_{ref\ unknown}`
31-
32-
| To predict this unknown proportion, the dimension of the reference
33-
dataset :math:`D_{ref\ unknown}` (samples in columns, organisms as
34-
rows) is first reduced to 20 with the scikit-learn_
35-
implementation of PCA.
36-
| This reference dataset is then divided into three subsets:
37-
:math:`D_{train\ unknown}` (64%), :math:`D_{test\ unknown}` (20%), and
38-
:math:`D_{validation unknown}`\ (16%).
39-
40-
| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm
41-
is then trained on :math:`D_{train\ unknown}`, and the test accuracy
42-
is computed with :math:`D_{test\ unknown}` .
43-
| The trained KNN model is then corrected for probability estimation of
44-
unknown proportion using the scikit-learn implementation of the
45-
Platt’s scaling method_ with :math:`D_{validation\ unknown}`.
46-
This procedure is repeated for each sample of the test dataset.
47-
48-
The proportion of unknown :math:`p_{unknown}` sources in each sample is
49-
then computed using the trained and corrected KNN model.
19+
| Let :math:`S_i \in \{S_1, .., S_n\}` be a sample from the normalized
20+
sinks dataset :math:`D_{sink}`,
21+
:math:`o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}` be an
22+
organism in :math:`S_i`, and :math:`n_o^{\ i}` be the total number of
23+
organisms in :math:`S_i`, with :math:`o_{j}^{\ i} \in \mathbb{Z}+`.
24+
| Let :math:`m` be the mean number of samples per class in the reference
25+
dataset, such that :math:`m = \frac{1}{O}\sum_{i=1}^{O}S_i`.
26+
| For each :math:`S_i` sample, I define :math:`||m||` estimated samples
27+
:math:`U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}` to add to the
28+
reference dataset to account for the unknown source proportion in a
29+
test sample.
30+
31+
Separately for each :math:`S_i`, a proportion denoted
32+
:math:`\alpha \in [0,1]` (default = :math:`0.1`) of each of the
33+
:math:`o_{j}^{\ i}` organism of :math:`S_i` is added to each
34+
:math:`U_k^{S_i}` samples such that
35+
:math:`U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}` , where
36+
:math:`x_{i \ j}` is sampled from a Gaussian distribution
37+
:math:`\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)`.
38+
39+
The :math:`||m||` :math:`U_k^{S_i}` samples are then added to the
40+
reference dataset :math:`D_{ref}`, and labeled as *unknown*, to create a
41+
new reference dataset denoted :math:`{}^{unk}D_{ref}`.
42+
43+
| To predict the proportion of unknown sources, a Bray-Curtis_ pairwise dissimilarity matrix of all :math:`S_i` and
44+
:math:`U_k^{S_i}` samples is computed using scikit-bio. This distance
45+
matrix is then embedded in two dimensions (default) with the
46+
scikit-bio implementation of PCoA.
47+
| This sample embedding is divided into three subsets:
48+
:math:`{}^{unk}D_{train}` (:math:`64\%`), :math:`{}^{unk}D_{test}`
49+
(:math:`20\%`), and :math:`{}^{unk}D_{validation}`\ (:math:`16\%`).
50+
51+
| The scikit-learn implementation of KNN algorithm is then trained on
52+
:math:`{}^{unk}D_{train}`, and the training accuracy is computed with
53+
:math:`{}^{unk}D_{test}`.
54+
| This trained KNN model is then corrected for probability estimation of
55+
the unknown proportion using the scikit-learn implementation of
56+
Platt_’s scaling method with :math:`{}^{unk}D_{validation}`.
57+
58+
The proportion of unknown sources in :math:`S_i`, :math:`p_u \in [0,1]`
59+
is then estimated using this trained and corrected KNN model.
60+
61+
Ultimately, this process is repeated independantly for each sink sample
62+
:math:`S_i` of :math:`D_{sink}`.
5063

5164
Prediction of known source proportion
5265
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5366

54-
First, only organism TAXID corresponding to the *species* taxonomic
55-
level are kept using ETE toolkit_. A distance matrix is then
56-
computed on the merged training dataset :math:`D_{ref}` and test dataset
57-
:math:`D_{sink}` using the scikit-bio implementation of weighted Unifrac
58-
distance_ (default).
67+
First, only organism TAXIDs corresponding to the species taxonomic level
68+
are retained using the ETE toolkit_. A weighted Unifrac (default)
69+
pairwise distance_ matrix is then computed on the merged and
70+
normalized training dataset :math:`D_{ref}` and test dataset
71+
:math:`D_{sink}` with scikit-bio.
5972

60-
The distance matrix is then embedded in two dimensions using the
61-
scikit-learn implementation of t-SNE_.
73+
This distance matrix is then embedded in two dimensions (default) using
74+
the scikit-learn implementation of t-SNE_.
6275

6376
The 2-dimensional embedding is then split back to training
64-
:math:`D_{ref\ tsne}` and testing dataset :math:`D_{sink\ tsne}`.
65-
66-
| The training dataset :math:`D_{ref\ tsne}` is further divided into
67-
three subsets: :math:`D_{train\ tsne}` (64%), :math:`D_{test\ tsne}`
68-
(20%), and :math:`D_{validation\ tsne}` (16%).
69-
| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm
70-
is then trained on the train subset, and the test accuracy is computed
71-
with :math:`D_{test\ tsne}`.
72-
| The trained KNN model is then corrected for source proportion
73-
estimation using the scikit-learn implementation of the Platt’s method
74-
with :math:`D_{validation\ tsne}`.
75-
76-
The proportion of each source :math:`p_{c}` sources in each sample is
77-
then computed using the trained and corrected KNN model.
77+
:math:`{}^{tsne}D_{ref}` and testing dataset :math:`{}^{tsne}D_{sink}`.
78+
79+
| The training dataset :math:`{}^{tsne}D_{ref}` is further divided into
80+
three subsets: :math:`{}^{tsne}D_{train}` (:math:`64\%`),
81+
:math:`{}^{tsne}D_{test}` (:math:`20\%`), and
82+
:math:`{}^{tsne}D_{validation}` (:math:`16\%`).
83+
| The KNN algorithm is then trained on the train subset, with a five
84+
(default) cross validation to look for the optimum number of
85+
K-neighbors. The training accuracy is then computed with
86+
:math:`{}^{tsne}D_{test}`. Finally, this second trained KNN model is
87+
also corrected for source proportion estimation using the scikit-learn
88+
implementation of the Platt’s method with
89+
:math:`{}^{tsne}D_{validation}`.
90+
91+
The proportion :math:`p_{c_s} \in [0,1]` of each of the :math:`n_s`
92+
sources :math:`c_s \in \{c_{1},\ ..,\ c_{n_s}\}` in each sample
93+
:math:`S_i` is then estimated using this second trained and corrected
94+
KNN model.
95+
96+
Combining unknown and source proportion
97+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
98+
99+
Then for each sample :math:`S_i` of the test dataset :math:`D_{sink}`,
100+
the predicted unknown proportion :math:`p_{u}` is then combined with the
101+
predicted proportion :math:`p_{c_s}` for each of the :math:`n_s` sources
102+
:math:`c_s` of the training dataset such that
103+
:math:`\sum_{c_s=1}^{n_s} s_c + p_u = 1` where
104+
:math:`s_c = p_{c_s} \cdot p_u`.
105+
106+
Finally, a summary table gathering the estimated sources proportions is
107+
returned as a ``csv`` file, as well as the t-SNE embedding sample
108+
coordinates.
78109

79110
.. _GMPR: https://peerj.com/articles/4600/
111+
.. _Bray-Curtis: https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1942268
80112
.. _scikit-learn: https://scikit-learn.org/stable/
81113
.. _method: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
82114
.. _toolkit: http://etetoolkit.org/
83115
.. _distance: https://www.ncbi.nlm.nih.gov/pubmed/17220268
84-
.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html
116+
.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html
117+
.. _Platt: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639

docs/usage.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
11
# Usage
22

33
```bash
4-
$ python sourcepredict -h
5-
usage: SourcePredict v0.32 [-h][-a alpha] [-s SOURCES][-l labels]
6-
[-n NORMALIZATION][-dt distance] [-me METHOD][-e embed] [-di DIM][-o output] [-se SEED][-k kfold] [-t THREADS]
4+
$ sourcepredict -h
5+
usage: SourcePredict v0.33 [-h] [-a ALPHA] [-s SOURCES] [-l LABELS]
6+
[-n NORMALIZATION] [-dt DISTANCE] [-me METHOD]
7+
[-e EMBED] [-di DIM] [-o OUTPUT] [-se SEED]
8+
[-k KFOLD] [-t THREADS]
79
otu_table
810

911
==========================================================
10-
SourcePredict v0.32
12+
SourcePredict v0.33
1113
Coprolite source classification
1214
Author: Maxime Borry
13-
Contact: <borry[at]shh.mpg.de>
15+
Contact: <borry[at]shh.mpg.de>
16+
Homepage & Documentation: github.com/maxibor/sourcepredict
17+
==========================================================
1418

15-
# Homepage & Documentation: github.com/maxibor/sourcepredict
1619

1720
positional arguments:
1821
otu_table path to otu table in csv format
@@ -24,19 +27,19 @@ optional arguments:
2427
data/modern_gut_microbiomes_sources.csv
2528
-l LABELS Path to labels csv file. Default =
2629
data/modern_gut_microbiomes_labels.csv
27-
-n NORMALIZATION Normalization method (RLE | CLR | Subsample | GMPR).
28-
Default = GMPR
30+
-n NORMALIZATION Normalization method (RLE | Subsample | GMPR). Default =
31+
GMPR
2932
-dt DISTANCE Distance method. (unweighted_unifrac | weighted_unifrac)
3033
Default = weighted_unifrac
3134
-me METHOD Embedding Method. TSNE or UMAP. Default = TSNE
3235
-e EMBED Output embedding csv file. Default = None
3336
-di DIM Number of dimensions to retain for dimension reduction.
3437
Default = 2
3538
-o OUTPUT Output file basename. Default =
36-
&lt;sample_basename>.sourcepredict.csv
39+
<sample_basename>.sourcepredict.csv
3740
-se SEED Seed for random generator. Default = 42
38-
-k KFOLD Number of fold for K-fold cross validation in feature
39-
selection and parameter optimization. Default = 5
41+
-k KFOLD Number of fold for K-fold cross validation in parameter
42+
optimization. Default = 5
4043
-t THREADS Number of threads for parallel processing. Default = 2
4144

4245
```

sourcepredict

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Homepage & Documentation: github.com/maxibor/sourcepredict
112112

113113

114114
if __name__ == "__main__":
115-
version = "0.32"
115+
version = "0.33"
116116
warnings.filterwarnings("ignore")
117117
SINK, ALPHA, NORMALIZATION, SOURCES, LABELS, SEED, DISTANCE, METHOD, DIM, OUTPUT, EMBED_CSV, KFOLD, THREADS = _get_args()
118118
SEED = utils.check_gen_seed(SEED)

0 commit comments

Comments
 (0)