update doc and bump version

maxibor · maxibor · commit f78d01d08381 · 2019-06-28T13:13:29.000+02:00
diff --git a/README.md b/README.md
@@ -74,7 +74,7 @@ The documentation of SourcePredict is available here: [sourcepredict.readthedocs
 ### Environments included in the default source file
 
 - *Homo sapiens* gut microbiome
-- *Canis familiaris* gut microbiom
+- *Canis familiaris* gut microbiome
 - Soil microbiome
 
 ### Updating the source file 
diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,6 +1,6 @@
 package:
   name: sourcepredict
-  version: "0.32"
+  version: "0.33"
 
 source:
   path: ../
diff --git a/docs/intro.md b/docs/intro.md
@@ -7,14 +7,14 @@ Prediction/source tracking of metagenomic samples source using machine learning
 
 ----
 
-[SourcePredict](https://github.com/maxibor/sourcepredict) is a Python package to classify and predict the source of metagenomics sample given a training set.  
+SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python Conda package to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
 
-The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics.  
-One of the goals of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
-These taxonomic classifiers, such as Kraken for example, will compute the taxonomic composition in Operational Taxonomic Unit (OTU), from the DNA sequencing data.
+DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics.  
+One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers.
+These taxonomic classifiers, such as [Kraken](https://ccb.jhu.edu/software/kraken/), will compute the organism taxonomic composition from the DNA sequencing data.
 
-When in most cases the origin of a metagenomic sample is known, it is sometimes part of the research question to infer and/or confirm its source.  
-Using samples of known sources, a training set can be established with the OTU sample composition as features, and the source of the sample as class labels.  
-With this training set, a machine learning algorithm can be trained to predict the source of unlabeled samples from their OTU taxonomic composition.
-
-SourcePredict performs the classification/prediction of unlabeled samples sources from their OTU taxonomic compositions.
+In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the research question to predict and/or confirm the source.
+Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, *i.e.* the organisms identified in the samples as features, and the sources of the samples as class labels.
+With this reference dataset, a machine learning algorithm can be trained to predict the source of unknown samples (sinks) from their taxonomic composition.  
+Other tools used to perform the prediction of a sample source already exist, such as SourceTracker [sourcetracker](https://www.nature.com/articles/nmeth.1650), which employs Gibbs sampling.
+However, with Sourcepredict using a dimension reduction algorithm, followed by K-Nearest-Neighbors (KNN) classification, the interpretation of the results is made more straightforward thanks to the embedding of the samples in a human observable low dimensional space.
diff --git a/docs/methods.rst b/docs/methods.rst
@@ -1,84 +1,117 @@
 Methods
 =======
 
-All samples are first normalized to correct for uneven sequencing depth
-using GMPR_ (default). After normalization, Sourcepredict
-performs a two steps prediction: first a prediction of the proportion of
-unknown sources, i.e. not represented in the reference dataset. Then a
-prediction of the proportion of each known source of the reference
-dataset in the test samples.
-
-Organism are represented by their taxonomic identifiers (TAXID).
+Starting with a numerical organism count matrix (samples as columns,
+organisms as rows, obtained by a taxonomic classifier) of merged
+references and sinks datasets, samples are first normalized relative to
+each other, to correct for uneven sequencing depth using the GMPR_ method
+(default). After normalization, Sourcepredict performs a
+two-step prediction algorithm. First, it predicts the proportion of
+unknown sources, *i.e.* which are not represented in the reference
+dataset. Second it predicts the proportion of each known source of the
+reference dataset in the sink samples.
+
+Organisms are represented by their taxonomic identifiers (TAXID).
 
 Prediction of unknown sources proportion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-| Let :math:`S` be a sample of size :math:`O` organims from the test
-  dataset :math:`D_{sink}`
-| Let :math:`n` be the average number of samples per class in the
-  reference dataset.
-| I define :math:`U_n` samples to add to the training dataset to account
-  for the unknown source proportion in a test sample.
-
-To compute :math:`U_n`, a :math:`\alpha` proportion (default =
-:math:`0.1`) of each :math:`o_i` organism (with :math:`i\in[1,O]`) is
-added to the training dataset for each :math:`U_j` samples (with
-:math:`j\in[1,n]`), such as :math:`U_j(o_i) = \alpha\times S_(o_i)`
-
-The :math:`U_n` samples are then merged as columns to the reference
-dataset (:math:`D_{ref}`) to create a new reference dataset denoted
-:math:`D_{ref\ unknown}`
-
-| To predict this unknown proportion, the dimension of the reference
-  dataset :math:`D_{ref\ unknown}` (samples in columns, organisms as
-  rows) is first reduced to 20 with the scikit-learn_
-  implementation of PCA.
-| This reference dataset is then divided into three subsets:
-  :math:`D_{train\ unknown}` (64%), :math:`D_{test\ unknown}` (20%), and
-  :math:`D_{validation unknown}`\ (16%).
-
-| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm
-  is then trained on :math:`D_{train\ unknown}`, and the test accuracy
-  is computed with :math:`D_{test\ unknown}` .
-| The trained KNN model is then corrected for probability estimation of
-  unknown proportion using the scikit-learn implementation of the
-  Platt’s scaling method_ with :math:`D_{validation\ unknown}`.
-  This procedure is repeated for each sample of the test dataset.
-
-The proportion of unknown :math:`p_{unknown}` sources in each sample is
-then computed using the trained and corrected KNN model.
+| Let :math:`S_i \in \{S_1, .., S_n\}` be a sample from the normalized
+  sinks dataset :math:`D_{sink}`,
+  :math:`o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}` be an
+  organism in :math:`S_i`, and :math:`n_o^{\ i}` be the total number of
+  organisms in :math:`S_i`, with :math:`o_{j}^{\ i} \in \mathbb{Z}+`.
+| Let :math:`m` be the mean number of samples per class in the reference
+  dataset, such that :math:`m = \frac{1}{O}\sum_{i=1}^{O}S_i`.
+| For each :math:`S_i` sample, I define :math:`||m||` estimated samples
+  :math:`U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}` to add to the
+  reference dataset to account for the unknown source proportion in a
+  test sample.
+
+Separately for each :math:`S_i`, a proportion denoted
+:math:`\alpha \in [0,1]` (default = :math:`0.1`) of each of the
+:math:`o_{j}^{\ i}` organism of :math:`S_i` is added to each
+:math:`U_k^{S_i}` samples such that
+:math:`U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}` , where
+:math:`x_{i \ j}` is sampled from a Gaussian distribution
+:math:`\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)`.
+
+The :math:`||m||` :math:`U_k^{S_i}` samples are then added to the
+reference dataset :math:`D_{ref}`, and labeled as *unknown*, to create a
+new reference dataset denoted :math:`{}^{unk}D_{ref}`.
+
+| To predict the proportion of unknown sources, a Bray-Curtis_ pairwise dissimilarity matrix of all :math:`S_i` and
+  :math:`U_k^{S_i}` samples is computed using scikit-bio. This distance
+  matrix is then embedded in two dimensions (default) with the
+  scikit-bio implementation of PCoA.
+| This sample embedding is divided into three subsets:
+  :math:`{}^{unk}D_{train}` (:math:`64\%`), :math:`{}^{unk}D_{test}`
+  (:math:`20\%`), and :math:`{}^{unk}D_{validation}`\ (:math:`16\%`).
+
+| The scikit-learn implementation of KNN algorithm is then trained on
+  :math:`{}^{unk}D_{train}`, and the training accuracy is computed with
+  :math:`{}^{unk}D_{test}`.
+| This trained KNN model is then corrected for probability estimation of
+  the unknown proportion using the scikit-learn implementation of
+  Platt_’s scaling method with :math:`{}^{unk}D_{validation}`.
+
+The proportion of unknown sources in :math:`S_i`, :math:`p_u \in [0,1]`
+is then estimated using this trained and corrected KNN model.
+
+Ultimately, this process is repeated independantly for each sink sample
+:math:`S_i` of :math:`D_{sink}`.
 
 Prediction of known source proportion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-First, only organism TAXID corresponding to the *species* taxonomic
-level are kept using ETE toolkit_. A distance matrix is then
-computed on the merged training dataset :math:`D_{ref}` and test dataset
-:math:`D_{sink}` using the scikit-bio implementation of weighted Unifrac
-distance_ (default).
+First, only organism TAXIDs corresponding to the species taxonomic level
+are retained using the ETE toolkit_. A weighted Unifrac (default)
+pairwise distance_ matrix is then computed on the merged and
+normalized training dataset :math:`D_{ref}` and test dataset
+:math:`D_{sink}` with scikit-bio.
 
-The distance matrix is then embedded in two dimensions using the
-scikit-learn implementation of t-SNE_.
+This distance matrix is then embedded in two dimensions (default) using
+the scikit-learn implementation of t-SNE_.
 
 The 2-dimensional embedding is then split back to training
-:math:`D_{ref\ tsne}` and testing dataset :math:`D_{sink\ tsne}`.
-
-| The training dataset :math:`D_{ref\ tsne}` is further divided into
-  three subsets: :math:`D_{train\ tsne}` (64%), :math:`D_{test\ tsne}`
-  (20%), and :math:`D_{validation\ tsne}` (16%).
-| The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm
-  is then trained on the train subset, and the test accuracy is computed
-  with :math:`D_{test\ tsne}`.
-| The trained KNN model is then corrected for source proportion
-  estimation using the scikit-learn implementation of the Platt’s method
-  with :math:`D_{validation\ tsne}`.
-
-The proportion of each source :math:`p_{c}` sources in each sample is
-then computed using the trained and corrected KNN model.
+:math:`{}^{tsne}D_{ref}` and testing dataset :math:`{}^{tsne}D_{sink}`.
+
+| The training dataset :math:`{}^{tsne}D_{ref}` is further divided into
+  three subsets: :math:`{}^{tsne}D_{train}` (:math:`64\%`),
+  :math:`{}^{tsne}D_{test}` (:math:`20\%`), and
+  :math:`{}^{tsne}D_{validation}` (:math:`16\%`).
+| The KNN algorithm is then trained on the train subset, with a five
+  (default) cross validation to look for the optimum number of
+  K-neighbors. The training accuracy is then computed with
+  :math:`{}^{tsne}D_{test}`. Finally, this second trained KNN model is
+  also corrected for source proportion estimation using the scikit-learn
+  implementation of the Platt’s method with
+  :math:`{}^{tsne}D_{validation}`.
+
+The proportion :math:`p_{c_s} \in [0,1]` of each of the :math:`n_s`
+sources :math:`c_s \in \{c_{1},\ ..,\ c_{n_s}\}` in each sample
+:math:`S_i` is then estimated using this second trained and corrected
+KNN model.
+
+Combining unknown and source proportion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Then for each sample :math:`S_i` of the test dataset :math:`D_{sink}`,
+the predicted unknown proportion :math:`p_{u}` is then combined with the
+predicted proportion :math:`p_{c_s}` for each of the :math:`n_s` sources
+:math:`c_s` of the training dataset such that
+:math:`\sum_{c_s=1}^{n_s} s_c + p_u = 1` where
+:math:`s_c = p_{c_s} \cdot p_u`.
+
+Finally, a summary table gathering the estimated sources proportions is
+returned as a ``csv`` file, as well as the t-SNE embedding sample
+coordinates.
 
 .. _GMPR: https://peerj.com/articles/4600/
+.. _Bray-Curtis: https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1942268
 .. _scikit-learn: https://scikit-learn.org/stable/
 .. _method: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
 .. _toolkit: http://etetoolkit.org/
 .. _distance: https://www.ncbi.nlm.nih.gov/pubmed/17220268
-.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html
+.. _t-SNE: http://www.jmlr.org/papers/v9/vandermaaten08a.html
+.. _Platt: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
diff --git a/docs/usage.md b/docs/usage.md
@@ -1,18 +1,21 @@
 # Usage
 
 ```bash
-$ python sourcepredict -h
-usage: SourcePredict v0.32 [-h][-a alpha] [-s SOURCES][-l labels]
-                           [-n NORMALIZATION][-dt distance] [-me METHOD][-e embed] [-di DIM][-o output] [-se SEED][-k kfold] [-t THREADS]
+$ sourcepredict -h
+usage: SourcePredict v0.33 [-h] [-a ALPHA] [-s SOURCES] [-l LABELS]
+                           [-n NORMALIZATION] [-dt DISTANCE] [-me METHOD]
+                           [-e EMBED] [-di DIM] [-o OUTPUT] [-se SEED]
+                           [-k KFOLD] [-t THREADS]
                            otu_table
 
 ==========================================================
-SourcePredict v0.32
+SourcePredict v0.33
 Coprolite source classification
 Author: Maxime Borry
-Contact: &lt;borry[at]shh.mpg.de>
+Contact: <borry[at]shh.mpg.de>
+Homepage & Documentation: github.com/maxibor/sourcepredict
+==========================================================
 
-# Homepage & Documentation: github.com/maxibor/sourcepredict
 
 positional arguments:
   otu_table         path to otu table in csv format
@@ -24,19 +27,19 @@ optional arguments:
                     data/modern_gut_microbiomes_sources.csv
   -l LABELS         Path to labels csv file. Default =
                     data/modern_gut_microbiomes_labels.csv
-  -n NORMALIZATION  Normalization method (RLE | CLR | Subsample | GMPR).
-                    Default = GMPR
+  -n NORMALIZATION  Normalization method (RLE | Subsample | GMPR). Default =
+                    GMPR
   -dt DISTANCE      Distance method. (unweighted_unifrac | weighted_unifrac)
                     Default = weighted_unifrac
   -me METHOD        Embedding Method. TSNE or UMAP. Default = TSNE
   -e EMBED          Output embedding csv file. Default = None
   -di DIM           Number of dimensions to retain for dimension reduction.
                     Default = 2
   -o OUTPUT         Output file basename. Default =
-                    &lt;sample_basename>.sourcepredict.csv
+                    <sample_basename>.sourcepredict.csv
   -se SEED          Seed for random generator. Default = 42
-  -k KFOLD          Number of fold for K-fold cross validation in feature
-                    selection and parameter optimization. Default = 5
+  -k KFOLD          Number of fold for K-fold cross validation in parameter
+                    optimization. Default = 5
   -t THREADS        Number of threads for parallel processing. Default = 2
 
 ```
diff --git a/sourcepredict b/sourcepredict
@@ -112,7 +112,7 @@ Homepage & Documentation: github.com/maxibor/sourcepredict
 
 
 if __name__ == "__main__":
-    version = "0.32"
+    version = "0.33"
     warnings.filterwarnings("ignore")
     SINK, ALPHA, NORMALIZATION, SOURCES, LABELS, SEED, DISTANCE, METHOD, DIM, OUTPUT, EMBED_CSV, KFOLD, THREADS = _get_args()
     SEED = utils.check_gen_seed(SEED)