Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.
With conda (recommended)
$ conda install -c conda-forge -c maxibor sourcepredictWith pip
$ pip install sourcepredict- Sink taxonomic count file (see example file and documentation)
- Source taxonomic count file (see example file and documentation)
- Source label file (see example file and documentation)
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
  == Sample: ERR1915662 ==
	Adding unknown
	Normalizing (GMPR)
	Computing Bray-Curtis distance
	Performing MDS embedding in 2 dimensions
	KNN machine learning
	Training KNN classifier on 2 cores...
	-> Testing Accuracy: 1.0
	----------------------
	- Sample: ERR1915662
		 known:98.61%
		 unknown:1.39%
Step 2: Checking for source proportion
	Computing weighted_unifrac distance on species rank
	TSNE embedding in 2 dimensions
	KNN machine learning
	Performing 5 fold cross validation on 2 cores...
	Trained KNN classifier with 10 neighbors
	-> Testing Accuracy: 0.99
	----------------------
	- Sample: ERR1915662
		 Canis_familiaris:96.1%
		 Homo_sapiens:2.47%
		 Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csvSourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.
Depending on the normalization method (-n), the embedding (-me) method, the cpus available for parallel processing (-t), and the data, the runtime should be between a few seconds and a few minutes per sink sample.
The documentation of SourcePredict is available here: sourcepredict.readthedocs.io
- The sources were obtained with a simple Nextflow pipeline, with Kraken2 using the MiniKraken2_v2_8GB.
 See the documentation for more informations on how to build a custom source file.
- The example source file is here modern_gut_microbiomes_sources.csv
- The example label file is here modern_gut_microbiomes_sources.csv
- Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
- Canis familiaris gut microbiome (1)
- Soil microbiome (1, 2, 3)
If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.
Sourcepredict has been published in JOSS.
@article{Borry2019Sourcepredict,
	journal = {Journal of Open Source Software},
	doi = {10.21105/joss.01540},
	issn = {2475-9066},
	number = {41},
	publisher = {The Open Journal},
	title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
	url = {http://dx.doi.org/10.21105/joss.01540},
	volume = {4},
	author = {Borry, Maxime},
	pages = {1540},
	date = {2019-09-04},
	year = {2019},
	month = {9},
	day = {4}
}
