-
Notifications
You must be signed in to change notification settings - Fork 28
Description
I discovered a bug in the windows version of Rsamtools, which does not occur in the same version on Ubuntu 18.04 or CentOS 7. It may be related to #5 and #12, but the examples I have might help with solving the issue.
I'm trying to extract sequences from a large reference database of virus references found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrl*.seq.gz (about 4.2GB in total) (my current database may differ a bit from the most recent database though, let me know if you would like to have the exact same database).
I'm using these same commands on the 3 systems, not all cases go wrong, but I have isolated a case in idx nr 2621935 which gives different results on windows as compared to Ubuntu and CentOS.
> gbvrl_fa <- open(FaFile("gbvrl.nt.fasta"))
> idx <- scanFaIndex(gbvrl_fa)
> sequence <- scanFa(gbvrl_fa, idx[2621935])
On windows I get this result:
A DNAStringSet instance of length 1
width seq names
[1] 4979 CATGGGTATGGCATCAGTCCTAGATAAAGGGACTGGCAAGT...TTTAACTGTCGTATCTCTTGTCTGTGGTATACTTAGTCTAG EU851411.1
On the other two platforms I get this result (which is the correct sequence, the sequence on windows is incorrect):
A DNAStringSet instance of length 1
width seq names
[1] 4979 TTGAACAAGTAACCAGTCGTAAG...CTTACGACTGGTTACTTGTTCAA EU851411.1
The accession numbers match, but the sequences don't...
A second hint is the last sequence in the index, which the 2 non-windows platforms retrieve just fine, but the windows version gives an error:
Windows:
> scanFa(gbvrl_fa, idx[length(idx)])
Error in value[[3L]](cond) :
record 1 (FN398484.1:1-903) contains invalid DNA letters
file: gbvrl.nt.fasta
The last hint is an error claiming that the reference is not present in the current index, which happens with some sequences:
Windows:
> scanFa(gbvrl_fa, idx[1400001])
Error in value[[3L]](cond) : record 1 (KY058615.1:1-1166) failed
file: gbvrl.nt.fasta
Sequences in the index before around 1,300,000 seem to succeed at retrieving the correct sequence and sequences after around 1,400,000 seem to fail, I don't know if that makes sense.
I've checked the Rsamtools and dependencies's versions and they match. Also the md5sum of all the fasta and fasta.fai files match.
I hope that these examples shine some light on the issues.
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] shinydashboard_0.7.1 rbokeh_0.5.0 shinyWidgets_0.4.8 shinycssloaders_0.2.0 stringr_1.4.0
[6] ggplot2_3.2.1 data.table_1.12.2 shiny_1.3.2 Rsamtools_2.0.3 Biostrings_2.52.0
[11] XVector_0.24.0 GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 IRanges_2.18.3 S4Vectors_0.22.0
[16] BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 lattice_0.20-38 assertthat_0.2.1 digest_0.6.20 mime_0.7
[6] R6_2.4.0 evaluate_0.14 httr_1.4.1 pillar_1.4.2 zlibbioc_1.30.0
[11] rlang_0.4.0 lazyeval_0.2.2 rstudioapi_0.10 hexbin_1.27.3 DT_0.9
[16] rmarkdown_1.16 BiocParallel_1.18.1 htmlwidgets_1.5.1 RCurl_1.95-4.12 munsell_0.5.0
[21] compiler_3.6.1 httpuv_1.5.1 xfun_0.10 gistr_0.4.2 pkgconfig_2.0.3
[26] htmltools_0.3.6 tidyselect_0.2.5 tibble_2.1.3 GenomeInfoDbData_1.2.1 codetools_0.2-16
[31] crayon_1.3.4 dplyr_0.8.3 withr_2.1.2 later_0.8.0 bitops_1.0-6
[36] grid_3.6.1 jsonlite_1.6 xtable_1.8-4 gtable_0.3.0 magrittr_1.5
[41] scales_1.0.0 stringi_1.4.3 pryr_0.1.4 promises_1.0.1 tools_3.6.1
[46] glue_1.3.1 purrr_0.3.2 crosstalk_1.0.0 maps_3.3.0 yaml_2.2.0
[51] colorspace_1.4-1 BiocManager_1.30.8 knitr_1.25



