JNP-Solutions
diff --git a/‎README.md‎
Lines changed: 12 additions & 5 deletions b/‎README.md‎
Lines changed: 12 additions & 5 deletions
diff --git a/‎build.sbt‎
Lines changed: 20 additions & 1 deletion b/‎build.sbt‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎src/main/scala/com/jnpersson/fastdoop/IndexedFastaReader.scala‎
Lines changed: 1 addition & 0 deletions b/‎src/main/scala/com/jnpersson/fastdoop/IndexedFastaReader.scala‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/main/scala/com/jnpersson/kmers/HDFSUtil.scala‎
Lines changed: 1 addition & 1 deletion b/‎src/main/scala/com/jnpersson/kmers/HDFSUtil.scala‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/main/scala/com/jnpersson/kmers/MinimizerCLIConf.scala‎
Lines changed: 4 additions & 4 deletions b/‎src/main/scala/com/jnpersson/kmers/MinimizerCLIConf.scala‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎src/main/scala/com/jnpersson/kmers/SparkTool.scala‎
Lines changed: 3 additions & 2 deletions b/‎src/main/scala/com/jnpersson/kmers/SparkTool.scala‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/main/scala/com/jnpersson/kmers/SplitterFormat.scala‎
Lines changed: 1 addition & 0 deletions b/‎src/main/scala/com/jnpersson/kmers/SplitterFormat.scala‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/main/scala/com/jnpersson/kmers/input/FileInputs.scala‎
Lines changed: 6 additions & 4 deletions b/‎src/main/scala/com/jnpersson/kmers/input/FileInputs.scala‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎src/main/scala/com/jnpersson/kmers/input/InputReader.scala‎
Lines changed: 1 addition & 1 deletion b/‎src/main/scala/com/jnpersson/kmers/input/InputReader.scala‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/main/scala/com/jnpersson/kmers/minimizer/MinSplitter.scala‎
Lines changed: 2 additions & 2 deletions b/‎src/main/scala/com/jnpersson/kmers/minimizer/MinSplitter.scala‎
Lines changed: 2 additions & 2 deletions
@@ -14,13 +14,16 @@ and the fraction of reads assigned to each taxon.
 
 Slacken is based on Apache Spark and is thus a distributed application. It can run on a single machine, but can
 also scale to a cluster with hundreds or thousands of machines. It does not keep all data in RAM during processing, but
-processes data in batches.
+processes data in batches. On a 16-core PC, Slacken needs only 16 GB of RAM to classify with the genomes from the Kraken 2 standard library.
 
-We do not currently support translated mode (protein/AA sequence classification) but only nucleotide sequences. Also, 
+Unfortunately, Slacken does not currently support translated mode (protein/AA sequence classification) but only nucleotide sequences. Also, 
 Slacken has its own database format (Parquet based) and can not use pre-built Kraken 2 databases as they are.
 
 For more motivation and details, please see [our 2025 paper in NAR Genomics and Bioinformatics](https://academic.oup.com/nargab/article/7/2/lqaf076/8158581).
 
+**Users of version 1.x, please note the new command line syntax in version 2.0.** All commands and examples in this 
+README and on the Wiki have been updated. [See the commands overview.](https://github.com/JNP-Solutions/Slacken/wiki/Slacken-commands-overview)
+
 Copyright (c) Johan Nyström-Persson 2019-2025.
 
 ## Contents
@@ -168,7 +171,7 @@ Here,
 * `--reads 100` is the threshold for including a taxon in the initial set (R100).
 * `-l /data/standard-224c` is required, and indicates where genomes for library building may be found.
 * `--bracken-length 150` specifies that Bracken weights for the given read length (150) should be generated. That can be slow, and
-also requires extra space, so we recommend omitting `--bracken-length` when Bracken is not needed.
+also requires extra space, so we recommend omitting `--bracken-length` when Bracken is not needed. When generating Bracken weights, we recommend giving Slacken at least 32 GB of RAM.
 
 When the command has finished, the following files will be generated:
 
@@ -404,13 +407,17 @@ These options may also be permanently configured by editing `slacken.sh`.
 
 Slacken can run on AWS EMR (Elastic MapReduce) and should also work similarly on other commercial cloud providers
 that support Apache Spark. In this scenario, data can be stored on AWS S3 and the computation can run on a mix of
-on-demand and spot (interruptible) instances. We refer the reader to the AWS EMR documentation for more details.
+on-demand and spot (interruptible) instances. 
+
+A [tutorial on Slacken with AWS EMR](https://github.com/JNP-Solutions/Slacken/wiki/Classifying-metagenomic-samples-on-AWS-Elastic-MapReduce) 
+is available. The tutorial shows how to use Slacken to classify samples using the public indexes on AWS S3.
 
 The cluster configuration we generally recommend is 4 GB RAM per CPU (but 2 GB per CPU may be enough for small workloads).
 For large workloads, the worker nodes should have fast physical hard drives, such as NVMe. On EMR Spark will automatically use
 these drives for temporary space. We have found the m7gd and m6gd machine families to work well.
 
-To run on AWS EMR, first, install the AWS CLI.
+The tutorial above shows how to run Slacken using the EMR GUI. You can also run it on EMR from the command line. 
+To do this, first install the [AWS CLI](https://aws.amazon.com/cli/).
 Copy `slacken-aws.sh.template` to a new file, e.g. `slacken-aws.sh` and edit the file to configure
 some settings such as the S3 bucket to use for the Slacken jar. Then, create the AWS EMR cluster. You will receive a
 cluster ID, either from the web GUI or from the CLI. Set the `AWS_EMR_CLUSTER` environment variable to this id:
 
@@ -2,7 +2,24 @@ name := "Slacken"
 
 version := "2.0.0"
 
-scalaVersion := "2.12.20"
+lazy val scala212 = "2.12.20"
+
+lazy val scala213 = "2.13.15"
+
+lazy val supportedScalaVersions = List(scala212, scala213)
+
+ThisBuild / scalaVersion := scala212
+
+lazy val root = (project in file(".")).
+  settings(
+    crossScalaVersions := supportedScalaVersions,
+    libraryDependencies ++= {
+      CrossVersion.partialVersion(scalaVersion.value) match {
+        case Some((2, 13)) => List("org.scala-lang.modules" %% "scala-parallel-collections" % "1.0.0")
+        case _                       => Nil
+      }
+    }
+    )
 
 val sparkVersion = "3.5.0"
 
@@ -22,6 +39,8 @@ libraryDependencies += "org.scalatest" %% "scalatest" % "latest.integration" % "
 
 libraryDependencies += "org.scalatestplus" %% "scalacheck-1-18" % "latest.integration" % "test"
 
+libraryDependencies += "org.scala-lang.modules" %% "scala-collection-compat" % "2.13.0"
+
 //The "provided" configuration prevents sbt-assembly from including spark in the packaged jar.
 libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
 
 
@@ -24,6 +24,7 @@ import org.apache.hadoop.mapreduce.lib.input.FileSplit
 import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext}
 
 import scala.io.Source
+import scala.collection.BufferedIterator
 
 /**
  * FAI (fasta index) record.
 
@@ -53,7 +53,7 @@ object HDFSUtil {
     new Iterator[T] {
       def hasNext: Boolean = rit.hasNext
 
-      def next: T = rit.next
+      def next(): T = rit.next
     }
 
   private def files(path: String)(implicit spark: SparkSession): Iterator[LocatedFileStatus] = {
 
@@ -66,16 +66,16 @@ trait MinimizerCLIConf {
   this: ScallopConf =>
 
   protected def defaultK = 35
-  val k = opt[Int](descr = s"Length of each k-mer", default = Some(defaultK))
+  val k = opt[Int](descr = "Length of each k-mer", default = Some(defaultK))
 
   protected def defaultMinimizerWidth = 10
-  val minimizerWidth = opt[Int](name = "m", descr = s"Width of minimizers",
+  val minimizerWidth = opt[Int](name = "m", descr = "Width of minimizers",
     default = Some(defaultMinimizerWidth))
 
   validate (k) { k =>
     if (minimizerWidth() > k) {
       Left("-m must be <= -k")
-    } else Right(Unit)
+    } else Right(())
   }
 
   protected def defaultOrdering: String = "lexicographic"
@@ -115,7 +115,7 @@ trait MinimizerCLIConf {
   def defaultMinimizerSpaces: Int = 0
 
   val minimizerSpaces = opt[Int](name = "spaces",
-    descr = s"Number of masked out nucleotides in minimizer (spaced seed)",
+    descr = "Number of masked out nucleotides in minimizer (spaced seed)",
     default = Some(defaultMinimizerSpaces))
 
   /** Apply a spaced seed mask to minimizer priorities */
 
@@ -33,7 +33,7 @@ private[jnpersson] abstract class SparkTool(appName: String) {
       enableHiveSupport().
       getOrCreate()
 
-    //BareLocalFileSystem bypasses the need for winutils.exe on Windows and does no harm on other OS's
+  //BareLocalFileSystem bypasses the need for winutils.exe on Windows and does no harm on other OS's
     //This affects access to file:/ paths (effectively local files)
     sp.sparkContext.hadoopConfiguration.
       setClass("fs.file.impl", classOf[BareLocalFileSystem], classOf[FileSystem])
@@ -58,6 +58,7 @@ object SparkTool {
   }
 }
 
+//noinspection TypeAnnotation
 trait HasInputReader {
   this: ScallopConf =>
 
@@ -69,7 +70,7 @@ trait HasInputReader {
  * CLI configuration for a Spark-based application.
  */
 //noinspection TypeAnnotation
-class SparkConfiguration(args: Array[String])(implicit val spark: SparkSession) extends ScallopConf(args) {
+class SparkConfiguration(args: Seq[String])(implicit val spark: SparkSession) extends ScallopConf(args) {
   protected val showAllOpts =
     args.contains("--detailed-help") //to make this value available during the option construction stage
 
 
@@ -21,6 +21,7 @@ import com.jnpersson.kmers.minimizer._
 import org.apache.spark.sql.SparkSession
 
 import java.util.Properties
+import scala.collection.compat.immutable.ArraySeq
 
 /** Logic for persisting minimizer formats (ordering and parameters) to files.
  * @param P the type of MinimizerPriorities that is being managed.
 
@@ -27,6 +27,8 @@ import org.apache.spark.sql.expressions.Window
 import org.apache.spark.sql.functions.{collect_list, element_at, lit, monotonically_increasing_id, slice, substring}
 import org.apache.spark.sql.{Dataset, SparkSession}
 
+import scala.collection.parallel.immutable.ParVector
+import scala.collection.compat._
 
 /**
  * A set of input files that can be parsed into [[InputFragment]]
@@ -109,7 +111,7 @@ class FileInputs(val files: Seq[String], k: Int, inputGrouping: InputGrouping =
       case _ =>
         expandedFiles.map(forFile)
     }
-    val fs = readers.par.map(_.getInputFragments(withAmbiguous, sampleFraction)).seq
+    val fs = readers.to(ParVector).map(_.getInputFragments(withAmbiguous, sampleFraction)).seq
     spark.sparkContext.union(fs.map(_.rdd)).toDS()
   }
 
@@ -208,7 +210,7 @@ class FastqTextInput(file: String)(implicit spark: SparkSession) extends HadoopI
   }.rdd
 
   def getSequenceTitles: Dataset[SeqTitle] =
-    rdd.map(x => x(0)).toDS
+    rdd.map(x => x(0)).toDS()
 
   protected[input] def getFragments(): Dataset[InputFragment] =
     rdd.map(ar => {
@@ -239,7 +241,7 @@ class IndexedFastaInput(file: String, k: Int)(implicit spark: SparkSession)
     sc.newAPIHadoopFile(input, classOf[IndexedFastaFormat], classOf[Text], classOf[PartialSequence], conf).values
 
   def getSequenceTitles: Dataset[SeqTitle] =
-    rdd.map(_.getKey).toDS.distinct
+    rdd.map(_.getKey).toDS().distinct
 
   protected[input] def getFragments(): Dataset[InputFragment] = {
     val k = this.k
@@ -267,6 +269,6 @@ class IndexedFastaInput(file: String, k: Int)(implicit spark: SparkSession)
         val key = partialSeq.getKey.split(" ")(0)
         makeInputFragment(key, partialSeq.getSeqPosition, partialSeq.getBuffer, start, useEnd)
       }
-    }).toDS
+    }).toDS()
   }
 }
@@ -106,7 +106,7 @@ class PairedInputReader(lhs: InputReader, rhs: InputReader)(implicit spark: Spar
   import spark.sqlContext.implicits._
   import PairedInputReader._
 
-  protected[input] def getFragments: Dataset[InputFragment] = {
+  protected[input] def getFragments(): Dataset[InputFragment] = {
     /* As we currently have no input format that correctly handles paired reads, joining the reads by
           header is the best we can do (and still inexpensive in the big picture).
           Otherwise, it is hard to guarantee that they would be paired up correctly.
 
@@ -137,7 +137,7 @@ final case class MinSplitter[+P <: MinimizerPriorities](priorities: P, k: Int) {
     new Iterator[Supermer] {
       def hasNext: Boolean = window.hasNext
 
-      def next: Supermer = {
+      def next(): Supermer = {
         val p = window.next
 
         //TODO INVALID handling for computed priorities
@@ -184,7 +184,7 @@ final case class MinSplitter[+P <: MinimizerPriorities](priorities: P, k: Int) {
     new Iterator[Minimizer] {
       def hasNext: Boolean = window.hasNext
 
-      def next: Minimizer = {
+      def next(): Minimizer = {
         val p = window.next
 
         if (!matches.isValid(p)) {
Original file line number	Diff line number	Diff line change
`@@ -53,7 +53,7 @@ object HDFSUtil {`
`53`	`53`	`new Iterator[T] {`
`54`	`54`	`def hasNext: Boolean = rit.hasNext`
`55`	`55`
`56`		`- def next: T = rit.next`
	`56`	`+ def next(): T = rit.next`
`57`	`57`	`}`
`58`	`58`
`59`	`59`	`private def files(path: String)(implicit spark: SparkSession): Iterator[LocatedFileStatus] = {`