-
Couldn't load subscription status.
- Fork 539
Closed
Labels
Description
Backend
VL (Velox)
Bug description
Using OSS Spark 3.5.2 and Gluten 1.4 with the provided configs... I am not able to read and write from S3... seems that the task are running but no errors and no advances on the tasks....
If I disable Gluten plugin it just works and also I am not able to read data from S3 if using Gluten.
I don't see any errors or traces that could lead to see what is happening.
Any hints?
25/09/10 12:03:31 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 5 (VeloxColumnarWriteFilesRDD[20] at parquet at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
25/09/10 12:03:31 INFO TaskSchedulerImpl: Adding task set 5.0 with 4 tasks resource profile 0
25/09/10 12:03:31 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 33) (jupyter, executor driver, partition 0, NODE_LOCAL, 9274 bytes)
25/09/10 12:03:31 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 34) (jupyter, executor driver, partition 1, NODE_LOCAL, 9274 bytes)
25/09/10 12:03:31 INFO TaskSetManager: Starting task 2.0 in stage 5.0 (TID 35) (jupyter, executor driver, partition 2, NODE_LOCAL, 9274 bytes)
25/09/10 12:03:31 INFO TaskSetManager: Starting task 3.0 in stage 5.0 (TID 36) (jupyter, executor driver, partition 3, NODE_LOCAL, 9274 bytes)
25/09/10 12:03:31 INFO Executor: Running task 0.0 in stage 5.0 (TID 33)
25/09/10 12:03:31 INFO Executor: Running task 2.0 in stage 5.0 (TID 35)
25/09/10 12:03:31 INFO Executor: Running task 3.0 in stage 5.0 (TID 36)
25/09/10 12:03:31 INFO Executor: Running task 1.0 in stage 5.0 (TID 34)
Gluten version
Gluten-1.4
Spark version
Spark-3.5.x
Spark configurations
.master("local[*]")
.appName("s3a-committers-stable")
.config("spark.plugins", "org.apache.gluten.GlutenPlugin")
.config("spark.memory.offHeap.size", "8g")
.config("spark.shuffle.manager","org.apache.spark.shuffle.sort.ColumnarShuffleManager")
.config("spark.memory.offHeap.enabled", "true")
.config("spark.driver.extraClassPath","/opt/spark/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar")
.config("spark.executor.extraClassPath","/opt/spark/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled", "true")
.config("spark.hadoop.fs.s3a.use.instance.credentials","true")
# .config("spark.speculation", "false")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.gluten.velox.awsSdkLogLevel","debug")
.config("spark.log.level", "info")
System information
Docker based on Jupyter Image. Java 17 Ubuntu
Relevant logs
Just get stuck there. If I use debug I don't see any errors/weird things indicating problems. It just get stuck.
25/09/10 12:05:09 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 249.9 KiB, free 8.3 GiB)
25/09/10 12:05:09 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 90.6 KiB, free 8.3 GiB)
25/09/10 12:05:09 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on jupyter:39287 (size: 90.6 KiB, free: 8.4 GiB)
25/09/10 12:05:09 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1585
25/09/10 12:05:09 INFO DAGScheduler: Submitting 16 missing tasks from ShuffleMapStage 12 (MapPartitionsRDD[35] at genShuffleDependency at VeloxSparkPlanExecApi.scala:542) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
25/09/10 12:05:09 INFO TaskSchedulerImpl: Adding task set 12.0 with 16 tasks resource profile 0
25/09/10 12:05:09 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 49) (jupyter, executor driver, partition 0, PROCESS_LOCAL, 9875 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 1.0 in stage 12.0 (TID 50) (jupyter, executor driver, partition 1, PROCESS_LOCAL, 9888 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 2.0 in stage 12.0 (TID 51) (jupyter, executor driver, partition 2, PROCESS_LOCAL, 9889 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 3.0 in stage 12.0 (TID 52) (jupyter, executor driver, partition 3, PROCESS_LOCAL, 9877 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 4.0 in stage 12.0 (TID 53) (jupyter, executor driver, partition 4, PROCESS_LOCAL, 9888 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 5.0 in stage 12.0 (TID 54) (jupyter, executor driver, partition 5, PROCESS_LOCAL, 9889 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 6.0 in stage 12.0 (TID 55) (jupyter, executor driver, partition 6, PROCESS_LOCAL, 9877 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 7.0 in stage 12.0 (TID 56) (jupyter, executor driver, partition 7, PROCESS_LOCAL, 9888 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 8.0 in stage 12.0 (TID 57) (jupyter, executor driver, partition 8, PROCESS_LOCAL, 9889 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 9.0 in stage 12.0 (TID 58) (jupyter, executor driver, partition 9, PROCESS_LOCAL, 9877 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 10.0 in stage 12.0 (TID 59) (jupyter, executor driver, partition 10, PROCESS_LOCAL, 9888 bytes)
25/09/10 12:05:09 INFO TaskSetManager: Starting task 11.0 in stage 12.0 (TID 60) (jupyter, executor driver, partition 11, PROCESS_LOCAL, 9889 bytes)
25/09/10 12:05:09 INFO Executor: Running task 2.0 in stage 12.0 (TID 51)
25/09/10 12:05:09 INFO Executor: Running task 0.0 in stage 12.0 (TID 49)
25/09/10 12:05:09 INFO Executor: Running task 1.0 in stage 12.0 (TID 50)
25/09/10 12:05:09 INFO Executor: Running task 3.0 in stage 12.0 (TID 52)
25/09/10 12:05:09 INFO Executor: Running task 4.0 in stage 12.0 (TID 53)
25/09/10 12:05:09 INFO Executor: Running task 5.0 in stage 12.0 (TID 54)
25/09/10 12:05:09 INFO Executor: Running task 6.0 in stage 12.0 (TID 55)
25/09/10 12:05:09 INFO Executor: Running task 7.0 in stage 12.0 (TID 56)
25/09/10 12:05:09 INFO Executor: Running task 9.0 in stage 12.0 (TID 58)
25/09/10 12:05:09 INFO Executor: Running task 8.0 in stage 12.0 (TID 57)
25/09/10 12:05:09 INFO Executor: Running task 10.0 in stage 12.0 (TID 59)
25/09/10 12:05:09 INFO Executor: Running task 11.0 in stage 12.0 (TID 60)
25/09/10 12:09:30 INFO BlockManagerInfo: Removed broadcast_11_piece0 on jupyter:39287 in memory (size: 6.4 KiB, free: 8.4 GiB)
25/09/10 12:09:30 INFO BlockManagerInfo: Removed broadcast_9_piece0 on jupyter:39287 in memory (size: 40.4 KiB, free: 8.4 GiB)
[Stage 5:> (0 + 4) / 4][Stage 12:> (0 + 12) / 16]