Skip to content

[VL] Not being able to read or write to S3a on Spark 1.4.2 and Gluten 1.4 #10670

@Neuw84

Description

@Neuw84

Backend

VL (Velox)

Bug description

Using OSS Spark 3.5.2 and Gluten 1.4 with the provided configs... I am not able to read and write from S3... seems that the task are running but no errors and no advances on the tasks....

If I disable Gluten plugin it just works and also I am not able to read data from S3 if using Gluten.

I don't see any errors or traces that could lead to see what is happening.

Any hints?

25/09/10 12:03:31 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 5 (VeloxColumnarWriteFilesRDD[20] at parquet at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
25/09/10 12:03:31 INFO TaskSchedulerImpl: Adding task set 5.0 with 4 tasks resource profile 0
25/09/10 12:03:31 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 33) (jupyter, executor driver, partition 0, NODE_LOCAL, 9274 bytes) 
25/09/10 12:03:31 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 34) (jupyter, executor driver, partition 1, NODE_LOCAL, 9274 bytes) 
25/09/10 12:03:31 INFO TaskSetManager: Starting task 2.0 in stage 5.0 (TID 35) (jupyter, executor driver, partition 2, NODE_LOCAL, 9274 bytes) 
25/09/10 12:03:31 INFO TaskSetManager: Starting task 3.0 in stage 5.0 (TID 36) (jupyter, executor driver, partition 3, NODE_LOCAL, 9274 bytes) 
25/09/10 12:03:31 INFO Executor: Running task 0.0 in stage 5.0 (TID 33)
25/09/10 12:03:31 INFO Executor: Running task 2.0 in stage 5.0 (TID 35)
25/09/10 12:03:31 INFO Executor: Running task 3.0 in stage 5.0 (TID 36)
25/09/10 12:03:31 INFO Executor: Running task 1.0 in stage 5.0 (TID 34)

Gluten version

Gluten-1.4

Spark version

Spark-3.5.x

Spark configurations


      .master("local[*]")
      .appName("s3a-committers-stable")
      .config("spark.plugins", "org.apache.gluten.GlutenPlugin") 
      .config("spark.memory.offHeap.size", "8g") 
      .config("spark.shuffle.manager","org.apache.spark.shuffle.sort.ColumnarShuffleManager")  
      .config("spark.memory.offHeap.enabled", "true") 
      .config("spark.driver.extraClassPath","/opt/spark/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar")
      .config("spark.executor.extraClassPath","/opt/spark/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar")
      .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
      .config("spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled", "true")
      .config("spark.hadoop.fs.s3a.use.instance.credentials","true")

     # .config("spark.speculation", "false")
      .config("spark.dynamicAllocation.enabled", "false")
      .config("spark.gluten.velox.awsSdkLogLevel","debug")
      .config("spark.log.level", "info")

System information

Docker based on Jupyter Image. Java 17 Ubuntu

Relevant logs

Just get stuck there. If I use debug I don't see any errors/weird things indicating problems. It just get stuck.

25/09/10 12:05:09 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 249.9 KiB, free 8.3 GiB)
25/09/10 12:05:09 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 90.6 KiB, free 8.3 GiB)
25/09/10 12:05:09 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on jupyter:39287 (size: 90.6 KiB, free: 8.4 GiB)
25/09/10 12:05:09 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:1585
25/09/10 12:05:09 INFO DAGScheduler: Submitting 16 missing tasks from ShuffleMapStage 12 (MapPartitionsRDD[35] at genShuffleDependency at VeloxSparkPlanExecApi.scala:542) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
25/09/10 12:05:09 INFO TaskSchedulerImpl: Adding task set 12.0 with 16 tasks resource profile 0
25/09/10 12:05:09 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 49) (jupyter, executor driver, partition 0, PROCESS_LOCAL, 9875 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 1.0 in stage 12.0 (TID 50) (jupyter, executor driver, partition 1, PROCESS_LOCAL, 9888 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 2.0 in stage 12.0 (TID 51) (jupyter, executor driver, partition 2, PROCESS_LOCAL, 9889 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 3.0 in stage 12.0 (TID 52) (jupyter, executor driver, partition 3, PROCESS_LOCAL, 9877 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 4.0 in stage 12.0 (TID 53) (jupyter, executor driver, partition 4, PROCESS_LOCAL, 9888 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 5.0 in stage 12.0 (TID 54) (jupyter, executor driver, partition 5, PROCESS_LOCAL, 9889 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 6.0 in stage 12.0 (TID 55) (jupyter, executor driver, partition 6, PROCESS_LOCAL, 9877 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 7.0 in stage 12.0 (TID 56) (jupyter, executor driver, partition 7, PROCESS_LOCAL, 9888 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 8.0 in stage 12.0 (TID 57) (jupyter, executor driver, partition 8, PROCESS_LOCAL, 9889 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 9.0 in stage 12.0 (TID 58) (jupyter, executor driver, partition 9, PROCESS_LOCAL, 9877 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 10.0 in stage 12.0 (TID 59) (jupyter, executor driver, partition 10, PROCESS_LOCAL, 9888 bytes) 
25/09/10 12:05:09 INFO TaskSetManager: Starting task 11.0 in stage 12.0 (TID 60) (jupyter, executor driver, partition 11, PROCESS_LOCAL, 9889 bytes) 
25/09/10 12:05:09 INFO Executor: Running task 2.0 in stage 12.0 (TID 51)
25/09/10 12:05:09 INFO Executor: Running task 0.0 in stage 12.0 (TID 49)
25/09/10 12:05:09 INFO Executor: Running task 1.0 in stage 12.0 (TID 50)
25/09/10 12:05:09 INFO Executor: Running task 3.0 in stage 12.0 (TID 52)
25/09/10 12:05:09 INFO Executor: Running task 4.0 in stage 12.0 (TID 53)
25/09/10 12:05:09 INFO Executor: Running task 5.0 in stage 12.0 (TID 54)
25/09/10 12:05:09 INFO Executor: Running task 6.0 in stage 12.0 (TID 55)
25/09/10 12:05:09 INFO Executor: Running task 7.0 in stage 12.0 (TID 56)
25/09/10 12:05:09 INFO Executor: Running task 9.0 in stage 12.0 (TID 58)
25/09/10 12:05:09 INFO Executor: Running task 8.0 in stage 12.0 (TID 57)
25/09/10 12:05:09 INFO Executor: Running task 10.0 in stage 12.0 (TID 59)
25/09/10 12:05:09 INFO Executor: Running task 11.0 in stage 12.0 (TID 60)
25/09/10 12:09:30 INFO BlockManagerInfo: Removed broadcast_11_piece0 on jupyter:39287 in memory (size: 6.4 KiB, free: 8.4 GiB)
25/09/10 12:09:30 INFO BlockManagerInfo: Removed broadcast_9_piece0 on jupyter:39287 in memory (size: 40.4 KiB, free: 8.4 GiB)
[Stage 5:>                  (0 + 4) / 4][Stage 12:>               (0 + 12) / 16]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions