Fix Kyuubi OOM bug when multiple batch jobs are submitted at once in large amount #7227

JoonPark1 · 2025-10-21T16:45:38Z

…cordingly once engine submit timeout is reached - prevent subsequent kyuubi OOM

Why are the changes needed?

This PR change is to address bug #7226. It updates the behavior of updating metadata store accordingly for batch jobs that have timed out due to waiting for available spark driver engine. This will fix the subsequent restarted kyuubi server from repeatedly polling for the spark application status of each and every batch job, which can cause consecutive OOM errors under k8 cluster deployment mode for kyuubi.

How was this patch tested?

This patch was tested through integration test that was added to test suite class called "SparkOnKubernetesTestsSuite.scala".

Was this patch authored or co-authored using generative AI tooling?

No!

…cordingly once engine submit timeout is reached - prevent subsequent kyuubi OOM

codecov-commenter · 2025-10-21T19:05:50Z

Codecov Report

❌ Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (3b205a3) to head (b952a15).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
...kyuubi/engine/KubernetesApplicationOperation.scala	0.00%	14 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #7227   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     696           
  Lines       43530   43543   +13     
  Branches     5883    5884    +1     
======================================
- Misses      43530   43543   +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull Request Overview

This PR addresses issue #7226 by preventing Kyuubi OOM errors when multiple batch jobs time out waiting for Spark driver engines. When a batch job reaches the engine submit timeout, the metadata store is now properly updated with TIMEOUT state and NOT_FOUND engine state, preventing the restarted Kyuubi server from repeatedly polling these timed-out jobs.

Key Changes:

Updated timeout handling to persist batch job state when engine submission times out
Added metadata store update with proper error state and message on timeout
Added integration test to verify timeout behavior updates metadata correctly

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
KubernetesApplicationOperation.scala	Added metadata store update logic when driver pod is not found after submit timeout
SparkOnKubernetesTestsSuite.scala	Added integration test verifying timeout state is properly persisted to metadata store

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala

Copilot · 2025-10-24T19:57:13Z

...-it/src/test/scala/org/apache/kyuubi/kubernetes/test/spark/SparkOnKubernetesTestsSuite.scala

    assert(!failKillResponse._1)
  }
+  test(
+    "If spark batch reach timeout, it should have associated Kyuubi Application Operation be " +


Grammatical error in test description. Should be 'reaches timeout' instead of 'reach timeout', and 'should have the associated' instead of 'should have associated'.

Suggested change

"If spark batch reach timeout, it should have associated Kyuubi Application Operation be " +

"If spark batch reaches timeout, it should have the associated Kyuubi Application Operation be " +

Co-authored-by: Copilot <[email protected]>

turboFei · 2025-10-24T20:15:28Z

Hi @JoonPark1
Thanks for the contribution.

For this issue, does it has chance to update metadata in BatchJobSubmission?

kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/operation/BatchJobSubmission.scala

Lines 169 to 189 in e8bbf52

    
           private def updateBatchMetadata(): Unit = { 
        
             val endTime = if (isTerminalState(state)) lastAccessTime else 0L 
        
             if (isTerminalState(state) && _applicationInfo.isEmpty) { 
        
               _applicationInfo = Some(ApplicationInfo.NOT_FOUND) 
        
             } 
        
             _applicationInfo.foreach { appInfo => 
        
               val metadataToUpdate = Metadata( 
        
                 identifier = batchId, 
        
                 state = state.toString, 
        
                 engineOpenTime = appStartTime, 
        
                 engineId = appInfo.id, 
        
                 engineName = appInfo.name, 
        
                 engineUrl = appInfo.url.orNull, 
        
                 engineState = getAppState(state, appInfo.state).toString, 
        
                 engineError = appInfo.error, 
        
                 endTime = endTime) 
        
               session.sessionManager.updateMetadata(metadataToUpdate) 
        
             } 
        
           }

JoonPark1 · 2025-10-24T20:45:11Z

Hey @turboFei. I believe the spark driver engine state and spark app state will be updated for metadata store...

turboFei · 2025-10-24T22:30:20Z

Hi @JoonPark1
Before this PR, it can not update the metadata with BatchJobSubmission:: updateBatchMetadata?

Could you provide more details?

JoonPark1 · 2025-10-24T22:39:21Z

@turboFei Sure! Once the kyuubi batch job times out because the elapsed time exceeds the configured submitTimeout property value (no spark driver is instantiated and in running state to handle the submitted batch job), the metadata about the spark application and the spark driver engine state is updated accordingly via "org.apache.kyuubi.server.metadata.MetadataManager" class' updateMetadata method which takes in the new up-to-date Metadata construct object instance (which is instance of class "org.apache.kyuubi.server.metadata.api.Metadata"). Then, internally within the manager class, the method calls upon the "org.apache.kyuubi.server.metadata.MetadataStore" class' updateMetadata method, which keeps the data regarding the state of each submitted kyuubi batch jobs utilizing spark compute engine as in-sync with the state of kyuubi's metadata store in relationalDB. As you can see, the whole flow does not need to invoke the BatchJobSubmission:: updateBatchMetadata to update the kyuubi's metadata store instance.

turboFei · 2025-10-26T21:38:35Z

Hi, I checked the call site.

getApplicationInfoByTag <- KyuubiApplicationManager::getApplicationInfo

KyuubiApplicationManager::getApplicationInfo <- BatchJobSumission::currentApplicationInfo

And finally, BatchJobSumission::updateBatchMetadata will update the metadata.

JoonPark1 · 2025-10-26T22:13:39Z

@turboFei. Hey Fei! You are right about the flow of the call-site... basically, the polling for status of submitted batch job originates from BatchJobSubmission:currentApplicationInfo... However, I don't think my change at the level of KubernetesApplicationOperation class calls the method BatchJobSumission::updateBatchMetadata. Do you suggest I refactor my change so that the update to kyuubi metadata happens through the BatchJobSubmission internal method? The alternative call-site flow will be BatchJobSumission::updateBatchMetadata() -> KyuubiSessionManager::updateMetadata(). I think this way also works as you proposed because it also works with relevant fields for Metadata instance associated with the specific batch job submission on kyuubi side... Let me know what you think.

turboFei · 2025-10-26T23:01:56Z

Hi @JoonPark1
I checked the description in #7226

However, when there is heavy load, it can cause kyuubi to store records via MetadataManager about the state of each batch job as "PENDING" and repeated polling about each batch job's status until it runs out of memory.

I just wonder why kyuubi server dit not update the metadata as the flow of the call-site I mentioned above.

I mean before this PR.

turboFei · 2025-10-26T23:08:47Z

And for KubernetesApplicationOperation, it used SharedIndexInformer to get the kubernetes application states.

kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala

Lines 151 to 162 in 07d1d5f

    
           val enginePodInformer = client.pods() 
        
             .withLabel(LABEL_KYUUBI_UNIQUE_KEY) 
        
             .inform(new SparkEnginePodEventHandler(kubernetesInfo)) 
        
           info(s"[$kubernetesInfo] Start Kubernetes Client POD Informer.") 
        
           enginePodInformers.put(kubernetesInfo, enginePodInformer) 
        
           if (sparkAppUrlSource == KubernetesApplicationUrlSource.SVC) { 
        
             info(s"[$kubernetesInfo] Start Kubernetes Client SVC Informer.") 
        
             val engineSvcInformer = client.services() 
        
               .inform(new SparkEngineSvcEventHandler(kubernetesInfo)) 
        
             engineSvcInformers.put(kubernetesInfo, engineSvcInformer) 
        
           } 
        
           client

Do you mean?

it did not update the metadata inside BatchJobSubmission, and the batch state was PENDING.
after kyuubi server restarted, due the batch state was PENDING, it tried to resubmit the batch
and failed to get the batch app state after submit timeout, application state was NOT_FOUND
then it still did not update the metadata inside BatchJobSubmission

JoonPark1 · 2025-10-27T14:05:32Z

@turboFei Hey Fei. You're train of thought is right on besides point #3. Basically, once kyuubi server restarts, it tries to resubmit the batch and poll for the batch status but once submitTimeout is reached, the metadata associated for that batch is not updated and remains as "PENDING" with engineState being "UNKNOWN". This does not stop the resubmission and repolling for the batch on subsequent restarts of kyuubi server. That's why my change aimed to mark it as one of final/terminal states for the batch once submit-timeout is reached: "TIMEOUT" for batch app status so that repeated polling is not happening because the kyuubi batch manager will know the state of batch is final and timed out waiting for the compute driver engine.

Reformat comment for better readability and fix scalastyle char-limit violations.

turboFei · 2025-10-27T18:34:15Z

but once submitTimeout is reached, the metadata associated for that batch is not updated and remains as "PENDING" with engineState being "UNKNOWN"

Is it possible to enhance the BatchJobSumission::updateBatchMetadata to cover this case?

JoonPark1 · 2025-10-27T19:38:26Z

Hey @turboFei. I'm reconsidering moving my metadata-updating logic directly into the BatchJobSubmission:updateBatchMetadata method like you suggested... Does this seem like a good logic to set the appState appropriate as TIMED_OUT (which is appropriate given the situation). This block will be placed just before the Metadata object instantiation that will be passed to KyuubiSessionManager::updateMetadata() call.

val curTime = System.currentTimeMillis();
    val submitTime = submitTime(); 
    val diff = curTime - submitTime 
    if(diff > KyuubiConf.ENGINE_SUBMIT_TIMEOUT && state.toString.equals("PENDING")){
      _applicationInfo = Some(ApplicationInfo.NOT_FOUND)
      state = OperationState.TIMEOUT
    }

Let me know if you think this refactoring done directly within the updateBatchMetadata method is more appropriate. This should update batch metadata store the same way as having the effect done at KyuubiApplicationOperation level.

JoonPark1 · 2025-10-28T00:08:56Z

This issue is specific to kyuubi v1.9.2 @turboFei

turboFei · 2025-10-28T01:29:08Z

This issue is specific to kyuubi v1.9.2 @turboFei

could you try to use the latest kyuubi?

turboFei · 2025-10-28T01:30:38Z

I am using the latest kyuubi based on master branch, and do not meet this issue.

JoonPark1 · 2025-10-28T13:37:39Z

Hey @turboFei... I know you said the current latest version of kyuubi does not face this issue. Have you tried replicating it and checking the state of metadata store to see if the polling for the batch job that timed out is updated appropriately?

turboFei · 2025-10-29T22:15:00Z

Hey @turboFei... I know you said the current latest version of kyuubi does not face this issue. Have you tried replicating it and checking the state of metadata store to see if the polling for the batch job that timed out is updated appropriately?

I remember I fixed many batch issues, not sure all of them backport to 1.9.2.

Could you try to use the latest version, I think they are compatible.

turboFei · 2025-10-29T22:16:01Z

BTW, the affected version mentioned in #7226 is 1.10.2, could you correct it to prevent misleading?

JoonPark1 · 2025-10-30T01:51:58Z

My bad @turboFei. The kyuubi version we are using that's being affected by the issue is v1.10.2!

turboFei · 2025-10-30T07:46:43Z

mysql> select count(*) from metadata where engine_state = 'NOT_FOUND';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (2.44 sec)

mysql> select count(*) from metadata where engine_state != 'NOT_FOUND';
+----------+
| count(*) |
+----------+
|   613153 |
+----------+
1 row in set (2.29 sec)

mysql>

JoonPark1 · 2025-10-30T15:40:56Z

@turboFei Hey Fei! I saw your state of metadata store... For the issue, when a kyuubi batch job times out, the engine_state column is set to "UNKNOWN", but my fix attempts to correct this to more appropriate state ("NOT_FOUND") as it makes sense to indicate the spark driver responsible for handling the batch job is not up. Let me know what you think whether my fix is appropriate given the circumstance. Basically, if the issue is already resolved in latest version, the number of records corresponding to batch jobs with engine_state UNKNOWN should be 0 ( could you verify it is indeed that)?

turboFei · 2025-11-03T03:27:57Z

Some comments:

I think you can try to use the latest master branch code
If it is a bug need to fix, please update the metadata inside BatchJobSubmission

turboFei · 2025-11-06T03:05:24Z

I think we can limit the number to recover in one batch,

kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/server/KyuubiRestFrontendService.scala

Line 182 in 53074ec

    
           private[kyuubi] def recoverBatchSessions(): Unit = withBatchRecoveryLockRequired {

Do not recover all batches together.

For example, there are 400 batches to recover on restarting.

We can add a config, to recover 50 batches one time, and wait the batches engineId ready(app submitted) or wait a maximum interval before next recovery. how do you think about?

@JoonPark1

JoonPark1 · 2025-11-10T15:01:36Z

@turboFei That does sound like a good alternative to impose a limit on max # of batches to recover per each recovery attempt. Is this already an available configuration for kyuubi or does it needs to be added?

If it's not available, I think we can add as extra kyuubi server-side config and have it be part of updated KyuubiSessionManager::getBatchSessionsToRecover method as extra argument as a batchSize parameter to the method and deduced pagination offset to read from the MetadataStore instance. Additionally, in the KyuubiRestFrontendService::recoverBatchSessions(), we can have it call KyuubiSessionManager::getBatchSessionsToRecover repeatedly with pagination offset tracking to obtain sequence of instantiated KyuubiBatchSessions that correspond to batch records obtained from relational store until there are no more metadata records pertaining to batches to process.

turboFei · 2025-11-10T21:11:00Z

Is this already an available configuration for kyuubi or does it needs to be added?

Not yet now. would you like to contribute it?

JoonPark1 · 2025-11-10T21:20:26Z

@turboFei Sure. I would love to contribute it...

… on metadata store upon kyuubi server recovery - to prevent it from being overwhelmed and face OOM issue

fix KubernetesApplicationOperation so that metadata records update ac…

1cbeb7b

…cordingly once engine submit timeout is reached - prevent subsequent kyuubi OOM

github-actions bot added module:kubernetes module:server module:integration-tests labels Oct 21, 2025

fix styling for test suite for SparkOnKubernetes

de98cf8

pan3793 requested a review from turboFei October 24, 2025 08:56

turboFei requested a review from Copilot October 24, 2025 19:56

Copilot AI reviewed Oct 24, 2025

View reviewed changes

Apply suggestion from @Copilot

783954d

Co-authored-by: Copilot <[email protected]>

Refactor comment in KubernetesApplicationOperation

b952a15

Reformat comment for better readability and fix scalastyle char-limit violations.

add new kyuubi server-side config to handle recovery batch jobs based…

eb34301

… on metadata store upon kyuubi server recovery - to prevent it from being overwhelmed and face OOM issue

github-actions bot added the module:common label Nov 11, 2025

	"If spark batch reach timeout, it should have associated Kyuubi Application Operation be " +
	"If spark batch reaches timeout, it should have the associated Kyuubi Application Operation be " +

Fix Kyuubi OOM bug when multiple batch jobs are submitted at once in large amount #7227

Are you sure you want to change the base?

Fix Kyuubi OOM bug when multiple batch jobs are submitted at once in large amount #7227

Conversation

JoonPark1 commented Oct 21, 2025

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

codecov-commenter commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

turboFei commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoonPark1 commented Oct 24, 2025

Uh oh!

turboFei commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoonPark1 commented Oct 24, 2025

Uh oh!

turboFei commented Oct 26, 2025

Uh oh!

JoonPark1 commented Oct 26, 2025

Uh oh!

turboFei commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoonPark1 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Oct 27, 2025

Uh oh!

JoonPark1 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoonPark1 commented Oct 28, 2025

Uh oh!

turboFei commented Oct 28, 2025

Uh oh!

turboFei commented Oct 28, 2025

Uh oh!

JoonPark1 commented Oct 28, 2025

Uh oh!

turboFei commented Oct 29, 2025

Uh oh!

turboFei commented Oct 29, 2025

Uh oh!

JoonPark1 commented Oct 30, 2025

Uh oh!

turboFei commented Oct 30, 2025

Uh oh!

JoonPark1 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoonPark1 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov-commenter commented Oct 21, 2025 •

edited

Loading

turboFei commented Oct 24, 2025 •

edited

Loading

turboFei commented Oct 24, 2025 •

edited

Loading

turboFei commented Oct 26, 2025 •

edited

Loading

turboFei commented Oct 26, 2025 •

edited

Loading

JoonPark1 commented Oct 27, 2025 •

edited

Loading

JoonPark1 commented Oct 27, 2025 •

edited

Loading

JoonPark1 commented Oct 30, 2025 •

edited

Loading

turboFei commented Nov 3, 2025 •

edited

Loading

turboFei commented Nov 6, 2025 •

edited

Loading

JoonPark1 commented Nov 10, 2025 •

edited

Loading

turboFei commented Nov 10, 2025 •

edited

Loading