Skip to content
This repository was archived by the owner on Sep 12, 2023. It is now read-only.
This repository was archived by the owner on Sep 12, 2023. It is now read-only.

Encounter NIL Error when job in error stage with TTL value set #170

@mirocody

Description

@mirocody

Hi community,
I am trying to deploy a simple task using pytorchjob with the following yaml:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorchjob
  namespace: abc
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch

  runPolicy:
    ttlSecondsAfterFinished: 864000

the scripy exception.py is nothing but just throw an exception to let the contaienr go to error status. Then the training operator pod logs the following:

E1026 03:50:23.343541       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 560 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000e89320, 0xc000703618, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000e89320, 0x18987c0, 0xc000703500, 0xc0008183c0, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000e89320, 0x1b88fa0, 0xc000818270, 0xc000624f60, 0x13, 0xc000a1b590, 0x28, 0xc000818270, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x1750a40, 0xc000348340)
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x0)
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc000d26400)
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00026c750)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001121f50, 0x1b46440, 0xc000818180, 0xc000d26401, 0xc000a36240)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00026c750, 0x3b9aca00, 0x0, 0x1, 0xc000a36240)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00, 0x0, 0x1986d01)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00)
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]

It looks like the assumption in this line works that the completion time is not set when the clean up started.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions