This repository was archived by the owner on Sep 12, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 71
This repository was archived by the owner on Sep 12, 2023. It is now read-only.
Encounter NIL Error when job in error stage with TTL value set #170
Copy link
Copy link
Open
Description
Hi community,
I am trying to deploy a simple task using pytorchjob with the following yaml:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorchjob
namespace: abc
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- args:
- |+
echo "Hello World!"
python -u exception.py
command:
- /usr/bin/env
- bash
- -c
env:
- name: LOCAL_RANK
value: '0'
image: <centos>
name: pytorch
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- args:
- |+
echo "Hello World!"
python -u exception.py
command:
- /usr/bin/env
- bash
- -c
env:
- name: LOCAL_RANK
value: '0'
image: <centos>
name: pytorch
runPolicy:
ttlSecondsAfterFinished: 864000
the scripy exception.py is nothing but just throw an exception to let the contaienr go to error status. Then the training operator pod logs the following:
E1026 03:50:23.343541 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 560 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000e89320, 0xc000703618, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, 0x0, 0x18987c0, ...)
/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000e89320, 0x18987c0, 0xc000703500, 0xc0008183c0, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, ...)
/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000e89320, 0x1b88fa0, 0xc000818270, 0xc000624f60, 0x13, 0xc000a1b590, 0x28, 0xc000818270, 0x40903b, 0xc000030000, ...)
/workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x1750a40, 0xc000348340)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x0)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc000d26400)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00026c750)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001121f50, 0x1b46440, 0xc000818180, 0xc000d26401, 0xc000a36240)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00026c750, 0x3b9aca00, 0x0, 0x1, 0xc000a36240)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00, 0x0, 0x1986d01)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]
It looks like the assumption in this line works that the completion time is not set when the clean up started.
Metadata
Metadata
Assignees
Labels
No labels