Skip to content

Conversation

@yanjunz97
Copy link
Contributor

@yanjunz97 yanjunz97 commented Apr 22, 2025

The PR supports to connect k8s API server through localhost when cpvm eth1 is down.

Testing done:
Override the Kubernetes service host to unaccessible and observe the NSX Operator
runs as expected by connecting k8s API server through localhost.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 85.29412% with 5 lines in your changes missing coverage. Please review.

Project coverage is 75.79%. Comparing base (6822956) to head (f6c0470).

Files with missing lines Patch % Lines
pkg/util/kubernetes.go 87.50% 3 Missing and 1 partial ⚠️
cmd/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1077      +/-   ##
==========================================
+ Coverage   75.77%   75.79%   +0.01%     
==========================================
  Files         145      146       +1     
  Lines       19708    19740      +32     
==========================================
+ Hits        14934    14962      +28     
- Misses       3863     3866       +3     
- Partials      911      912       +1     
Flag Coverage Δ
unit-tests 75.79% <85.29%> (+0.01%) ⬆️
Files with missing lines Coverage Δ
pkg/util/cert.go 55.15% <100.00%> (ø)
cmd/main.go 0.00% <0.00%> (ø)
pkg/util/kubernetes.go 87.50% <87.50%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

func main() {
log.Info("Starting NSX Operator")
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
mgr, err := ctrl.NewManager(pkgutil.GetConfig(), ctrl.Options{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the API server address switch only can occur in the startup stage, right? Then if the eth1 down during the NSX operator runtime, what will happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it only occurs in the startup stage.

The current case is when wcp enabled between backup and restore, cpvm eth1 will be down after NSX restore and we rely on NSX Operator to recover it. In this case, NSX Operator will always restarts as NSX connection will be down due to restore. In other cases eth1 may be down, shall we always expect NSX or WCP side to bring it back, and it might be fine NSX Operator does not work during that time?

If there is use case that NSX Operator should switch from cluster ip to localhost at runtime, maybe we can leverage the liveness probe to force the nsx operator restarting. Actually we need to refactor the liveness probe in a following up PR as currently it will try to check the eth1, i.e. get api like http://172.26.0.3:8384/healthz

Copy link
Contributor Author

@yanjunz97 yanjunz97 May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked this in HA mode, and found operator will restart after eht1 down automatically because the lease renewal failed.
Updated: But in non-HA mode, operator will not restart, but the api server call will fail with errors like {"error": "Put \"https://172.24.0.1:443/apis/crd.nsx.vmware.com/v1alpha1/namespaces/ns-1/subnetsets/pod-default/status\": http2: client connection lost"}

@zhengxiexie
Copy link
Contributor

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants