Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
- add ut for ascend by ([@shijinye](https://github.com/shijinye)) in [#664](https://github.com/Project-HAMi/HAMi/pull/664)
- optimization map init in test by ([@lengrongfu](https://github.com/lengrongfu)) in [#678](https://github.com/Project-HAMi/HAMi/pull/678)
- Optimize monitor by ([@for800000](https://github.com/for800000)) in [#683](https://github.com/Project-HAMi/HAMi/pull/683)
- fix code lint faild by ([@lengrongfu](https://github.com/lengrongfu)) in [#685](https://github.com/Project-HAMi/HAMi/pull/685)
- fix code lint failed by ([@lengrongfu](https://github.com/lengrongfu)) in [#685](https://github.com/Project-HAMi/HAMi/pull/685)
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by ([@Nimbus318](https://github.com/Nimbus318)) in [#687](https://github.com/Project-HAMi/HAMi/pull/687)
- fix vGPUmonitor deviceidx is always 0 by ([@lengrongfu](https://github.com/lengrongfu)) in [#684](https://github.com/Project-HAMi/HAMi/pull/684)
- add ut for pkg/scheduler/event.go by ([@Penguin-zlh](https://github.com/Penguin-zlh)) in [#688](https://github.com/Project-HAMi/HAMi/pull/688)
Expand Down
6 changes: 3 additions & 3 deletions blog/2024-12-31-post/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1600,7 +1600,7 @@ type DevicePluginServer interface {
// Plugin can run device specific operations and instruct Kubelet
// of the steps to make the Device available in the container
Allocate(context.Context, *AllocateRequest) (*AllocateResponse, error)
// PreStartContainer is called, if indicated by Device Plugin during registeration phase,
// PreStartContainer is called, if indicated by Device Plugin during registration phase,
// before each container start. Device plugin can run device specific operations
// such as resetting the device before making devices available to the container
PreStartContainer(context.Context, *PreStartContainerRequest) (*PreStartContainerResponse, error)
Expand Down Expand Up @@ -1678,7 +1678,7 @@ func (plugin *NvidiaDevicePlugin) WatchAndRegister() {
errorSleepInterval := time.Second * 5
successSleepInterval := time.Second * 30
for {
err := plugin.RegistrInAnnotation()
err := plugin.RegisterInAnnotation()
if err != nil {
klog.Errorf("Failed to register annotation: %v", err)
klog.Infof("Retrying in %v seconds...", errorSleepInterval)
Expand All @@ -1692,7 +1692,7 @@ func (plugin *NvidiaDevicePlugin) WatchAndRegister() {
```

```golang
func (plugin *NvidiaDevicePlugin) RegistrInAnnotation() error {
func (plugin *NvidiaDevicePlugin) RegisterInAnnotation() error {
devices := plugin.getAPIDevices()
klog.InfoS("start working on the devices", "devices", devices)
annos := make(map[string]string)
Expand Down
4 changes: 2 additions & 2 deletions changelog/source/v2.5.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ authors:
- add ut for ascend by ([@shijinye](https://github.com/shijinye)) in [#664](https://github.com/Project-HAMi/HAMi/pull/664)
- optimization map init in test by ([@lengrongfu](https://github.com/lengrongfu)) in [#678](https://github.com/Project-HAMi/HAMi/pull/678)
- Optimize monitor by ([@for800000](https://github.com/for800000)) in [#683](https://github.com/Project-HAMi/HAMi/pull/683)
- fix code lint faild by ([@lengrongfu](https://github.com/lengrongfu)) in [#685](https://github.com/Project-HAMi/HAMi/pull/685)
- fix code lint failed by ([@lengrongfu](https://github.com/lengrongfu)) in [#685](https://github.com/Project-HAMi/HAMi/pull/685)
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by ([@Nimbus318](https://github.com/Nimbus318)) in [#687](https://github.com/Project-HAMi/HAMi/pull/687)
- fix vGPUmonitor deviceidx is always 0 by ([@lengrongfu](https://github.com/lengrongfu)) in [#684](https://github.com/Project-HAMi/HAMi/pull/684)
- add ut for pkg/scheduler/event.go by ([@Penguin-zlh](https://github.com/Penguin-zlh)) in [#688](https://github.com/Project-HAMi/HAMi/pull/688)
Expand Down Expand Up @@ -160,4 +160,4 @@ authors:
- phoenixwu0229 ([@phoenixwu0229](https://github.com/phoenixwu0229))
- chinaran ([@chinaran](https://github.com/chinaran))

**Full Changelog**: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0
**Full Changelog**: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Goverance
title: Governance
---

Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is an "all-in-one" tools designed to manage Heterogeneous AI Computing Devices in a k8s cluster
Expand Down
2 changes: 1 addition & 1 deletion docs/contributor/ladder.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Contributer Ladder
title: Contributor Ladder
---

This docs different ways to get involved and level up within the project. You can see different roles within the project in the contributor roles.
Expand Down
6 changes: 3 additions & 3 deletions docs/developers/Dynamic-mig.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ HAMi is done by using [hami-core](https://github.com/Project-HAMi/HAMi-core), wh
- CPU, Mem, and GPU combined schedule
- GPU dynamic slice: Hami-core and MIG
- Support node-level binpack and spread by GPU memory, CPU and Mem
- A unified vGPU Pool different virtualization technics
- A unified vGPU Pool different virtualization techniques
- Tasks can choose to use MIG, use HAMi-core, or use both.

### Config maps
Expand Down Expand Up @@ -104,7 +104,7 @@ data:

## Examples

Dynamic mig is compatable with hami tasks, as the example below:
Dynamic mig is compatible with hami tasks, as the example below:
Just Setting `nvidia.com/gpu` and `nvidia.com/gpumem`.

```yaml
Expand Down Expand Up @@ -149,7 +149,7 @@ The Procedure of a vGPU task which uses dynamic-mig is shown below:

<img src="https://github.com/Project-HAMi/HAMi/blob/master/docs/develop/imgs/hami-dynamic-mig-procedure.png?raw=true" width = "800" />

Note that after submited a task, deviceshare plugin will iterate over templates defined in configMap `hami-scheduler-device`, and find the first available template to fit. You can always change the content of that configMap, and restart vc-scheduler to customize.
Note that after submitted a task, deviceshare plugin will iterate over templates defined in configMap `hami-scheduler-device`, and find the first available template to fit. You can always change the content of that configMap, and restart vc-scheduler to customize.

If you submit the example on an empty A100-PCIE-40GB node, then it will select a GPU and choose MIG template below:

Expand Down
4 changes: 2 additions & 2 deletions docs/developers/protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ HAMi needs to know the spec of each AI devices in the cluster in order to schedu

```
hami.io/node-handshake-\{device-type\}: Reported_\{device_node_current_timestamp\}
hami.io/node-\{deivce-type\}-register: \{Device 1\}:\{Device2\}:...:\{Device N\}
hami.io/node-\{device-type\}-register: \{Device 1\}:\{Device2\}:...:\{Device N\}
```

The definiation of each device is in the following format:
The definition of each device is in the following format:
```
\{Device UUID\},\{device split count\},\{device memory limit\},\{device core limit\},\{device type\},\{device numa\},\{healthy\}
```
Expand Down
4 changes: 2 additions & 2 deletions docs/developers/scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ GPU spread, use different GPU cards when possible, egs:

### Node-scheduler-policy

![node-shceduler-policy-demo.png](../resources/node-shceduler-policy-demo.png)
![node-scheduler-policy-demo.png](../resources/node-scheduler-policy-demo.png)

#### Binpack

Expand Down Expand Up @@ -166,4 +166,4 @@ GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75
```

So, in `Spread` policy we can select `GPU1`.
So, in `Spread` policy we can select `GPU1`.
2 changes: 1 addition & 1 deletion docs/get-started/deploy-with-helm.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ spec:
nvidia.com/gpumem: 10240 # Each vGPU contains 10240m device memory (Optional,Integer)
```

#### 2. Verify in container resouce control {#verify-in-container-resouce-control}
#### 2. Verify in container resource control {#verify-in-container-resource-control}

Execute the following query command:

Expand Down
2 changes: 1 addition & 1 deletion docs/installation/how-to-use-volcano-vgpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ spec:
resources:
limits:
volcano.sh/vgpu-number: 2 # requesting 2 gpu cards
volcano.sh/vgpu-memory: 3000 # (optinal)each vGPU uses 3G device memory
volcano.sh/vgpu-memory: 3000 # (optional)each vGPU uses 3G device memory
volcano.sh/vgpu-cores: 50 # (optional)each vGPU uses 50% core
EOF
```
Expand Down
6 changes: 3 additions & 3 deletions docs/installation/offline-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ Load the images, tag them with your internal registry, and push them to your reg
docker load -i {HAMi_image}.tar
docker tag projecthami/hami:{HAMi version} {your_inner_registry}/hami:{HAMi version}
docker push {your_inner_registry}/hami:{HAMi version}
docker tag docker.io/jettech/kube-webhook-certgen:v1.5.2 {your inner_regisry}/kube-webhook-certgen:v1.5.2
docker push {your inner_regisry}/kube-webhook-certgen:v1.5.2
docker tag docker.io/jettech/kube-webhook-certgen:v1.5.2 {your inner_registry}/kube-webhook-certgen:v1.5.2
docker push {your inner_registry}/kube-webhook-certgen:v1.5.2
docker tag liangjw/kube-webhook-certgen:v1.1.1 {your_inner_registry}/kube-webhook-certgen:v1.1.1
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:{your kubernetes version} {your_inner_registry}/kube-scheduler:{your kubernetes version}
docker push {your_inner_registry}/kube-scheduler:{your kubernetes version}
Expand All @@ -31,7 +31,7 @@ docker push {your_inner_registry}/kube-scheduler:{your kubernetes version}
## Prepare HAMi chart

Download the charts folder from [github](https://github.com/Project-HAMi/HAMi/tree/master/charts),
place it into $\{CHART_PATH\} inside cluser, then edit the following fields in $\{CHART_PATH\}/hami/values.yaml.
place it into $\{CHART_PATH\} inside cluster, then edit the following fields in $\{CHART_PATH\}/hami/values.yaml.

```yaml
scheduler:
Expand Down
2 changes: 1 addition & 1 deletion docs/key-features/device-resource-isolation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Device resource isolation
---

A simple demostration for device isolation:
A simple demonstration for device isolation:
A task with the following resources.

```
Expand Down
2 changes: 1 addition & 1 deletion docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Hence, if an issue is important it is important to advocate its priority early i

<!-- ### Release Artifacts

The HAMi container images are availble at `dockerHub`.
The HAMi container images are available at `dockerHub`.
You can visit `https://hub.docker.com/r/karmada/<component_name>` to see the details of images.
For example, [here](https://hub.docker.com/r/karmada/karmada-controller-manager) for karmada-controller-manager.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

# genererate front-proxy-ca, server-ca
# generate front-proxy-ca, server-ca

set -e
set -o pipefail
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

# genererate CA & leaf certificates of etcd.
# generate CA & leaf certificates of etcd.

set -e
set -o pipefail
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ title: Enable Enflame GPU Sharing

## Enabling GCU-sharing Support

* Deploy gcushare-device-plugin on enflame nodes (Please consult your device provider to aquire its package and document)
* Deploy gcushare-device-plugin on enflame nodes (Please consult your device provider to acquire its package and document)

> **NOTICE:** *Install only gpushare-device-plugin, don't install gpu-scheduler-plugin package.*

Expand Down
2 changes: 1 addition & 1 deletion docs/userguide/Hygon-device/enable-hygon-dcu-sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,4 @@ Launch your DCU tasks like you usually do

1. DCU-sharing in init container is not supported, pods with "hygon.com/dcumem" in init container will never be scheduled.

2. Only one vdcu can be aquired per container. If you want to mount multiple dcu devices, then you shouldn't set `hygon.com/dcumem` or `hygon.com/dcucores`
2. Only one vdcu can be acquired per container. If you want to mount multiple dcu devices, then you shouldn't set `hygon.com/dcumem` or `hygon.com/dcucores`
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ title: Enable Illuvatar GPU Sharing

## Enabling GPU-sharing Support

* Deploy gpu-manager on iluvatar nodes (Please consult your device provider to aquire its package and document)
* Deploy gpu-manager on iluvatar nodes (Please consult your device provider to acquire its package and document)

> **NOTICE:** *Install only gpu-manager, don't install gpu-admission package.*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ the GPU device plugin (gpu-device) handles fine-grained allocation based on the

## Enabling topo-awareness scheduling

* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document)
* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to acquire its package and document)

* Deploy HAMi according to README.md

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Binpack schedule policy
---

To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.
To allocate metax device with minimum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.

```yaml
apiVersion: v1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Binpack schedule policy
---

To allocate metax device with mininum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.
To allocate metax device with minimum damage to topology, you need to only assign `metax-tech.com/gpu` with annotations `hami.io/node-scheduler-policy: "binpack"`.

```yaml
metadata:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ title: Enable Mthreads GPU sharing

## Enabling GPU-sharing Support

* Deploy MT-CloudNative Toolkit on mthreads nodes (Please consult your device provider to aquire its package and document)
* Deploy MT-CloudNative Toolkit on mthreads nodes (Please consult your device provider to acquire its package and document)

> **NOTICE:** *You can remove mt-mutating-webhook and mt-gpu-scheduler after installation(optional).*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,4 @@ spec:
nvidia.com/gpu: 2 # requesting 2 vGPUs
```

> **NOTICE:** * You can assign this task to multiple GPU types, use comma to seperate,In this example, we want to run this job on A100 or V100*
> **NOTICE:** * You can assign this task to multiple GPU types, use comma to separate,In this example, we want to run this job on A100 or V100*
4 changes: 2 additions & 2 deletions docs/userguide/NVIDIA-device/specify-device-type-to-use.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ For example, a task with the following annotation will be assigned to A100 or V1
```yaml
metadata:
annotations:
nvidia.com/use-gputype: "A100,V100" # Specify the card type for this job, use comma to seperate, will not launch job on non-specified card
nvidia.com/use-gputype: "A100,V100" # Specify the card type for this job, use comma to separate, will not launch job on non-specified card
```

A task may use `nvidia.com/nouse-gputype` to evade certain type of GPU. In this following example, that job won't be assigned to 1080(include 1080Ti) or 2080(include 2080Ti) type of card.

```yaml
metadata:
annotations:
nvidia.com/nouse-gputype: "1080,2080" # Specify the blacklist card type for this job, use comma to seperate, will not launch job on specified card
nvidia.com/nouse-gputype: "1080,2080" # Specify the blacklist card type for this job, use comma to separate, will not launch job on specified card
```
4 changes: 2 additions & 2 deletions docs/userguide/monitoring/real-time-device-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ It contains the following metrics:

| Metrics | Description | Example |
|----------|-------------|---------|
| Device_memory_desc_of_container | Container device meory real-time usage | `{context="0",ctrname="2-1-3-pod-1",data="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",module="0",offset="0",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 |
| Device_memory_desc_of_container | Container device memory real-time usage | `{context="0",ctrname="2-1-3-pod-1",data="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",module="0",offset="0",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 |
| Device_utilization_desc_of_container | Container device real-time utilization | `{ctrname="2-1-3-pod-1",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 |
| HostCoreUtilization | GPU real-time utilization on host | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",zone="vGPU"}` 0 |
| HostGPUMemoryUsage | GPU real-time device memory usage on host | `{deviceidx="0",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",zone="vGPU"}` 2.87244288e+08 |
| vGPU_device_memory_limit_in_bytes | device limit for a certain container | `{ctrname="2-1-3-pod-1",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 2.62144e+09 |
| vGPU_device_memory_usage_in_bytes | device usage for a certain container | `{ctrname="2-1-3-pod-1",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 |
| vGPU_device_memory_usage_in_bytes | device usage for a certain container | `{ctrname="2-1-3-pod-1",deviceuuid="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec",podname="2-1-3-pod-1",podnamespace="default",vdeviceid="0",zone="vGPU"}` 0 |
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: Exclusive gpu usage

## Job description

To allocate an exlusive GPU, you need only assign `volcano.sh/vgpu-number` without any other `volcano.sh/xxx` fields, as the example below:
To allocate an exclusive GPU, you need only assign `volcano.sh/vgpu-number` without any other `volcano.sh/xxx` fields, as the example below:

```yaml
apiVersion: v1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ spec:
resources:
limits:
volcano.sh/vgpu-number: 2 # requesting 2 gpu cards
volcano.sh/vgpu-memory: 3000 # (optinal)each vGPU uses 3G device memory
volcano.sh/vgpu-memory: 3000 # (optional)each vGPU uses 3G device memory
volcano.sh/vgpu-cores: 50 # (optional)each vGPU uses 50% core
EOF
```
Expand Down
Loading