Skip to content

Conversation

@yetanothertw
Copy link
Contributor

@yetanothertw yetanothertw commented Dec 30, 2025

Part of #4117

Summary

Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • Yes
  • No

@yetanothertw yetanothertw self-assigned this Dec 30, 2025
@yetanothertw yetanothertw added documentation Improvements or additions to documentation Team:Admin Issues owned by the Admin Docs Team labels Dec 30, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 30, 2025

Vale Linting Results

Summary: 1 warning, 2 suggestions found

⚠️ Warnings (1)
File Line Rule Message
troubleshoot/elasticsearch/increase-tier-capacity.md 51 Elastic.DontUse Don't use 'Note that'.
💡 Suggestions (2)
File Line Rule Message
troubleshoot/elasticsearch/increase-tier-capacity.md 51 Elastic.Wordiness Consider using 'to' instead of 'in order to'.
troubleshoot/elasticsearch/increase-tier-capacity.md 67 Elastic.Wordiness Consider using 'impossible' instead of 'not possible'.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 30, 2025


::::::{tab-item} {{ech}}
In order to increase the disk capacity of the data nodes in your cluster:
::::::{applies-item} { ess: }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these steps work for ech and ece (tweaking the first steps - you can use these)

the autoscaling UI has also since changed to a multiselect dropdown (both envs) - screenshots should be removed:

image image

not sure about the limit reached stuff. assume it's right?


::::::{tab-item} Self-managed
In order to increase the data node capacity in your cluster, you will need to calculate the amount of extra disk space needed.
::::::{applies-item} { self: }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ECK steps.

cursor tells me that the steps for ECK are different ... would be good to get @eedugon's confirmation here

::::::{tab-item} {{eck}}
In order to increase the disk capacity of data nodes in your {{eck}} cluster, you can either add more data nodes or increase the storage size of existing nodes.

**Option 1: Add more data nodes**

Update the `count` field in your data node NodeSet to add more nodes:

```yaml subs=true
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: {{version.stack}}
  nodeSets:
  - name: data-nodes
    count: 5  # Increase from previous count
    config:
      node.roles: ["data"]
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
```

Apply the changes:

```sh
kubectl apply -f your-elasticsearch-manifest.yaml
```

ECK will automatically create the new nodes and {{es}} will relocate shards to balance the load. You can monitor the progress using:

```console
GET /_cat/shards?v&h=state,node&s=state
```

**Option 2: Increase storage size of existing nodes**

If your storage class supports [volume expansion](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#expanding-persistent-volumes-claims), you can increase the storage size in the `volumeClaimTemplates`:

```yaml subs=true
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: {{version.stack}}
  nodeSets:
  - name: data-nodes
    count: 3
    config:
      node.roles: ["data"]
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 200Gi  # Increased from previous size
```

Apply the changes. If the volume driver supports `ExpandInUsePersistentVolumes`, the filesystem will be resized online without restarting {{es}}. Otherwise, you may need to manually delete the Pods after the resize so they can be recreated with the expanded filesystem.

For more information, see [Updating deployments](/deploy-manage/deploy/cloud-on-k8s/update-deployments.md) and [Volume claim templates](/deploy-manage/deploy/cloud-on-k8s/volume-claim-templates.md).
::::::

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shainaraskas , the previous look good to me!

We could link to our official doc about volume claim templates for ECK and volume expansion: https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s/volume-claim-templates#k8s-volume-claim-templates-update


::::::{tab-item} {{ech}}
In order to get the shards assigned we’ll need to increase the number of shards that can be collocated on a node in the cluster. We’ll achieve this by inspecting the system-wide `cluster.routing.allocation.total_shards_per_node` [cluster setting](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-get-settings) and increasing the configured value.
::::::{applies-item} { ess: }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setting is invalid in ESS

image

technically these steps work but only because they're being set in an invalid way

@eedugon would we expect people to ever work around non-whitelisted settings in this way?

regardless, this is another case where the ech and self-managed instructions are very similar. the difference between them raises a red flag for me - you can still add nodes in ECH, so checking the target tier and scaling up that tier should also be done before increasing the total number of shards per node. this is the same fix that is causing us grief over here

Copy link
Contributor

@eedugon eedugon Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shainaraskas : cluster.routing.allocation.total_shards_per_node is a dynamic setting. When needed, it's recommended to use it with the cluster settings API and not defining it statically on elasticsearch.yml (as that would require a rolling restart of all nodes).

So, in ECH, even if the setting is not whitelisted I don't think it's set in an invalid way when set through the cluster settings API.

Anyway, take this in mind, as I think it's related with the existence of this document:

In the past the default of that setting was 1000, and the most common reason to need that setting was as a temporary measure to allow an unexpected amount of shards being allocated. That's why this document was super useful.
Currently that setting defaults to no limit so it will probably won't be needed anymore, except if a user wants to keep the amount of shards under strict control and limits.

would we expect people to ever work around non-whitelisted settings in this way?

IMO, if the setting is dynamic and there are legitimate use cases for it I'd say yes, without needing to whitelist them at node config level. But it's just my opinion.

Anyway this document probably won't be as useful as it was in the past, considering that today cluster.routing.allocation.total_shards_per_node does not have a limit by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course if we still want to document this for ECH we need to ensure the reader doesn't try to configure cluster.routing.allocation.total_shards_per_node as a user setting because it's not whitelisted, they should do it with the cluster settings API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (final thought here) we can rewrite a bit the introduction for users to understand that there's a dynamic cluster setting (cluster.routing.allocation.total_shards_per_node) that sets a maximum amount of shards a node can handle. In older versions that maximum defaulted to 1000, and that could cause the error Total number of shards per node has been reached.

If that setting (cluster.routing.allocation.total_shards_per_node) is set the user might need to increase it if they have exceeded the amount of shards in any of the nodes.

And the instructions to set it.... I'd say they are the same regardless of the deployment type (i'd only suggest the dynamic way in this troubleshooting document).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue on this page

To accomplish this, complete the following steps:

:::::::{tab-set}
:::::::{applies-switch}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the applies to says stack, we need procedures for all 4 deployment types. likely this will be a split between ECE/ECH and self/ECK, but we would need to verify in more detail

@eedugon
Copy link
Contributor

eedugon commented Dec 31, 2025

I have a comment / question about the self-managed tab, @shainaraskas and @yetanothertw , on this doc: https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/4473/troubleshoot/elasticsearch/increase-capacity-data-node

Don't you think the self-managed tab is not really about self-managed? If you read through it you will see that it first says:

To increase the data node capacity in your cluster, you need to calculate the amount of extra disk space needed.

^^ That's regardless of the deployment type! Calculating or knowing how much extra disk you need (and what tier needs disk) is 1) not strictly related with "adding more disk", and 2) independent of the deployment type.

The only valid payload of that section is at the end of step 3, look at the 2 bullets:

image

And step 4 is not really a next step, is an informational comment in case of adding more nodes (and again valid for all deployment types!)

In short: self-managed instructions says almost nothing in reality (I'm not saying there's much to say but the content feels weird as the majority of it is not really for self-managed exclusively).

My proposal is:

  • Have a common introduction for all deployment types, saying that:
    • It's a good practice to check and calculate what exact nodes are struggling with disk and the amount of extra space you need or want (and we can offer the steps mentioned previously, or a shorter summary)
    • explain that to add more disk the most common ways are:
      • Add more nodes to the cluster (to the data tier that might be short of disk)
      • Expand the disks of the nodes (change or replace your nodes by bigger ones).

And then we can offer the instructions to execute the previous tasks in all deployment types, such as:

  • ECE or ECH: use autoscaling, add more capacity or even change HW profile of the deployment.
  • ECK: as @shainaraskas mentioned
  • Self managed: not much to say. If the user wants to add nodes we can link to the install instructions. If they want to expand disks they should know how to do it, as we don't offer specific instructions. It depends on the type of storage used, operating system, etc. It's something not really related with Elastic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation Team:Admin Issues owned by the Admin Docs Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants