-
Notifications
You must be signed in to change notification settings - Fork 153
Description
Hi, I'm having a problem with retrieving data for the Console, Scalars, and Debug Samples sections.
The error is the following:
I've upgraded clearml-server using the (guide) but the error is the same.
I tried to inspect the logs of all docker containers.
in the Elastic container I found this error:
{"@timestamp":"2025-10-24T15:06:02.310Z", "log.level": "WARN", "message":"path: /events-log-d1bd92a3b039400cbafc60a7a5b1e52b/_search, params: {index=events-log-d1bd92a3b039400cbafc60a7a5b1e52b}, status: 503", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[clearml][transport_worker][T#7]","log.logger":"rest.suppressed","elasticsearch.cluster.uuid":"OrRFbiRWSouzPWdpjCwrQg","elasticsearch.node.id":"_aiygnWISgetdkhPNK2w9A","elasticsearch.node.name":"clearml","elasticsearch.cluster.name":"clearml","error.type":"org.elasticsearch.action.search.SearchPhaseExecutionException","error.message":"all shards failed","error.stack_trace":"Failed to execute phase [query], all shards failed; shardFailures {[_na_][events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]: org.elasticsearch.action.
in apiserver container
[2025-10-27 08:30:02,000] [9] [ERROR] [clearml.queue_metrics] Failed collecting queue metrics: ApiError(503, 'unavailable_shards_exception', '[queue_metrics_d1bd92a3b039400cbafc60a7a5b 1e52b_2025-10][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-10][0]] containing [index {[queue_metrics_ d1bd92a3b039400cbafc60a7a5b1e52b_2025-10][ZETIJJoBxdNb2Or0-MCv], source[{"timestamp":1761553561994,"queue":"9b1b026812214d69906dfb7c365d4b1c","average_waiting_time":0,"queue_length":0} ]}]]') Traceback (most recent call last): File "/opt/clearml/apiserver/bll/queue/queue_metrics.py", line 320, in start queue_metrics.log_queue_metrics_to_es(queue.company, [queue]) File "/opt/clearml/apiserver/bll/queue/queue_metrics.py", line 83, in log_queue_metrics_to_es self.es.index(index=es_index, document=queue_doc) File "/usr/local/lib/python3.9/site-packages/elasticsearch/_sync/client/utils.py", line 455, in wrapped return api(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/elasticsearch/_sync/client/__init__.py", line 2470, in index return self.perform_request( # type: ignore[return-value] File "/usr/local/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request response = self._perform_request( File "/usr/local/lib/python3.9/site-packages/elasticsearch/_sync/client/_base.py", line 352, in _perform_request raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( elasticsearch.ApiError: ApiError(503, 'unavailable_shards_exception', '[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-10][0] primary shard is not active Timeout: [1m], request: [ BulkShardRequest [[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-10][0]] containing [index {[queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2025-10][ZETIJJoBxdNb2Or0-MCv], source [{"timestamp":1761553561994,"queue":"9b1b026812214d69906dfb7c365d4b1c","average_waiting_time":0,"queue_length":0}]}]]') [2025-10-27 08:30:07,241] [9] [WARNING] [elastic_transport.node_pool] Node <Urllib3HttpNode(http://elasticsearch:9200)> has failed for 2 times in a row, putting on 2 second timeout [2025-10-27 08:30:07,242] [9] [WARNING] [elastic_transport.transport] Retrying request after non-successful status 503 (attempt 1 of 3)