Q&A: How to Troubleshoot Duplicate Data in Elasticsearch #24226

EzLittleChen · 2025-11-13T07:56:31Z

EzLittleChen
Nov 13, 2025

Question

We deployed multiple aggregator-pattern vectors in Kubernetes, using a single consumer group to consume from multiple topics before writing data to Elasticsearch. However, we occasionally observed duplicate entries in Elasticsearch—sometimes two duplicates, sometimes three—all sharing identical offsets. Logs only showed Elasticsearch request timeouts. I'm unsure whether this stems from duplicate Kafka consumption or Elasticsearch request retries.

Vector Config

image:
repository: timberio/vector
tag: 0.48.0-debian
role: "Aggregator"

replicas: 3

logLevel: "info"

customConfig:
api:
address: 0.0.0.0:8686
enabled: true
data_dir: /var/lib/vector
acknowledgements:
enabled: false
sources:
self_metrics:
type: internal_metrics
scrape_interval_secs: 1
elk_kafka:
type: kafka
bootstrap_servers: "xx.xx.xx.xxx:9092,xx.xx.xx.xxx:9092,xx.xx.xx.xxx:9092"
librdkafka_options:
partition.assignment.strategy: "roundrobin"
group_id: "app-consumer-vector"
topics:
- "^app-*"
drain_timeout_ms: 30000
session_timeout_ms: 120000
socket_timeout_ms: 60000
fetch_wait_max_ms: 3000
decoding:
codec: json
transforms:
kafka_message:
type: "remap"
inputs:
- "elk_kafka"
source: |-
.@timestamp = .timestamp
. = map_keys(., recursive: true) -> |key| { replace(key, ".", "") }
if is_json!(.message, variant: "object") {
. = merge!(., parse_json!(.message))
del(.message)
}
sinks:
self_metrics_exporter:
type: prometheus_exporter
inputs:
- self_metrics
address: 0.0.0.0:9598
flush_period_secs: 60
es:
compression: gzip
type: elasticsearch
inputs:
- kafka_message
batch:
max_bytes: 10485760
max_events: 10000
timeout_secs: 1
buffer:
type: "disk"
max_size: 1073741824
when_full: "block"
bulk:
index: |-
{{ print "k8s-{{.namespace}}-{{.deployment}}-vector%Y.%m.%d" }}
request:
timeout_secs: 30
endpoints: ["http://xx.xx.xx.xxx:19600","http://xx.xx.xx.xxx:19600"]

Vector Logs

2025-11-13T06:17:49.557448Z WARN sink{component_kind="sink" component_id=es component_type=elasticsearch}:request{request_id=80870}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing batch.max_bytes and/or using compression if applicable. Alternatively request.timeout_secs can be increased. internal_log_rate_limit=true
2025-11-13T06:18:22.107407Z WARN sink{component_kind="sink" component_id=es component_type=elasticsearch}:request{request_id=80870}: vector::sinks::util::retries: Internal log [Request timed out. If this happens often while the events are actually reaching their destination, try decreasing batch.max_bytes and/or using compression if applicable. Alternatively request.timeout_secs can be increased.] has been suppressed 3 times.
2025-11-13T06:18:22.107444Z WARN sink{component_kind="sink" component_id=es component_type=elasticsearch}:request{request_id=80870}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing batch.max_bytes and/or using compression if applicable. Alternatively request.timeout_secs can be increased. internal_log_rate_limit=true

Answered by jayy-77

Nov 13, 2025

Hey, I guess you can use the dedupe transform, something like this:

type = "dedupe"
inputs = [ "red_parse" ]
fields.match = ["event.id"]

Identify the field that holds the unique value, and based on that, eliminate the duplicate logs.

View full answer

jayy-77 · 2025-11-13T08:05:13Z

jayy-77
Nov 13, 2025

Hey, I guess you can use the dedupe transform, something like this:

type = "dedupe"
inputs = [ "red_parse" ]
fields.match = ["event.id"]

Identify the field that holds the unique value, and based on that, eliminate the duplicate logs.

3 replies

EzLittleChen Nov 13, 2025
Author

Thank you very much. I will add this configuration and test it for a while, but I'm still very curious to know why the duplicate data issue occurred. Is there any way to analyze where the problem might be?

jayy-77 Nov 13, 2025

Your timeout_secs value can be the issue. When the Elasticsearch sink does not receive a response within the timeout period, it retries the batch. However, the original request may still be processing in Elasticsearch and will eventually succeed. This means both the original request and the retry(s) write the same data, causing duplicates with identical offsets.

EzLittleChen Nov 13, 2025
Author

Alright, thank you very much. I'll add the configuration above and modify the time_out parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q&A: How to Troubleshoot Duplicate Data in Elasticsearch #24226

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Q&A: How to Troubleshoot Duplicate Data in Elasticsearch #24226

Uh oh!

EzLittleChen Nov 13, 2025

Question

Vector Config

Vector Logs

Replies: 1 comment · 3 replies

Uh oh!

jayy-77 Nov 13, 2025

Uh oh!

EzLittleChen Nov 13, 2025 Author

Uh oh!

jayy-77 Nov 13, 2025

Uh oh!

EzLittleChen Nov 13, 2025 Author

EzLittleChen
Nov 13, 2025

Replies: 1 comment 3 replies

jayy-77
Nov 13, 2025

EzLittleChen Nov 13, 2025
Author

EzLittleChen Nov 13, 2025
Author