Concept for patch: source startup delay #23775

johnhtodd · 2025-09-14T16:44:47Z

johnhtodd
Sep 14, 2025

We now are using memory enrichment tables extensively - they're really useful, and are I think are a not-well-understood function of Vector that has amazing functionality. We're also using the sink-based storage function for memory tables: periodically storing the contents of those memory enrichment tables to disk. Why? Because we re-load them when Vector restarts, so we don't have to generate the whole table again. Basically: we're saving state of how to deal with or modify objects passing through Vector, so if a system crashes (or comes up as "new") we can create the memory table as it was with some reasonable belief that it is close to what it was when the system went away.

But there is a problem: loading these disk files of saved state information into the memory tables takes time. Sometimes a few seconds, and in the worst cases, it takes many minutes (we have some tables with hundreds of millions of records.) But our other sources (files or kafka sources in our case) start up immediately, and so we have objects flowing through the Vector system, but we have incomplete ability to apply the correct actions to those objects because the memory tables aren't full yet. So we're processing events with "rules" that aren't complete. This leads to very unpredictable outcomes that are often wildly incorrect.

Currently, we use some really ugly hacks to solve this, such as blocking kafka origins with packet filters, or causing other transient faults that we can manually clear when the state tables are ready for the other sources to start feeding data through the system.

My memory table example is just one reason that one might want to delay or halt ingestion of a source. I expect others have different but equally valid reasons that some sources should be delayed or ignored.

I would propose some method by which sources could be delayed upon startup, so sources could be activated in an orderly, well-understood sequence rather than all at once. This would apply to all sources in their general configuration parameter section.

The most trivial way of doing this would be a seconds-based timer that simply waited until expiry to begin processing the source.

A much more flexible way to do this would be the ability to run some small VRL script on a regular basis (specifiable interval in the source config?) that would result in a "true" or "false" outcome. The source would be enabled or disabled based on this outcome. We use a variant of this type of logic already in the user authentication model in the websocket sink (and other places?) where standalone VRL is used as a method to generate a true/false outcome. Once we have the internal metrics visible via VRL (this PR: #23430) then it would be easy to examine the uptime of the vector itself and emulate the "delay for X seconds" model that I suggest as the "trivial" solution above. But a programmatic method would open the door to much more sophisticated methods of evaluating and activating/deactivating sources. This perhaps is parallel to the "backpressure" model that is already used, but it would also be much simpler and would not have buffering issues. It is not a replacement for backpressure, but would be an adjunct method. Sources would just be turned off until the VRL was "true".

I'd like to get any opinions on which of these methods makes sense, or perhaps neither, or perhaps there other models that already exist but I haven't thought of which are a better solution.

pront · 2025-09-15T18:41:41Z

pront
Sep 15, 2025
Maintainer

Hi @johnhtodd, I would recommend writing an RFC before we go ahead with any implementation.

But a programmatic method would open the door to much more sophisticated methods of evaluating and activating/deactivating sources.

This is very interesting and we don't offer any flexibility in this area. Today, we build all or nothing. I could imagine a new synchronization mechanism where we tell Vector to wait until certain conditions are met but there are several questions here:

how can users define conditions? VRL programs?
how do we observe these conditions? is it automatic or user sends a signal to the Vector process?
to evaluate the condition, do we need to run Vector to get internal telemetry and/or Vector components?
- This implies that we have to run a pipeline before and then another pipeline after. Could we have a Vector (prepare pipeline) to Vector (actual pipeline) setup?

5 replies

johnhtodd Sep 15, 2025
Author

Hi @johnhtodd, I would recommend writing an RFC before we go ahead with any implementation.

Understood.

how can users define conditions? VRL programs?

Yes, I'd say VRL is the thing that makes the most sense since it's already been done in at least one other area.

how do we observe these conditions? is it automatic or user sends a signal to the Vector process?

I'd say it's automatic, but now that we have the concept of memory tables it may be possible to signal things through those. In effect: cross-event signaling.

to evaluate the condition, do we need to run Vector to get internal telemetry and/or Vector components?

I would suggest that the VRL be left up to the user entirely, with a 'true' or 'false' outcome. Triggering the evaluation on some timed basis would be the trick, which could either be done as part of this new model ("evaluation_interval: 5" to run the VRL every 5 seconds?) or I have done some really ugly hacks with "demo_logs" to activate some timed event that I need done on a frequent basis. As long as shared memory enrichment tables are available in the VRL environment that is doing the evaluation, there are many different ways to collect and interpret signals across components.

The most trivial thing to do that I could see would rely on those internal vector telemetry elements being visible somehow within VRL (see PR 23430) and using the "uptime_seconds" timer to be the gate for allowing a source to be activated. However, it can be much more sophisticated such as looking at the "utilization" values, or time of day, or anything else.

This implies that we have to run a pipeline before and then another pipeline after. Could we have a Vector (prepare pipeline) to Vector (actual pipeline) setup?

I'm not quite sure what you mean here.

pront Sep 16, 2025
Maintainer

Most the above make sense to me.

I would suggest that the VRL be left up to the user entirely, with a 'true' or 'false' outcome.
I'm not quite sure what you mean here.

Do you want to run a Vector pipeline to pass events to the VRL conditions? Or what data will the VRL conditions depend on? Purely outside of Vector and available at startup?

johnhtodd Sep 16, 2025
Author

Do you want to run a Vector pipeline to pass events to the VRL conditions? Or what data will the VRL conditions depend on? Purely outside of Vector and available at startup?

I don't think it makes sense to pass events; that seems overly complex and sort of not consistent with the way it's been done before. If there is the ability to connect to and observe things on memory enrichment tables, and also to access vector metrics (per the PR that is pending) then for me that seems sufficient.

pront Sep 16, 2025
Maintainer

I don't think it makes sense to pass events; that seems overly complex and sort of not consistent with the way it's been done before.

We are discussing a new feature so not sure what you mean by "done before" in this context.

If there is the ability to connect to and observe things on memory enrichment tables, and also to access vector metrics (per the PR that is pending) then for me that seems sufficient.

Let's wait for the RFC and discuss the details there. I expect some problems will become more obvious as we dive deeper.

johnhtodd Sep 16, 2025
Author

I don't think it makes sense to pass events; that seems overly complex and sort of not consistent with the way it's been done before.

We are discussing a new feature so not sure what you mean by "done before" in this context.

Sorry - wasn't clear. I meant that VRL in a semi-isolated method (not taking steady stream of events) to output true/false decisions has already been done in the websocket authentication model.

If there is the ability to connect to and observe things on memory enrichment tables, and also to access vector metrics (per the PR that is pending) then for me that seems sufficient.

Let's wait for the RFC and discuss the details there. I expect some problems will become more obvious as we dive deeper.

OK. It will take me a while on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concept for patch: source startup delay #23775

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Concept for patch: source startup delay #23775

Uh oh!

Uh oh!

johnhtodd Sep 14, 2025

Replies: 1 comment · 5 replies

Uh oh!

pront Sep 15, 2025 Maintainer

Uh oh!

johnhtodd Sep 15, 2025 Author

Uh oh!

pront Sep 16, 2025 Maintainer

Uh oh!

johnhtodd Sep 16, 2025 Author

Uh oh!

pront Sep 16, 2025 Maintainer

Uh oh!

johnhtodd Sep 16, 2025 Author

johnhtodd
Sep 14, 2025

Replies: 1 comment 5 replies

pront
Sep 15, 2025
Maintainer

johnhtodd Sep 15, 2025
Author

pront Sep 16, 2025
Maintainer

johnhtodd Sep 16, 2025
Author

pront Sep 16, 2025
Maintainer

johnhtodd Sep 16, 2025
Author