Replies: 1 comment 5 replies
-
|
Hi @johnhtodd, I would recommend writing an RFC before we go ahead with any implementation.
This is very interesting and we don't offer any flexibility in this area. Today, we build all or nothing. I could imagine a new synchronization mechanism where we tell Vector to wait until certain conditions are met but there are several questions here:
|
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We now are using memory enrichment tables extensively - they're really useful, and are I think are a not-well-understood function of Vector that has amazing functionality. We're also using the sink-based storage function for memory tables: periodically storing the contents of those memory enrichment tables to disk. Why? Because we re-load them when Vector restarts, so we don't have to generate the whole table again. Basically: we're saving state of how to deal with or modify objects passing through Vector, so if a system crashes (or comes up as "new") we can create the memory table as it was with some reasonable belief that it is close to what it was when the system went away.
But there is a problem: loading these disk files of saved state information into the memory tables takes time. Sometimes a few seconds, and in the worst cases, it takes many minutes (we have some tables with hundreds of millions of records.) But our other sources (files or kafka sources in our case) start up immediately, and so we have objects flowing through the Vector system, but we have incomplete ability to apply the correct actions to those objects because the memory tables aren't full yet. So we're processing events with "rules" that aren't complete. This leads to very unpredictable outcomes that are often wildly incorrect.
Currently, we use some really ugly hacks to solve this, such as blocking kafka origins with packet filters, or causing other transient faults that we can manually clear when the state tables are ready for the other sources to start feeding data through the system.
My memory table example is just one reason that one might want to delay or halt ingestion of a source. I expect others have different but equally valid reasons that some sources should be delayed or ignored.
I would propose some method by which sources could be delayed upon startup, so sources could be activated in an orderly, well-understood sequence rather than all at once. This would apply to all sources in their general configuration parameter section.
The most trivial way of doing this would be a seconds-based timer that simply waited until expiry to begin processing the source.
A much more flexible way to do this would be the ability to run some small VRL script on a regular basis (specifiable interval in the source config?) that would result in a "true" or "false" outcome. The source would be enabled or disabled based on this outcome. We use a variant of this type of logic already in the user authentication model in the websocket sink (and other places?) where standalone VRL is used as a method to generate a true/false outcome. Once we have the internal metrics visible via VRL (this PR: #23430) then it would be easy to examine the uptime of the vector itself and emulate the "delay for X seconds" model that I suggest as the "trivial" solution above. But a programmatic method would open the door to much more sophisticated methods of evaluating and activating/deactivating sources. This perhaps is parallel to the "backpressure" model that is already used, but it would also be much simpler and would not have buffering issues. It is not a replacement for backpressure, but would be an adjunct method. Sources would just be turned off until the VRL was "true".
I'd like to get any opinions on which of these methods makes sense, or perhaps neither, or perhaps there other models that already exist but I haven't thought of which are a better solution.
Beta Was this translation helpful? Give feedback.
All reactions