From 810e70b3a0521dceb0306e08c0ea73835f3ea94d Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Fri, 24 Oct 2025 00:56:47 +0200 Subject: [PATCH 1/4] Long-term store: Refurbish section --- .../airflow/data-retention-hot-cold.md | 2 +- docs/solution/index.md | 17 ++- docs/solution/longterm/index.md | 103 ++++++++++++++++ docs/solution/time-series/index.md | 17 +-- docs/solution/time-series/longterm.md | 113 ------------------ docs/start/application/index.md | 33 ++++- 6 files changed, 153 insertions(+), 132 deletions(-) create mode 100644 docs/solution/longterm/index.md delete mode 100644 docs/solution/time-series/longterm.md diff --git a/docs/integrate/airflow/data-retention-hot-cold.md b/docs/integrate/airflow/data-retention-hot-cold.md index cddf6a4d..f6dc9bfd 100644 --- a/docs/integrate/airflow/data-retention-hot-cold.md +++ b/docs/integrate/airflow/data-retention-hot-cold.md @@ -1,5 +1,5 @@ (airflow-data-retention-hot-cold)= -# Build a hot and cold storage data retention policy in CrateDB with Apache Airflow +# Build a hot/cold storage data retention policy in CrateDB with Apache Airflow :::{article-info} --- diff --git a/docs/solution/index.md b/docs/solution/index.md index c2978188..b2a823cd 100644 --- a/docs/solution/index.md +++ b/docs/solution/index.md @@ -7,6 +7,7 @@ :hidden: time-series/index industrial/index +longterm/index analytics/index machine-learning/index ::: @@ -15,7 +16,7 @@ machine-learning/index ## Explanations :::{div} sd-text-muted -About time series data storage and analytics, and machine learning. +About time series and long-term data storage, real-time analytics, and machine learning. ::: ::::{grid} 1 2 2 2 @@ -36,6 +37,20 @@ and how to apply time series modeling and analysis procedures to your data. - Scientific computing ::: +:::{grid-item-card} {material-outlined}`manage_history;2em` Long-term store +:link: longterm-store +:link-type: ref +:link-alt: About storing time series data for the long term +Permanently keeping your raw data accessible for querying yields insightful +analysis opportunities other systems can't provide easily. ++++ +**What's inside:** +- Time-based bucketing. +- Advanced querying. +- Import data using Dask. +- Optimizing storage for historic time series data. +::: + :::{grid-item-card} {material-outlined}`model_training;2em` Machine learning :link: machine-learning :link-type: ref diff --git a/docs/solution/longterm/index.md b/docs/solution/longterm/index.md new file mode 100644 index 00000000..f5108a7f --- /dev/null +++ b/docs/solution/longterm/index.md @@ -0,0 +1,103 @@ +(longterm-store)= +(timeseries-longterm)= +(timeseries-long-term-storage)= + +# Long-term store + +:::{div} sd-text-muted +Never retire data just because your other systems can't handle the cardinality. +::: + +CrateDB stores large volumes of data, keeping it accessible for querying +and insightful analysis, even considering historic data records. + +Many organizations need to retain data for years or decades to meet regulatory +requirements, support historical analysis, or preserve valuable insights for +future use. However, traditional storage systems force you to choose between +accessibility and affordability, often leading to data exports, archival +systems, or downsampling that sacrifice query capabilities. + +CrateDB eliminates this trade-off by storing large volumes of data efficiently +while keeping it fully accessible for querying and analysis. Unlike systems +that struggle with high cardinality or require expensive tiered architectures, +CrateDB handles billions of unique records in a single platform, maintaining +fast query performance even on historic datasets spanning years. + +By keeping all your data in one place, you avoid the complexity and costs of +exporting to specialized long-term storage systems, data lakes, or cold storage +tiers. Your historical data remains as queryable as your recent data, enabling +seamless analysis across any time range without data movement, ETL pipelines, +or rehydration processes. + +With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL. +Other than integrating well with commodity systems using standard database +access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface +on top. + +## Use cases + +:::{rubric} Metrics and monitoring +::: + +::::{grid} 1 1 1 2 +:gutter: 2 +:padding: 0 + +:::{grid-item-card} Prometheus +:link: prometheus +:link-type: ref +Prometheus and similar monitoring systems excel at real-time alerting but face challenges +with long-term metric retention due to storage costs and query performance at scale. CrateDB +addresses these challenges by providing: +- **Scalable long-term storage**: Store years of metrics without compromising query performance. +- **High cardinality support**: Handle millions of unique label combinations that would overwhelm traditional TSDBs. +- **Rich SQL analytics**: Perform complex analytical queries on historic metrics using standard SQL. +- **Seamless integration**: Use CrateDB's Prometheus Adapter for transparent remote write/read operations. ++++ +Set up CrateDB as a long-term metrics store for Prometheus. +::: + +:::{grid-item-card} OpenTelemetry +:link: opentelemetry +:link-type: ref +OpenTelemetry and similar observability frameworks excel at generating rich telemetry data +but face challenges with long-term retention due to storage scale and query complexity. +CrateDB addresses these challenges by providing: +- **Scalable long-term storage**: Store large volumes of telemetry through CrateDB's distributed architecture. +- **Vendor-neutral ingestion**: Use OpenTelemetry SDKs/agents and Telegraf to send telemetry into your CrateDB observability pipeline. +- **Rich SQL analytics**: Run SQL/time-series queries, aggregations and joins on telemetry data for troubleshooting and analytics. +- **Flexible attribute mapping**: Customize which span/log/profile attributes become columns/tags for dimensional queries. ++++ +Set up CrateDB as a long-term observability backend for OpenTelemetry. +::: + +:::: + +## Related sections + +{ref}`metrics-store` includes information about how to +store and analyze high volumes of system monitoring information +like metrics and log data with CrateDB. + +{ref}`analytics` describes how +CrateDB provides real-time analytics on raw data stored for the long term. +Keep massive amounts of data ready in the hot zone for analytics purposes. + +[Optimizing storage efficiency for historic time series data] +illustrates how to reduce table storage size by 80%, +by using arrays for time-based bucketing, a historical table having +a dedicated layout, and querying using the UNNEST table function. + +{ref}`Build a hot/cold storage data retention policy ` +describes how to manage aging data by leveraging CrateDB cluster +features to mix nodes with different hardware setups, i.e. hot +nodes using the latest generation of NVMe drives for responding +to analytics queries quickly, and cold nodes that have access to +cheap mass storage for retaining historic data. + +{ref}`weather-data-storage` provides information about how to +use CrateDB for mass storage of synoptic weather observations, +allowing you to query them efficiently. + + +[Optimizing storage efficiency for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762 diff --git a/docs/solution/time-series/index.md b/docs/solution/time-series/index.md index 4a4fd5ea..8b387e9d 100644 --- a/docs/solution/time-series/index.md +++ b/docs/solution/time-series/index.md @@ -69,21 +69,6 @@ Machine Learning on Time Series Data: EDA, Decomposition, AutoML. ::: -:::{grid-item-card} {material-outlined}`manage_history;2em` Long-term storage -:link: timeseries-longterm -:link-type: ref -:link-alt: About storing time series data for the long term - -Run efficient data operations for current and historical time series data. - -+++ -**What's inside:** -Time-based bucketing. -Import data using Dask. -Optimizing storage for historic time series data. -::: - - :::: @@ -92,6 +77,7 @@ Optimizing storage for historic time series data. **Domains:** {ref}`analytics` • {ref}`industrial` • +{ref}`longterm-store` • {ref}`machine-learning` • {ref}`metrics-store` @@ -114,7 +100,6 @@ Optimizing storage for historic time series data. Fundamentals Advanced analysis video -Long-term store ::: diff --git a/docs/solution/time-series/longterm.md b/docs/solution/time-series/longterm.md deleted file mode 100644 index f04633bb..00000000 --- a/docs/solution/time-series/longterm.md +++ /dev/null @@ -1,113 +0,0 @@ -(timeseries-longterm)= -(timeseries-long-term-storage)= -# Time series long-term storage - -CrateDB stores large volumes of data, keeping it accessible for querying -and insightful analysis, even considering historic data records. -**Never retire data just because your database can't handle the cardinality.** - - -:::{rubric} Use Cases and Tutorials -::: - - -::::{info-card} - -:::{grid-item} -:columns: auto 9 9 9 -**Optimizing storage for historic time series data** - -This tutorial illustrates how to reduce table storage size by 80%, -by using arrays for time-based bucketing, a historical table having -a dedicated layout, and querying using the UNNEST table function. - -{{ '{}[Optimizing storage for historic time series data]'.format(tutorial) }} -::: - -:::{grid-item} -:columns: 3 -{tags-primary}`Rich Time Series` -{tags-primary}`Storage Efficiency` - -{tags-secondary}`SQL` -::: - -:::: - - -:::{rubric} Related -::: - -::::{info-card} - -:::{grid-item} -:columns: auto 9 9 9 -**CrateDB as metrics and log data store for the long term** - -Store and analyze high volumes of system monitoring information. -Read more about using CrateDB as {ref}`metrics-store`. -::: - -:::{grid-item} -:columns: 3 -{tags-primary}`Long-term Storage` -{tags-primary}`Metrics` -{tags-primary}`Logging` -::: - -:::: - - -::::{info-card} - -:::{grid-item} -:columns: auto 9 9 9 -**CrateDB provides real-time analytics on raw data stored for the long term** - -Keep massive amounts of data ready in the hot zone for analytics purposes. -Read more about using CrateDB for {ref}`analytics`. -::: - -:::{grid-item} -:columns: 3 -{tags-primary}`Long-term storage` -{tags-primary}`Real-time analytics` -::: - -:::: - - -:::{rubric} Applications -::: - -::::{info-card} - -:::{grid-item} -:columns: auto 8 8 8 -**Storing and analyzing massive amounts of synoptic weather data** - -Wetterdienst uses CrateDB for mass storage of weather data, allowing you to -query it efficiently. It provides access to data at more than ten canonical -sources of raw weather data from domestic weather agencies. - -[![Wetterdienst Documentation](https://img.shields.io/badge/Documentation-Data%20Export-darkgreen?logo=Markdown)](https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export) -[![Wetterdienst Project](https://img.shields.io/badge/Repository-Wetterdienst-darkblue?logo=GitHub)](https://github.com/earthobservations/wetterdienst) -::: - -:::{grid-item} -:columns: 4 -{tags-primary}`Earth Observations` -{tags-primary}`Metadata` -{tags-primary}`Sensor Data` -{tags-primary}`Rich Time Series` - -{tags-secondary}`pandas` -{tags-secondary}`Polars` -{tags-secondary}`SQL` -::: - -:::: - - - -[Optimizing storage for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762 diff --git a/docs/start/application/index.md b/docs/start/application/index.md index d3b0779f..7fa075a1 100644 --- a/docs/start/application/index.md +++ b/docs/start/application/index.md @@ -1,5 +1,5 @@ (example-applications)= -# Sample Applications +# Sample applications :::{rubric} Starter @@ -86,3 +86,34 @@ Users can ask questions of the knowledge base using natural language. ::: :::: + + +:::{rubric} Community +::: + +:::::{grid} 1 2 2 3 +:gutter: 2 + +::::{grid-item-card} +:link: https://wetterdienst.readthedocs.io/en/latest/usage/python-api.html#export +:link-type: url +(weather-data-storage)= +:::{rubric} Store and analyze massive amounts of synoptic weather data +::: +Wetterdienst uses CrateDB for mass storage of weather data, allowing you to +query it efficiently. It provides access to data at more than ten canonical +sources of raw weather data from domestic weather agencies. ++++ +**What's inside:** + +{tags-primary}`Earth observations` +{tags-primary}`Metadata` +{tags-primary}`Sensor data` +{tags-primary}`Time series` + +{tags-secondary}`pandas` +{tags-secondary}`Polars` +{tags-secondary}`SQL` +:::: + +::::: From aeb62d72aada719fc989d955e7bcfe5acaa79073 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 25 Oct 2025 03:00:55 +0200 Subject: [PATCH 2/4] Long-term store: Add "Tools" section, bundling data retention utilities - Airflow-based data retention - CTK-based data retention --- docs/solution/longterm/index.md | 47 ++++++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 7 deletions(-) diff --git a/docs/solution/longterm/index.md b/docs/solution/longterm/index.md index f5108a7f..89c2498d 100644 --- a/docs/solution/longterm/index.md +++ b/docs/solution/longterm/index.md @@ -73,6 +73,43 @@ Set up CrateDB as a long-term observability backend for OpenTelemetry. :::: +## Tools + +### Automatic retention and expiration + +When operating a system storing and processing large amounts of data, +it is crucial to manage data flows and life-cycles well, which includes +handling concerns of data expiry, size reduction, and archival. + +Optimally, corresponding tasks are automated rather than manually +performed. CrateDB provides relevant integrations and standalone +applications for automatic data retention purposes. + +:::{rubric} Apache Airflow +::: + +{ref}`Build a hot/cold storage data retention policy ` +describes how to manage aging data by leveraging CrateDB cluster +features to mix nodes with different hardware setups, i.e. hot +nodes using the latest generation of NVMe drives for responding +to analytics queries quickly, and cold nodes that have access to +cheap mass storage for retaining historic data. + +:::{rubric} CrateDB Toolkit +::: + +[CrateDB Toolkit Retention and Expiration] is a data retention and +expiration policy management system for CrateDB, providing multiple +retention strategies. + +:::{note} +The system derives its concepts from [InfluxDB data retention] ideas and +from the {ref}`Airflow-based data retention tasks for CrateDB `, +but aims to be usable as a standalone system in different software environments. +Effectively, it is a Python library and CLI around a policy management +table defined per [retention-policy-ddl.sql]. +::: + ## Related sections {ref}`metrics-store` includes information about how to @@ -88,16 +125,12 @@ illustrates how to reduce table storage size by 80%, by using arrays for time-based bucketing, a historical table having a dedicated layout, and querying using the UNNEST table function. -{ref}`Build a hot/cold storage data retention policy ` -describes how to manage aging data by leveraging CrateDB cluster -features to mix nodes with different hardware setups, i.e. hot -nodes using the latest generation of NVMe drives for responding -to analytics queries quickly, and cold nodes that have access to -cheap mass storage for retaining historic data. - {ref}`weather-data-storage` provides information about how to use CrateDB for mass storage of synoptic weather observations, allowing you to query them efficiently. +[CrateDB Toolkit Retention and Expiration]: https://cratedb-toolkit.readthedocs.io/retention.html +[InfluxDB data retention]: https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/ [Optimizing storage efficiency for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762 +[retention-policy-ddl.sql]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/setup/schema.sql From 5230f1a1626ca05a33280f7640e9de61f6e48732 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 25 Oct 2025 03:55:59 +0200 Subject: [PATCH 3/4] Long-term store: Implement suggestions by CodeRabbit --- docs/solution/longterm/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/solution/longterm/index.md b/docs/solution/longterm/index.md index 89c2498d..a4379d06 100644 --- a/docs/solution/longterm/index.md +++ b/docs/solution/longterm/index.md @@ -29,7 +29,7 @@ tiers. Your historical data remains as queryable as your recent data, enabling seamless analysis across any time range without data movement, ETL pipelines, or rehydration processes. -With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL. +With CrateDB, compatible with PostgreSQL, you can do all of that using plain SQL. Other than integrating well with commodity systems using standard database access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface on top. From 41f5039a01f872440b827eb7f4df2bf2de187050 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Mon, 27 Oct 2025 11:07:10 +0100 Subject: [PATCH 4/4] Long-term store: "Automatic retention and expiration" to separate page --- docs/solution/longterm/index.md | 49 ++++++----------------------- docs/solution/longterm/retention.md | 40 +++++++++++++++++++++++ 2 files changed, 49 insertions(+), 40 deletions(-) create mode 100644 docs/solution/longterm/retention.md diff --git a/docs/solution/longterm/index.md b/docs/solution/longterm/index.md index a4379d06..9c971075 100644 --- a/docs/solution/longterm/index.md +++ b/docs/solution/longterm/index.md @@ -4,6 +4,11 @@ # Long-term store +:::{toctree} +:hidden: +retention +::: + :::{div} sd-text-muted Never retire data just because your other systems can't handle the cardinality. ::: @@ -73,43 +78,6 @@ Set up CrateDB as a long-term observability backend for OpenTelemetry. :::: -## Tools - -### Automatic retention and expiration - -When operating a system storing and processing large amounts of data, -it is crucial to manage data flows and life-cycles well, which includes -handling concerns of data expiry, size reduction, and archival. - -Optimally, corresponding tasks are automated rather than manually -performed. CrateDB provides relevant integrations and standalone -applications for automatic data retention purposes. - -:::{rubric} Apache Airflow -::: - -{ref}`Build a hot/cold storage data retention policy ` -describes how to manage aging data by leveraging CrateDB cluster -features to mix nodes with different hardware setups, i.e. hot -nodes using the latest generation of NVMe drives for responding -to analytics queries quickly, and cold nodes that have access to -cheap mass storage for retaining historic data. - -:::{rubric} CrateDB Toolkit -::: - -[CrateDB Toolkit Retention and Expiration] is a data retention and -expiration policy management system for CrateDB, providing multiple -retention strategies. - -:::{note} -The system derives its concepts from [InfluxDB data retention] ideas and -from the {ref}`Airflow-based data retention tasks for CrateDB `, -but aims to be usable as a standalone system in different software environments. -Effectively, it is a Python library and CLI around a policy management -table defined per [retention-policy-ddl.sql]. -::: - ## Related sections {ref}`metrics-store` includes information about how to @@ -120,6 +88,10 @@ like metrics and log data with CrateDB. CrateDB provides real-time analytics on raw data stored for the long term. Keep massive amounts of data ready in the hot zone for analytics purposes. +{ref}`retention` illustrates how to optimally implement data retention +procedures, to manage the life-cycle of data stored in CrateDB, handling +concerns of data expiry, size reduction, and archival. + [Optimizing storage efficiency for historic time series data] illustrates how to reduce table storage size by 80%, by using arrays for time-based bucketing, a historical table having @@ -130,7 +102,4 @@ use CrateDB for mass storage of synoptic weather observations, allowing you to query them efficiently. -[CrateDB Toolkit Retention and Expiration]: https://cratedb-toolkit.readthedocs.io/retention.html -[InfluxDB data retention]: https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/ [Optimizing storage efficiency for historic time series data]: https://community.cratedb.com/t/optimizing-storage-for-historic-time-series-data/762 -[retention-policy-ddl.sql]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/setup/schema.sql diff --git a/docs/solution/longterm/retention.md b/docs/solution/longterm/retention.md new file mode 100644 index 00000000..ff8975e7 --- /dev/null +++ b/docs/solution/longterm/retention.md @@ -0,0 +1,40 @@ +(expiration)= +(retention)= + +# Automatic retention and expiration + +When operating a system storing and processing large amounts of data, +it is crucial to manage data flows and life-cycles well, which includes +handling concerns of data expiry, size reduction, and archival. + +Optimally, corresponding tasks are automated rather than manually +performed. CrateDB provides relevant integrations and standalone +applications for automatic data retention purposes. + +:::{rubric} Apache Airflow +::: + +{ref}`Build a hot/cold storage data retention policy ` +describes how to manage aging data by leveraging CrateDB cluster +features to mix nodes with different hardware setups, i.e. hot +nodes using the latest generation of NVMe drives for responding +to analytics queries quickly, and cold nodes that have access to +cheap mass storage for retaining historic data. + +:::{rubric} CrateDB Toolkit +::: + +[CrateDB Toolkit Retention and Expiration] is a data retention and +expiration policy management system for CrateDB, providing multiple +retention strategies. + +The system derives its concepts from [InfluxDB data retention] ideas and +from the {ref}`Airflow-based data retention tasks for CrateDB `, +but aims to be usable as a standalone system in different software environments. +Effectively, it is a Python library and CLI around a policy management +table defined per [retention-policy-ddl.sql]. + + +[CrateDB Toolkit Retention and Expiration]: https://cratedb-toolkit.readthedocs.io/retention.html +[InfluxDB data retention]: https://docs.influxdata.com/influxdb/v1/guides/downsample_and_retain/ +[retention-policy-ddl.sql]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/setup/schema.sql