Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

This file was deleted.

2 changes: 0 additions & 2 deletions docs/modules/ROOT/examples/extension-start.sh

This file was deleted.

2 changes: 0 additions & 2 deletions docs/modules/ROOT/examples/java-start.sh

This file was deleted.

37 changes: 29 additions & 8 deletions docs/modules/ROOT/pages/backfill-cli.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Developers can also use the backfill CLI to trigger change events for downstream
== Installation

The CDC backfill CLI is distributed both as a JAR file and as a Pulsar-admin extension NAR file.
The Pulsar-admin extension is packaged with the DataStax Luna Streaming distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar distribution in the `/cliextensions` folder, so you don't need to build from source unless you want to make changes to the code.

Both artifacts are built with Gradle.
To build the CLI, run the following commands:
Expand Down Expand Up @@ -50,19 +50,26 @@ Once the artifacts are generated, you can run the backfill CLI tool as either a
Java standalone::
+
--
[source,shell,subs="attributes+"]
[source,shell]
----
include::example$java-start.sh[]
java -jar backfill-cli/build/libs/backfill-cli-{version}-all.jar --data-dir target/export --export-host 127.0.0.1:9042 \
--export-username cassandra --export-password cassandra --keyspace ks1 --table table1
----
--

Pulsar-admin extension::
+
--
include::partial$extension.adoc[]
The Pulsar-admin extension is packaged with the IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) distribution in the /cliextensions folder, so you don't need to build from source unless you want to make changes to the code.

. Move the generated NAR archive to the /cliextensions folder of your Pulsar installation (e.g. /pulsar/cliextensions).
. Modify the client.conf file of your Pulsar installation to include: `customCommandFactories=cassandra-cdc`.
. Run the following command (this assumes the https://docs.datastax.com/en/installing/docs/installTARdse.html[default installation] of DSE Cassandra):
+
[source,shell]
----
include::example$extension-start.sh[]
-data-dir target/export --export-host 127.0.0.1:9042 \
--export-username cassandra --export-password cassandra --keyspace ks1 --table table1
----
--
====
Expand Down Expand Up @@ -255,64 +262,78 @@ be exported in subdirectories of the data directory specified here;
there will be one subdirectory per keyspace inside the data
directory, then one subdirectory per table inside each keyspace
directory.

|--help, -h
|Displays this help message

|--dsbulk-log-dir=PATH, -l
|The directory where DSBulk should store its logs. The default is a
'logs' subdirectory in the current working directory. This
subdirectory will be created if it does not exist. Each DSBulk
operation will create a subdirectory inside the log directory
specified here. This command is not available in the Pulsar-admin extension.

|--export-bundle=PATH
|The path to a secure connect bundle to connect to the Cassandra
cluster, if that cluster is a DataStax Astra cluster. Options
--export-host and --export-bundle are mutually exclusive.
|The path to a Secure Connect Bundle (SCB) to connect to an Astra DB database. Options --export-host and --export-bundle are mutually exclusive.

|--export-consistency=CONSISTENCY
|The consistency level to use when exporting data. The default is
LOCAL_QUORUM.

|--export-max-concurrent-files=NUM\|AUTO
|The maximum number of concurrent files to write to. Must be a positive
number or the special value AUTO. The default is AUTO.

|--export-max-concurrent-queries=NUM\|AUTO
|The maximum number of concurrent queries to execute. Must be a
positive number or the special value AUTO. The default is AUTO.

|--export-splits=NUM\|NC
|The maximum number of token range queries to generate. Use the NC
syntax to specify a multiple of the number of available cores, e.g.
8C = 8 times the number of available cores. The default is 8C. This
is an advanced setting; you should rarely need to modify the default
value.

|--export-dsbulk-option=OPT=VALUE
|An extra DSBulk option to use when exporting. Any valid DSBulk option
can be specified here, and it will be passed as-is to the DSBulk
process. DSBulk options, including driver options, must be passed as
'--long.option.name=<value>'. Short options are not supported. For more DSBulk options, see https://docs.datastax.com/en/dsbulk/docs/reference/commonOptions.html[here].

|--export-host=HOST[:PORT]
|The host name or IP and, optionally, the port of a node from the
Cassandra cluster. If the port is not specified, it will default to
9042. This option can be specified multiple times. Options
--export-host and --export-bundle are mutually exclusive.

|--export-password
|The password to use to authenticate against the origin cluster.
Options --export-username and --export-password must be provided
together, or not at all. Omit the parameter value to be prompted for
the password interactively.

|--export-protocol-version=VERSION
|The protocol version to use to connect to the Cassandra cluster, e.g.
'V4'. If not specified, the driver will negotiate the highest
version supported by both the client and the server.

|--export-username=STRING
|The username to use to authenticate against the origin cluster.
Options --export-username and --export-password must be provided
together, or not at all.

|--keyspace=<keyspace>, -k
|The name of the keyspace where the table to be exported exists

|--max-rows-per-second=PATH
|The maximum number of rows per second to read from the Cassandra
table. Setting this option to any negative value or zero will
disable it. The default is -1.

|--table=<table>, -t
|The name of the table to export data from for cdc back filling

|--version, -v
|Displays version info.
|===
Expand Down
8 changes: 4 additions & 4 deletions docs/modules/ROOT/pages/cdc-cassandra-events.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= CDC for Cassandra Events
= CDC for Cassandra Events

The DataStax CDC for Cassandra agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]: +
The {cdc_cass_first} agent pushes the mutation primary key for the CDC-enabled table into the Apache Pulsar events topic (also called the dirty topic). The messages in the data topic (or clean topic) are keyed messages where both the key and the payload are https://avro.apache.org/docs/current/spec.html#schema_record[AVRO records]:

* The message key is an AVRO record including all the primary key columns of your Cassandra table.
* The message payload is an AVRO record including regular columns from your Cassandra table.
Expand All @@ -18,9 +18,9 @@ Finally, the following CQL data types are encoded as AVRO logical types:

See https://avro.apache.org/docs/current/spec.html#Logical+Types[AVRO Logical Types] for more info on AVRO.

== Change Events Key
== Change Event's Key

For a given table, the change events key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.
For a given table, the change event's key is an AVRO record that contains a field for each column in the primary key of the table at the time the event was created. Both the events and the data topics (also called the dirty and the clean topics) have the same message key, an AVRO record including the primary key columns.

== `INSERT` Event

Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/cdcExample.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This installation requires the following. Latest version artifacts are available
** DSE - use `agent-dse4-<version>-all.jar`
** OSS C* - use `agent-c4-<version>-all.jar`
* Pulsar
** DataStax Luna Streaming - use `agent-dse4-<version>-all.jar`
** IBM Elite Support for Apache Pulsar - use `agent-dse4-<version>-all.jar`
* Pulsar C* source connector (CSC)
** Pulsar Cassandra Source NAR - use `pulsar-cassandra-source-<version>.nar`

Expand Down
51 changes: 20 additions & 31 deletions docs/modules/ROOT/pages/faqs.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,30 @@

If you are new to {cdc_cass_first}, these frequently asked questions are for you.

== Introduction

=== What is {cdc_cass}?
== What is {cdc_cass}?

The {cdc_cass} is a an open-source product from DataStax.

With {cdc_cass}, updates to data in Apache Cassandra are put into a Pulsar topic, which in turn can write the data to external targets such as Elasticsearch, Snowflake, and other platforms.
The {csc_pulsar_first} component is simple, with a 1:1 correspondence between the Cassandra table and a single Pulsar topic.

=== What are the requirements for {cdc_pulsar}?
== What are the requirements for {cdc_pulsar}?

Minimum requirements are:

* Cassandra version 3.11+ or 4.0+, DSE 6.8.16+ for near real-time event streaming CDC
* Cassandra version 3.0 to 3.10 for batch CDC
* Luna Streaming 2.8.0+ or Apache Pulsar 2.8.1+
* IBM Elite Support for Apache Pulsar (formerly DataStax Luna Streaming) or Apache Pulsar 2.8.1+
* Additional memory and CPU available on all Cassandra nodes

[NOTE]
====
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.
====

// insert link to pulsar cluster system doc

Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.
Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.

=== What is the impact of the C* CDC solution on the existing C* cluster?
== What is the impact of the C* CDC solution on the existing C* cluster?

For each CDC-enabled C* table, C* needs extra processing cycles and storage to process the CDC commit logs. The impact for dealing with a single CDC-enabled table is small, but when there are a large number of C* tables with CDC enabled, the impact within C* increases. The performance impact occurs within C* itself, not the C* CDC solution with Pulsar.

Expand All @@ -39,7 +35,7 @@ For each C* write operation (one detected change-event), the Pulsar CSC connecto

In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CDC solution would double the workload by adding the same amount of read workload to C* table. Since the C* read is primary key-based, it will be efficient.

=== What are the {cdc_cass} limitations?
== What are the {cdc_cass} limitations?

{cdc_cass} has the following limitations:

Expand All @@ -50,8 +46,7 @@ In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CD
* Does not support range deletes.
* CQL column names must not match a Pulsar primitive type name (ex: INT32) below

==== Table Pulsar primitive types

.Pulsar primitive types
[cols=2*, options=header]
[%autowidth]
|===
Expand Down Expand Up @@ -91,9 +86,9 @@ It stores the number of milliseconds since January 1, 1970, 00:00:00 GMT as an I

|===

=== What happens if Luna Streaming or Apache Pulsar is unavailable?
== What happens if the Apache Pulsar service is unavailable?

If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.
If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.

The CDC agent keeps track of the CDC commitlog segment offsets, so the CDC agent knows where to resume sending the mutation messages when the Pulsar cluster is back online.

Expand All @@ -108,14 +103,14 @@ WARN [CoreThread-5] 2021-10-29 09:12:52,790 NoSpamLogger.java:98 - Rejecting M
----

To avoid or recover from this situation, increase the `cdc_total_space_in_mb` and restart the node.
To prevent hitting this new limit, increase the write throughput to Luna Streaming or Apache Pulsar, or decrease the write throughput to your node.
To prevent hitting this new limit, increase the write throughput to your Apache Pulsar cluster, or decrease the write throughput to your node.

Increasing the Luna Streaming or Apache Pulsar write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Luna Streaming or Apache Pulsar configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).
Increasing the write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Pulsar cluster configuration (the number of partitions of your topics), or the {cdc_pulsar} configuration (query executors, batching and cache settings, connector parallelism).

As a last resort, if losing data is acceptable in your CDC pipeline, remove `commitlog` files from the `cdc_raw` directory.
Restarting the node is not needed in this case.

=== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?
== I have multiple Cassandra datacenters. How do I configure {cdc_cass}?

In a multi-datacenter Cassandra configuration, enable CDC and install the change agent in only one datacenter.
To ensure the data sent to all datacenters are delivered to the data topic, make sure to configure replication to the datacenter that has CDC enabled on the table.
Expand All @@ -125,36 +120,30 @@ To ensure all updates in DC2 and DC3 are propagated to the data topic, configure
For example, `replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3, 'dc3': 3})`.
The data replicated to DC1 will be processed by the change agent and eventually end up in the data topic.

=== Is {cdc_cass} an open-source project?
== Is {cdc_cass} an open-source project?

Yes, {cdc_cass} is open source using the Apache 2.0 license. You can find the source code on the GitHub repository https://github.com/datastax/cdc-apache-cassandra[datastax/cdc-apache-cassandra].

=== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?
== What does {cdc_cass} provide that I cannot get with open-source Apache Pulsar?

In effect, the {cdc_cass} implements the reverse of Apache Pulsar or DataStax Cassandra Sink Connector.
With those sink connectors, data is taken from a Pulsar topic and put into Cassandra.
With {cdc_cass}, updates to a Cassandra table are converted into events and put into a data topic.
From there, the data can be published to external platforms like Elasticsearch, Snowflake, and other platforms.

//=== Does {cdc_cass} support Kubernetes?

//Yes.
//You can run the {cdc_pulsar} on Luna Streaming or Apache Pulsar running on Minikube, Google Kubernetes Engine (GKE), Microsoft Azure Kubernetes Service, // Amazon Kubernetes Service (AKS), and other commonly used platforms.
//You can deploy the change agent with Cassandra on Kubernetes with the https://github.com/datastax/cass-operator[cass-operator].

=== Where is the {cdc_cass} public GitHub repository?
== Where is the {cdc_cass} public GitHub repository?

The source for this FAQs document is co-located with the {cdc_cass} repository code.
You can access the repository https://github.com/datastax/cdc-apache-cassandra[here].

=== How do I install {cdc_cass}?
== How do I install {cdc_cass}?

Follow the xref:install.adoc[install] instructions.

=== What is Prometheus?
== What is Prometheus?

https://prometheus.io/docs/introduction/overview/[Prometheus] is an open-source tool to collect metrics on a running app, providing real-time monitoring and alerts.

=== What is Grafana?
== What is Grafana?

https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.
https://grafana.com/[Grafana] is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.
Loading