Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 50 additions & 25 deletions source/includes/_bulk_data.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -2,44 +2,45 @@

To perform advanced reporting or analytics on your ControlShift data, you can mirror all your data into an external database.
This can be a helpful tool for answering high-level questions about member engagement or integrating activity with data from other tools.
Once your data is in an external data warehouse replica analysts can use SQL to answer questions about activity or join it
Once your data is in an external data warehouse replica, analysts can use SQL to answer questions about activity or join it
with data from other sources.

We provide a set of automated bulk exports and webhooks, along with examples (linked below) on how to use them.
<aside class="warning">
Note that this is <strong>not</strong> a real-time replica of the database.
A fresh copy of the data is provided nightly, but in between nightly exports, only <strong>inserts</strong> of new rows are exported. Updates and deletions will only be reflected in the nightly export.
If real-time data is required, we recommend consuming the specific relevant webhooks (e.g. <code>signature.updated</code>) rather than relying on the incremental bulk data exports.
</aside>

It's possible to consume the Bulk Data API in its underlying format as CSV files in an S3 bucket or as a higher level
HTTPS Webhook API that is not specific to AWS or S3. Many data warehouse integration technologies like [BigQuery S3 Transfers](https://cloud.google.com/bigquery/docs/s3-transfer),
[Airbyte](https://docs.airbyte.com/integrations/sources/s3/) or [AWS Data Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html)
are able to natively process files in S3 buckets. However, if you are using a different technology or want to implement a custom integration
you can use our webhook events to get the same data in a cloud platform agnostic way.
To get the data into your external system, you'll need to consume **bulk data exports**. There are two types of exports:

We provide a [ControlShift to Redshift Pipeline](#bulk-data-controlshift-to-redshift-pipeline) as an example of sample code that demonstrates how to use the high-level webhooks to mirror your ControlShift data into Redshift.
Similar strategies can be used to mirror your data into other data warehouses. We've designed the underlying APIs to work flexibly regardless of
your technical architecture. Since we expose the file events as standard HTTPS webhooks they should be compatible with any programming language.
- The **full export** happens once a day, and includes a _complete_ copy of the current data in the tables.
- The **incremental export** happens once a _minute_, and includes only rows that have been _added_ to the table in the last minute.

## Export schedule and webhooks
A bulk data export (full or incremental) is a set of CSV files, one for each [ControlShift table](#bulk-data-bulk-data-data-schemas).

Every night, we'll export the most up-to-date version of all of your data into a set of CSV files, one for each internal ControlShift table. The [data.full_table_exported](#webhook-endpoints-data-full_table_exported) indicates such an export. These full CSV files should _replace_ the existing data in your mirror database.
## How to use full and incremental exports
The data in the full exports should _replace_ the existing data in your mirror database.
<strong>Refreshing your mirror database with the nightly full export is essential to ensuring an accurate copy of the data.</strong>

Additionally, once a minute, we'll produce CSV files with any new rows that have been _added_ to ControlShift's internal tables. The [data.incremental_table_exported](#webhook-endpoints-data-incremental_table_exported) webhooks indicates a set of these added-rows exports. Note that the incremental exports do _not_ include any updates or deletions of existing rows; you'll have to wait for the nightly export to receive fresh data with updates and deletions included.
If you're using the incremental exports, the data in them should be _added_ to your mirror database.
Remember, the incremental exports do _not_ include any updates or deletions of existing rows; you'll have to wait for the nightly export to receive fresh data with updates and deletions included.

<aside class="notice">
Bulk data webhooks should be automatically included when adding a new webhook endpoint. Please contact support to report any issues with bulk data webhook generation. For testing, you can manually trigger these wehbooks by visiting <code>https://&lt;your controlshift instance&gt;/org/settings/integrations/webhook_endpoints</code> and clicking on "Trigger" under "Test Nightly Bulk Data Export Webhook".
</aside>
## Webhooks

## Bulk Data Data Schemas
When a new bulk data export is ready, you'll receive a webhook to each of your webhook endpoints.

The bulk data webhooks include exports of the following tables:
- The webhook for full exports is [`data.full_table_exported`](#webhook-endpoints-data-full_table_exported).
- The webhook for incremental exports is [`data.incremental_table_exported`](#webhook-endpoints-data-incremental_table_exported)

<% data.export_tables['tables'].each do |tbl_info| %>
* <%= tbl_info['table']['name'] %>
<% end %>
<aside class="notice">
Bulk data webhooks should be automatically included when adding a new webhook endpoint. Please contact support to report any issues with bulk data webhook generation. For testing, you can manually trigger these wehbooks by visiting <code>https://&lt;your controlshift instance&gt;/org/settings/integrations/webhook_endpoints</code> and clicking on "Trigger" under "Test Nightly Bulk Data Export Webhook".
</aside>

For full information on the schema of each table, use the `/api/bulk_data/schema.json` API endpoint.
Your system should listen for those webhooks to know when and where to get the exported data.

## Bulk Data Files
## Bulk data files

Each table exposed by the bulk data API is made available as a CSV file with the URL to download each file sent via webhook.
Each table exposed by the bulk data API is made available as a CSV file, with the URL to download each file sent via webhook.

We expire access to data 6 hours after it has been generated. This means that if you are building an automated system
to ingest data from this API it must process webhook notifications within 6 hours.
Expand Down Expand Up @@ -67,6 +68,30 @@ Finally, when the compression for data exports is enabled the filename includes

When the **Compress bulk data exports** option is enabled (available at the Webhooks integration page), incremental and nightly bulk data export files will be compressed in [`bzip2` format](https://sourceware.org/bzip2/). This will improve the performance for fetching the files from S3 since they will be considerably smaller.


## Processing the exported data

It's possible to consume the Bulk Data API in its underlying format as CSV files in an S3 bucket or as a higher level
HTTPS Webhook API that is not specific to AWS or S3. Many data warehouse integration technologies like [BigQuery S3 Transfers](https://cloud.google.com/bigquery/docs/s3-transfer),
[Airbyte](https://docs.airbyte.com/integrations/sources/s3/) or [AWS Data Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html)
are able to natively process files in S3 buckets. However, if you are using a different technology or want to implement a custom integration
you can use our webhook events to get the same data in a cloud platform agnostic way.

We provide a [ControlShift to Redshift Pipeline](#bulk-data-controlshift-to-redshift-pipeline) as an example of sample code that demonstrates how to use the high-level webhooks to mirror your ControlShift data into Redshift.
Similar strategies can be used to mirror your data into other data warehouses. We've designed the underlying APIs to work flexibly regardless of
your technical architecture. Since we expose the file events as standard HTTPS webhooks they should be compatible with any programming language.


## Data schemas

The bulk data webhooks include exports of the following tables:

<% data.export_tables['tables'].each do |tbl_info| %>
* <%= tbl_info['table']['name'] %>
<% end %>

For full information on the schema of each table, use the `/api/bulk_data/schema.json` API endpoint.

### Interpreting the share_clicks table

The `share_clicks` table is designed to help you understand in detail how social media sharing influences member actions.
Expand Down Expand Up @@ -135,7 +160,7 @@ Additionally the following columns apply for all kinds of unsubscribe:
| created_at | When the unsubscribe was recorded in the platform. |
| updated_at | Same as `created_at` as unsubscribe records cannot be updated after creation. |

## ControlShift to Redshift Pipeline
## ControlShift to Redshift pipeline

Setting up an Amazon Redshift integration is a great way to learn more about the actions your members are taking or perform
sophisticated analytics, but it is an advanced topic that requires knowledge of Amazon Web Services, SQL, and Terraform.
Expand Down