diff --git a/source/includes/_bulk_data.md.erb b/source/includes/_bulk_data.md.erb index 36521493e4a..d52839e24a8 100644 --- a/source/includes/_bulk_data.md.erb +++ b/source/includes/_bulk_data.md.erb @@ -2,44 +2,45 @@ To perform advanced reporting or analytics on your ControlShift data, you can mirror all your data into an external database. This can be a helpful tool for answering high-level questions about member engagement or integrating activity with data from other tools. -Once your data is in an external data warehouse replica analysts can use SQL to answer questions about activity or join it +Once your data is in an external data warehouse replica, analysts can use SQL to answer questions about activity or join it with data from other sources. -We provide a set of automated bulk exports and webhooks, along with examples (linked below) on how to use them. + -It's possible to consume the Bulk Data API in its underlying format as CSV files in an S3 bucket or as a higher level -HTTPS Webhook API that is not specific to AWS or S3. Many data warehouse integration technologies like [BigQuery S3 Transfers](https://cloud.google.com/bigquery/docs/s3-transfer), -[Airbyte](https://docs.airbyte.com/integrations/sources/s3/) or [AWS Data Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html) -are able to natively process files in S3 buckets. However, if you are using a different technology or want to implement a custom integration -you can use our webhook events to get the same data in a cloud platform agnostic way. +To get the data into your external system, you'll need to consume **bulk data exports**. There are two types of exports: -We provide a [ControlShift to Redshift Pipeline](#bulk-data-controlshift-to-redshift-pipeline) as an example of sample code that demonstrates how to use the high-level webhooks to mirror your ControlShift data into Redshift. -Similar strategies can be used to mirror your data into other data warehouses. We've designed the underlying APIs to work flexibly regardless of -your technical architecture. Since we expose the file events as standard HTTPS webhooks they should be compatible with any programming language. +- The **full export** happens once a day, and includes a _complete_ copy of the current data in the tables. +- The **incremental export** happens once a _minute_, and includes only rows that have been _added_ to the table in the last minute. -## Export schedule and webhooks +A bulk data export (full or incremental) is a set of CSV files, one for each [ControlShift table](#bulk-data-bulk-data-data-schemas). -Every night, we'll export the most up-to-date version of all of your data into a set of CSV files, one for each internal ControlShift table. The [data.full_table_exported](#webhook-endpoints-data-full_table_exported) indicates such an export. These full CSV files should _replace_ the existing data in your mirror database. +## How to use full and incremental exports +The data in the full exports should _replace_ the existing data in your mirror database. +Refreshing your mirror database with the nightly full export is essential to ensuring an accurate copy of the data. -Additionally, once a minute, we'll produce CSV files with any new rows that have been _added_ to ControlShift's internal tables. The [data.incremental_table_exported](#webhook-endpoints-data-incremental_table_exported) webhooks indicates a set of these added-rows exports. Note that the incremental exports do _not_ include any updates or deletions of existing rows; you'll have to wait for the nightly export to receive fresh data with updates and deletions included. +If you're using the incremental exports, the data in them should be _added_ to your mirror database. +Remember, the incremental exports do _not_ include any updates or deletions of existing rows; you'll have to wait for the nightly export to receive fresh data with updates and deletions included. - +## Webhooks -## Bulk Data Data Schemas +When a new bulk data export is ready, you'll receive a webhook to each of your webhook endpoints. -The bulk data webhooks include exports of the following tables: +- The webhook for full exports is [`data.full_table_exported`](#webhook-endpoints-data-full_table_exported). +- The webhook for incremental exports is [`data.incremental_table_exported`](#webhook-endpoints-data-incremental_table_exported) -<% data.export_tables['tables'].each do |tbl_info| %> -* <%= tbl_info['table']['name'] %> -<% end %> + -For full information on the schema of each table, use the `/api/bulk_data/schema.json` API endpoint. +Your system should listen for those webhooks to know when and where to get the exported data. -## Bulk Data Files +## Bulk data files -Each table exposed by the bulk data API is made available as a CSV file with the URL to download each file sent via webhook. +Each table exposed by the bulk data API is made available as a CSV file, with the URL to download each file sent via webhook. We expire access to data 6 hours after it has been generated. This means that if you are building an automated system to ingest data from this API it must process webhook notifications within 6 hours. @@ -67,6 +68,30 @@ Finally, when the compression for data exports is enabled the filename includes When the **Compress bulk data exports** option is enabled (available at the Webhooks integration page), incremental and nightly bulk data export files will be compressed in [`bzip2` format](https://sourceware.org/bzip2/). This will improve the performance for fetching the files from S3 since they will be considerably smaller. + +## Processing the exported data + +It's possible to consume the Bulk Data API in its underlying format as CSV files in an S3 bucket or as a higher level +HTTPS Webhook API that is not specific to AWS or S3. Many data warehouse integration technologies like [BigQuery S3 Transfers](https://cloud.google.com/bigquery/docs/s3-transfer), +[Airbyte](https://docs.airbyte.com/integrations/sources/s3/) or [AWS Data Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html) +are able to natively process files in S3 buckets. However, if you are using a different technology or want to implement a custom integration +you can use our webhook events to get the same data in a cloud platform agnostic way. + +We provide a [ControlShift to Redshift Pipeline](#bulk-data-controlshift-to-redshift-pipeline) as an example of sample code that demonstrates how to use the high-level webhooks to mirror your ControlShift data into Redshift. +Similar strategies can be used to mirror your data into other data warehouses. We've designed the underlying APIs to work flexibly regardless of +your technical architecture. Since we expose the file events as standard HTTPS webhooks they should be compatible with any programming language. + + +## Data schemas + +The bulk data webhooks include exports of the following tables: + +<% data.export_tables['tables'].each do |tbl_info| %> +* <%= tbl_info['table']['name'] %> +<% end %> + +For full information on the schema of each table, use the `/api/bulk_data/schema.json` API endpoint. + ### Interpreting the share_clicks table The `share_clicks` table is designed to help you understand in detail how social media sharing influences member actions. @@ -135,7 +160,7 @@ Additionally the following columns apply for all kinds of unsubscribe: | created_at | When the unsubscribe was recorded in the platform. | | updated_at | Same as `created_at` as unsubscribe records cannot be updated after creation. | -## ControlShift to Redshift Pipeline +## ControlShift to Redshift pipeline Setting up an Amazon Redshift integration is a great way to learn more about the actions your members are taking or perform sophisticated analytics, but it is an advanced topic that requires knowledge of Amazon Web Services, SQL, and Terraform.