Skip to content
emimaesmith edited this page Feb 1, 2019 · 17 revisions

Strike is the name of Scale’s system job that monitors and handles incoming source data files that are ingested into Scale. A Strike job monitors a given workspace for new files. Scale administrators will want to create a Strike job for each data feed that will be processed by Scale.

When a new file is copied into the monitored workspace, its file name is checked against a number of rules using regular expressions configured for that Strike job. When the first rule that matches the new file’s name is reached, that rule’s other fields indicate how Strike should handle the file, such as tagging it with data type tags or moving the file to a new location in a different workspace.

Strike Configuration Specification Version 2.0

A valid Strike configuration is a JSON document with the following structure:

{
    "version": "2.0",
    "workspace": STRING,
    "monitor": {
        "type": STRING
    },
    "files_to_ingest": [
        {
            "filename_regex": STRING,
            "data_types": [
                STRING,
                STRING
            ],
            "new_workspace": STRING,
            "new_file_path": STRING
        }
    ],
    "recipe": {
        "name": STRING,
        "version": STRING
    }
}

version

Type: String
Required: No


Defines the version of the configuration used. This allows updates to be made to the specification while maintaining backwards compatibility by allowing Scale to recognize an older version and convert it to the current version. The default value, if not included, is the latest version (currently 1.0). It is recommended, though not required, that you include the version so that future changes to the specification will still accept your strike configuration.
  • workspace

    Type: String
    Required: Yes


    Specifies the name of the workspace that is being monitored. The type of the workspace (its broker type) will determine which types of monitor can be used.

monitor

Type: JSON Object
Required: Yes


Specifies the type and configuration of the monitor that will watch workspace for new files.
  • type

    Type: String
    Required: No


    Specifies the type of the monitor to use. The other fields that configure the monitor are based upon the type of the monitor in the type field. Certain monitor types may only be used on workspaces with corresponding broker types. The valid monitor types are:
    • dir-watcher - A dir-watcher monitor watches a file directory for incoming files. This monitor may only be used with a host workspace.
    • s3 - An s3 monitor utilizes an Amazon Web Services (AWS) Simple Queue Service (SQS) to receive AWS S3 file notification events. This monitor may only be used with an s3 workspace.

    Additional monitor fields may be required depending on the type of monitor selected. See below for more information on each monitor type.

files_to_ingest

Type: Array
Required: Yes


A list of JSON objects that define the rules for how to handle files that appear in the monitored workspace. The array must contain at least one item. Each JSON object has the following fields:
  • filename_regex

    Type: String
    Required: Yes


    Defines a regular expression to check against the names of new files in the monitored workspace. When a new file appears in the workspace, the file’s name is checked against each expression in order of the files_to_ingest array. If an expression matches the new file name in the workspace, that file is ingested according to the other fields in the JSON object and all subsequent rules in the list are ignored (first rule matched is applied).
  • data_types

    Type: Array
    Required: No


    A list of strings. Any file that matches the corresponding file name regular expression will have these data type strings “tagged” with the file. If not provided, data_types defaults to [].
  • new_workspace

    Type: String
    Required: No


    Specifies the name of a new workspace to which the file should be copied. This allows the ingest process to move files to a different workspace after they appear in the monitored workspace.
  • new_file_path

    Type: String
    Required: No


    Specifies a new relative path for storing new files. If new_workspace is also specified, the file is moved to the new workspace at this new path location (instead of using the current path the new file originally came in on). If new_workspace is not specified, the file is moved to this new path location within the original monitored workspace. In either of these cases, three additional and dynamically named directories, for the current year, month, and day, will be appended to the new_file_path value automatically by the Scale system (i.e. workspace_path/YYYY/MM/DD).

recipe

Type: JSON Object
Required: Yes


A JSON objects that defines the recipe name and version number that should be triggered when the Strike has finished ingesting a file. A valid recipe configuration has the following fields:
  • name

    Type: String
    Required: Yes


    Defines the name of the Recipe Type that should be triggered when the Strike has finished ingesting a file.
  • version

    Type: String
    Required: Yes


    Defines the name of the Recipe Type that should be triggered when the Strike has finished ingesting a file.

Directory Watching Monitor

The directory watching monitor uses a workspace that mounts a host directory into the container and watches that directory for new files. Therefore this monitor only works with a host workspace. When a new file appears in the mounted host directory, its file name is checked for the trailing file name suffix specified in the transfer_suffix configuration field. While the file name contains the suffix, the monitor will continue tracking the size of the file and how long it takes to copy the file into the directory. Whenever the file copy is complete, the process copying the file should rename the file and remove the transfer_suffix. Once the monitor sees the renamed file, it will apply the files_to_ingest rules against it. The monitor will create two sub-directories in the host directory, deferred and ingesting. If a copied file does not match any of the ingest rules, it is moved to the deferred directory. If the file matches an ingest rule, it is moved to ingesting and an ingest job is created to ingest it.

Example directory watching monitor configuration:

{
    "version": "2.0",
    "workspace": "my-host-workspace",
    "monitor": {
        "type": "dir-watcher",
        "transfer_suffix": "_tmp"
    },
    "files_to_ingest": [
        {
            "filename_regex": "*.h5",
            "data_types": [
                "data type 1",
                "data type 2"
            ],
            "new_workspace": "my-new-workspace",
            "new_file_path": "/new/file/path"
        }
    ],
    "recipe": {
        "name": "my-recipe",
        "version": "1.0.0"
    }
}

The directory watching monitor requires one additional field in its configuration:

transfer_suffix

Type: String
Required: Yes


Defines a suffix that is used on the file names (by the system or process that is transferring files into the directory) to indicate that the files are still transferring and have not yet finished being copied into the monitored directory.

S3 Monitor (experimental)

The S3 monitor polls an AWS SQS queue for object creation notifications that describe new source data files available in an AWS S3 bucket (so this monitor only works with an S3 workspace). After the monitor finds a new file in the S3 bucket, it applies the file against the configured Strike rules.

??? danger "Security" A dedicated IAM account should be used rather than the root AWS account to limit the risk of damage if a leak were to occur and similarly the IAM account should be given the minimum possible permissions needed to work with the bucket. The access tokens should also be changed periodically to further protect against leaks.

??? danger "Security" While this broker is in the experimental phase, the access keys are currently stored in plain text within the Scale database and exposed via the REST interface. A future version will maintain these values using a more appropriate encrypted store service.

Example S3 monitor configuration:

{
    "version": "2.0",
    "workspace": "my-host-workspace",
    "monitor": {
        "type": "s3",
        "sqs_name": "my-sqs"
        "credentials": {
            "access_key_id": "AKIAIOSFODNN7EXAMPLE",
            "secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
        },
        "region_name": "us-east-1"
    },
    "files_to_ingest": [
        {
            "filename_regex": "*.h5",
            "data_types": [
                "data type 1",
                "data type 2"
            ],
            "new_workspace": "my-new-workspace",
            "new_file_path": "/new/file/path"
        }
    ],
    "recipe": {
        "name": "my-recipe",
        "version": "1.0.0"
    }
}

The S3 monitor has the following additional fields in its configuration:

sqs_name

Type: String
Required: Yes


Defines the name of the SQS queue that should be polled for object creation notifications that describe new files in the S3 bucket.

credentials

Type: JSON Object
Required: No


Provides the necessary information to access the bucket. This attribute should be omitted when using IAM role-based security. If it is included for key-based security, then both sub-attributes must be included. An IAM account should be created and granted the appropriate permissions to the bucket before attempting to use it here.
  • access_key_id

    Type: String
    Required: No


    A Unique identifier for the user account in IAM that will be used as a proxy for read and write operations within Scale.
  • secret_access_key

    Type: String
    Required: No


    A generated token that the system can use to prove it should be able to make requests on behalf of the associated IAM account without requiring the actual password used by that account.

region_name

Type: String
Required: No


Specifies the AWS region where the SQS Queue is located. This is not always required, as environment variables or configuration files could set the default region, but it is a highly recommended setting for explicitly indicating the SQS region.

Clone this wiki locally