Skip to content

Latest commit

 

History

History
17 lines (11 loc) · 820 Bytes

File metadata and controls

17 lines (11 loc) · 820 Bytes

Preprocessing

Dataplug leverages joblib to deploy a preprocessing jobs. Joblib allows to use distributed backends to parallelize and scale the preprocessing tasks.

Dataplug allows to pass a configuration to joblib to use a distributed backend, for instance, to use dask distributed.

co = CloudObject.from_s3(CSV, "s3://dataplug/some_csv_data.csv")

parallel_config = {"verbose": 10}  # Here you put the joblib configuration, for instance, use backend="dask" to use dask distributed
co.preprocess(parallel_config=parallel_config)

The parallel_config parameter is directly passed to joblib when a Parallel instance is created. You can read more in the joblib documentation.