This Python-based data processing tool uses Apache Spark to orchestrate data flows, monitor ETL processes, retrieve data from various sources, and perform save operations. It is designed for enterprise environments and compatible with Cloudera Data Platform (CDP) and modern Spark distributions. This utility allows users to extract data from sources such as Hive and Parquet, apply inline SQL transformations, validate data quality using Great Expectations, and subsequently store the processed data in Hive, Kudu, and Parquet.
- Data Movement: Extract, Transform, Load (ETL) between Hive, Kudu, Impala, and Filesystem (Parquet).
- Transformation: Inline SQL transformations.
- Data Quality: Integrated Great Expectations for data validation.
- Monitoring: automated tracking of ETL processes in a Hive table.
- Configuration: JSON-based configuration for easy management.
- Enterprise Grade: Python 3 support, Type Hinting, Logging, and Testing.
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
pip install -e .To build the distributable wheel package:
python setup.py bdist_wheel- Python >= 3.8
- PySpark >= 3.0.0
- Great Expectations
Define your data movements in data_movements_{env}.json.
Example with Great Expectations:
{
"data_movements": [
{
"name": "hive-to-hive-voice-data",
"active": true,
"source_type": "hive",
"source_sql": "select '1' as a, '2' as b",
"destination_type": "hive",
"destination_mode": "overwrite",
"destination_table": "default.sample_etl",
"destination_sql": "SELECT a, b FROM",
"expectations": [
{
"type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "a"}
},
{
"type": "expect_column_values_to_be_unique",
"kwargs": {"column": "a"}
}
]
}
]
}Standard Execution:
python main.py -f flow_name -c spark_config.jsonPySpark Submission:
pyspark --master yarn --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.bin.path=/path/to/datamov --conf spark.pyspark.python=/usr/bin/python3Run unit and integration tests:
pytest tests/├── datamov
│ ├── connectors
│ │ └── spark_manager
│ └── core
│ ├── config_reader
│ ├── data_flow
│ ├── data_movements
│ ├── data_processor
│ ├── engine
│ ├── logger
│ └── validator
├── scripts
├── tests