Classify the complaints registered at https://www.consumerfinance.gov/ as malicious (the one which requires immediate attention) and non malicious
- Use pyspark
- Download data from website in parts
- Convert data files to parquet format since the data is huge.
- Save model in S3 bucket in compressed format.
- Generate a new feature ['diff_in_days'].
- Impute values in ['diff_in_days'] using mean.
- Impute the missing values of ['company_response', 'consumer_consent_provided', 'submitted_via'] with most frequent items.
- Transform ['company_response', 'consumer_consent_provided', 'submitted_via'] using string indexer.
- Transform ['company_response', 'consumer_consent_provided', 'submitted_via'] using one hot encoder.
- Tokenize ['issue']
- Hash the tokenized words.
- Create transformed issue column using IDF
- Apply vector assembler on all transformed columns
- Apply standard scalar to assembled column
- Transformed file will contain only the Scaled and Assembled columns and target feature.
- Python
- PySpark
- PySpark ML
- Airflow as Scheduler
- MongoDB
- GCP Compute Engine
- S3 Bucket
- Artifact Registry
- Grafana
- Prometheus
- Node Exporter
- Promtail
- Loki