This is my first project using airflow.
- Get a sample of the most recent tweets in English language
- Use Latent Dirichlet Allocation to model topics and extract most significant words / word tokens
- Upload result as json file to an S3 bucket
- Twitter dev account
- Existing AWS S3 Bucket
Place an .env file in dags/topics/ containing the following env variables:
TWITTER_CONSUMER_KEYTWITTER_CONSUMER_SECRETTWITTER_ACCESS_TOKENTWITTER_ACCESS_TOKEN_SECRETTWITTER_SAMPLE_SIZE(optional)LDA_NUM_TOPICS(optional)LDA_NUM_TOP_WORDS(optional)S3_ACCESS_KEYS3_SECRET_KEYS3_BUCKETAIRFLOW_SCHEDULE(optional)
- Run
pip install -r requirements.txtfirst - Set
AIRFLOW_HOMEto your project path - Get airflow up and running:
airflow initdbairflow schedulerairflow webserver -p 8080
- Run tasks separately:
airflow test airflow-topics extract_tweets $(date +%F)airflow test airflow-topics dump_topics $(date +%F)airflow test airflow-topics push_results $(date +%F)
- Alternatively:
airflow trigger_dag airflow-topics