This project is a serverless sentiment analysis data pipeline that collects posts from Reddit and Steam game reviews, analyzes sentiment, and stores results in AWS.
- Data Sources: Reddit posts and Steam game reviews
- Pipeline: AWS Step Functions orchestrating Lambda functions
- Storage: Kinesis Firehose → S3 → Iceberg tables in AWS Glue
- Schedule: Runs every 12 hours via EventBridge Scheduler
- Final Output:
sentimenttable insentimentalGlue database
- AWS CLI configured with appropriate credentials
- AWS SAM CLI installed
- Python 3.12+
- Required Python packages:
tomli,boto3 - An S3 bucket for deployment artifacts
- Reddit API credentials (Client ID and Secret)
- OpenAI API key (for sentiment analysis)
Ensure your AWS credentials are configured:
aws configure --profile <your-profile-name>Copy the example configuration and fill in your parameters:
cp samconfig.toml.example samconfig.tomlEdit samconfig.toml with your values:
BucketName: S3 bucket for storing dataRedditClientId: Your Reddit API client IDRedditClientSecret: Your Reddit API client secretOpenAIApiKey: Your OpenAI API key (optional)- AWS profile and region settings
Before deploying the SAM template, create the required Iceberg tables in Athena:
cd sentiment_getter/scripts
python create_iceberg_table.py
cd ../..This script creates two tables:
post: Stores raw posts from Reddit and Steamsentiment: Stores sentiment analysis results
Build the SAM application:
sam buildDeploy to AWS:
sam deployFor first-time deployment, use guided mode:
sam deploy --guidedCreate a keywords configuration file at s3://<your-bucket>/config/keywords_config.json:
{
"source": {
"reddit": [
{
"keyword": "example_game",
"subreddits": ["gaming"],
"sort": "top",
"time_filter": "day",
"post_limit": 10
}
],
"steam": [
{
"keyword": "example_game",
"sort": "top",
"post_limit": 10
}
]
}
}The pipeline is configured to run every 12 hours. To modify the schedule, edit the ScheduleExpression in template.yaml:
ScheduleExpression: rate(12 hours) # or use cron: cron(0 0,12 * * ? *)To disable automatic execution, set State: DISABLED in the template.
To manually trigger the pipeline:
aws stepfunctions start-execution \
--state-machine-arn <KeywordAPIPipelineArn> \
--input '{}'The ARN is available in the stack outputs after deployment.
- CloudWatch Logs: All Lambda functions and Step Functions log to CloudWatch
- Step Functions Console: Monitor pipeline executions and view execution history
- Athena: Query the
sentimenttable for analysis results - SNS Notifications: Error notifications sent to the configured SNS topic
Query sentiment data using Athena:
SELECT * FROM sentimental.sentiment
WHERE created_at >= current_date - interval '7' day
ORDER BY created_at DESC;