Skip to content

simon-milata/data-tech-stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

162 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Tech Stats

Live Demo: https://dts.simonmilata.com/

A serverless app that collects, aggregates, and visualizes historical GitHub statistics for data-related repositories. The project focuses on end-to-end API design, serverless architecture, and cost-constrained cloud deployment.

Project goals

  • Deploy a public-facing API using FastAPI and AWS API Gateway.
  • Build an interactive frontend to visualize historical trends.
  • Implement a simple but realistic data ingestion and aggregation pipeline.
  • Keep the entire system free or as close to free as possible.
  • Understand architectural tradeoffs in a small production system.

High-level architecture

ETL pipeline

(EventBridge → Lambda → GitHub APIs → S3 → aggregation Lambda → S3)

dts-etl drawio

  • Scheduled Extract Lambda fetches GitHub data
  • Raw snapshots are stored in S3 (partitioned by date)
  • Scheduled Aggregation Lambda produces weekly / monthly datasets

API architecture

(Client → Cloudflare → API Gateway → Lambda → S3)

dts-api drawio

  • Cloudflare handles DNS, SSL termination, and basic protection
  • API Gateway routes requests to API Lambda
  • FastAPI runs inside Lambda using Mangum
  • API Lambda reads pre-aggregated data from S3

Architecture Reasoning

  • Serverless (Lambda + API Gateway): Chosen for scale-to-zero capabilities. With only tens of daily invocations, a dedicated server would sit 99% idle; Lambda incurs zero cost when inactive.
  • S3 as Data Store: The "Write-Once-Read-Many" pattern makes S3 significantly cheaper ($0.023/GB) than maintaining a database.
  • Cloudflare: Acts as the entry point for DNS, SSL, and basic bot protection. This allows the project to bypass AWS Route 53 hosted zone fees ($0.50/mo).

Cost Model & Predictions

  • Compute (Lambda): Monthly usage is ~8,610 GB-s, which is <3% of the 400,000 GB-s free monthly allowance.
  • Storage (S3): Accumulating ~1MB/day (raw snapshots + aggregates). Even as the dataset grows, the storage cost is estimated at a few cents for the first few years.
  • API Gateway: At current volumes (~1,800 requests/month), the cost is estimated at <$0.002/month.
  • Cloudflare: Using the Free Tier for DNS and SSL termination to maintain a total recurring cost of exactly $0.00.

While S3 storage and API calls technically accrue a few cents as the dataset grows, AWS typically waives these, resulting in a net cost of $0.00.

Tech stack

Backend & API

Data Engineering

  

AWS Infrastructure

    

Edge & DNS

Scope & limitations

  • Not designed for high write volume or real-time updates
  • No database (by design)
  • Focused on clarity and cost efficiency over scale
  • Data collection began on deployment; historical trends are built moving forward.

Frontend

This is a backend-centric project. I used AI to build the UI so I could focus entirely on the data engineering, serverless architecture, and API logic.

Releases

No releases published

Packages

 
 
 

Contributors