From e01df1b24ddf2cfbb0c942a871ba427517872250 Mon Sep 17 00:00:00 2001
From: dzhelezov <dzhelezov@gmail.com>
Date: Thu, 16 Jan 2025 16:00:12 +0100
Subject: [PATCH] initial RFC for datalake for squids

---
 datalake/rfc.md | 349 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 349 insertions(+)
 create mode 100644 datalake/rfc.md

diff --git a/datalake/rfc.md b/datalake/rfc.md
new file mode 100644
index 0000000..aed1c2f
--- /dev/null
+++ b/datalake/rfc.md
@@ -0,0 +1,349 @@
+# RFC: Community-Driven Blockchain Data Lake
+
+**Date**: 2025-01-16  
+**Version**: 1.0 (Draft)  
+**Authors**: [Your Name], [Team Members], & Community Contributors
+
+---
+
+## Table of Contents
+1. [Overview](#1-overview)  
+2. [Scope & Goals](#2-scope--goals)  
+3. [High-Level Architecture](#3-high-level-architecture)  
+4. [Data Submission & Governance](#4-data-submission--governance)  
+5. [Metadata Format](#5-metadata-format)  
+6. [Catalog (Directory) Service](#6-catalog-directory-service)  
+7. [Data Layout & Conventions](#7-data-layout--conventions)  
+8. [Real-Time vs. Historical Data Flow](#8-real-time-vs-historical-data-flow)  
+9. [AI Agent Usage Scenarios](#9-ai-agent-usage-scenarios)  
+10. [Security & Access Control](#10-security--access-control)  
+11. [Implementation Plan](#11-implementation-plan)  
+12. [Open Questions & Next Steps](#12-open-questions--next-steps)
+
+---
+
+## 1. Overview
+This document specifies the design for a **community-driven data lake** containing **blockchain data** (e.g., DeFi events, token trades, NFT transfers, wallet balances). The goals are:
+
+- **Ease of Contribution**: Community members can publish datasets without complex infrastructure.  
+- **Discoverability**: A central *catalog service* helps AI agents or human users find relevant datasets.  
+- **Lightweight Querying**: No heavy clusters (Spark, Trino, etc.) required—users or agents can directly query Parquet files on object storage (e.g., Cloudflare R2 or AWS S3), plus optional real-time feeds.  
+- **Append-Only Model**: Blockchain data is inherently append-only, simplifying ingestion and versioning.
+
+---
+
+## 2. Scope & Goals
+
+### 2.1 In-Scope
+1. **Append-Only Storage**  
+   - Storing Parquet files with minimal partitioning.  
+   - “Historical” data sets that grow over time.
+
+2. **Lightweight Metadata Standard**  
+   - JSON-based schema for describing each dataset.  
+   - Core fields: name, owner, schema, license, etc.
+
+3. **Catalog/Directory Service**  
+   - An API or repository for dataset discovery.  
+   - Basic search (by dataset name, tags, chain, etc.).
+
+4. **Community Contribution Workflow**  
+   - Guidelines for how contributors submit new datasets or updates.  
+   - Automated validation of metadata (CI pipelines, etc.).
+
+5. **Basic Real-Time Integration**  
+   - Optional local or “script-based” indexing for near-live data.  
+   - Minimal overhead, no large streaming cluster.
+
+### 2.2 Out of Scope
+1. **Full Transactional Lakehouse**  
+   - E.g., Iceberg, Delta Lake with ACID transactions.
+
+2. **Advanced Security/Entitlements**  
+   - Row-level or column-level security.
+
+3. **Sophisticated Catalog Features**  
+   - Complex lineage, data masking, cross-dataset joins via a single interface.
+
+4. **Large ETL Platforms**  
+   - Airflow, Spark streaming, etc.
+
+---
+
+## 3. High-Level Architecture
+
+```
+                ┌─────────────────────────┐
+                │  Catalog/Directory API  │
+                │   (stores dataset meta) │
+                └───────────┬─────────────┘
+                            │
+  ┌─────────────────────────▼───────────────────────────┐
+  │                   Object Storage                    │
+  │ ┌───────────────────────┬─────────────────────────┐ │
+  │ │ Datasets Contributed  │                         │ │
+  │ │ (Parquet Files)       │  Partitioned by chain,  │ │
+  │ │ & Metadata            │  date, etc.             │ │
+  │ └───────────────────────┴─────────────────────────┘ │
+  └──────────────────────────────────────────────────────┘
+                            │
+     ┌──────────────────────▼───────────────────────────┐
+     │ Consumer Tools & AI Agents                       │
+     │  - DuckDB or Postgres FDW reading Parquet        │
+     │  - Real-time "local" DB for fresh data           │
+     └───────────────────────────────────────────────────┘
+```
+
+1. **Object Storage**: The main repository of historical blockchain datasets (Parquet).  
+2. **Catalog Service**: Stores & indexes metadata describing each dataset (name, schema, object path, etc.).  
+3. **Community Contributions**: Users submit new datasets or updates via a workflow (e.g., GitHub PR or REST API).  
+4. **Query**: AI agents or devs use DuckDB/Postgres FDW to read data *directly* from the object store.  
+5. **Real-Time**: Optional small pipeline that writes near-live data to a local DB, also queryable from the same environment.
+
+---
+
+## 4. Data Submission & Governance
+
+### 4.1 Submission Workflow
+1. **Contributor Creates Metadata**  
+   - Follows the standard JSON schema (see [Section 5](#5-metadata-format)).
+
+2. **Contributor Publishes Data to Object Storage**  
+   - Places Parquet files in an agreed sub-path, e.g. `r2://community-lake/dex_swaps/chain=.../date=...` or `s3://community-lake/...`.
+
+3. **Registration**  
+   - Contributor either:  
+     - **Option A**: Submits a Pull Request to a GitHub repository that stores the metadata, or  
+     - **Option B**: Calls a `POST /datasets` endpoint on the Catalog Service with the metadata JSON.
+
+4. **Review & Validation**  
+   - Automated checks ensure required fields are present and no naming conflicts arise.  
+   - Maintainers approve or reject with feedback.
+
+5. **Catalog Entry Created/Updated**  
+   - The dataset is now discoverable via `GET /datasets`.
+
+### 4.2 Community Governance
+- **Versioning**: Datasets can increment versions if schema changes.  
+- **Owner & Maintainers**: The original submitter plus optional co-maintainers.  
+- **Open Review**: The community can comment or submit issues if data is incomplete/incorrect.
+
+---
+
+## 5. Metadata Format
+We propose a **JSON-based** structure heavily inspired by **Frictionless Data’s JSON Table Schema** or a custom schema. **Key fields**:
+
+```json
+{
+  "name": "dex_swaps",
+  "title": "DEX Swaps (Ethereum, BSC)",
+  "description": "Community-contributed dataset of all DEX swap events.",
+  "owner": "0x123abc",
+  "license": "CC-BY-4.0",
+  "version": "2025-01-16",
+  "keywords": ["defi", "swap", "ethereum", "bsc"],
+  "resources": [
+    {
+      "path": "r2://community-lake/dex_swaps/",
+      "schema": {
+        "fields": [
+          {
+            "name": "tx_hash",
+            "type": "string"
+          },
+          {
+            "name": "chain",
+            "type": "string"
+          },
+          {
+            "name": "timestamp",
+            "type": "datetime"
+          },
+          {
+            "name": "token_in",
+            "type": "string"
+          },
+          {
+            "name": "amount_in",
+            "type": "decimal"
+          }
+        ],
+        "primaryKey": ["tx_hash"]
+      },
+      "partitions": ["chain", "date"],
+      "update_frequency": "daily"
+    }
+  ]
+}
+```
+
+1. **`name`**: Unique slug for the dataset.  
+2. **`title`/`description`**: Human-friendly.  
+3. **`owner`**: On-chain address or user identity.  
+4. **`license`**: E.g., CC-BY, MIT, etc.  
+5. **`keywords`**: For search.  
+6. **`resources`**: Each resource points to a data “path” plus a schema describing fields.  
+
+**Validation**:
+- A small tool or CI check ensures required fields exist and schema fields match known data types.
+
+---
+
+## 6. Catalog (Directory) Service
+
+### 6.1 Functionality
+- **CRUD for Metadata**:
+  - `POST /datasets` – register a new dataset  
+  - `GET /datasets` – list/search datasets  
+  - `GET /datasets/{name}` – fetch details for a specific dataset  
+  - `PUT /datasets/{name}` – update dataset metadata (e.g., new version)
+
+- **Search**:
+  - Filter by keywords, chain, date, or partial name match.
+
+### 6.2 Implementation Options
+1. **Lightweight DB + REST**  
+   - A small Postgres or SQLite plus a Node/Flask/FastAPI service.  
+   - Easiest to implement, well-known patterns.
+
+2. **GitHub-Based**  
+   - All metadata in a central repository.  
+   - A static site or simple service reads from the repo and provides a JSON/GraphQL endpoint for search.
+
+**Recommendation**: Start with a simple REST API service (Option 1). We can evolve to more advanced catalogs if needed.
+
+---
+
+## 7. Data Layout & Conventions
+
+1. **Object Storage Bucket/Namespace Structure**  
+   - `r2://community-lake/{dataset_name}/[partition_key1=value1]/...[file.parquet]`  
+   - Example: `r2://community-lake/dex_swaps/chain=ethereum/date=2025-01-01/part-000.parquet`
+
+2. **Partitions**  
+   - Typically by `chain`, `date`, or `block_range`.  
+   - Contributors should define them in their metadata’s `partitions` array.
+
+3. **File Format**  
+   - **Parquet** recommended for columnar analytics.  
+   - CSV/JSON allowed but discouraged for large datasets.
+
+---
+
+## 8. Real-Time vs. Historical Data Flow
+
+### 8.1 Historical (Append-Only)
+- **Daily or Hourly** Parquet snapshots appended to object storage.  
+- Typically derived from blockchain event logs or subgraphs.  
+- Minimal overhead, easy to query with DuckDB.
+
+### 8.2 Real-Time (Optional)
+- **Local Script/Repo**: The contributor might maintain a small “listener” script for near-live data that writes to a local DB (DuckDB or Postgres).  
+- The metadata could point to instructions or a Git repo link.  
+- **User** or **AI agent** can clone and run it if they need real-time data.
+
+---
+
+## 9. AI Agent Usage Scenarios
+
+1. **Agent Receives a Query**  
+   - E.g. “Fetch total volume of ETH->USDC swaps in the last 24h.”
+
+2. **Agent Calls Catalog**  
+   - Finds the `dex_swaps` dataset, sees the storage path and schema.
+
+3. **Agent Launches DuckDB**  
+   - `SELECT * FROM parquet_scan('r2://community-lake/dex_swaps/chain=ethereum/*.parquet') WHERE date >= current_date - 1;`
+
+4. **Agent Provides Answer**  
+   - Minimizes hallucinations thanks to explicit schema in metadata.
+
+---
+
+## 10. Security & Access Control
+
+1. **Public Read**  
+   - By default, all datasets are publicly readable if they have an open license.  
+   - Bucket/namespace policy: `GetObject` allowed for everyone.
+
+2. **Write Permissions**  
+   - Only owners or verified contributors can upload new data to their dataset path.  
+   - Catalog service requires auth (API key, OAuth, etc.) to register or update datasets.
+
+3. **Optional Private Datasets**  
+   - Could be done with restricted ACLs or separate private bucket.  
+   - Not a primary focus for the initial launch.
+
+---
+
+## 11. Implementation Plan
+
+### 11.1 Phase 1: MVP
+1. **Repo Setup**  
+   - Create a GitHub repo: `community-blockchain-datalake`  
+   - Define the minimal JSON metadata schema and add examples.
+
+2. **Catalog Service**  
+   - Simple Node.js/FastAPI service + Postgres DB.  
+   - Endpoints: `GET /datasets`, `GET /datasets/{id}`, `POST /datasets`, etc.
+
+3. **Object Storage Configuration**  
+   - `community-lake` (public read), with partitions for initial bootstrapped data.
+
+4. **Bootstrap Datasets**  
+   - Add 2–3 example datasets (e.g., `dex_swaps`, `nft_transfers`, `token_prices`) to seed the lake.  
+   - Provide sample queries in the repo’s README.
+
+5. **CI Validation**  
+   - A small script checks new metadata JSON for required fields upon PR or `POST /datasets`.
+
+6. **Documentation**  
+   - Outline how contributors can create metadata, upload data, and register in the catalog.
+
+### 11.2 Phase 2: Enhancements
+1. **Web UI**  
+   - A simple front-end for browsing datasets (optional but user-friendly).
+
+2. **Search & Filters**  
+   - Expand catalog queries (e.g., filter by chain, tags).
+
+3. **Real-Time Integration Examples**  
+   - Provide reference code for a local listener that writes live data to a DuckDB table.
+
+4. **Advanced Metadata**  
+   - Stats (row counts, min/max block number), versioning.
+
+5. **Community Governance Features**  
+   - Upvotes, star ratings, or “verified data” badges.
+
+---
+
+## 12. Open Questions & Next Steps
+
+1. **What Should the Baseline Metadata Fields Be?**  
+   - Finalize required vs. optional fields in the JSON schema.
+
+2. **Submission Workflow**  
+   - Decide between a purely GitHub-based flow vs. the REST API (or both).
+
+3. **Licensing Model**  
+   - Confirm the default license (e.g., CC0, CC-BY) for contributed data.
+
+4. **Object Storage Costs & Funding**  
+   - Who pays for storage if community contributions grow large?
+
+5. **Support for Non-EVM Chains**  
+   - Should the partitions always assume EVM-based addresses, or do we want to handle Solana, Aptos, etc.?
+
+---
+
+# Conclusion
+This RFC defines a **community-driven** blockchain data lake architecture with a **metadata-driven catalog service** at its core. By adopting a **simple JSON format**, **append-only object storage (Parquet)**, and a **lightweight submission workflow**, we can rapidly onboard community datasets. AI agents and human analysts alike will benefit from consistent, discoverable data—without heavy infrastructure.
+
+**Action**:  
+- Please review and provide feedback.  
+- Once finalized, the development team can begin Phase 1 implementation immediately.
+
+**Contact**:  
+- [Your Name / Team] – _Slack/GitHub handle_  
+- [Project Repo or Issue Tracker Link] – for ongoing collaboration and discussion.