diff --git a/apps/git-second-brain/README.md b/apps/git-second-brain/README.md new file mode 100644 index 00000000..9bebfc7e --- /dev/null +++ b/apps/git-second-brain/README.md @@ -0,0 +1,93 @@ +# Git Second Brain + +A RAG (Retrieval-Augmented Generation) application that lets you ask +natural-language questions about **any Git repository** by analysing its +commit history. The included example uses the **FastAPI** open-source project. + +Commits are embedded as vectors and stored in **Oracle AI Database 26ai**. +At query time the most relevant commits are retrieved via `VECTOR_DISTANCE` +and passed as context to an OpenAI model through **LangChain**, producing +grounded answers with commit citations. + +## Project structure + +``` +git-second-brain/ +├── database/ # SQL scripts: user creation + schema setup +├── data-loader/ # One-time ETL: parse commits, embed, load into Oracle 26ai +├── app/ # Streamlit chat UI + LangChain RAG chain +├── diffs/ # Pre-extracted per-commit diff files +└── fastapi_commits.txt # Delimited commit metadata +``` + +| Folder | Purpose | Details | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | +| **database/** | SQL scripts to create the Oracle user, table, indexes, and (optionally) the vector index. | [database/README.md](database/README.md) | +| **data-loader/** | Reads the extracted commit metadata and diff files, generates 384-dim vector embeddings with `sentence-transformers`, and bulk-inserts everything into Oracle 26ai. | [data-loader/README.md](data-loader/README.md) | +| **app/** | Streamlit chat interface where users ask questions. A custom LangChain retriever queries Oracle 26ai vector search, and the retrieved commits are sent to OpenAI to generate a cited answer. | [app/README.md](app/README.md) | + +## Extracting repo data + +The examples below use **FastAPI**, but this works with **any Git repository**. + +```bash +# Clone the target repo +git clone https://github.com/tiangolo/fastapi.git +mkdir diffs +cd fastapi + +# Extract commit metadata with safe delimiters +git log --all --no-merges \ + --pretty=format:"<<>>%n%H%n%an%n%aI%n%s%n<<>>%n%b%n<<>>%n" \ + > ../fastapi_commits.txt + +# Extract diff stats as a single file +git log --all --no-merges \ + --pretty=format:"===SHA:%H===" --stat \ + > ../diffs/all_diffs.txt + +cd .. +``` + +> **Tip:** The data loader caps at 3 000 commits by default, which keeps +> indexing time under 10 minutes and covers roughly 2015–today for FastAPI. + +## Prerequisites + +- Python 3.10+ +- Oracle AI Database 26ai (running and accessible) +- OpenAI API key (for the chat app) + +## Quick start + +> **Important:** Load the environment variables from each folder's `.env` file +> before running Python scripts. See each folder's README for details. + +```bash +# 0. Set up the database +cd database +sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @01_create_user.sql +sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @02_create_schema.sql +cd .. + +# 1. Extract repo data (see "Extracting repo data" above) + +# 2. Load data into Oracle 26ai +cd data-loader +python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate +pip install -r requirements.txt +cp .env.example .env # fill in your Oracle credentials +# load env vars, then: +python load_data.py +cd .. + +# 3. Run the app +cd app +python -m venv .venv && .venv\Scripts\activate +pip install -r requirements.txt +cp .env.example .env # fill in Oracle + OpenAI credentials +# load env vars, then: +streamlit run app.py +``` + +See each folder's README for full setup and configuration details. diff --git a/apps/git-second-brain/app/.env.example b/apps/git-second-brain/app/.env.example new file mode 100644 index 00000000..8d6c6570 --- /dev/null +++ b/apps/git-second-brain/app/.env.example @@ -0,0 +1,7 @@ +# Oracle AI Database 26ai connection +ORACLE_USER=GITHUB_SECOND_BRAIN +ORACLE_PASSWORD= +ORACLE_DSN=localhost:1521/FREEPDB1 + +# OpenAI (can also be entered in the Streamlit sidebar) +OPENAI_API_KEY=sk-... diff --git a/apps/git-second-brain/app/README.md b/apps/git-second-brain/app/README.md new file mode 100644 index 00000000..f3582246 --- /dev/null +++ b/apps/git-second-brain/app/README.md @@ -0,0 +1,101 @@ +# Git Second Brain — App + +Streamlit chat UI that lets you ask natural-language questions about a +repository's commit history, powered by **Oracle AI Database 26ai Vector Search**, +LangChain, and OpenAI. + +## Architecture + +``` +User question + │ + ▼ +┌────────────────────┐ ┌──────────────────────────┐ +│ Streamlit (app.py)│─────▶│ OracleCommitRetriever │ +│ Chat interface │ │ sentence-transformers │ +└────────┬───────────┘ │ + Oracle 26ai vector │ + │ │ VECTOR_DISTANCE search │ + │ context docs └──────────────────────────┘ + ▼ +┌────────────────────┐ +│ LangChain RAG │ +│ ChatOpenAI (GPT) │ +└────────────────────┘ +``` + +## Prerequisites + +| Requirement | Version | +| ----------------------- | ----------------------------- | +| Python | 3.10+ | +| Oracle AI Database 26ai | Running and accessible | +| OpenAI API key | Any `gpt-4o-mini` capable key | + +The `data-loader/` must have been run first so the `FASTAPI_COMMITS` table is +populated with embeddings. + +## Setup + +```bash +cd app +python -m venv .venv + +# Windows +.venv\Scripts\activate +# Linux / macOS +source .venv/bin/activate + +pip install -r requirements.txt +``` + +Copy `.env.example` to `.env` and fill in your credentials: + +```bash +cp .env.example .env +``` + +## Running + +The app reads Oracle credentials from environment variables. Load them before +starting Streamlit: + +```bash +# Load env vars from .env (use your preferred method) +# Windows PowerShell: +Get-Content .env | ForEach-Object { if ($_ -match '^([^#].+?)=(.*)$') { [Environment]::SetEnvironmentVariable($Matches[1], $Matches[2]) } } + +# Linux / macOS: +# export $(grep -v '^#' .env | xargs) + +streamlit run app.py +``` + +The app opens at . + +## Smoke test + +A standalone script that verifies the vector-search round trip without +Streamlit or OpenAI. Requires the same environment variables: + +```bash +python smoke_test.py +``` + +## Files + +| File | Purpose | +| ------------------ | ------------------------------------------------------------- | +| `app.py` | Streamlit chat UI + LangChain RAG chain | +| `retriever.py` | LangChain `BaseRetriever` backed by Oracle 26ai vector search | +| `smoke_test.py` | Minimal end-to-end connectivity & vector-search test | +| `requirements.txt` | Pinned Python dependencies | +| `.env.example` | Template for required environment variables | + +## Environment variables + +| Variable | Required | Default | Description | +| ----------------- | -------- | ------- | ---------------------------------------------- | +| `ORACLE_USER` | Yes | — | Database username | +| `ORACLE_PASSWORD` | Yes | — | Database password | +| `ORACLE_DSN` | Yes | — | Connect string, e.g. `localhost:1521/FREEPDB1` | +| `OPENAI_API_KEY` | No | — | Can also be entered in the Streamlit sidebar | diff --git a/apps/git-second-brain/app/app.py b/apps/git-second-brain/app/app.py new file mode 100644 index 00000000..308ff44c --- /dev/null +++ b/apps/git-second-brain/app/app.py @@ -0,0 +1,171 @@ +""" +Git Second Brain - Streamlit Chat UI +Ask natural-language questions about FastAPI's commit history, +powered by Oracle AI Database 26ai Vector Search + LangChain + OpenAI. + +Run: + streamlit run app.py +""" + +import os + +import streamlit as st +from langchain_core.output_parsers import StrOutputParser +from langchain_core.prompts import ChatPromptTemplate +from langchain_openai import ChatOpenAI +from retriever import OracleCommitRetriever + +# ========================= Page config ========================= +st.set_page_config( + page_title="Git Second Brain", + page_icon="🧠", + layout="wide", +) + +# ========================= Sidebar ============================= +with st.sidebar: + st.title("Git Second Brain") + st.caption("Oracle AI Database 26ai + LangChain + OpenAI") + + openai_key = st.text_input( + "OpenAI API Key", + type="password", + value=os.getenv("OPENAI_API_KEY", ""), + help="Stored only in this session, never persisted.", + ) + + model_name = st.selectbox( + "Model", + ["gpt-4o-mini", "gpt-4o", "gpt-4.1-mini", "gpt-4.1-nano"], + index=0, + ) + + top_k = st.slider("Commits to retrieve", min_value=3, max_value=15, value=8) + + temperature = st.slider("Temperature", min_value=0.0, max_value=1.0, value=0.2, step=0.05) + + st.divider() + st.markdown( + "**How it works**\n\n" + "1. Your question is embedded with sentence-transformers\n" + "2. Oracle 26ai runs `VECTOR_DISTANCE` to find the most relevant commits\n" + "3. LangChain passes those commits as context to OpenAI\n" + "4. You get a grounded answer with commit citations" + ) + + st.divider() + st.markdown("**Sample questions**") + sample_questions = [ + "Why did FastAPI switch to Pydantic v2?", + "How has dependency injection evolved?", + "What were the biggest breaking changes in the last 2 years?", + "When did lifespan replace startup/shutdown events?", + "What security fixes were applied recently?", + ] + for q in sample_questions: + if st.button(q, use_container_width=True): + st.session_state["prefill"] = q + +# ========================= System prompt ======================= +SYSTEM_PROMPT = """\ +You are Git Second Brain, an AI assistant that answers questions about the +FastAPI open-source project by analyzing its Git commit history. + +You will receive a set of relevant commits retrieved from Oracle AI Database 26ai +via vector similarity search. Use ONLY these commits to answer the question. +If the commits do not contain enough information, say so honestly. + +Rules: +- Cite specific commits by their short SHA and date when supporting a claim. +- Summarize the narrative arc when multiple commits tell a story. +- Keep answers concise but thorough (3-6 paragraphs max). +- If you are unsure, say "Based on the commits I found..." to hedge. +- Never invent commit SHAs or dates. +""" + +RAG_TEMPLATE = ChatPromptTemplate.from_messages( + [ + ("system", SYSTEM_PROMPT), + ("human", "Retrieved commits:\n\n{context}\n\n---\nQuestion: {question}"), + ] +) + +# ========================= Init state ========================== +if "messages" not in st.session_state: + st.session_state.messages = [] + +if "retriever" not in st.session_state: + with st.spinner("Connecting to Oracle AI Database 26ai ..."): + st.session_state.retriever = OracleCommitRetriever(top_k=top_k) + +# ========================= Chat display ======================== +st.header("Ask your repo anything") + +for msg in st.session_state.messages: + with st.chat_message(msg["role"]): + st.markdown(msg["content"]) + if msg.get("sources"): + with st.expander(f"Retrieved commits ({len(msg['sources'])})"): + for doc in msg["sources"]: + meta = doc.metadata + st.markdown( + f"**`{meta['sha'][:10]}`** | {meta['date']} | " + f"*{meta['author']}*\n\n" + f"> {meta['subject']}" + ) + st.divider() + +# ========================= Chat input ========================== +prefill = st.session_state.pop("prefill", None) +user_input = st.chat_input("Ask about FastAPI's history ...") or prefill + +if user_input: + if not openai_key: + st.error("Please enter your OpenAI API key in the sidebar.") + st.stop() + + # Show user message + st.session_state.messages.append({"role": "user", "content": user_input}) + with st.chat_message("user"): + st.markdown(user_input) + + # Retrieve from Oracle 26ai + with st.chat_message("assistant"): + with st.spinner("Searching Oracle 26ai Vector Search ..."): + retriever = st.session_state.retriever + retriever.top_k = top_k + docs = retriever.invoke(user_input) + + context = "\n\n---\n\n".join(doc.page_content for doc in docs) + + # LangChain RAG chain + llm = ChatOpenAI( + model=model_name, + temperature=temperature, + api_key=openai_key, + ) + chain = RAG_TEMPLATE | llm | StrOutputParser() + + with st.spinner("Generating answer ..."): + answer = chain.invoke({"context": context, "question": user_input}) + + st.markdown(answer) + + # Show retrieved commits + with st.expander(f"Retrieved commits ({len(docs)})"): + for doc in docs: + meta = doc.metadata + st.markdown( + f"**`{meta['sha'][:10]}`** | {meta['date']} | " + f"*{meta['author']}*\n\n" + f"> {meta['subject']}" + ) + st.divider() + + st.session_state.messages.append( + { + "role": "assistant", + "content": answer, + "sources": docs, + } + ) diff --git a/apps/git-second-brain/app/requirements.txt b/apps/git-second-brain/app/requirements.txt new file mode 100644 index 00000000..e387f469 --- /dev/null +++ b/apps/git-second-brain/app/requirements.txt @@ -0,0 +1,6 @@ +oracledb>=2.2.0,<4 +sentence-transformers>=5.0,<6 +langchain>=1.2,<2 +langchain-core>=1.2,<2 +langchain-openai>=1.1,<2 +streamlit>=1.38,<2 diff --git a/apps/git-second-brain/app/retriever.py b/apps/git-second-brain/app/retriever.py new file mode 100644 index 00000000..49c5c817 --- /dev/null +++ b/apps/git-second-brain/app/retriever.py @@ -0,0 +1,107 @@ +""" +LangChain custom retriever backed by Oracle AI Database 26ai Vector Search. + +This retriever embeds the user query with sentence-transformers, runs a +VECTOR_DISTANCE query against the FASTAPI_COMMITS table, and returns +LangChain Document objects with commit metadata. +""" + +import array +import os + +import oracledb +from langchain_core.callbacks import CallbackManagerForRetrieverRun +from langchain_core.documents import Document +from langchain_core.retrievers import BaseRetriever +from pydantic import Field, PrivateAttr +from sentence_transformers import SentenceTransformer + + +class OracleCommitRetriever(BaseRetriever): + """Retrieve FastAPI commits from Oracle AI Database 26ai via vector similarity. + + Configuration is read from environment variables: + ORACLE_USER – database username + ORACLE_PASSWORD – database password + ORACLE_DSN – Oracle connect string (host:port/service) + """ + + db_user: str = Field(default_factory=lambda: os.environ["ORACLE_USER"]) + db_password: str = Field(default_factory=lambda: os.environ["ORACLE_PASSWORD"], repr=False) + db_dsn: str = Field(default_factory=lambda: os.environ["ORACLE_DSN"]) + embed_model_name: str = "sentence-transformers/all-MiniLM-L6-v2" + top_k: int = 8 + + _embed_model: SentenceTransformer = PrivateAttr() + _conn: oracledb.Connection = PrivateAttr() + + def __init__(self, **kwargs): + super().__init__(**kwargs) + self._embed_model = SentenceTransformer(self.embed_model_name) + self._conn = oracledb.connect( + user=self.db_user, + password=self.db_password, + dsn=self.db_dsn, + ) + + def _get_relevant_documents( + self, + query: str, + *, + run_manager: CallbackManagerForRetrieverRun, + ) -> list[Document]: + """Embed the query and run vector search in Oracle 26ai.""" + vec = array.array( + "f", + self._embed_model.encode(query, normalize_embeddings=True).tolist(), + ) + + cur = self._conn.cursor() + cur.execute( + """ + SELECT sha, + TO_CHAR(commit_date, 'YYYY-MM-DD'), + author, + subject, + body, + files_changed + FROM FASTAPI_COMMITS + ORDER BY VECTOR_DISTANCE(embedding, :1, COSINE) + FETCH FIRST :2 ROWS ONLY + """, + [vec, self.top_k], + ) + + docs = [] + for sha, date_str, author, subject, body, files in cur: + body_text = body.read() if hasattr(body, "read") else (body or "") + files_text = files.read() if hasattr(files, "read") else (files or "") + + content = ( + f"Commit: {sha[:10]}\n" + f"Date: {date_str}\n" + f"Author: {author}\n" + f"Subject: {subject}\n" + f"Body: {body_text}\n" + f"Files changed:\n{files_text[:800]}" + ) + + docs.append( + Document( + page_content=content, + metadata={ + "sha": sha, + "date": date_str, + "author": author, + "subject": subject, + }, + ) + ) + + cur.close() + return docs + + def close(self): + """Clean up the database connection.""" + if self._conn: + self._conn.close() diff --git a/apps/git-second-brain/app/smoke_test.py b/apps/git-second-brain/app/smoke_test.py new file mode 100644 index 00000000..82ee99f8 --- /dev/null +++ b/apps/git-second-brain/app/smoke_test.py @@ -0,0 +1,62 @@ +""" +Smoke test: verify Python can query 26ai vector search end to end. +Run: python smoke_test.py +""" + +import array +import os +import sys + +import oracledb +from sentence_transformers import SentenceTransformer + +_REQUIRED_ENV = ("ORACLE_USER", "ORACLE_PASSWORD", "ORACLE_DSN") +_missing = [v for v in _REQUIRED_ENV if v not in os.environ] +if _missing: + sys.exit(f"ERROR: missing environment variables: {', '.join(_missing)}") + +DB_USER = os.environ["ORACLE_USER"] +DB_PASSWORD = os.environ["ORACLE_PASSWORD"] +DB_DSN = os.environ["ORACLE_DSN"] + +EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2" + +QUESTION = "Why did FastAPI adopt Pydantic v2?" + + +def main(): + print(f"Loading model: {EMBED_MODEL}") + model = SentenceTransformer(EMBED_MODEL) + + print(f"Encoding question: {QUESTION}") + vec = array.array("f", model.encode(QUESTION, normalize_embeddings=True).tolist()) + + print(f"Connecting to {DB_DSN} ...") + conn = oracledb.connect(user=DB_USER, password=DB_PASSWORD, dsn=DB_DSN) + cur = conn.cursor() + + print("Running vector search ...\n") + cur.execute( + """ + SELECT sha, commit_date, subject + FROM FASTAPI_COMMITS + ORDER BY VECTOR_DISTANCE(embedding, :1, COSINE) + FETCH FIRST 5 ROWS ONLY + """, + [vec], + ) + + print(f"{'#':<4} {'SHA':<12} {'DATE':<22} {'SUBJECT'}") + print("-" * 90) + for i, (sha, dt, subject) in enumerate(cur, 1): + short_sha = sha[:10] + date_str = dt.strftime("%Y-%m-%d %H:%M") if dt else "unknown" + print(f"{i:<4} {short_sha:<12} {date_str:<22} {subject[:60]}") + + cur.close() + conn.close() + print("\nSmoke test passed.") + + +if __name__ == "__main__": + main() diff --git a/apps/git-second-brain/data-loader/.env.example b/apps/git-second-brain/data-loader/.env.example new file mode 100644 index 00000000..1e3d2ee0 --- /dev/null +++ b/apps/git-second-brain/data-loader/.env.example @@ -0,0 +1,7 @@ +# Oracle AI Database 26ai connection +ORACLE_USER=GITHUB_SECOND_BRAIN +ORACLE_PASSWORD= +ORACLE_DSN=localhost:1521/FREEPDB1 + +# Optional: override the target schema (defaults to GITHUB_SECOND_BRAIN) +# ORACLE_SCHEMA=GITHUB_SECOND_BRAIN diff --git a/apps/git-second-brain/data-loader/README.md b/apps/git-second-brain/data-loader/README.md new file mode 100644 index 00000000..15f49411 --- /dev/null +++ b/apps/git-second-brain/data-loader/README.md @@ -0,0 +1,81 @@ +# Git Second Brain — Data Loader + +Reads a repository's commit history from a plain-text dump, generates vector +embeddings with `sentence-transformers`, and bulk-inserts everything into an +**Oracle AI Database 26ai** table. + +## How it works + +1. Parses `fastapi_commits.txt` (delimited commit metadata) and + `diffs/all_diffs.txt` (per-commit file-change stats). +2. For each commit, builds a combined text blob and encodes it with the + `all-MiniLM-L6-v2` model (384-dimensional vectors). +3. Inserts rows in batches into `FASTAPI_COMMITS`, handling duplicate SHAs + gracefully. + +## Prerequisites + +| Requirement | Version | +| ----------------------- | -------------------------------------------------------- | +| Python | 3.10+ | +| Oracle AI Database 26ai | Running, with the schema created via `database/` scripts | + +The following data files must exist relative to this folder: + +- `../fastapi_commits.txt` — commit metadata +- `../diffs/all_diffs.txt` — diff / file-change information (optional but recommended) + +## Setup + +```bash +cd data-loader +python -m venv .venv + +# Windows +.venv\Scripts\activate +# Linux / macOS +source .venv/bin/activate + +pip install -r requirements.txt +``` + +Copy `.env.example` to `.env` and fill in your credentials: + +```bash +cp .env.example .env +``` + +## Running + +Load your environment variables before running the script: + +```bash +# Load env vars from .env (use your preferred method) +# Windows PowerShell: +Get-Content .env | ForEach-Object { if ($_ -match '^([^#].+?)=(.*)$') { [Environment]::SetEnvironmentVariable($Matches[1], $Matches[2]) } } + +# Linux / macOS: +# export $(grep -v '^#' .env | xargs) + +python load_data.py +``` + +Progress is printed to stdout. A full run (~3 000 commits) takes a few minutes +depending on hardware and network latency to the database. + +## Environment variables + +| Variable | Required | Default | Description | +| ----------------- | -------- | --------------------- | ---------------------------------------------- | +| `ORACLE_USER` | Yes | — | Database username | +| `ORACLE_PASSWORD` | Yes | — | Database password | +| `ORACLE_DSN` | Yes | — | Connect string, e.g. `localhost:1521/FREEPDB1` | +| `ORACLE_SCHEMA` | No | `GITHUB_SECOND_BRAIN` | Target schema for the table | + +## Files + +| File | Purpose | +| ------------------ | ------------------------------------------- | +| `load_data.py` | Main loader script | +| `requirements.txt` | Pinned Python dependencies | +| `.env.example` | Template for required environment variables | diff --git a/apps/git-second-brain/data-loader/load_data.py b/apps/git-second-brain/data-loader/load_data.py new file mode 100644 index 00000000..4ea3b530 --- /dev/null +++ b/apps/git-second-brain/data-loader/load_data.py @@ -0,0 +1,239 @@ +""" +Load FastAPI commit history into Oracle AI Database 26ai with vector embeddings. + +Prerequisites: + 1. ../fastapi_commits.txt (delimited commit metadata, one block per commit) + 2. ../diffs/all_diffs.txt (git log --stat dump, delimited by ===SHA:hash===) + 3. pip install -r requirements.txt + 4. Schema created by running schema.sql + +Run: + python load_data.py +""" + +import array +import os +import re +import sys + +import oracledb +from sentence_transformers import SentenceTransformer + +# =========================== Config =========================== +_REQUIRED_ENV = ("ORACLE_USER", "ORACLE_PASSWORD", "ORACLE_DSN") +_missing = [v for v in _REQUIRED_ENV if v not in os.environ] +if _missing: + print(f"ERROR: missing environment variables: {', '.join(_missing)}") + sys.exit(1) + +DB_USER = os.environ["ORACLE_USER"] +DB_PASSWORD = os.environ["ORACLE_PASSWORD"] +DB_DSN = os.environ["ORACLE_DSN"] +DB_SCHEMA = os.getenv("ORACLE_SCHEMA", "GITHUB_SECOND_BRAIN") + +COMMITS_FILE = "../fastapi_commits.txt" +DIFFS_FILE = "../diffs/all_diffs.txt" + +MAX_COMMITS = 3000 +BATCH_SIZE = 100 +EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2" # 384 dims +# ============================================================== + + +def parse_diffs(path): + """Parse the single-file diff dump into a dict keyed by SHA.""" + diffs = {} + if not os.path.exists(path): + print(f"WARNING: {path} not found. Continuing without file-change info.") + return diffs + + current_sha = None + buffer = [] + with open(path, encoding="utf-8", errors="replace") as f: + for line in f: + m = re.match(r"===SHA:([0-9a-f]+)===", line.strip()) + if m: + if current_sha: + diffs[current_sha] = "".join(buffer).strip() + current_sha = m.group(1) + buffer = [] + else: + buffer.append(line) + if current_sha: + diffs[current_sha] = "".join(buffer).strip() + return diffs + + +def load_commits(path, limit): + """Load commits from a delimited plain-text dump. + + Each block looks like: + <<>> + + + + + <<>> + + <<>> + """ + commits = [] + with open(path, encoding="utf-8", errors="replace") as f: + raw = f.read() + + blocks = raw.split("<<>>") + for block in blocks: + block = block.strip() + if not block: + continue + if len(commits) >= limit: + break + + try: + header, rest = block.split("<<>>", 1) + except ValueError: + continue + + body, _, _ = rest.partition("<<>>") + header_lines = header.strip().splitlines() + if len(header_lines) < 4: + continue + + commits.append( + { + "sha": header_lines[0].strip(), + "author": header_lines[1].strip(), + "date": header_lines[2].strip(), + "subject": header_lines[3].strip(), + "body": body.strip(), + } + ) + + return commits + + +def build_content(commit, files_changed): + """Combine commit fields into a single string for embedding.""" + body = (commit.get("body") or "").strip() or "(no body)" + files = (files_changed or "").strip()[:1500] or "(unknown)" + return ( + f"Subject: {commit.get('subject', '')}\n" + f"Author: {commit.get('author', '')}\n" + f"Date: {commit.get('date', '')}\n" + f"Body: {body}\n" + f"Files changed:\n{files}" + ) + + +def normalize_date(raw): + """Turn a git ISO date like 2024-03-12T10:15:30+01:00 into a clean string.""" + if not raw: + return "1970-01-01T00:00:00" + # Strip timezone suffix, keep first 19 chars + cleaned = raw.split("+")[0].split("Z")[0][:19] + return cleaned if "T" in cleaned else "1970-01-01T00:00:00" + + +def flush_batch(cursor, sql, batch, model): + """Encode texts in batch, insert via executemany.""" + texts = [item[1] for item in batch] + vectors = model.encode(texts, normalize_embeddings=True, show_progress_bar=False) + + rows = [] + for (commit, content, files_changed), vec in zip(batch, vectors, strict=True): + rows.append( + ( + commit.get("sha"), + (commit.get("author") or "")[:200], + normalize_date(commit.get("date", "")), + (commit.get("subject") or "")[:1000], + commit.get("body") or "", + files_changed, + content, + array.array("f", vec.tolist()), + ) + ) + + try: + cursor.executemany(sql, rows) + except oracledb.IntegrityError: + # Retry one-by-one so a duplicate SHA does not kill the whole batch + inserted = 0 + for row in rows: + try: + cursor.execute(sql, row) + inserted += 1 + except oracledb.IntegrityError: + pass + return inserted + + return len(rows) + + +def main(): + if not os.path.exists(COMMITS_FILE): + print(f"ERROR: {COMMITS_FILE} not found in current directory.") + sys.exit(1) + + print("Loading embedding model (first run downloads ~90 MB)...") + model = SentenceTransformer(EMBED_MODEL) + + print(f"Parsing {DIFFS_FILE} ...") + diffs = parse_diffs(DIFFS_FILE) + print(f" parsed {len(diffs)} diffs") + + print(f"Loading {COMMITS_FILE} ...") + commits = load_commits(COMMITS_FILE, MAX_COMMITS) + print(f" loaded {len(commits)} commits") + + print(f"Connecting to Oracle AI Database 26ai at {DB_DSN} ...") + conn = oracledb.connect(user=DB_USER, password=DB_PASSWORD, dsn=DB_DSN) + cursor = conn.cursor() + + # Point all unqualified object references at the target schema. + # Schema names cannot be parameterised in DDL; validate against a strict + # allowlist pattern to prevent SQL injection. + if not re.fullmatch(r"[A-Za-z_][A-Za-z0-9_$#]{0,127}", DB_SCHEMA): + print(f"ERROR: ORACLE_SCHEMA value '{DB_SCHEMA}' is not a valid Oracle identifier.") + sys.exit(1) + + cursor.execute(f"ALTER SESSION SET CURRENT_SCHEMA = {DB_SCHEMA}") + + insert_sql = f""" + INSERT INTO {DB_SCHEMA}.FASTAPI_COMMITS + (sha, author, commit_date, subject, body, + files_changed, content_for_embedding, embedding) + VALUES + (:1, :2, + TO_TIMESTAMP(:3, 'YYYY-MM-DD"T"HH24:MI:SS'), + :4, :5, :6, :7, :8) + """ + + batch = [] + total = 0 + for commit in commits: + sha = commit.get("sha") + if not sha: + continue + files_changed = diffs.get(sha, "") + content = build_content(commit, files_changed) + batch.append((commit, content, files_changed)) + + if len(batch) >= BATCH_SIZE: + total += flush_batch(cursor, insert_sql, batch, model) + conn.commit() + batch = [] + print(f" inserted {total} commits...") + + if batch: + total += flush_batch(cursor, insert_sql, batch, model) + conn.commit() + + print(f"Done. Inserted {total} commits.") + + cursor.close() + conn.close() + + +if __name__ == "__main__": + main() diff --git a/apps/git-second-brain/data-loader/requirements.txt b/apps/git-second-brain/data-loader/requirements.txt new file mode 100644 index 00000000..02397586 --- /dev/null +++ b/apps/git-second-brain/data-loader/requirements.txt @@ -0,0 +1,2 @@ +oracledb>=2.2.0,<4 +sentence-transformers>=5.0,<6 diff --git a/apps/git-second-brain/database/01_create_user.sql b/apps/git-second-brain/database/01_create_user.sql new file mode 100644 index 00000000..020954b6 --- /dev/null +++ b/apps/git-second-brain/database/01_create_user.sql @@ -0,0 +1,34 @@ +-- ===================================================================== +-- Git Second Brain – Oracle AI Database 26ai user setup +-- +-- Run as SYS / SYSTEM (or any DBA) against the target PDB: +-- sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @01_create_user.sql +-- ===================================================================== + +-- Drop the user if it already exists (CASCADE removes all owned objects) +BEGIN + EXECUTE IMMEDIATE 'DROP USER GITHUB_SECOND_BRAIN CASCADE'; +EXCEPTION + WHEN OTHERS THEN + IF SQLCODE != -01918 THEN -- ORA-01918: user does not exist + RAISE; + END IF; +END; +/ + +CREATE USER GITHUB_SECOND_BRAIN +IDENTIFIED BY "ChangeMe_123!" +DEFAULT TABLESPACE USERS +TEMPORARY TABLESPACE TEMP +PROFILE DEFAULT +ACCOUNT UNLOCK; + +GRANT CREATE SESSION TO GITHUB_SECOND_BRAIN; +GRANT CREATE TABLE TO GITHUB_SECOND_BRAIN; +GRANT CREATE VIEW TO GITHUB_SECOND_BRAIN; +GRANT CREATE SEQUENCE TO GITHUB_SECOND_BRAIN; +GRANT CREATE PROCEDURE TO GITHUB_SECOND_BRAIN; +GRANT CREATE TRIGGER TO GITHUB_SECOND_BRAIN; +GRANT CREATE TYPE TO GITHUB_SECOND_BRAIN; + +ALTER USER GITHUB_SECOND_BRAIN QUOTA UNLIMITED ON USERS; diff --git a/apps/git-second-brain/database/02_create_schema.sql b/apps/git-second-brain/database/02_create_schema.sql new file mode 100644 index 00000000..1b859f61 --- /dev/null +++ b/apps/git-second-brain/database/02_create_schema.sql @@ -0,0 +1,56 @@ +-- ===================================================================== +-- Git Second Brain – Oracle AI Database 26ai schema +-- +-- All objects live under the GITHUB_SECOND_BRAIN schema. +-- +-- Connect as a user with privileges on GITHUB_SECOND_BRAIN, for example: +-- sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @02_create_schema.sql +-- Or connect directly as GITHUB_SECOND_BRAIN if you have the password. +-- ===================================================================== + +ALTER SESSION SET CURRENT_SCHEMA = GITHUB_SECOND_BRAIN; + +-- Drop old objects if this script is rerun +BEGIN + EXECUTE IMMEDIATE 'DROP TABLE GITHUB_SECOND_BRAIN.FASTAPI_COMMITS PURGE'; +EXCEPTION WHEN OTHERS THEN NULL; +END; +/ + +-- ===================================================================== +-- Main table: one row per commit, with a 384-dim vector column +-- ===================================================================== +CREATE TABLE GITHUB_SECOND_BRAIN.FASTAPI_COMMITS ( + id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, + sha VARCHAR2(64) UNIQUE NOT NULL, + author VARCHAR2(200), + commit_date TIMESTAMP, + subject VARCHAR2(1000), + body CLOB, + files_changed CLOB, + content_for_embedding CLOB, + embedding VECTOR(384, FLOAT32) +); + +-- Supporting indexes for metadata filters (date range, author) +CREATE INDEX GITHUB_SECOND_BRAIN.FASTAPI_COMMITS_DATE_IDX + ON GITHUB_SECOND_BRAIN.FASTAPI_COMMITS(commit_date); + +CREATE INDEX GITHUB_SECOND_BRAIN.FASTAPI_COMMITS_AUTHOR_IDX + ON GITHUB_SECOND_BRAIN.FASTAPI_COMMITS(author); + +COMMIT; + +-- ===================================================================== +-- Vector index +-- Create AFTER loading the data (faster build, better quality). +-- Run the block below once load_data.py finishes. +-- ===================================================================== +-- +-- CREATE VECTOR INDEX GITHUB_SECOND_BRAIN.FASTAPI_COMMITS_VEC_IDX +-- ON GITHUB_SECOND_BRAIN.FASTAPI_COMMITS (embedding) +-- ORGANIZATION INMEMORY NEIGHBOR GRAPH +-- DISTANCE COSINE +-- WITH TARGET ACCURACY 95; +-- +-- ===================================================================== diff --git a/apps/git-second-brain/database/README.md b/apps/git-second-brain/database/README.md new file mode 100644 index 00000000..440c7583 --- /dev/null +++ b/apps/git-second-brain/database/README.md @@ -0,0 +1,33 @@ +# Git Second Brain — Database + +SQL scripts for setting up the Oracle AI Database 26ai schema used by the +data loader and the app. + +## Prerequisites + +- Oracle AI Database 26ai (e.g. the free container image) +- A DBA connection to the target PDB (e.g. `SYSTEM`) + +## Scripts + +Run in order: + +| Script | Run as | Purpose | +| ---------------------- | ----------------------------- | ------------------------------------------------------------------------------- | +| `01_create_user.sql` | SYS / SYSTEM | Creates the `GITHUB_SECOND_BRAIN` user with required grants | +| `02_create_schema.sql` | SYSTEM or GITHUB_SECOND_BRAIN | Creates the `FASTAPI_COMMITS` table, indexes, and (optionally) the vector index | + +## Usage + +```bash +# Connect as SYSTEM to the pluggable database +sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 + +# Then run each script +@01_create_user.sql +@02_create_schema.sql +``` + +> **Note:** The vector index creation is commented out in `02_create_schema.sql`. +> Create it **after** loading data with `data-loader/` for faster build times +> and better index quality.