The Animal Genetics Research Platform is a transformative cloud ecosystem designed to bridge the critical gap between advanced genomic science and practical livestock management. Our mission is to democratize access to cutting-edge genetic research, empowering farmers and researchers to collaborate in accelerating genetic gain, enhancing animal welfare, and driving sustainable agricultural production.
At the heart of the platform is Emilia AI, a RAG-powered (Retrieval-Augmented Generation) AI agent. Unlike standard models, Emilia provides grounded responses by orchestrating natural language queries across peer-reviewed literature (arXiv, PubMed, Nature) and real-time internal databases. This ensures that every breeding recommendation and research insight is anchored in scientific truth.
- Seamless Collaboration: Uniting Farmers, Researchers, and Students in a shared data environment.
- Intelligent Assistance: Leveraging LLMs and LangChain for natural language data exploration and literature summarization.
- Advanced Analytics: Professional-grade tools for genomic analysis, mating simulations, and heritability calculations.
- Educational Pathway: Providing hands-on experience with real-world datasets for the next generation of animal scientists.
- Frontend: React (TypeScript) with TailwindCSS, Vite, and Lucide Icons. Deployed via AWS Amplify.
- Main Backend (AI): FastAPI (Python) orchestrating LangChain, ChatGPT (OpenAI), ChromaDB, and Neo4j (Planned).
- Service Connectivity & Protocols:
- LLM Orchestration: Uses LangChain (
ChatOpenAI,create_sql_agent) to manage complex reasoning, tool-calling, and structured data summarization. - Literature Retrieval:
- Live Fetching: Real-time integration with arXiv (SDK), PubMed (Biopython), and Nature (REST) to ensure data freshness.
- Vector Store: ChromaDB is integrated for persistent semantic indexing and RAG-based retrieval of local research documents.
- Database Access:
SQLAlchemyfor PostgreSQL;neo4jdriver for graph-based knowledge retrieval.
- LLM Orchestration: Uses LangChain (
- Service Connectivity & Protocols:
- User Backend: Bun.js (TypeScript) handling core user and farm data services.
- Database Connectivity: Native
MongoClientfor DocumentDB/MongoDB;supabase-jsfor PostgreSQL interactions.
- Database Connectivity: Native
- Databases:
- User Data: Supabase (PostgreSQL) for current implementation (cost-optimized), with code architecture fully ready for Amazon DocumentDB.
- Research Data: PostgreSQL (RDS), Neo4j (Planned), and ChromaDB (Vector).
- Object Storage: AWS S3.
- Infrastructure: AWS (ECS Fargate, Copilot,CloudFormation,CloudFront, API Gateway, VPC Links, ACM, Amplify).
- Data Pipeline:
- Offline Ingestion: Apache Airflow orchestrating scheduled literature harvesting and database migrations.
- Live Retrieval & RAG: Real-time API integration within the FastAPI service:
- LangChain Orchestrator: Manages the flow between user intent, literature retrieval, and final LLM refinement.
- Source Connectors:
arxivSDK,Bio.Entrez(PubMed), andrequests(Nature). - Vector Management: ChromaDB stores embeddings for retrieved content to enable semantic re-ranking and local knowledge retrieval.
- Emilia AI Co-pilot: A RAG-powered AI agent for breeding advice and research insights.
- Advanced Authentication: Secure user access via Supabase Auth supporting:
- Multi-method Login: Email/Password and Web3/MetaMask wallet authentication.
- Role-Based Access Control (RBAC): Specific interfaces for Researchers, Farmers, and Students.
- Secure Sessions: Managed via Supabase with protected routes in the React frontend.
- Multi-Source RAG: Retrieval of peer-reviewed research from PubMed, arXiv, and Nature using ChromaDB and Neo4j (Planned).
- Intelligent SQL Agent: Dynamic generation of complex genetic queries for PostgreSQL with automated schema analysis.
- Interactive Artifacts: Real-time generation of Plotly charts and formatted data tables, handled via a frontend-backend artifact protocol.
- Conversational History: Persistent, user-specific chat sessions and message history stored in Supabase.
Extracted from @[docs/architecture/arch-overview.md] and project evolution plans:
- Breeding & Mating Engine: Advanced predictive models for animal pairing optimization and genetic gain analysis.
- Research Environment: Integrated RStudio and JupyterHub workspaces for deep statistical data analysis.
- Unified Dashboard: Comprehensive real-time KPIs and analytics for farmers and scientists.
- Species Expansion: Extending genomic models beyond sheep to other livestock species (cattle, goats, etc.).
- Enhanced Automation: Automated data collection pipelines and AI-driven breeding optimization.
- Edge Capabilities: Offline-first data entry for remote farm environments with cloud synchronization.
The chat system in the Main Backend implements a sophisticated multi-step pipeline for processing researcher and farmer queries, supporting both synchronous and long-running asynchronous execution:
- Intent Analysis: Every query is analyzed by an intelligent agent (
analyze_query_intent) using a Zero-shot LLM Prompt to determine if it requires Research (literature-based RAG) or Database (structured SQL) processing. - Dual-Path Execution:
- Research Queries: Orchestrates a RAG pipeline (
retrieve_literature_for_question) that performs parallel searches across arXiv, PubMed, and Nature APIs, merging results into a grounded context. - Database Queries: Invokes an Intelligent SQL Agent (
create_sql_agentwithopenai-tools) that dynamically inspects the PostgreSQL schema, generates optimized SQL, and extracts structured data for visualization.
- Research Queries: Orchestrates a RAG pipeline (
- Connectivity & Orchestration:
- API Integration: Uses the
OpenAIclient for answer refinement and synthetic dataset generation. - State Management: Persistent job tracking via a custom
job_storeservice to manage long-running async tasks.
- API Integration: Uses the
- Sync/Async Model:
- Standard queries are processed via a synchronous
/chatendpoint. - Complex analyses use an Async Job Store (
/chat/async) with background task execution and frontend polling for job results.
- Standard queries are processed via a synchronous
- Artifact Generation: Results are processed into a standard payload (
ChatArtifacts) containing interactive Plotly Charts and Formatted Tables. - LLM Summarization: Uses LangChain and ChatGPT to provide grounded, natural language insights based on the retrieved artifacts and literature.
graph TD
User((User)) -->|Query/Prompt| AppServer[Application Server]
AppServer -->|Query| Search[Search]
Search -->|Fetch Information| KB[Knowledge Base]
KB -->|Relevant information| Search
Search -->|Enhanced context| AppServer
AppServer -->|Enhanced Context| LLM[LLM]
LLM -->|Generated text response| AppServer
AppServer -->|Response| User
KB -->|Internal Database, Web API, Web Search| DataSources[Data Sources]
subgraph Knowledge Base
DataSources
end
subgraph LLM Components
LLM
note[Perplexity, Llama Maverick]
end
style User fill:#f9f,stroke:#333,stroke-width:2px
style AppServer fill:#bbf,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
style Search fill:#bfb,stroke:#333,stroke-width:2px
style KB fill:#fbb,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
style LLM fill:#ffd,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
style DataSources fill:#ddd,stroke:#333,stroke-width:1px
Note: Components marked as (Planned) (e.g., Neo4j, DevOps Server) are part of the strategic roadmap. While the DocumentDB architecture is fully implemented and tested, the platform currently utilizes Supabase for cost-optimization during this phase.
%%{init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#2563eb',
'primaryTextColor': '#fff',
'primaryBorderColor': '#1d4ed8',
'lineColor': '#3b82f6',
'secondaryColor': '#059669',
'tertiaryColor': '#dc2626',
'background': '#f8fafc',
'mainBkg': '#ffffff',
'secondBkg': '#f1f5f9'
}
}}%%
flowchart TD
A["π₯ Users<br>Farmers & Researchers"] L_A_B_0@==> B["π Web & Mobile Apps<br>React + React Native"]
B L_B_C_0@==> C["ποΈ Main Server<br>2 APIs + AI + Research Tools"]
C L_C_E_0@==> E["πΎ PostgreSQL(RDS)<br>Farm & Animal Data"] & G["ποΈ DocumentDB<br>(Ready/Planned)"] & H["βοΈ S3<br>Files & Backups"]
F["πΈοΈ Neo4j<br>(Planned)"] L_C_F_0@==>C
I["π Research APIs<br>PubMed + Nature"] L_I_J_0@==> J["π ChromaDB<br>via LangChain"] L_J_C_0@==> C
D["π§ DevOps Server<br>(Planned)"] L_D_C_0@==> C
E L_E_F_0@== "Real-time ETL" ==> F
A:::userStyle
B:::frontendStyle
C:::serverStyle
E:::dataStyle
G:::dataStyle
H:::dataStyle
F:::dataStyle
I:::externalStyle
J:::dataStyle
D:::devopsStyle
classDef userStyle fill:#2563eb,stroke:#1d4ed8,stroke-width:3px,color:#fff
classDef frontendStyle fill:#059669,stroke:#047857,stroke-width:3px,color:#fff
classDef serverStyle fill:#7c3aed,stroke:#6d28d9,stroke-width:3px,color:#fff
classDef devopsStyle fill:#dc2626,stroke:#b91c1c,stroke-width:3px,color:#fff
classDef dataStyle fill:#0891b2,stroke:#0e7490,stroke-width:3px,color:#fff
classDef externalStyle fill:#6b7280,stroke:#4b5563,stroke-width:3px,color:#fff
L_A_B_0@{ animation: fast }
L_B_C_0@{ animation: fast }
L_C_E_0@{ animation: fast }
L_C_G_0@{ animation: fast }
L_C_H_0@{ animation: fast }
L_C_F_0@{ animation: fast }
L_I_J_0@{ animation: fast }
L_J_C_0@{ animation: fast }
L_D_C_0@{ animation: fast }
L_E_F_0@{ animation: fast }
The platform implements a Hybrid Sync/Async API Architecture to manage high-latency AI operations:
- Request Initiation: The React Frontend communicates with services via environment-driven endpoints (
VITE_FASTAPI_URL,VITE_USER_BACKEND_URL). - API Gateway & VPC: Requests are routed through AWS API Gateway and VPC Links, ensuring secure internal communication to ECS-hosted containers.
- Chat Execution Patterns:
- Synchronous Path: For simple queries, the frontend calls
/chat(POST), waiting for a direct JSON response. - Asynchronous Path: For complex DB/Research queries (determined by
isLikelyDbQuery), the frontend calls/chat/async. - Job Polling: Upon receiving a
job_id, the frontend enters a polling loop via/chat/async/{job_id}, retrieving the finalChatArtifactsonce the background task completes.
- Synchronous Path: For simple queries, the frontend calls
- Service Interoperability:
- Main Backend β User Backend: Services share a common Supabase instance for persistent state while maintaining isolation for AI-specific workloads.
- Backend β External APIs: All literature and LLM calls are orchestrated through the FastAPI Service Layer using the connectivity methods defined in the Tech Stack.
- Data Persistence: Chat history and artifacts are persisted to Supabase (PostgreSQL) using the
appendMessageworkflow in thechatStore.
(GIF of Emilia AI usage will be placed here)