🔍 Cognivia AI

Cognivia AI leverages cutting-edge technologies to provide intelligent document analysis and natural language interactions.

Author: Muhammad Husnain Ali

🛠️ Technologies Used

Core Technologies

Python - Primary programming language
LangChain - Framework for building LLM applications
OpenAI GPT-3.5 - Large Language Model for text processing
Streamlit - Web application framework

Data Processing & Storage

Pinecone - Vector database for similarity search
Supabase - PostgreSQL database for conversation history
PyPDF2 - PDF processing library

AI/ML Components

OpenAI Embeddings - text-embedding-3-small model (512 dimensions)
Vector Search - Semantic similarity matching
Conversation Memory - Context-aware chat history

Development Tools

Python Virtual Environment - Dependency isolation
Environment Variables - Secure configuration management
SQL - Database schema management

🚀 Features

Advanced PDF Processing
- Automatic text extraction and semantic chunking
- Support for multiple PDF uploads
- Intelligent document metadata preservation
- OCR support for scanned documents
Optimized RAG (Retrieval-Augmented Generation)
- Document-Only Responses: Strictly answers based on uploaded documents
- Existing Document Support: Automatically detects and works with pre-existing PDFs
- Similarity Threshold Filtering: Configurable relevance scoring
- Generic Response Detection: Prevents hallucination and general knowledge responses
- Source Attribution: Always cites document sources with page numbers
- Context Validation: Ensures answers are grounded in document content
AI-Powered Question Answering
- Natural language understanding with document constraints
- Context-aware responses from your PDFs only
- Multi-document correlation and analysis
- Intelligent "I don't know" responses when information isn't available
Enterprise-Grade Vector Search
- High-performance similarity matching with thresholds
- Scalable document indexing
- Real-time search capabilities
- Configurable search parameters and document limits
Smart Conversation Management
- Persistent chat history with Supabase
- Context retention across sessions
- Document-aware conversation flow
- Multi-user support with session isolation
Modern Chatbot Interface
- Chat Bubble Design: WhatsApp-style message interface
- Real-time Conversations: Instant responses with typing indicators
- Source Document Display: Expandable source citations
- Responsive Design: Mobile-friendly chat experience
- Document Status Tracking: Upload progress and document counts

🏗️ Architecture

Frontend: Streamlit web interface
LLM: OpenAI GPT-3.5-turbo for intelligent responses
Embeddings: OpenAI text-embedding-3-small (512 dimensions)
Vector Store: Pinecone for document similarity search
Memory: Supabase PostgreSQL for conversation persistence
PDF Processing: PyPDF2 with intelligent text chunking

⚙️ Requirements

Python 3.8+
OpenAI API key
Pinecone account and API key
Supabase project (for conversation memory)

🚀 Quick Setup

1. Clone and Setup Virtual Environment

# Clone the repository
git clone <repository-url>
cd ai-pdf-search-engine

# Create and activate virtual environment
# For Windows
python -m venv venv
.\venv\Scripts\activate

# For macOS/Linux
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

Create .env file in the project root:

# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_index_name
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSION=512
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key

# RAG Optimization Settings (Optional)
SIMILARITY_THRESHOLD=0.7          # Document relevance threshold (0.0-1.0)
MAX_DOCUMENTS_PER_QUERY=5         # Maximum documents to retrieve
LLM_TEMPERATURE=0                 # Response creativity (0.0-1.0)
MAX_TOKENS=1000                   # Maximum response length

# Chatbot Settings (Optional)
MAX_CHAT_HISTORY=20               # Messages to keep in memory
ENABLE_SOURCE_DISPLAY=true        # Show source documents

3. Setup Supabase Tables

Navigate to your Supabase project dashboard
Go to the SQL Editor
Open the provided setup_supabase.sql file in the project root
Execute the SQL commands to:
- Create chat sessions and messages tables
- Set up appropriate indexes
- Enable Row Level Security (RLS)
- Configure access policies

The SQL file includes all necessary table definitions, indexes, and security policies for the chat system.

4. Run Application

# Option 1: Use the runner script (recommended)
python run_app.py

# Option 2: Run directly with Streamlit
# Make sure your virtual environment is activated
# For Windows
.\venv\Scripts\activate

# For macOS/Linux
source venv/bin/activate

# Run the application
streamlit run app.py

5. Test the System

Run the test scripts to verify functionality:

# Test that the system only responds based on documents
python test_optimized_rag.py

# Demo working with existing documents (if any)
python demo_existing_docs.py

6. Deactivating Virtual Environment

When you're done working on the project, you can deactivate the virtual environment:

deactivate

🏗️ Project Structure

ai-pdf-search-engine/
├── app.py                 # Streamlit web interface
├── config.py             # Configuration and environment variables
├── pdf_processor.py      # PDF text extraction and chunking
├── vector_store.py       # Pinecone vector database integration
├── qa_system.py          # Question-answering logic
├── pdf_search_engine.py  # Main orchestration class
├── supabase_memory.py    # Conversation memory with Supabase
├── requirements.txt      # Python dependencies
├── .env.example         # Environment variables template
├── setup_supabase.sql   # Database schema for memory
├── .gitignore          # Git ignore configuration
└── README.md           # This file

💡 Advanced Configuration

RAG Optimization

# Fine-tune document retrieval and response quality
SIMILARITY_THRESHOLD = 0.7     # Higher = more strict document relevance
MAX_DOCUMENTS_PER_QUERY = 5    # More documents = better context, slower response
LLM_TEMPERATURE = 0            # 0 = deterministic, 1 = creative responses
MAX_TOKENS = 1000              # Longer responses vs. faster generation

Performance Tuning

# config.py
CHUNK_SIZE = 1000          # Adjust based on document complexity
CHUNK_OVERLAP = 200        # Increase for better context preservation
MAX_CHAT_HISTORY = 20      # Balance memory vs. performance
CACHE_TTL = 3600          # Cache lifetime in seconds

Scaling Considerations

Recommended Pinecone tier: Standard or Enterprise
Minimum RAM: 4GB
Recommended CPU: 4 cores
Storage: 10GB+ for document cache

🔧 Troubleshooting

Common Issues

PDF Processing Fails
- Ensure PDF is not password protected
- Check file permissions
- Verify PDF is not corrupted
Vector Store Errors
- Confirm Pinecone API key is valid
- Check index dimensions match configuration
- Verify network connectivity
Memory Issues
- Clear browser cache
- Restart application
- Check Supabase connection
Existing Documents Not Found
- Verify correct Pinecone index name in .env
- Check if using different API keys
- Run python demo_existing_docs.py to diagnose
- Use "Refresh Documents" button in the app

🤝 Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Open a Pull Request

🙏 Acknowledgments

OpenAI team for their powerful language models
Pinecone for vector search capabilities
Supabase team for the excellent database platform
LangChain community for the framework
All contributors and users of this project

📞 Support

📧 Email: m.husnainali.work@gmail.com
📝 Issues: GitHub Issues

Made with ❤️ by Muhammad Husnain Ali

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Cognivia AI

🛠️ Technologies Used

Core Technologies

Data Processing & Storage

AI/ML Components

Development Tools

🚀 Features

🏗️ Architecture

⚙️ Requirements

🚀 Quick Setup

1. Clone and Setup Virtual Environment

2. Configure Environment

3. Setup Supabase Tables

4. Run Application

5. Test the System

6. Deactivating Virtual Environment

🏗️ Project Structure

💡 Advanced Configuration

RAG Optimization

Performance Tuning

Scaling Considerations

🔧 Troubleshooting

Common Issues

🤝 Contributing

🙏 Acknowledgments

📞 Support

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
pdf_processor.py		pdf_processor.py
pdf_search_engine.py		pdf_search_engine.py
qa_system.py		qa_system.py
requirements.txt		requirements.txt
setup_supabase.sql		setup_supabase.sql
supabase_memory.py		supabase_memory.py
vector_store.py		vector_store.py

Folders and files

Latest commit

History

Repository files navigation

🔍 Cognivia AI

🛠️ Technologies Used

Core Technologies

Data Processing & Storage

AI/ML Components

Development Tools

🚀 Features

🏗️ Architecture

⚙️ Requirements

🚀 Quick Setup

1. Clone and Setup Virtual Environment

2. Configure Environment

3. Setup Supabase Tables

4. Run Application

5. Test the System

6. Deactivating Virtual Environment

🏗️ Project Structure

💡 Advanced Configuration

RAG Optimization

Performance Tuning

Scaling Considerations

🔧 Troubleshooting

Common Issues

🤝 Contributing

🙏 Acknowledgments

📞 Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages