Skip to content

Latest commit

ย 

History

History
423 lines (316 loc) ยท 9.08 KB

File metadata and controls

423 lines (316 loc) ยท 9.08 KB

Getting Started Guide

Welcome to Week 3 of the LLM learning track! This guide will get you up and running in 5 minutes.

๐Ÿ†“ This project uses FREE local models via Ollama - no API costs!


๐ŸŽฏ What You're Building

A reliable LLM system that extracts structured data from messy text:

"Invoice #123, total $456.78, due March 15th"
                    โ†“
{
  "invoice_number": "123",
  "total_amount": 456.78,
  "due_date": "2025-03-15"
}

With validation, retries, and 99%+ reliability.


โšก Quick Start (3 Steps)

Step 1: Setup (2 minutes)

Install Ollama first:

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or macOS with Homebrew
brew install ollama

# Start Ollama service
ollama serve

# Pull the model (in another terminal)
ollama pull llama3.2

Then setup the project:

# Run the quick start script
./quickstart.sh

This will:

  • โœ… Check Ollama installation
  • โœ… Download llama3.2 model if needed
  • โœ… Create virtual environment
  • โœ… Install dependencies
  • โœ… Run a demo extraction

OR manually:

# Install and start Ollama
ollama serve  # Keep running in one terminal
ollama pull llama3.2  # In another terminal

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Setup environment (optional - has defaults)
cp .env.example .env

Step 2: Run Your First Extraction (1 minute)

# Extract from a sample invoice
python cli.py extract \
  --input sample_inputs/invoice_tech.txt \
  --type invoice

You should see:

โœ“ Extraction succeeded after 1 attempt!

โ”Œโ”€ Extracted Data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ {
โ”‚   "invoice_number": "INV-2025-0342",
โ”‚   "total_amount": 8470.43,
โ”‚   "vendor_name": "TechSupply Solutions Inc.",
โ”‚   ...
โ”‚ }
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 3: Try More Examples (2 minutes)

# Extract from an email
python cli.py extract \
  --input sample_inputs/email_project.txt \
  --type email

# Extract from a support ticket
python cli.py extract \
  --input sample_inputs/support_ticket_urgent.txt \
  --type support_ticket

# See all available schema types
python cli.py list-schemas

๐Ÿ“š What to Read Next

1. Understand the Concepts (15 minutes)

Read CONCEPTS.md to understand:

  • Why function calling matters
  • How guardrails work
  • What makes output reliable

2. Explore the Code (20 minutes)

Start with schemas:

# Open the schema definitions
code src/schemas.py

Look for:

  • InvoiceData - see how fields are defined
  • Field() validators - see validation rules
  • @field_validator - see custom validation

Then explore the engine:

# Open the extraction engine
code src/extractor.py

Look for:

  • extract() method - main entry point
  • _call_llm() - how we use function calling
  • Retry logic in the main loop
  • _build_validation_feedback() - error recovery

3. Run with Verbose Logging (10 minutes)

# See everything that happens
python cli.py extract \
  --input sample_inputs/invoice_tech.txt \
  --type invoice \
  --verbose

Then check the logs:

ls -lt logs/ | head -2
cat logs/extraction_*.log

๐ŸŽฎ Interactive Learning Exercises

Exercise 1: Break It (5 minutes)

Create a file with incomplete data:

echo "Invoice #123, total $50" > test_invoice.txt

python cli.py extract \
  --input test_invoice.txt \
  --type invoice \
  --verbose

Questions:

  • What validation errors occur?
  • Does it retry? How many times?
  • What's the final error message?

Exercise 2: Test Validation (5 minutes)

Create invalid JSON:

echo '{"invoice_number": 123}' > bad.json

python cli.py validate \
  --schema invoice \
  --file bad.json

Questions:

  • What error does Pydantic report?
  • Why is 123 invalid for invoice_number?
  • What would be valid?

Exercise 3: Add a Field (15 minutes)

Open src/schemas.py and add a new optional field to InvoiceData:

payment_method: Optional[str] = Field(
    None,
    description="Payment method (credit card, check, etc.)"
)

Save and run:

python cli.py extract \
  --input sample_inputs/invoice_tech.txt \
  --type invoice

Does it extract the new field?

Exercise 4: Create Your Own Schema (30 minutes)

Add a new extraction type for receipts:

  1. Add to src/schemas.py:
class ReceiptData(BaseModel):
    store_name: str
    purchase_date: str = Field(..., pattern=r"^\d{4}-\d{2}-\d{2}$")
    total: float = Field(..., gt=0)
    items: List[str]
    payment_method: Optional[str] = None
  1. Register it:
EXTRACTION_SCHEMAS["receipt"] = {
    "model": ReceiptData,
    "description": "Extract data from receipts",
    "name": "extract_receipt_data"
}
  1. Create a sample receipt in sample_inputs/receipt.txt

  2. Test it:

python cli.py extract \
  --input sample_inputs/receipt.txt \
  --type receipt

๐Ÿ› Troubleshooting

"Ollama not found" or "command not found: ollama"

  • Install Ollama from https://ollama.ai
  • On macOS: brew install ollama
  • Verify: ollama --version

"Connection refused" to localhost:11434

  • Start Ollama: ollama serve
  • Check if running: curl http://localhost:11434/api/tags
  • Make sure no firewall is blocking port 11434

"Model not found"

  • Pull the model: ollama pull llama3.2
  • List installed models: ollama list
  • Try a different model: ollama pull mistral

"Validation failed after 3 attempts"

  • Check the input text - is it really an invoice/email/ticket?
  • Look at the validation errors in the output
  • Check logs for detailed error messages
  • Try adjusting the schema if it's too strict

"Module not found"

  • Make sure you activated the virtual environment
  • Run pip install -r requirements.txt again

๐Ÿ’ก Tips for Learning

1. Use Verbose Mode

Always run with --verbose when learning:

python cli.py extract -i <file> -t <type> --verbose

2. Read the Logs

Logs show everything:

# Find latest log
ls -lt logs/ | head -2

# View it
cat logs/extraction_*.log

3. Test Edge Cases

Try inputs that should fail:

  • Missing fields
  • Wrong formats
  • Ambiguous data
  • Empty files

4. Experiment with Temperature

# More deterministic
python cli.py extract -i <file> -t <type> --temperature 0.0

# More creative
python cli.py extract -i <file> -t <type> --temperature 0.7

5. Use the Validation Command

Test schemas without LLM calls:

python cli.py validate -s invoice -f data.json

๐Ÿ“– Documentation Index


๐ŸŽฏ Learning Path

1. Quick Start (5 min)
   โ””โ”€โ†’ Get it running

2. Concepts (15 min)
   โ””โ”€โ†’ Understand why

3. Code Exploration (30 min)
   โ”œโ”€โ†’ Read schemas.py
   โ”œโ”€โ†’ Read extractor.py
   โ””โ”€โ†’ Run with --verbose

4. Hands-On (60 min)
   โ”œโ”€โ†’ Exercise 1: Break it
   โ”œโ”€โ†’ Exercise 2: Test validation
   โ”œโ”€โ†’ Exercise 3: Add a field
   โ””โ”€โ†’ Exercise 4: Create new schema

5. Deep Dive (60+ min)
   โ”œโ”€โ†’ Modify retry logic
   โ”œโ”€โ†’ Add custom validators
   โ”œโ”€โ†’ Integrate with your app
   โ””โ”€โ†’ Deploy to production

โœ… Success Checklist

After completing this project, you should be able to:

  • Explain why function calling is more reliable than parsing text
  • Define a Pydantic schema with validation rules
  • Use the CLI to extract data from text
  • Understand how retry logic improves success rates
  • Debug validation failures using logs
  • Add a new extraction schema to the system
  • Configure temperature for determinism vs creativity
  • Validate JSON against schemas programmatically
  • Integrate the extractor into your own code
  • Explain when to use LLMs vs traditional parsing

๐Ÿš€ Next Steps

Once you're comfortable with this project:

  1. Extend it

    • Add new schemas (contracts, resumes, catalogs)
    • Add async support for batch processing
    • Build a web UI with Streamlit
  2. Apply it

    • Use it in your own projects
    • Process real documents
    • Build a data pipeline
  3. Continue learning

    • Week 4: RAG & Knowledge Integration
    • Week 5: Agents & Complex Workflows
    • Week 6: Production Deployment

๐Ÿค Need Help?

  • Check logs in logs/
  • Review error messages carefully
  • Read the relevant documentation section
  • Try the troubleshooting guide above
  • Experiment with simpler inputs first

Remember: The goal isn't just to make it workโ€”it's to understand why it works and when to use these patterns.

Take your time, experiment, break things, and learn! ๐ŸŽ“