Skip to content

R151Arushi/futureSolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Powered IDP Schema Mapper

An AI-powered tool to map Identity Provider (IDP) API schemas (e.g., Okta, Leen) using embeddings and vector similarity search.

Overview

This tool compares two schemas, identifies corresponding fields, and optionally uses associated documentation to improve match accuracy. It leverages:

  • OpenAI embeddings for semantic field representation
  • Pinecone vector database for similarity search
  • Bidirectional mapping capabilities with mutual match detection
  • Documentation processing for enhanced context

Project Structure and Component Purposes

Core Components

  • Schema Parser (src/schema_mapper/schema/parser.py)

    • Extracts fields and metadata from OpenAPI JSON schemas
    • Handles nested objects, arrays, and references
    • Creates SchemaField objects for vectorization
  • Field Vectorizer (src/schema_mapper/embedding/vectorizer.py)

    • Creates embeddings for schema fields using OpenAI models
    • Manages vector storage in Pinecone
    • Handles similarity searches between fields
  • Schema Mapper (src/schema_mapper/matching/mapper.py)

    • Orchestrates the mapping process between schemas
    • Supports one-way and bidirectional mapping
    • Detects mutual matches and calculates confidence scores
    • Exports results in various formats
  • Documentation Processor (src/schema_mapper/docs_processor/processor.py)

    • Processes documentation files in multiple formats:
      • Markdown, HTML, plain text
      • YAML and JSON (including OpenAPI schemas)
      • PDF documents
      • Code files (JavaScript, TypeScript, Python, Java)
    • Intelligently extracts relevant documentation from different file types
    • Chunks and embeds documentation for context enrichment
    • Associates relevant documentation with schema fields
  • Configuration (src/schema_mapper/utils/config.py)

    • Manages application settings
    • Handles command-line arguments and config files
    • Provides default values and validation
  • CLI (src/schema_mapper/cli/main.py)

    • Entry point for command-line usage
    • Handles workflow execution
    • Manages input/output operations
  • Web UI (src/schema_mapper/streamlit_app.py)

    • Provides a user-friendly interface via Streamlit
    • Visualizes mapping results and confidence scores
    • Allows interactive configuration

Additional Files

  • Setup and Config

    • setup.py - Package installation configuration
    • requirements.txt - Dependencies list
    • config-example.yaml - Example configuration template
    • sample_config.yaml - Configuration with sample data paths
  • Sample Data

    • sample_data/schemas/ - Example OpenAPI schemas
    • sample_data/docs/ - Documentation for sample schemas

Data Flow

1. Input Data

The required input data includes:

  • OpenAPI JSON Schemas: Located at paths specified in your config file or via CLI arguments

    • Example: sample_data/schemas/okta_sample.json and sample_data/schemas/leen_sample.json
  • Optional Documentation: Located in directories specified in your config file

    • Example: sample_data/docs/okta_user_api.md and sample_data/docs/leen_identity_api.md
    • Can include GitHub repository structure with various file formats
  • Configuration: Either through a YAML config file, CLI arguments, or environment variables

    • API keys for OpenAI and Pinecone
    • Mapping direction and confidence thresholds
    • Input/output paths
    • Supported documentation formats

2. Processing Pipeline

  1. Schema Parsing

    • OpenAPIParser loads and extracts fields from source and target schemas
    • Fields are organized with their metadata (path, type, description)
  2. Documentation Processing (optional)

    • DocumentationProcessor reads, chunks, and indexes documentation files
    • Supports multiple file formats (markdown, HTML, YAML, JSON, PDF, code files)
    • Format-specific extractors identify relevant documentation (descriptions, comments, JSDoc, etc.)
    • Documentation is embedded and stored in Pinecone namespaces (e.g., "okta_docs", "leen_docs")
    • Fields can be enriched with relevant documentation context
  3. Field Vectorization

    • FieldVectorizer creates embeddings for each field using OpenAI
    • Vectors are stored in Pinecone under respective namespaces (e.g., "okta", "leen")
  4. Field Matching

    • SchemaMapper queries similar fields between schemas
    • For each source field, finds the most similar target fields
    • When bidirectional, performs both source→target and target→source mappings
    • Identifies mutual matches (fields that match to each other in both directions)
  5. Output Generation

    • Results are formatted as specified (JSON, CSV, Excel)
    • Includes source field, target match, confidence score, and mutual match flag

3. Output Data

The output will be stored at the location specified in your config file (output.output_path):

  • Default: output/mapping_results.[json|csv|xlsx]
  • Sample: sample_data/output/mapping_results.[json|csv|xlsx]

Output formats available:

  • JSON: Detailed mapping with field metadata
  • CSV: Tabular format for spreadsheet analysis
  • Excel: Enhanced spreadsheet with formatting

Installation

Prerequisites

  • Python 3.8 or higher
  • OpenAI API key
  • Pinecone API key

Setup

  1. Clone the repository:
git clone https://github.com/example/schema-mapper.git
cd schema-mapper
  1. Install the package:
pip install -e .
  1. Set up environment variables (or use CLI parameters):
# Create .env file
echo "OPENAI_API_KEY=your_openai_key" > .env
echo "PINECONE_API_KEY=your_pinecone_key" >> .env

Usage

Basic Command Line Usage

# Basic mapping from source to target schema
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen

# Bidirectional mapping with higher confidence threshold
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --direction bidirectional --min-confidence 0.7

# Use documentation for context enrichment
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --target-docs-dir path/to/leen_docs --use-docs

# Specify specific documentation formats to process
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --use-docs --doc-formats .md .yaml .json

Using Configuration File

Create a YAML configuration file:

openai:
  api_key: your_openai_key
  embedding_model: text-embedding-3-small

pinecone:
  api_key: your_pinecone_key
  index_name: schema-mapper

schema_mapping:
  source_schema: path/to/okta_schema.json
  target_schema: path/to/leen_schema.json
  source_name: okta
  target_name: leen
  direction: bidirectional
  top_k: 5
  min_confidence: 0.6
  detect_mutual_matches: true

documentation:
  source_docs_dir: path/to/okta_docs
  target_docs_dir: path/to/leen_docs
  chunk_size: 1000
  chunk_overlap: 200
  use_docs: true
  # Specify which file formats to process (optional)
  supported_formats:
    - .md       # Markdown
    - .json     # JSON (including OpenAPI schemas)
    - .yaml     # YAML configuration and schemas
    - .js       # JavaScript files with JSDoc comments

output:
  format: json
  output_path: output/mapping_results

Then run with:

schema-mapper --config your_config.yaml

Web UI (Streamlit)

For a more interactive experience, use the Streamlit UI:

streamlit run src/schema_mapper/streamlit_app.py

The web interface allows you to:

  • Upload schema files
  • Configure mapping parameters
  • Visualize mapping results
  • Export in various formats

Example Workflow

  1. Prepare your schemas - Ensure you have OpenAPI JSON schemas for both source and target systems

  2. Gather documentation (optional) - Collect relevant API documentation in various formats (markdown, JSON, YAML, code files with comments)

  3. Create a configuration file - Copy config-example.yaml and modify for your needs

  4. Run the mapping - Execute via CLI or web UI

  5. Review the results - Examine the mappings and confidence scores

  6. Iterate if needed - Adjust confidence thresholds or add more documentation context

Troubleshooting

  • Missing API keys: Ensure OpenAI and Pinecone API keys are set
  • Schema parsing errors: Validate your OpenAPI schemas
  • Low confidence scores: Consider adding more documentation context
  • Memory issues: For large schemas, adjust batch sizes in vectorizer.py
  • File parsing errors: Check logs for format-specific errors; install optional dependencies as needed (PyPDF2, beautifulsoup4, etc.)

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors