AI-Powered IDP Schema Mapper

An AI-powered tool to map Identity Provider (IDP) API schemas (e.g., Okta, Leen) using embeddings and vector similarity search.

Overview

This tool compares two schemas, identifies corresponding fields, and optionally uses associated documentation to improve match accuracy. It leverages:

OpenAI embeddings for semantic field representation
Pinecone vector database for similarity search
Bidirectional mapping capabilities with mutual match detection
Documentation processing for enhanced context

Project Structure and Component Purposes

Core Components

Schema Parser (src/schema_mapper/schema/parser.py)
- Extracts fields and metadata from OpenAPI JSON schemas
- Handles nested objects, arrays, and references
- Creates SchemaField objects for vectorization
Field Vectorizer (src/schema_mapper/embedding/vectorizer.py)
- Creates embeddings for schema fields using OpenAI models
- Manages vector storage in Pinecone
- Handles similarity searches between fields
Schema Mapper (src/schema_mapper/matching/mapper.py)
- Orchestrates the mapping process between schemas
- Supports one-way and bidirectional mapping
- Detects mutual matches and calculates confidence scores
- Exports results in various formats
Documentation Processor (src/schema_mapper/docs_processor/processor.py)
- Processes documentation files in multiple formats:
  - Markdown, HTML, plain text
  - YAML and JSON (including OpenAPI schemas)
  - PDF documents
  - Code files (JavaScript, TypeScript, Python, Java)
- Intelligently extracts relevant documentation from different file types
- Chunks and embeds documentation for context enrichment
- Associates relevant documentation with schema fields
Configuration (src/schema_mapper/utils/config.py)
- Manages application settings
- Handles command-line arguments and config files
- Provides default values and validation
CLI (src/schema_mapper/cli/main.py)
- Entry point for command-line usage
- Handles workflow execution
- Manages input/output operations
Web UI (src/schema_mapper/streamlit_app.py)
- Provides a user-friendly interface via Streamlit
- Visualizes mapping results and confidence scores
- Allows interactive configuration

Additional Files

Setup and Config
- setup.py - Package installation configuration
- requirements.txt - Dependencies list
- config-example.yaml - Example configuration template
- sample_config.yaml - Configuration with sample data paths
Sample Data
- sample_data/schemas/ - Example OpenAPI schemas
- sample_data/docs/ - Documentation for sample schemas

Data Flow

1. Input Data

The required input data includes:

OpenAPI JSON Schemas: Located at paths specified in your config file or via CLI arguments
- Example: sample_data/schemas/okta_sample.json and sample_data/schemas/leen_sample.json
Optional Documentation: Located in directories specified in your config file
- Example: sample_data/docs/okta_user_api.md and sample_data/docs/leen_identity_api.md
- Can include GitHub repository structure with various file formats
Configuration: Either through a YAML config file, CLI arguments, or environment variables
- API keys for OpenAI and Pinecone
- Mapping direction and confidence thresholds
- Input/output paths
- Supported documentation formats

2. Processing Pipeline

Schema Parsing
- OpenAPIParser loads and extracts fields from source and target schemas
- Fields are organized with their metadata (path, type, description)
Documentation Processing (optional)
- DocumentationProcessor reads, chunks, and indexes documentation files
- Supports multiple file formats (markdown, HTML, YAML, JSON, PDF, code files)
- Format-specific extractors identify relevant documentation (descriptions, comments, JSDoc, etc.)
- Documentation is embedded and stored in Pinecone namespaces (e.g., "okta_docs", "leen_docs")
- Fields can be enriched with relevant documentation context
Field Vectorization
- FieldVectorizer creates embeddings for each field using OpenAI
- Vectors are stored in Pinecone under respective namespaces (e.g., "okta", "leen")
Field Matching
- SchemaMapper queries similar fields between schemas
- For each source field, finds the most similar target fields
- When bidirectional, performs both source→target and target→source mappings
- Identifies mutual matches (fields that match to each other in both directions)
Output Generation
- Results are formatted as specified (JSON, CSV, Excel)
- Includes source field, target match, confidence score, and mutual match flag

3. Output Data

The output will be stored at the location specified in your config file (output.output_path):

Default: output/mapping_results.[json|csv|xlsx]
Sample: sample_data/output/mapping_results.[json|csv|xlsx]

Output formats available:

JSON: Detailed mapping with field metadata
CSV: Tabular format for spreadsheet analysis
Excel: Enhanced spreadsheet with formatting

Installation

Prerequisites

Python 3.8 or higher
OpenAI API key
Pinecone API key

Setup

Clone the repository:

git clone https://github.com/example/schema-mapper.git
cd schema-mapper

Install the package:

pip install -e .

Set up environment variables (or use CLI parameters):

# Create .env file
echo "OPENAI_API_KEY=your_openai_key" > .env
echo "PINECONE_API_KEY=your_pinecone_key" >> .env

Usage

Basic Command Line Usage

# Basic mapping from source to target schema
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen

# Bidirectional mapping with higher confidence threshold
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --direction bidirectional --min-confidence 0.7

# Use documentation for context enrichment
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --target-docs-dir path/to/leen_docs --use-docs

# Specify specific documentation formats to process
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --use-docs --doc-formats .md .yaml .json

Using Configuration File

Create a YAML configuration file:

openai:
  api_key: your_openai_key
  embedding_model: text-embedding-3-small

pinecone:
  api_key: your_pinecone_key
  index_name: schema-mapper

schema_mapping:
  source_schema: path/to/okta_schema.json
  target_schema: path/to/leen_schema.json
  source_name: okta
  target_name: leen
  direction: bidirectional
  top_k: 5
  min_confidence: 0.6
  detect_mutual_matches: true

documentation:
  source_docs_dir: path/to/okta_docs
  target_docs_dir: path/to/leen_docs
  chunk_size: 1000
  chunk_overlap: 200
  use_docs: true
  # Specify which file formats to process (optional)
  supported_formats:
    - .md       # Markdown
    - .json     # JSON (including OpenAPI schemas)
    - .yaml     # YAML configuration and schemas
    - .js       # JavaScript files with JSDoc comments

output:
  format: json
  output_path: output/mapping_results

Then run with:

schema-mapper --config your_config.yaml

Web UI (Streamlit)

For a more interactive experience, use the Streamlit UI:

streamlit run src/schema_mapper/streamlit_app.py

The web interface allows you to:

Upload schema files
Configure mapping parameters
Visualize mapping results
Export in various formats

Example Workflow

Prepare your schemas - Ensure you have OpenAPI JSON schemas for both source and target systems
Gather documentation (optional) - Collect relevant API documentation in various formats (markdown, JSON, YAML, code files with comments)
Create a configuration file - Copy config-example.yaml and modify for your needs
Run the mapping - Execute via CLI or web UI
Review the results - Examine the mappings and confidence scores
Iterate if needed - Adjust confidence thresholds or add more documentation context

Troubleshooting

Missing API keys: Ensure OpenAI and Pinecone API keys are set
Schema parsing errors: Validate your OpenAPI schemas
Low confidence scores: Consider adding more documentation context
Memory issues: For large schemas, adjust batch sizes in vectorizer.py
File parsing errors: Check logs for format-specific errors; install optional dependencies as needed (PyPDF2, beautifulsoup4, etc.)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sample_data		sample_data
src		src
venv		venv
.DS_Store		.DS_Store
README.md		README.md
config-example.yaml		config-example.yaml
prd.md		prd.md
requirements.txt		requirements.txt
sample_config.yaml		sample_config.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered IDP Schema Mapper

Overview

Project Structure and Component Purposes

Core Components

Additional Files

Data Flow

1. Input Data

2. Processing Pipeline

3. Output Data

Installation

Prerequisites

Setup

Usage

Basic Command Line Usage

Using Configuration File

Web UI (Streamlit)

Example Workflow

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-Powered IDP Schema Mapper

Overview

Project Structure and Component Purposes

Core Components

Additional Files

Data Flow

1. Input Data

2. Processing Pipeline

3. Output Data

Installation

Prerequisites

Setup

Usage

Basic Command Line Usage

Using Configuration File

Web UI (Streamlit)

Example Workflow

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages