An AI-powered tool to map Identity Provider (IDP) API schemas (e.g., Okta, Leen) using embeddings and vector similarity search.
This tool compares two schemas, identifies corresponding fields, and optionally uses associated documentation to improve match accuracy. It leverages:
- OpenAI embeddings for semantic field representation
- Pinecone vector database for similarity search
- Bidirectional mapping capabilities with mutual match detection
- Documentation processing for enhanced context
-
Schema Parser (
src/schema_mapper/schema/parser.py)- Extracts fields and metadata from OpenAPI JSON schemas
- Handles nested objects, arrays, and references
- Creates
SchemaFieldobjects for vectorization
-
Field Vectorizer (
src/schema_mapper/embedding/vectorizer.py)- Creates embeddings for schema fields using OpenAI models
- Manages vector storage in Pinecone
- Handles similarity searches between fields
-
Schema Mapper (
src/schema_mapper/matching/mapper.py)- Orchestrates the mapping process between schemas
- Supports one-way and bidirectional mapping
- Detects mutual matches and calculates confidence scores
- Exports results in various formats
-
Documentation Processor (
src/schema_mapper/docs_processor/processor.py)- Processes documentation files in multiple formats:
- Markdown, HTML, plain text
- YAML and JSON (including OpenAPI schemas)
- PDF documents
- Code files (JavaScript, TypeScript, Python, Java)
- Intelligently extracts relevant documentation from different file types
- Chunks and embeds documentation for context enrichment
- Associates relevant documentation with schema fields
- Processes documentation files in multiple formats:
-
Configuration (
src/schema_mapper/utils/config.py)- Manages application settings
- Handles command-line arguments and config files
- Provides default values and validation
-
CLI (
src/schema_mapper/cli/main.py)- Entry point for command-line usage
- Handles workflow execution
- Manages input/output operations
-
Web UI (
src/schema_mapper/streamlit_app.py)- Provides a user-friendly interface via Streamlit
- Visualizes mapping results and confidence scores
- Allows interactive configuration
-
Setup and Config
setup.py- Package installation configurationrequirements.txt- Dependencies listconfig-example.yaml- Example configuration templatesample_config.yaml- Configuration with sample data paths
-
Sample Data
sample_data/schemas/- Example OpenAPI schemassample_data/docs/- Documentation for sample schemas
The required input data includes:
-
OpenAPI JSON Schemas: Located at paths specified in your config file or via CLI arguments
- Example:
sample_data/schemas/okta_sample.jsonandsample_data/schemas/leen_sample.json
- Example:
-
Optional Documentation: Located in directories specified in your config file
- Example:
sample_data/docs/okta_user_api.mdandsample_data/docs/leen_identity_api.md - Can include GitHub repository structure with various file formats
- Example:
-
Configuration: Either through a YAML config file, CLI arguments, or environment variables
- API keys for OpenAI and Pinecone
- Mapping direction and confidence thresholds
- Input/output paths
- Supported documentation formats
-
Schema Parsing
OpenAPIParserloads and extracts fields from source and target schemas- Fields are organized with their metadata (path, type, description)
-
Documentation Processing (optional)
DocumentationProcessorreads, chunks, and indexes documentation files- Supports multiple file formats (markdown, HTML, YAML, JSON, PDF, code files)
- Format-specific extractors identify relevant documentation (descriptions, comments, JSDoc, etc.)
- Documentation is embedded and stored in Pinecone namespaces (e.g., "okta_docs", "leen_docs")
- Fields can be enriched with relevant documentation context
-
Field Vectorization
FieldVectorizercreates embeddings for each field using OpenAI- Vectors are stored in Pinecone under respective namespaces (e.g., "okta", "leen")
-
Field Matching
SchemaMapperqueries similar fields between schemas- For each source field, finds the most similar target fields
- When bidirectional, performs both source→target and target→source mappings
- Identifies mutual matches (fields that match to each other in both directions)
-
Output Generation
- Results are formatted as specified (JSON, CSV, Excel)
- Includes source field, target match, confidence score, and mutual match flag
The output will be stored at the location specified in your config file (output.output_path):
- Default:
output/mapping_results.[json|csv|xlsx] - Sample:
sample_data/output/mapping_results.[json|csv|xlsx]
Output formats available:
- JSON: Detailed mapping with field metadata
- CSV: Tabular format for spreadsheet analysis
- Excel: Enhanced spreadsheet with formatting
- Python 3.8 or higher
- OpenAI API key
- Pinecone API key
- Clone the repository:
git clone https://github.com/example/schema-mapper.git
cd schema-mapper- Install the package:
pip install -e .- Set up environment variables (or use CLI parameters):
# Create .env file
echo "OPENAI_API_KEY=your_openai_key" > .env
echo "PINECONE_API_KEY=your_pinecone_key" >> .env# Basic mapping from source to target schema
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen
# Bidirectional mapping with higher confidence threshold
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --direction bidirectional --min-confidence 0.7
# Use documentation for context enrichment
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --target-docs-dir path/to/leen_docs --use-docs
# Specify specific documentation formats to process
schema-mapper --source-schema path/to/okta_schema.json --target-schema path/to/leen_schema.json --source-name okta --target-name leen --source-docs-dir path/to/okta_docs --use-docs --doc-formats .md .yaml .jsonCreate a YAML configuration file:
openai:
api_key: your_openai_key
embedding_model: text-embedding-3-small
pinecone:
api_key: your_pinecone_key
index_name: schema-mapper
schema_mapping:
source_schema: path/to/okta_schema.json
target_schema: path/to/leen_schema.json
source_name: okta
target_name: leen
direction: bidirectional
top_k: 5
min_confidence: 0.6
detect_mutual_matches: true
documentation:
source_docs_dir: path/to/okta_docs
target_docs_dir: path/to/leen_docs
chunk_size: 1000
chunk_overlap: 200
use_docs: true
# Specify which file formats to process (optional)
supported_formats:
- .md # Markdown
- .json # JSON (including OpenAPI schemas)
- .yaml # YAML configuration and schemas
- .js # JavaScript files with JSDoc comments
output:
format: json
output_path: output/mapping_resultsThen run with:
schema-mapper --config your_config.yamlFor a more interactive experience, use the Streamlit UI:
streamlit run src/schema_mapper/streamlit_app.pyThe web interface allows you to:
- Upload schema files
- Configure mapping parameters
- Visualize mapping results
- Export in various formats
-
Prepare your schemas - Ensure you have OpenAPI JSON schemas for both source and target systems
-
Gather documentation (optional) - Collect relevant API documentation in various formats (markdown, JSON, YAML, code files with comments)
-
Create a configuration file - Copy
config-example.yamland modify for your needs -
Run the mapping - Execute via CLI or web UI
-
Review the results - Examine the mappings and confidence scores
-
Iterate if needed - Adjust confidence thresholds or add more documentation context
- Missing API keys: Ensure OpenAI and Pinecone API keys are set
- Schema parsing errors: Validate your OpenAPI schemas
- Low confidence scores: Consider adding more documentation context
- Memory issues: For large schemas, adjust batch sizes in
vectorizer.py - File parsing errors: Check logs for format-specific errors; install optional dependencies as needed (PyPDF2, beautifulsoup4, etc.)
MIT