Skip to content

feature/add llm providers#27

Merged
leo-gan merged 4 commits intoleo-gan:mainfrom
LeoGan:feat/add-llm-providers-and-refactor
Oct 4, 2025
Merged

feature/add llm providers#27
leo-gan merged 4 commits intoleo-gan:mainfrom
LeoGan:feat/add-llm-providers-and-refactor

Conversation

@leo-gan
Copy link
Owner

@leo-gan leo-gan commented Oct 4, 2025

No description provided.

google-labs-jules bot and others added 4 commits October 4, 2025 19:49
This commit introduces support for OpenAI and Anthropic as LLM providers, enhances model selection flexibility, and refactors core components.

Key changes include:
- Added `OpenAIProvider` and `AnthropicProvider` to integrate with their respective APIs.
- Updated `pyproject.toml` with `openai` and `anthropic` optional dependencies.
- Enhanced `conf.py` to allow custom model strings in the format "provider/model-name" via the `get_provider_and_model_name` function.
- Updated the `ModelName` enum to replace `openai_gpt_3_5_turbo` with `openai_gpt_5`, and updated Anthropic models to `claude-4-sonet` and `claude-4.5-sonet`.
- Refactored the `get_provider` factory function in `llm_provider.py` to use a more maintainable dictionary mapping.
- Updated all `README.md` files to document the new providers, API keys, and the custom model name feature.
@gemini-code-assist
Copy link

Summary of Changes

Hello @leo-gan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the pdf-anonymizer application by integrating support for additional Large Language Model (LLM) providers, specifically OpenAI and Anthropic. The changes introduce a more flexible model selection system, allowing users to specify models using a 'provider/model-name' format or predefined aliases. This expansion broadens the tool's utility and offers users more choices for anonymization, alongside updated documentation to guide the new configurations.

Highlights

  • New LLM Provider Support: The application now supports OpenAI and Anthropic Large Language Models, significantly expanding the range of available anonymization options.
  • Flexible Model Specification: Users can now specify models using a flexible 'provider/model-name' string format, in addition to predefined enum values, allowing for greater customization and future extensibility.
  • Updated Documentation and Configuration: All relevant documentation, including the root README, CLI README, and .env.template, has been updated to reflect the new providers, their installation requirements, necessary environment variables, and model naming conventions.
  • Core Logic Refinement: The core LLM calling logic has been refactored to dynamically determine the provider and model from the input string, improving modularity and adaptability for new integrations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@leo-gan leo-gan merged commit 946767a into leo-gan:main Oct 4, 2025
2 checks passed
@leo-gan leo-gan deleted the feat/add-llm-providers-and-refactor branch October 4, 2025 20:57
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds support for new LLM providers like OpenAI and Anthropic, and refactors the model handling to be more flexible. The code is generally well-structured. However, I've identified a critical issue in the Anthropic provider implementation related to streaming, a high-severity issue where a new parameter is not being passed, and a few medium-severity issues to improve consistency and user experience. Please see the detailed comments for suggestions.

if isinstance(max_output_tokens, int) and max_output_tokens > 0
else DEFAULT_CHARACTERS_TO_ANONYMIZE // 4,
messages=[{"role": "user", "content": prompt}],
stream=True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Anthropic API is called with stream=True, but the response is handled as a non-streaming response. When streaming, client.messages.create returns a Stream object that needs to be iterated over to assemble the full response. The current implementation (response.content[0].text) will fail because a Stream object does not have a .content attribute.

To fix this, you should either implement the logic to handle a streaming response or, more simply, set stream=False to receive the complete message object at once.

Suggested change
stream=True,
stream=False,

raw_text = provider.call(prompt, model_enum.value)
provider_name, actual_model_name = get_provider_and_model_name(model_name)
provider = get_provider(provider_name)
raw_text = provider.call(prompt, actual_model_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The LLMProvider.call method was updated to accept a max_output_tokens parameter, but it's not being passed here. This makes the feature non-functional for providers like Anthropic that rely on it, as they will always use a default fallback. You should estimate a reasonable maximum for output tokens based on the input text length and pass it to the provider. A common heuristic is a fraction of the input length, for example len(text) // 4.

Suggested change
raw_text = provider.call(prompt, actual_model_name)
max_output_tokens = len(text) // 4
raw_text = provider.call(prompt, actual_model_name, max_output_tokens=max_output_tokens)

Comment on lines +99 to 105
provider_name, _ = get_provider_and_model_name(model_name)
if provider_name == "google":
if "gemini" in model_name and not os.getenv("GOOGLE_API_KEY"):
logging.error(
"Error: GOOGLE_API_KEY not found. Please set it in the .env file."
)
sys.exit(1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The API key check is only implemented for Google models. To improve user experience and provide early feedback for misconfigurations, you should add checks for the other providers that require API keys (OpenAI, Anthropic, OpenRouter, Hugging Face). This can be generalized using a map of providers to their required environment variable names, making it easier to maintain.

    provider_name, _ = get_provider_and_model_name(model_name)

    api_key_env_vars = {
        "google": "GOOGLE_API_KEY",
        "openai": "OPENAI_API_KEY",
        "anthropic": "ANTHROPIC_API_KEY",
        "openrouter": "OPENROUTER_API_KEY",
        "huggingface": "HUGGING_FACE_TOKEN",
    }

    if provider_name in api_key_env_vars:
        key_name = api_key_env_vars[provider_name]
        if not os.getenv(key_name):
            logging.error(f"Error: {key_name} not found. Please set it in the .env file.")
            sys.exit(1)

Comment on lines +150 to +152
completion = client.chat.completions.create(
model=model_name, messages=[{"role": "user", "content": prompt}]
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call method receives max_output_tokens, but it's not being used when calling the OpenAI API. The OpenAI API supports a max_tokens parameter, which you should use to limit the response size. This improves consistency with other providers like Anthropic and can help control costs and prevent unexpectedly large responses.

Suggested change
completion = client.chat.completions.create(
model=model_name, messages=[{"role": "user", "content": prompt}]
)
completion = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_output_tokens,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant