feature/add llm providers by leo-gan · Pull Request #27 · leo-gan/anonymizer

leo-gan · 2025-10-04T20:56:10Z

No description provided.

This commit introduces support for OpenAI and Anthropic as LLM providers, enhances model selection flexibility, and refactors core components. Key changes include: - Added `OpenAIProvider` and `AnthropicProvider` to integrate with their respective APIs. - Updated `pyproject.toml` with `openai` and `anthropic` optional dependencies. - Enhanced `conf.py` to allow custom model strings in the format "provider/model-name" via the `get_provider_and_model_name` function. - Updated the `ModelName` enum to replace `openai_gpt_3_5_turbo` with `openai_gpt_5`, and updated Anthropic models to `claude-4-sonet` and `claude-4.5-sonet`. - Refactored the `get_provider` factory function in `llm_provider.py` to use a more maintainable dictionary mapping. - Updated all `README.md` files to document the new providers, API keys, and the custom model name feature.

…uest

gemini-code-assist · 2025-10-04T20:56:30Z

Summary of Changes

Hello @leo-gan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the pdf-anonymizer application by integrating support for additional Large Language Model (LLM) providers, specifically OpenAI and Anthropic. The changes introduce a more flexible model selection system, allowing users to specify models using a 'provider/model-name' format or predefined aliases. This expansion broadens the tool's utility and offers users more choices for anonymization, alongside updated documentation to guide the new configurations.

Highlights

New LLM Provider Support: The application now supports OpenAI and Anthropic Large Language Models, significantly expanding the range of available anonymization options.
Flexible Model Specification: Users can now specify models using a flexible 'provider/model-name' string format, in addition to predefined enum values, allowing for greater customization and future extensibility.
Updated Documentation and Configuration: All relevant documentation, including the root README, CLI README, and .env.template, has been updated to reflect the new providers, their installation requirements, necessary environment variables, and model naming conventions.
Core Logic Refinement: The core LLM calling logic has been refactored to dynamically determine the provider and model from the input string, improving modularity and adaptability for new integrations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully adds support for new LLM providers like OpenAI and Anthropic, and refactors the model handling to be more flexible. The code is generally well-structured. However, I've identified a critical issue in the Anthropic provider implementation related to streaming, a high-severity issue where a new parameter is not being passed, and a few medium-severity issues to improve consistency and user experience. Please see the detailed comments for suggestions.

gemini-code-assist · 2025-10-04T20:58:55Z

packages/pdf-anonymizer-core/src/pdf_anonymizer_core/llm_provider.py

+            if isinstance(max_output_tokens, int) and max_output_tokens > 0
+            else DEFAULT_CHARACTERS_TO_ANONYMIZE // 4,
+            messages=[{"role": "user", "content": prompt}],
+            stream=True,


The Anthropic API is called with stream=True, but the response is handled as a non-streaming response. When streaming, client.messages.create returns a Stream object that needs to be iterated over to assemble the full response. The current implementation (response.content[0].text) will fail because a Stream object does not have a .content attribute.

To fix this, you should either implement the logic to handle a streaming response or, more simply, set stream=False to receive the complete message object at once.

Suggested change

stream=True,

stream=False,

gemini-code-assist · 2025-10-04T20:58:55Z

packages/pdf-anonymizer-core/src/pdf_anonymizer_core/call_llm.py

-            raw_text = provider.call(prompt, model_enum.value)
+            provider_name, actual_model_name = get_provider_and_model_name(model_name)
+            provider = get_provider(provider_name)
+            raw_text = provider.call(prompt, actual_model_name)


The LLMProvider.call method was updated to accept a max_output_tokens parameter, but it's not being passed here. This makes the feature non-functional for providers like Anthropic that rely on it, as they will always use a default fallback. You should estimate a reasonable maximum for output tokens based on the input text length and pass it to the provider. A common heuristic is a fraction of the input length, for example len(text) // 4.

Suggested change

raw_text = provider.call(prompt, actual_model_name)

max_output_tokens = len(text) // 4

raw_text = provider.call(prompt, actual_model_name, max_output_tokens=max_output_tokens)

gemini-code-assist · 2025-10-04T20:58:56Z

packages/pdf-anonymizer-cli/src/pdf_anonymizer_cli/cli.py

+    provider_name, _ = get_provider_and_model_name(model_name)
+    if provider_name == "google":
+        if "gemini" in model_name and not os.getenv("GOOGLE_API_KEY"):
            logging.error(
                "Error: GOOGLE_API_KEY not found. Please set it in the .env file."
            )
            sys.exit(1)


The API key check is only implemented for Google models. To improve user experience and provide early feedback for misconfigurations, you should add checks for the other providers that require API keys (OpenAI, Anthropic, OpenRouter, Hugging Face). This can be generalized using a map of providers to their required environment variable names, making it easier to maintain.

provider_name, _ = get_provider_and_model_name(model_name) api_key_env_vars = { "google": "GOOGLE_API_KEY", "openai": "OPENAI_API_KEY", "anthropic": "ANTHROPIC_API_KEY", "openrouter": "OPENROUTER_API_KEY", "huggingface": "HUGGING_FACE_TOKEN", } if provider_name in api_key_env_vars: key_name = api_key_env_vars[provider_name] if not os.getenv(key_name): logging.error(f"Error: {key_name} not found. Please set it in the .env file.") sys.exit(1)

gemini-code-assist · 2025-10-04T20:58:56Z

packages/pdf-anonymizer-core/src/pdf_anonymizer_core/llm_provider.py

+        completion = client.chat.completions.create(
+            model=model_name, messages=[{"role": "user", "content": prompt}]
+        )


The call method receives max_output_tokens, but it's not being used when calling the OpenAI API. The OpenAI API supports a max_tokens parameter, which you should use to limit the response size. This improves consistency with other providers like Anthropic and can help control costs and prevent unexpectedly large responses.

Suggested change

completion = client.chat.completions.create(

model=model_name, messages=[{"role": "user", "content": prompt}]

)

completion = client.chat.completions.create(

model=model_name,

messages=[{"role": "user", "content": prompt}],

max_tokens=max_output_tokens,

)

google-labs-jules bot and others added 4 commits October 4, 2025 19:49

small updates

83f252d

fix for the unknown model name

a04d6cd

fix for the Anthropic provider import; for the Anthropic provider req…

c87c4dc

…uest

leo-gan merged commit 946767a into leo-gan:main Oct 4, 2025
2 checks passed

leo-gan deleted the feat/add-llm-providers-and-refactor branch October 4, 2025 20:57

gemini-code-assist bot reviewed Oct 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/add llm providers#27

feature/add llm providers#27
leo-gan merged 4 commits intoleo-gan:mainfrom
LeoGan:feat/add-llm-providers-and-refactor

leo-gan commented Oct 4, 2025

Uh oh!

gemini-code-assist bot commented Oct 4, 2025

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 4, 2025

Uh oh!

gemini-code-assist bot Oct 4, 2025

Uh oh!

gemini-code-assist bot Oct 4, 2025

Uh oh!

gemini-code-assist bot Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	raw_text = provider.call(prompt, actual_model_name)
	max_output_tokens = len(text) // 4
	raw_text = provider.call(prompt, actual_model_name, max_output_tokens=max_output_tokens)

Conversation

leo-gan commented Oct 4, 2025

Uh oh!

gemini-code-assist bot commented Oct 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant