Skip to content

experimental multilingual idea#171

Draft
richard-rogers wants to merge 1 commit intomainfrom
dev/richard/multingual1
Draft

experimental multilingual idea#171
richard-rogers wants to merge 1 commit intomainfrom
dev/richard/multingual1

Conversation

@richard-rogers
Copy link
Contributor

@richard-rogers richard-rogers commented Oct 26, 2023

Uses proposed schema chaining 1380 to support a schema per language for each metric module. Multiple languages can be selected when initializing a metric collection. Metrics are prefixed with the language code.

@richard-rogers richard-rogers marked this pull request as draft October 26, 2023 16:48
Copy link
Collaborator

@jamie256 jamie256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Its ok if we don't have other languages models plugged in but we should stub out how to swap or at least validate these match the configured language.

_transformer_model = Encoder(transformer_name, custom_encoder)
register_dataset_udf(
[_prompt, _response],
f"{language}.{_response}.relevance_to_{_prompt}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this renaming prefixing the language in the metric name will create a discontinuity with existing integrations and break back-compat.

We shouldn't prefix the localization in the metric name, at least not for the original english only launch of LangKit. Better would be to put this in metadata or in the platform something like column entity schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want, for example, to track English and French toxicity in the same column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep the original name for english, and add the language prefix only for other languages?

@@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None):
_nltk_downloaded = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexicon downloaded I believe is language specific, we can't just rename the metric but still download the english based corpus from nltk right? At least we should perform a check and raise an error or log a warning in many of these metrics where the existing models don't target other languages than en?

input_output.init(config=config)
text_schema = udf_schema()
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
for language in langauges:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here? "langauges"

textstat.init(config=config)
def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema:
for language in languages:
regexes.init(language, config=config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like identation is wrong here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the modules are imported before calling init with the desired languages, does that mean that english will always be applied, and others will be additional language-specific metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants