Conversation
jamie256
left a comment
There was a problem hiding this comment.
Some initial comments. Its ok if we don't have other languages models plugged in but we should stub out how to swap or at least validate these match the configured language.
| _transformer_model = Encoder(transformer_name, custom_encoder) | ||
| register_dataset_udf( | ||
| [_prompt, _response], | ||
| f"{language}.{_response}.relevance_to_{_prompt}", |
There was a problem hiding this comment.
this renaming prefixing the language in the metric name will create a discontinuity with existing integrations and break back-compat.
We shouldn't prefix the localization in the metric name, at least not for the original english only launch of LangKit. Better would be to put this in metadata or in the platform something like column entity schema?
There was a problem hiding this comment.
Do you want, for example, to track English and French toxicity in the same column?
There was a problem hiding this comment.
maybe we could keep the original name for english, and add the language prefix only for other languages?
| @@ -41,6 +39,16 @@ def init(lexicon: Optional[str] = None, config: Optional[LangKitConfig] = None): | |||
| _nltk_downloaded = True | |||
There was a problem hiding this comment.
The lexicon downloaded I believe is language specific, we can't just rename the metric but still download the english based corpus from nltk right? At least we should perform a check and raise an error or log a warning in many of these metrics where the existing models don't target other languages than en?
| input_output.init(config=config) | ||
| text_schema = udf_schema() | ||
| def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema: | ||
| for language in langauges: |
There was a problem hiding this comment.
typo here? "langauges"
| textstat.init(config=config) | ||
| def init(languages: List[str] = ["en"], config: Optional[LangKitConfig] = None) -> DeclarativeSchema: | ||
| for language in languages: | ||
| regexes.init(language, config=config) |
There was a problem hiding this comment.
looks like identation is wrong here
There was a problem hiding this comment.
Considering that the modules are imported before calling init with the desired languages, does that mean that english will always be applied, and others will be additional language-specific metrics?
Uses proposed schema chaining 1380 to support a schema per language for each metric module. Multiple languages can be selected when initializing a metric collection. Metrics are prefixed with the language code.