Skip to content

update detectFormLanguages API endpoint to detect laguages and form media#2967

Open
Sujan167 wants to merge 3 commits intodevfrom
feat/2342-detect-form-language-and-media
Open

update detectFormLanguages API endpoint to detect laguages and form media#2967
Sujan167 wants to merge 3 commits intodevfrom
feat/2342-detect-form-language-and-media

Conversation

@Sujan167
Copy link
Contributor

@Sujan167 Sujan167 commented Dec 3, 2025

What type of PR is this? (check all applicable)

  • 🍕 Feature
  • 🐛 Bug Fix
  • 📝 Documentation
  • 🧑‍💻 Refactor
  • ✅ Test
  • 🤖 Build or CI
  • ❓ Other (please specify)

Related Issue

Describe this PR

This PR cleans up and improves how we detect languages in XLSForms while also adding support for extracting media files.

  • detect_languages() now just detects languages.

  • get_media_files() handles finding the correct media column and collecting file references.

The endpoint simply calls these helpers and returns the combined result.

What’s changed compared to the old language detection:

  • It’s no longer tied to FastAPI or I/O, it works purely on the parsed sheets.

  • Old version only looked at label::, hint::, and required_message:: columns. The new version checks any column ending with ::lang, making detection more flexible.

Screenshots

image image image

Alternative Approaches Considered

Could have added a separate endpoint for media extraction and left the old language-detection code as-is, but that would introduce duplication and extra complexity. Refactoring kept things cleaner.

Review Guide

Notes for the reviewer. How to test this change?

Checklist before requesting a review

[optional] What gif best describes this PR or how it makes you feel?

@github-actions github-actions bot added enhancement New feature or request frontend Related to the frontend code backend Related to backend code labels Dec 3, 2025
@Sujan167 Sujan167 requested a review from Anuj-Gupta4 December 3, 2025 04:18
detected_languages.append(match.group(2))
for col in df.columns:
col_norm = normalize_col(col)
match = re.search(r"::(\w+)$", col_norm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is specifically for finding language column, it would be better to specify language columns in the regex like label, hint and required_message.

if lang in INCLUDED_LANGUAGES and lang not in detected:
detected.append(lang)

if default_lang and default_lang not in detected:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we searching for the same default_lang in detected twice.

# Priority 1: default language
# Priority 2: first detected language
# Priority 3: plain "image" column
if default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This language selection logic is valid only if the user doesn't use advanced configuration.

else:
lang_to_use = None

if lang_to_use:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to write a regex that simply looks for image:: and media::image::, instead of also checking for language column.

if langs["default_language"]
else [],
"supported_languages": list(INCLUDED_LANGUAGES.keys()),
"media_files": media,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do already have an endpoint for returning which media is required, after the form is uploaded (ODK Central handles it this way).

Its not a bad idea to try and determined the media uploads directly from the form, before its uploaded. I just wonder if there is a good reason ODK doesn't do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right,
However, in step 3(project creation), the XLSForm won't be reaching ODK yet, so we won't be able to fetch that data from ODK at that stage (since we have the code set up for it after the form is submitted).

That's why we need to detect the media requirements directly from the custom backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extraction of the media fields is reliable this way, then personally I think its a nicer user experience.

But equally, nothing is stopping us posting the form to ODK as part of the project creation workflow, instead of at the end - resulting in the same experience and less work for us.

Let's park this until we refactor the project creation to be rely more on ODK instead of FieldTM specific stuff

@Anuj-Gupta4
Copy link
Collaborator

The pre-commit fail seems to be genuine. The api container crashes if it's rebuilt.

@NSUWAL123 NSUWAL123 mentioned this pull request Dec 4, 2025
7 tasks
@Sujan167 Sujan167 force-pushed the feat/2342-detect-form-language-and-media branch from 55a748a to 370475f Compare December 9, 2025 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Related to backend code enhancement New feature or request frontend Related to the frontend code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants