update detectFormLanguages API endpoint to detect laguages and form media#2967

Open

Sujan167 wants to merge 3 commits intodevfrom

feat/2342-detect-form-language-and-media

Contributor

Sujan167 commented Dec 3, 2025

What type of PR is this? (check all applicable)

Related Issue

Project creation should prompt the user for additional datasets / form media only if specified in a form field #2342

Describe this PR

This PR cleans up and improves how we detect languages in XLSForms while also adding support for extracting media files.

detect_languages() now just detects languages.
get_media_files() handles finding the correct media column and collecting file references.

The endpoint simply calls these helpers and returns the combined result.

What’s changed compared to the old language detection:

It’s no longer tied to FastAPI or I/O, it works purely on the parsed sheets.
Old version only looked at label::, hint::, and required_message:: columns. The new version checks any column ending with ::lang, making detection more flexible.

Screenshots

Alternative Approaches Considered

Could have added a separate endpoint for media extraction and left the old language-detection code as-is, but that would introduce duplication and extra complexity. Refactoring kept things cleaner.

Review Guide

Notes for the reviewer. How to test this change?

Checklist before requesting a review

📖 Read the Field-TM Contributing Guide: https://github.com/hotosm/field-tm/blob/main/CONTRIBUTING.md
📖 Read the HOT Code of Conduct: https://docs.hotosm.org/code-of-conduct
👷‍♀️ Create small PRs. In most cases, this will be possible.
✅ Provide tests for your changes.
📝 Use descriptive commit messages.
📗 Update any related documentation and include any relevant screenshots.

[optional] What gif best describes this PR or how it makes you feel?

Sujan167 temporarily deployed to test

December 3, 2025 04:04

— with

GitHub Actions Inactive

github-actions bot added enhancement frontend backend labels

pre-commit-ci bot temporarily deployed to test

December 3, 2025 04:05

Inactive

Sujan167 requested a review from Anuj-Gupta4

December 3, 2025 04:18

Anuj-Gupta4 reviewed

View reviewed changes

src/backend/app/central/central_routes.py Outdated

-                                      detected_languages.append(match.group(2))
+                      for col in df.columns:
+                          col_norm = normalize_col(col)
+                          match = re.search(r"::(\w+)$", col_norm)

Collaborator

Anuj-Gupta4 Dec 3, 2025

Since this function is specifically for finding language column, it would be better to specify language columns in the regex like label, hint and required_message.

src/backend/app/central/central_routes.py Outdated

+                              if lang in INCLUDED_LANGUAGES and lang not in detected:
+                                  detected.append(lang)
+                  if default_lang and default_lang not in detected:

Collaborator

Anuj-Gupta4 Dec 3, 2025

Why are we searching for the same default_lang in detected twice.

src/backend/app/central/central_routes.py Outdated

+                  # Priority 1: default language
+                  # Priority 2: first detected language
+                  # Priority 3: plain "image" column
+                  if default:

Collaborator

Anuj-Gupta4 Dec 3, 2025

This language selection logic is valid only if the user doesn't use advanced configuration.

src/backend/app/central/central_routes.py Outdated

+                  else:
+                      lang_to_use = None
+                  if lang_to_use:

Collaborator

Anuj-Gupta4 Dec 3, 2025

I think it will be better to write a regex that simply looks for image:: and media::image::, instead of also checking for language column.

Sujan167 force-pushed the feat/2342-detect-form-language-and-media branch from 9054dee to 5e111a0 Compare

December 3, 2025 05:58

Sujan167 had a problem deploying to test

December 3, 2025 05:58

— with

GitHub Actions Failure

pre-commit-ci bot had a problem deploying to test

December 3, 2025 05:59

Failure

pre-commit-ci bot had a problem deploying to test

December 3, 2025 06:10

Failure

spwoodcock reviewed

View reviewed changes

src/backend/app/central/central_routes.py

+                          if langs["default_language"]
+                          else [],
                           "supported_languages": list(INCLUDED_LANGUAGES.keys()),
+                          "media_files": media,

Member

spwoodcock Dec 3, 2025

We do already have an endpoint for returning which media is required, after the form is uploaded (ODK Central handles it this way).

Its not a bad idea to try and determined the media uploads directly from the form, before its uploaded. I just wonder if there is a good reason ODK doesn't do this

Contributor Author

Sujan167 Dec 3, 2025

You're right,
However, in step 3(project creation), the XLSForm won't be reaching ODK yet, so we won't be able to fetch that data from ODK at that stage (since we have the code set up for it after the form is submitted).

That's why we need to detect the media requirements directly from the custom backend.

Member

spwoodcock Dec 3, 2025

If extraction of the media fields is reliable this way, then personally I think its a nicer user experience.

But equally, nothing is stopping us posting the form to ODK as part of the project creation workflow, instead of at the end - resulting in the same experience and less work for us.

Let's park this until we refactor the project creation to be rely more on ODK instead of FieldTM specific stuff

Collaborator

Anuj-Gupta4 commented Dec 3, 2025

The pre-commit fail seems to be genuine. The api container crashes if it's rebuilt.

NSUWAL123 mentioned this pull request

Upload Form Media #2969

Draft

7 tasks

Sujan167 added 3 commits

December 9, 2025 11:03


          feat(central): update detectFormLanguages API endpoint to detect lang…

67fda8b

…uages and form media.


          feat(central): refine language detection logic and simplify media fil…

8e1bb3f

…e retrieval


          fix precommit

370475f

Sujan167 force-pushed the feat/2342-detect-form-language-and-media branch from 55a748a to 370475f Compare

December 9, 2025 05:32

Sujan167 temporarily deployed to test

December 9, 2025 05:32

— with

GitHub Actions Inactive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend enhancement frontend