Skip to content

Fix Unix timestamp normalisation#558

Merged
Ben-Hodgkiss merged 5 commits into
mainfrom
date-patch-2
May 29, 2026
Merged

Fix Unix timestamp normalisation#558
Ben-Hodgkiss merged 5 commits into
mainfrom
date-patch-2

Conversation

@Ben-Hodgkiss
Copy link
Copy Markdown
Contributor

@Ben-Hodgkiss Ben-Hodgkiss commented May 29, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Description

The DateDataType.normalise() method handles Unix timestamps via a %s pattern branch. Over time, support has been extended to handle timestamps of varying digit lengths, with 12- and 13-digit values treated as milliseconds (divided by 1000) and 9- and 10-digit values treated as seconds.
A previous change extended this to include 11-digit millisecond timestamps, which appear in listed building designation data sourced from Historic England.

Two related issues have been found with the current logic:

  • 11-digit millisecond timestamps were not handled. Values such as 13392000000 and -49507200000 have 11 digits and represent dates in the late 1960s and early 1970s when treated as milliseconds. These were previously rejected as invalid dates because 11 was not included in the permitted digit-length tuple.
  • Negative 10-digit timestamps were incorrectly treated as seconds. A value such as -6048000000 has a 10-digit absolute value and was being processed as seconds, placing it in 1778 and triggering a far-past-date issue. It should be treated as milliseconds, giving a date of 1969-11-22.

The root cause of the second issue is that the original length-based division logic assumes positive timestamps, where a 10-digit value in seconds lands in a plausible range (2001-2286). For negative values, a 10-digit second timestamp goes far into the past, outside any realistic designation date range.

Fix
Two changes to the %s branch in DateDataType.normalise():

  • Add 11 to the tuple of digit lengths that are divided by 1000 before conversion.
  • For 9- and 10-digit values, rather than assuming seconds unconditionally, attempt the conversion and check whether the result falls within a plausible range (1800-2100). If it does not, retry by dividing by 1000 and converting as milliseconds.

This approach avoids ambiguity - there is no realistic overlap between a plausible second timestamp and a plausible millisecond timestamp across any digit length within the 1800-2100 window.

Related Tickets & Documents

  • Ticket Link
  • Related Issue #
  • Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note
on the devices and browsers this has been tested on, as well as any relevant
images for UI changes.

Added/updated tests?

We encourage you to keep the code coverage percentage at 80% and above. Please refer to the Digital Land Testing Guidance for more information.

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

[optional] Are there any post deployment tasks we need to perform?

[optional] Are there any dependencies on other PRs or Work?

@Ben-Hodgkiss Ben-Hodgkiss changed the title Fix syntax error in test_date.py Fix Unix timestamp normalisation May 29, 2026
@Ben-Hodgkiss
Copy link
Copy Markdown
Contributor Author

Known limitations of Unix timestamp normalisation

This ticket improves handling of Unix timestamps in DateDataType.normalise(), but there are known gaps in the ranges we currently support. Logging here for awareness and to query whether we need to address these in future.

How timestamp length detection works

The %s branch identifies whether a value is a Unix timestamp by counting the number of digits in its absolute integer part, then decides whether to treat it as seconds or milliseconds based on that length:

  • 9-10 digits - treated as seconds; if the result falls outside 1800-2100, retried as milliseconds
  • 11-13 digits - always treated as milliseconds
    Values with fewer than 9 digits (or more than 13) are not recognised as timestamps at all and fall through to other date patterns.

Unhandled ranges

The following date ranges cannot currently be expressed as Unix timestamps and have them normalised correctly:

Format Unhandled date range Window size
Seconds 1966-10-31 to 1973-03-03 ~6 years
Milliseconds 1969-12-20 to 1970-01-12 ~23 days

The millisecond gap is negligible. The seconds gap is more notable but as discussed below, positive values in that range cannot be safely extended without risking collisions with other date patterns.

Why we haven't extended below 9 digits

Extending positive timestamp handling below 9 digits risks collisions with other date patterns:

  • 1-4 digits - would collide with %Y (interpreted as a year)
  • 8 digits - would collide with %Y%m%d (e.g. 20200102)
    Negative values are safer to extend since they cannot be mistaken for years or compact date strings, but have not been extended as the use case has not been confirmed.

Query

The seconds gap (1966-10-31 to 1973-03-03) is the most potentially significant, as this window overlaps with plausible listed building designation dates from the early statutory lists compiled from 1947 onwards. Do we have evidence of timestamps in this range appearing in source data? If so, we should consider:

  • Extending negative digit-length handling to cover 5-8 digit negative second timestamps
  • Accepting that positive sub-9-digit timestamps cannot be safely handled without risking collisions, and treating them as data quality issues instead

@Ben-Hodgkiss Ben-Hodgkiss merged commit 6cc66be into main May 29, 2026
5 checks passed
@Ben-Hodgkiss Ben-Hodgkiss deleted the date-patch-2 branch May 29, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants