Catalog plugin#74
Draft
avaldebe wants to merge 61 commits into
Draft
Conversation
Collaborator
Author
|
and these are the contents of the catalog file >>> import polars as pl
>>> df = pl.read_parquet("catalog.parquet")
>>> df
shape: (14_893, 16)
┌─────────────────────────────────┬─────────┬──────────────┬──────────────────────────────┬───┬─────────┬───────────┬─────────────────────────┬─────────────────────────┐
│ filename ┆ Country ┆ Country Code ┆ Air Quality Station EoI Code ┆ … ┆ AggType ┆ Timezone ┆ Start ┆ End │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ enum ┆ enum ┆ str ┆ ┆ enum ┆ cat ┆ datetime[ns, UTC] ┆ datetime[ns, UTC] │
╞═════════════════════════════════╪═════════╪══════════════╪══════════════════════════════╪═══╪═════════╪═══════════╪═════════════════════════╪═════════════════════════╡
│ daily/CY/SPO-CY0002R_00005_102… ┆ Cyprus ┆ CY ┆ CY0002R ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2024-01-01 00:00:00 UTC ┆ 2024-12-31 23:00:00 UTC │
│ daily/CY/SPO-CY0002R_06001_100… ┆ Cyprus ┆ CY ┆ CY0002R ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2024-01-01 00:00:00 UTC ┆ 2024-12-31 23:00:00 UTC │
│ daily/CY/SPO-CY0004A_00005_100… ┆ Cyprus ┆ CY ┆ CY0004A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2024-01-01 00:00:00 UTC ┆ 2024-12-31 23:00:00 UTC │
│ daily/DE/SPO.DE_DEBE010_PM2_da… ┆ Germany ┆ DE ┆ DEBE010 ┆ … ┆ day ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2024-12-31 23:00:00 UTC │
│ daily/DE/SPO.DE_DEBE034_PM1_da… ┆ Germany ┆ DE ┆ DEBE034 ┆ … ┆ day ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2024-12-31 23:00:00 UTC │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ hourly/XK/SPO-XK0012A_00007_10… ┆ Kosovo ┆ XK ┆ XK0012A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2025-05-14 04:00:00 UTC │
│ hourly/XK/SPO-XK0012A_00008_10… ┆ Kosovo ┆ XK ┆ XK0012A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2025-05-14 04:00:00 UTC │
│ hourly/XK/SPO-XK0012A_00010_10… ┆ Kosovo ┆ XK ┆ XK0012A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2025-05-14 04:00:00 UTC │
│ hourly/XK/SPO-XK0012A_00038_10… ┆ Kosovo ┆ XK ┆ XK0012A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2025-05-14 04:00:00 UTC │
│ hourly/XK/SPO-XK0012A_06001_10… ┆ Kosovo ┆ XK ┆ XK0012A ┆ … ┆ hour ┆ Etc/GMT+1 ┆ 2023-12-31 23:00:00 UTC ┆ 2025-05-14 04:00:00 UTC │
└─────────────────────────────────┴─────────┴──────────────┴──────────────────────────────┴───┴─────────┴───────────┴─────────────────────────┴─────────────────────────┘
>>> from pprint import pprint
>>> pprint(df.schema)
Schema([('filename', String),
('Country',
Enum(categories=['Andorra', 'Albania', 'Austria', 'Bosnia and Herzegovina', 'Belgium', 'Bulgaria', 'Switzerland', 'Cyprus', 'Czechia', 'Germany', 'Denmark', 'Estonia', 'Spain', 'Finland', 'France', 'United Kingdom', 'Greece', 'Croatia', 'Hungary', 'Ireland', 'Iceland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Montenegro', 'North Macedonia', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Serbia', 'Sweden', 'Slovenia', 'Slovakia', 'Turkey', 'Kosovo'])),
('Country Code',
Enum(categories=['AD', 'AL', 'AT', 'BA', 'BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR', 'GB', 'GR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'RS', 'SE', 'SI', 'SK', 'TR', 'XK'])),
('Air Quality Station EoI Code', String),
('Air Quality Station Name', String),
('Sampling Point Id', String),
('Air Pollutant', Categorical(ordering='physical')),
('Longitude', Float64),
('Latitude', Float64),
('Altitude', Float32),
('Air Quality Station Type',
Enum(categories=['background', 'industrial', 'traffic'])),
('Air Quality Station Area',
Enum(categories=['rural', 'rural-nearcity', 'rural-regional', 'rural-remote', 'suburban', 'urban'])),
('AggType', Enum(categories=['hour', 'day', 'var'])),
('Timezone', Categorical(ordering='physical')),
('Start', Datetime(time_unit='ns', time_zone='UTC')),
('End', Datetime(time_unit='ns', time_zone='UTC'))]) |
JohnPaton
previously approved these changes
Oct 3, 2025
cd14e43 to
e0a16f9
Compare
c3eb0cb to
f5489bf
Compare
f5489bf to
ffbc076
Compare
ffbc076 to
f655b98
Compare
f655b98 to
0969773
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uses the CLI plugin mechanism from #71 and the chained CLI commands from #73 to provide new command:
The catalog file is meant to sit at the top of a data directory.
It contains the most relevant station metadata only for the stations/pollutants that have observation files:
From the observation files the catalog contains
The catalog file should help to narrow down the files that need to be open for a particular task.
With this new command, it is possible to put several tasks we do daily at work into single command invocation
This is what it does
I'll probably fine tune the
catalogoptions, but this should give you a good idea of what this PR wants to accomplish