-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path01_develop_data_processing.qmd
More file actions
168 lines (136 loc) · 4.47 KB
/
01_develop_data_processing.qmd
File metadata and controls
168 lines (136 loc) · 4.47 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: "Processing StatCan Data"
format:
html:
toc: true
embed-resources: true
jupyter: python3
editor:
render-on-save: true
---
# Setup
## Parameters
```{python}
from pyprojroot import here
```
```{python}
# | tags: [parameters]
LABOUR_DATA_FILE = here() / "data" / "14100355.csv"
```
## Libraries
```{python}
import polars as pl
import polars.selectors as cs
from mizani.bounds import squish
from pyprojroot import here
from great_tables import GT, md, html
from plotnine import *
from labourcan.data_processing import read_labourcan
```
## Read data
[`read_labourcan`](https://github.com/wvictor14/labourcan/blob/main/py/labourcan/data_processing.py) returns a polars dataframe with columns:
- Unnecessary metadata columns removed
- Filtered to seasonally adjusted estimates only
- Additional `YEAR`, `MONTH`, and `DATE_YMD` columns extracted from `REF_DATE`
- Sorted chronologically by year and month
```{python}
labour = read_labourcan(LABOUR_DATA_FILE)
labour.glimpse()
```
# Process data
For our graphic, we need to make this variable: **signed cenetered rank in % change in # of jobs**
*Signed* referring to that the rank is effectively computed over negative and positive %change separately, and in opposite directions (descending for negative, and ascending for positive).
Meaning the highest negative value (closest to 0) is `rank = -1`, and lowest positive value (closets to 0) is `rank = 1`
For example:
- Input is vector of % change which can be positive or negative: `[-0.01, -0.02, 0.01, 0.02]`
- Output: `[-1, -2, 1, 2]`
## % Change per month
First, compute % change from previous month. This needs to be done over different subsets of data:
- Industry
- Geolocation
- Labour Force Characteristic (If provided)
- Gender
- Age group
In the seasonally adjusted dataset, only Industry and Geolocation are provided. The LFC is total employment, the Gender is both, and Age group is all.
```{python}
labour_processed = (
# if we sort acesnding by time, then lag value is the month before
labour.sort(["Industry", "YEAR", "MONTH"])
.with_columns(
LAGGED_VALUE=pl.col("VALUE")
.shift(1)
.over(["Industry"])
)
# compute percent difference
.with_columns((pl.col("VALUE") - pl.col("LAGGED_VALUE")).alias("DIFF"))
.with_columns((pl.col("DIFF") / pl.col("LAGGED_VALUE")).alias("PDIFF"))
.select(
pl.col("Industry"),
cs.matches("Labour force characteristics"),
pl.col("DATE_YMD"),
pl.col("YEAR"),
pl.col("MONTH"),
cs.matches("VALUE"),
cs.matches("DIFF"),
)
.sort(["Industry", "YEAR", "MONTH", "PDIFF"])
)
labour_processed.glimpse()
```
## Signed Centered Rank
Now we can compute the `signed centered rank`.
Define `centered_rank_expr` function which takes a polars series and returns an expression, meaning it can be used in a polars `with_columns` call, which is nice because it can take advantage of polars lazy-evaluation optimization.
Below is the definition and a test-case.
```{python}
def centered_rank_expr(col):
"""
- Largest negative value gets rank -1
- Smallest positive value gets rank +1
- Zero gets rank 0
"""
return (
pl.when(col < 0)
.then(
# minus the total # of -ve values
(col.rank(method="ordinal", descending=True) * -1) + (col > 0).sum()
)
.when(col == 0)
.then(pl.lit(0))
.when(col > 0)
.then(col.rank(method="ordinal") - (col < 0).sum())
.otherwise(pl.lit(None))
)
# test it on this subset of data
test_series = (
# .filter(pl.col("Labour force characteristics") == "Employment")
labour_processed
.with_columns(pl.col("PDIFF").round(decimals=4))
.filter(pl.col("YEAR") == 2025, pl.col("MONTH") == 1)
.select(pl.col("PDIFF"))
.sample(n=10, seed=1)
.select("PDIFF")
)
test_series.with_columns(centered_rank_expr(pl.col("PDIFF")).alias("rank")).sort(
"PDIFF"
)
```
Looks good, so now we can apply to the data:
```{python}
labour_processed = labour_processed.with_columns(
centered_rank_across_industry=centered_rank_expr(pl.col("PDIFF")).over(
["YEAR", "MONTH"]
)
)
labour_processed.glimpse()
```
Check output visually for 1 year 1 month
```{python}
# check 1 year 1 month
(
labour_processed
.with_columns(pl.col("PDIFF").round(decimals=4))
.filter(pl.col("YEAR") == 2025)
.sort(["YEAR", "MONTH", "PDIFF"])
.select(["YEAR", "MONTH", "Industry", "VALUE", "DIFF", "PDIFF", cs.matches("rank")])
)
```