Integrate county level data (WIP)#52
Conversation
lucasmbrown
left a comment
There was a problem hiding this comment.
Overall looking super great, what an impressive accomplishment to integrate all this on your own. Exciting to see all this. I left lots of comments throughout, but that's only because you added such a large volume of functionality. Looking great and I'll probably be a quick thumb on a second pass!
| @@ -0,0 +1,383 @@ | |||
| ,locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,cases,deaths,recovered,active,tested,hospitalized,hospitalized_current,discharged,icu,icu_current,date | |||
There was a problem hiding this comment.
Love some good sample test data! So awesome. Small nit: should we create a new directory for this called test_fixtures or something like that? We have some "real" / production data in /data so it might be good to keep that clean.
There was a problem hiding this comment.
Also, for the two new files that are in /covid, we could maybe move those into the same folder. It looks like you haven't used them yet in tests but that would be great.
There was a problem hiding this comment.
This reminds me -- we don't ever run these tests, haha. Just created #53 to track it.
There was a problem hiding this comment.
Ah yes. Tests in a dark room haha.
Agreed on fixtures folder
The csvs in /covid are an accident I think haha, but will move those too!
| COUNTY_POSITIVITY_3DRA_FIELD = "COVID+ RATE (3DRA)" | ||
| COUNTY_POSITIVITY_COLOR_FIELD = "COVID+ COLOR" | ||
| _COUNTY_NUM_LAGS = 14 | ||
| _, _COUNTY_NEW_CASES_LAG_FIELDS = generate_lag_column_name_formatter_and_column_names( |
There was a problem hiding this comment.
FYI, there might be a better way to do this -- I wrote this but it still feels a little hacky to me, so I'm open to all other suggestions!
There was a problem hiding this comment.
Yeah I think I adopted the "unless I'm refactoring everything or it's necessary, don't change pattern" mindset. Maybe we can track a few issues on refactors we know we should do?
| _GREEN = "Green" | ||
| _YELLOW = "Yellow" | ||
| _RED = "Red" | ||
| _DEEP_SHIT = "Dark Red" |
There was a problem hiding this comment.
as fun as that is, maybe we do _DARK_RED just since this is open source (and about a tragic topic)? But also I would be fine leaving it as is!
There was a problem hiding this comment.
Yeah good to change this before it's merged
| _RED = "Red" | ||
| _DEEP_SHIT = "Dark Red" | ||
|
|
||
| # Define the upper bounds for each color for the new cases per million metric. |
|
|
||
| # NaN rows where the 3DRA is NaN. | ||
| # Note: this handles a common county-level case where a county simply does not have any testing data. | ||
| county_df.loc[ |
There was a problem hiding this comment.
I'm surprised fit_and_predict_cubic_spline_in_r doesn't automatically return NaNs if its input data is NaN -- we could make that part of the function? Or this works just fine!
There was a problem hiding this comment.
It breaks if you have all-NaN I think. Like the fitting portion was breaking on me. I can investigate further if we want!
|
|
||
|
|
||
| def calculate_state_summary(transformed_df, columns): | ||
| def compute_lagged_frame( |
There was a problem hiding this comment.
As you said this logic is definitely complex but I'm sure it's way faster than the terrible for loops I had running! Should we potentially replace the other lagging parts of transform with this to also speed those up? (If we want to do that, we could potentially implement the text fixtures first so that we can tell if they worked successfully without changing the underlying data.)
There was a problem hiding this comment.
yeah I think the real improvement would be to apply transformations across states in parallel, which is maybe a larger refactor I didn't want to do here. I'm happy to add it in or just plug this function in on a state-by-state level
| return consecutive_true_series, consecutive_false_series | ||
|
|
||
|
|
||
| def get_color_series_from_range(series, color_range_dict): |
| covidtracking_df = extract_covidtracking_historical_data() | ||
| cdc_ili_df = extract_cdc_ili_data() | ||
| covidatlas_df = extract_covid_atlas_data() | ||
| # covidtracking_df = extract_covidtracking_historical_data() |
There was a problem hiding this comment.
Are these intentinoally commented out, or was that just to speed up local development of your new feature?
There was a problem hiding this comment.
just for local dev!
|
|
||
| """ | ||
| print("Starting to ETL...") | ||
| logger.info("Starting to ETL...") |
There was a problem hiding this comment.
hell yea! lol.
We have a bunch of other prints we could remove as well in this PR or future work...
There was a problem hiding this comment.
yep for sure. I'll see if it's an easy copy -replace
|
|
||
| covidtracking_df = extract_covidtracking_historical_data() | ||
| cdc_ili_df = extract_cdc_ili_data() | ||
| covidatlas_df = extract_covid_atlas_data() |
There was a problem hiding this comment.
If we don't think we'll ever need to mix county-level and state-level data, I'm wondering if we want to actually put this in a separate function -- and maybe the function is invoked separately by main() so it's easy to run one command that updates all data, but there's a bit cleaner separation of concerns.
I feel like extract_transform_and_load_covid_data has gotten to be a beast and now might be a good time to refactor a bit!
There was a problem hiding this comment.
agreed it would be better, but the status quo here isn't much different. We don't do a lot where we merge all the state data into one state object and then do additional things with that. As far as main is concerned, county-data is similar to state-level beds data. Happy to refactor, but perhaps we could save some of these for some "internal code cleanup" tasks once the county-level functionality is squared away
Closes #43.
Some caveats here:
Ok that's all I'll preface with haha, please rip to shreds!