Kellogg Garden's Freights Geospatial Data Analysis in GCP

Uncovering hidden patterns, improving predictive analysis and creating a competitive edge through location intelligence.

From raw data to an interactive dashboard: how to load, clean, analyze and visualize a historical freight dataset

Authors

@pedroheyerdahl

Summary

Exporting to csv with sheets
Loading in Bigquery
Cleaning the data
Data enrichment
Spatial Clustering
Weighted Spatial Averages
Visualization tools

Exporting to csv

BigQuery supports querying Drive data as an external data source. But, since the original dataset was provided as a xlsx file with three tabs (zip_to_zip, accessorials and data), in order to load the data to Bigquery, first it was neccessary to split each tab into a different file.

After uploading the original file to Drive, each tab was copyied to a different spreadsheet and later renamed accordingly.

Any extra carriage returns were removed prior to loading in BQ, using Google Sheet's Find and Replace GUI tool and searching line breaks characters with regex.

Loading into BigQuery

After retrieving the Drive URI for each spreadsheet, a permanent table linked to the external data source was created using the bq command-line tool's mk command:

bq mk \
--external_table_definition=[SCHEMA]@[SOURCE_FORMAT]=[DRIVE_URI] \
DATASET.TABLE

bq mk \
--external_table_definition=Shipping_Warehouse:STRING,Assigned_Customer_Warehouse:STRING,Base_T_L_H_Nbr:INTEGER,SO____T_R_:STRING,Customer:INTEGER,custname:STRING,shipdate:DATE,Zip:STRING,Has_Stops_Over:BOOLEAN,_Contract__Freight_:STRING,_Freight_Amount_:STRING,_Discrepancy_Amount_:STRING,_Grand_Total_:STRING,T_L_H_Comments:STRING,Posted:STRING,Customer_PO_Number:STRING,Carrier:STRING,Shipment_:STRING,State:STRING@GOOGLE_SHEETS=https://docs.google.com/spreadsheets/d/1lHzmgg0C9kXHTFI3a5d_AORTB_CB7xTG44pIcvRGYGM/edit?usp=sharing \
kellogg.raw_data

bq mk \
--external_table_definition=Shipping_Warehouse:STRING,Assigned_Customer_Warehouse:STRING,Base_T_L_H_Nbr:INTEGER,SO____T_R_:STRING,Customer:INTEGER,custname:STRING,shipdate:DATE,Zip:STRING,Has_Stops_Over:BOOLEAN,_Contract__Freight_:STRING,_Freight_Amount_:STRING,_Discrepancy_Amount_:STRING,_Grand_Total_:STRING,T_L_H_Comments:STRING,Posted:STRING,Customer_PO_Number:STRING,Carrier:STRING,Shipment_:STRING,State:STRING@GOOGLE_SHEETS=https://docs.google.com/spreadsheets/d/1n1km521AlxMzY8NmHPht4KnJXDtI_NQF3Imt_XP_1wY/edit?usp=sharing \
kellogg.raw_accessorials


bq mk \
--external_table_definition=Ship_Zip:INTEGER,Deliver_Zip:INTEGER,_Grand_Total_:STRING,State:STRING,Customer:INTEGER,custname:STRING,shipdate:DATE,Has_Stops_Over:BOOLEAN,_Contract__Freight_:STRING,_Freight_Amount_:STRING,_Discrepancy_Amount_:STRING,T_L_H_Comments:STRING,Posted:BOOLEAN,Customer_PO_Number:STRING,Carrier:STRING,Shipment_:STRING@GOOGLE_SHEETS=https://docs.google.com/spreadsheets/d/1n4dAba8CmGCKxr70yf0Z7xO5xxdwO7iBTOecLRLgi2s/edit?usp=sharing \
kellogg.raw_zip_to_zip

Cleaning the data

After loading the data into BQ, a regex function removes special characters from currency columns, such as $ and -. This same function transforms zeroed values to NULL. Next, duplicated shipments are discarded, and so are the rows with null 'grand_total' amounts. Finally, the external connection data is materialized into a native Bigquery table with the project's default column naming convention.

Quality assessment

Since the original file has three different tabs with duplicate columns, a script evaluates the summary of distinct and null counts for each dimension and the min-max ranges for each metric.

Column	Distinct	Null	Min	Max	Avg
consignee_id	2978	0	2800401	9905624	-
consignee_name	2978	0	-	-
ship_date	0	0	2018-07-02	2021-06-21	-
has_stops_over	2	0	-	-	-
contract_price	335	48370	10	2000	464.68
freight_price	4195	2483	10	50000	500.51
discrepancy_amount	2487	62079	0.04	4150	339.14
final_price	4634	0	23.14	50000	564.94
was_posted	1	0	-	-	-
cust_po	78173	3067	-	-	-
carrier	392	2	-	-	-
load_id	82323	0	-	-	-
state	52	22	-	-	-

At this step, all of the metrics display a healthy amount of null values and min-max ranges (other than a suspect 50k final_price max value).

Table	row_count	dstc_consignee_id	...	min_shipped_date	max_shipped_date	null_final_price	min_final_price	max_final_price	avg_final_price	...	dstc_load_id
data	82323	2978	...	02/07/2018	21/06/2021	null	23.14	50000	564.94	...	82323
accessorials	82323	2978	...	02/07/2018	21/06/2021	null	23.14	50000	564.94	...	82323
zip_to_zip	82323	2978	...	02/07/2018	21/06/2021	null	23.14	50000	564.94	...	82323

Since no discrepancies were found, with all columns having the same value across tables, from this point on, the analysis is based almost exclusively on the zip_to_zip table. The only exception being the shipper_id, that will be joined from the accessorials table in the next step.

Data enrichment

To run a geospatial analysis over the data first, we need to give it some geographical context. The only field in the dataset that represents spatial information is the zip code. Luckily, Big Query has a public us zip codes dataset on geo_us_boundaries, enabling geocoding each load to a specific origin and destination in space. Besides geocoding the data, the loads_locations script accomplishes a few other things: In this step, the shipper_id field, absent in the zip_to_zip table, is joined from the accessorials table. Since the scope of the project is the analysis of ground transport, any overseas territory data is discarded. Consignees with inconsistent or bad zip code data are disregarded. At the end of this step, the point count drops from 82,323 to 76,655 loads. Following the cleaning and enrichment, the data is finally ready to be analyzed.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
0_transform_summarize.sql		0_transform_summarize.sql
1_loads_locations.sql		1_loads_locations.sql
2_add_census_data.sql		2_add_census_data.sql
3_spatial_clustering.sql		3_spatial_clustering.sql
4_wav_economic_areas.sql		4_wav_economic_areas.sql
README.md		README.md
dijkstra.sql		dijkstra.sql
nearest_node.sql		nearest_node.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kellogg Garden's Freights Geospatial Data Analysis in GCP

Authors

Summary

Exporting to csv

Loading into BigQuery

Cleaning the data

Quality assessment

Data enrichment

Spatial Analysis

About

Uh oh!

Releases

Packages

Languages

pedroheyerdahl/kellogg

Folders and files

Latest commit

History

Repository files navigation

Kellogg Garden's Freights Geospatial Data Analysis in GCP

Authors

Summary

Exporting to csv

Loading into BigQuery

Cleaning the data

Quality assessment

Data enrichment

Spatial Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages