Skip to content

slowings/Phase3_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phase 3 Final Project: Ensuring Access to Water

Student name: Sarah Lowing

Student pace: Self paced

Project review date: 6/5/2023

Instructor: Abhineet Kulkarni

Blog post URL: https://wordpress.com/post/datamonsterdotblog.wordpress.com/47

Repo Structure

|--Phase3_Project
|   |--ipynb_checkpoints
|   |--CSVFiles
|   |--Phase3_Project.ipynb
|   |--README.md
|   |--SANDBOX.ipynb

Introduction

For the Phase 3 final project we will develop a model to predict water well failure in Tanzania using information gathered by the Tanzanian government and hosted as a competition by DrivenData. We'll depart from the competition to address the needs of our client, a local NGO working with an international funding partner to locate the wells at greatest risk of failure in order to determine which geographic regions to direct their funding towards.

Water Management

Tanzania faces an increased demand for water based on population growth projections. To meet the needs of this growing population they have transferred water rights and management to national and regional authorities. These municipalities face many challenges, such as increased contamination of groundwater storage from mining and agricultural runoff, frequent chollera outbreaks, and naturally occuring floride deposits in hazardous quantities. Additionally, there are threats posed by changing climate conditions that have shifted rainfall patterns, causing storms producing more intense rainfall and increased flooding, leaving less water to filter into the underground aquafers and lakes from which the wells draw from- compounding this is more recent multi-year drought cycles. As a result, paradoxically, this flood prone country faces increasing water shortages.

It's important to clarify that the word well can mean many things in this dataset, from complex mechanicanical pump sites with extensive filtration to a hand pump that brings up untreated groundwater (water from lakes, rivers, and streams).

Some key factors we will take into account:

Pollution- Which water points have harmful concentrations of pollutants

Population- Which water points have the highest population density

Salinization- Which water points are at greatest risk from sea level rise

Screenshot 2023-06-12 at 12 38 29 PM

EDA

Please note there is a seperate Sandbox file as well as the Phase3_Project in this notebook, where the bulk of our initial analysis and preliminary modelling takes place.

By eliminating duplicate and unneccessary columns we were able to reduce our dataframe to the following columns:

amount_tsh- TSH measures the distance water travels vertically to the pump site. 70% have 0, indicating the well is actually groundwater/sourced from lakes, streams, rivers, etc. We also see some values that might be erroneous, like 138,000, but might not be if the water is travelling through piping to a house in a town for instance.

gps_height

latitude and longitude-

installer- Organization that installed the well

basin - Geographic water basin: Lake Victoria, Pangani, Rufiji, Internal, Lake Tanganyika, Wami/Ruvu, Lake Nyasa, Ruvuma/Southern Coast, Lake Rukwa

district_code - Geographic location (coded)

population - Population around the well

gps_height - Height of pump head above sea level

quality_group - Quality of the water: good, salty, unknown, milky, colored, fluoride

source_class - Source of the water: spring, shallow well, borehole, river/lake, rainwater harvesting, dam, other

extraction_type- Kind of extraction at waterpoint: gravity, handpump, other, submersible, motorpump, rope pump, wind-powered, etc...

After cleaning and consalidating column values, we are ready for modelling.

Modelling

We ran three successive models, a logistic regression, a decision tree, and a random forest. We used the information generated by a LabelViz plot of decision tree to inform our understanding of some of the key variables influencing water security in Tanzania. A feature importance plot of our random forest confirmed our findings from the DT.

Findings and Conclusion

Screenshot 2023-06-12 at 9 19 33 AM

By far the most relevant factor in well failure is location. As we have been discussing, there is a noteable lack of working wells in the south eastern region of the country, and our analysis has confirmed this. The next most important factor in well failure is extraction type, meaning well pumps or simple machines designed to move water. Less established manufacturers of well pumps are most likely to fail. Our best model, the DTC was able to accurately predict well functionality fairly well, with an overall F1 score of 94%. More importantly, the recall score for our class 1, or non functional wells was a robust 91%, up from in our baseline model.

Decision Tree Classification Report:

well status precision recall f1-score support
functional 0.93 0.98 0.96 7877
non functional 0.98 0.95 0.94 6520
Screenshot 2023-06-13 at 3 46 00 PM

As we can see from the confusion matrix above, our model was able to capture 95% of failed wells, identifying 6,180 failed wells in our test data, and misclassified only 313 wells as failing. Our performance on class 0 (functional wells), was even more impressive, with our model identifying 98% of functional wells, or 7,804 wells, with only 100 false positives.

Next steps

Digging in deeper:

  • Use mining site locations to anticipate water quality degradation
  • Identify communities whose nearest water access is greater than 10 miles
  • Use well depth and gps height to create a new column for wells at greatest risk of salinization
  • Use a blackbox model to improve F1 and recall scores to get the most accurate picture of where wells will fail

About

Cumulative project for Flatiron Phase 3: Predicting well failure in Tanzania

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published