TABLE OF CONTENTS
- Introduction
- Styles
- Coding Practices
There are already many style guides out there for Python, so instead of re-creating the wheel, this document borrows (i.e. copy and paste) heavily from them, with a few tweaks here and there. The styles captured here are to emphasize them because of their usefulness to creating consistent, maintainable, and readable code.
There are also a lot of styles not captured here that are still worth knowing about, and reading the following guides is highly encouraged. The guides listed below are in agreement on many coding conventions, but offer different perspectives on their reasoning for certain styles.
- PEP 8 - The official style guid of the Python community
- Google Python Style Guide - If it’s good for Google, it should be good enough for us
- The Hitchhiker's Guide to Python
Don't. Some of the above guides (specifically Google's) give leeway to breaking conventions. The most valid reason for this has to deal with backwards compatibility. Backwards compatibility is not something we need to worry to deal with at our organization.
There is a lot of code that existed at KIPP NorCal before this style guide, and so there will be code that doesn't conform this guide. Whenever refactoring this code, we should also work to clean up areas that don't follow convention.
There are some conventions that will lay out multiple options for a style convention (although, usually no more than two). Which ever option you choose to use when building new code, use it consistently. Do not switch between them. When working on existing code, stick to the convention that the original author chose to use.
We will follow what PEP 8 has laid out for naming conventions. Below is a copy and paste of some key points.
Never use the characters ‘l’ (lowercase letter el), ‘O’ (uppercase letter oh), or ‘I’ (uppercase letter eye) as single character variable names.
In some fonts, these characters are indistinguishable from the numerals one and zero. When tempted to use ‘l’, use ‘L’ instead.
Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.
Class names should normally use the CapWords (aka CamelCase) convention.
There is a note in PEP 8 where "The naming convention for functions may be used instead". Disregard this. We will always use CapWords.
Function names should be lowercase, with words separated by underscores.
There is a note in PEP 8 where underscores are used optionally to improve readability. Disregard this. Underscores always improve readability.
Use single leading underscore names to denote a method or a function as private. Additionally from PEP 8 - "weak “internal use” indicator. E.g. from M import * does not import objects whose name starts with an underscore".
def _single_leading_underscore() -> None:
"""This will not get imported."""
return NoneUse single trailing underscore to avoid conflicts with Python keywords:
def single_trailing_underscore_(x: int) -> bool:
class_ = 5 # Can use with variables, too!
if x == class_:
return True
else:
return FalseAlways use self for the first argument to instance methods.
Always use cls for the first argument to class methods.
If a function argument’s name clashes with a reserved keyword, it is generally better to append a single trailing underscore rather than use an abbreviation or spelling corruption. Thus class_ is better than clss. (Perhaps better is to avoid such clashes by using a synonym.)
Variables should follow the naming conventions of methods/functions.
Follow the guide for imports in PEP 8. In general, here are the key points:
Importing modules or pakages need their own line.
BAD
import os, sysGOOD
import os
import sysImporting multiple items that are contained within a module or package on one line is okay.
# Both acceptable
from os import chdir, getcwd
from os import chdir
from os import getcwdImports should always be at the top of a file after any module docstrings and before any module globals or constants are defined.
Imports should be grouped into three groups with a blank line between them. Within the groups, it is recommended to alphabetize the imports by module or package. The three groupings are:
- standard library imports
- related third party imports
- local application/library specific imports
Example:
"""
Module docstring
"""
import datetime
import os
import pandas
import requests
from some_local_module import foo
SOME_CONSTANT = NoneAvoid using wildcard imports (*) when importing. This method of import loads everything from that module or package directly into your module's name space. In most cases, this is unnecessary. If you only need one object from a package or module, then explicitly import that object instead of everything with a wildcard. If you do need everything from a package or module, then import the package or module to avoid namespace issues.
EXAMPLE
# instead of this
from os import *
# do this if you need all of the os library
import os
# or this if you need one function
from os import getcwdException to this rule is when using the Django (Galaxy). It is common to import models into views.py using a wildcard, or importing views into
urls.py using a wildcard import. This is because of Django's structure, and it is idiomatic for the Django framework.
TODO: Add content here
Type hinting is a little controversial in the Python community. Many people feel that it goes against dynamic typing, which is one of the features of Python that makes it unique.
For KIPP NorCal, many of our repos are designed for our own business purposes with a very specific set of requirements in mind, and type hinting can be a useful way to document code. Type hints can also speed up code development with auto-completion in IDEs. Type hints can also help catch bugs when used with a linter (such as Pylint or Flake8).
The syntax for type hints was defined in PEP 3107. Below are some examples:
def simple_example(a: str) -> None:
"""Takes one string param (a) and returns None"""
print(a)
def foo(a: str, b: bool = False) -> bool:
"""
Takes a string parameter with a boolean parameter defaulting to False.
Returns a boolean.
"""
if b:
return True
else:
return False
from typing import Union
def bar(a: Union[int, float]) -> Union[None, int]:
"""
One parameter (a) which can be an integer or a float.
Returns either an integer or None.
"""
if isinstance(a, int):
return a
else:
return NoneThe general rule of thumb for good commenting is that your comments should be adding additional context that might not be apparent in the code. Comments should not restate exactly what your code is doing.
BAD
# print results
print(results)TODO: Add content
For docstrings, follow PEP 257.
Here is an example for a single line docstring:
def foo():
"""A single line docstring."""Here is an example of a multi-line docstrings These are equivalent:
def foo():
"""A multi-line
docstring.
"""
def bar():
"""
A multi-line
docstring.
"""Whenever handling exceptions with a try/except block, do not use a bare exception as this can hide bugs in your code. Always capture specific exceptions.
BAD
def raises_an_exception(some_list):
try:
return some_list[1000]
except: # or except Exception:
return NoneGOOD
def raises_an_exception(some_list):
try:
return some_list[1000]
except IndexError:
return NoneThe one exception to this rule is using a bare try/except block within the if __name__ == '__main__'. The bare try/except blocks here capture the
error, log it, and send a Slack notification before terminating the code.
Whenever opening files, always use the with command. This creates a context manager, which reads cleaner and reduces the risk of corrupting your file.
BAD
f = open('file.txt', "w")
f.write("Hello World!")
f.close()GOOD
with open('file.txt') as f:
f.write("Hello World!")Python doesn't have a stance on whether single or double quotes are better for strings. Since the vast majority of existing code uses double quotes, lets stick to using double quotes.
When the signature of a function or a call to a function exceeds the set line length, then separate the signature or call over multiple lines with each parameter getting it's own line.
EXAMPLE
# Pretend these are really long
# Long function signature
def some_long_func(
a: str,
b: str,
c: str,
d: str
) -> None:
# Do some stuff
return None
# Long function call
some_long_func(
"my",
"dog",
"eats",
"rocks"
)You don't have to only do this when exceeding the line limit. If at any point you feel that breaking these up over multiple lines makes your code more readable, then have at it!
TODO: Add content here
Development should always be done on your local machine to avoid breaking production or losing data. The pipelines on our servers are for production and should always be on the main branch whenever possible.
A high level development workflow example:
- Write a tech spec for the product/feature you're building
- Prep for development
- If new work, create a new repo and create a development branch
- Add branch protections to require reviews on PRs and to block commits to main
- If refactoring/creating a feature, checkout whichever branch you are planning to develop off of (this should almost always be main), and run
git pullto get the most up-to-date code. Then create your development branch- Helpful hint is to use a Jira issue ID or a semvar version number in the name. This can help point to documentation for your work incase you forget what the branch was for or incase someone else needs to look at the code.
- If new work, create a new repo and create a development branch
- Write your code. Commit and push often. You'll be happy you did if anything happens to your computer.
- Test your code. This can look different from project to project. Whatever you choose to do, make sure you are covering your edge cases and the code is working as expected.
- Once done with dev and testing, open a PR in GitHub
- Once the PR is approved and merged to main:
- If new repo:
- ssh into server
- run
git clone <git repo address>injobsdirectory - build docker image
- schedule job to run in crontab
- If existing repo:
- ssh into server and navigate to the repo's directory (should be in
/home/data_admin/jobs) - Make sure repo is on main branch and run
git pull - Rebuild docker image
- Check that the new image name matches the image name in the command in crontab, so it will still run as expected. If not, then rename the image or update the command
- ssh into server and navigate to the repo's directory (should be in
- If new repo:
New repos or enhancements to code need to be accompanied with a tech spec. Annual code rollover and bug fixes do not require tech specs, however, it's highly encouraged for large bug fixes.
There is a Python/Dev folder on our Data Team drive where tech specs (and other docs) live. There is a template tech spec that can be found here. All of our specs are stored in their respective repo's folder stored here.
The spec template is meant to be flexible. Use what you need and delete what you don't. Feel free to add sections if needed. The purpose of these documents is to capture the why, the how, the expected outcomes and decisions made around new work or an enhancement.
One thing that is required to follow with tech specs is the naming conventions. Naming conventions need to follow: [SemVar number] - [Repo name] - [Title of work]. An example would be 4.0 - Google Accounts - Internal Refactor
Semvar stands for semantic versioning. It might seem like overkill, but using semantic versioning when naming our specs will help give a timeline to the specs.
We use Pipenv to manage our environments in dev and production. We'll give a brief overview below, but Pipenv documentation can be found here if you want to know more.
Pipenv generates two files: Pipfile and Pipfile.lock. Both of these files need to be included in our git repos.
The Pipfile is a file that tracks all of our dependencies for a repo with broad versioning. Most of our dependencies will have an * which indicates that we are using the latest version of a package. Some packages may also indicate that we are only using a version before/after some specific version.
The Pipfile.lock file is a file that tracks the specific packages that are installed in the environment (based on what is in the Pipfile). The Pipfile.lock file is not meant to be edited.
It is recommended by Pipenv and others to always set up you environment by installing from the Pipfile.lock file. We don't do it this way, so forget what you just read.
We always build our docker images by installing Pipenv and then running pipenv intall --skip-lock. This command will create a virtual environment with the most up-to-date versions of a repo's dependencies allowed by the Pipfile. The benefit to this is that it ensures our repos are operating on the most up-to-date code as possible. The downside is that sometimes a new release of a package might not be compatible with your code or other dependencies in your repo. This is where the Pipfile.lock comes in handy.
The Pipfile.lock file is our plan to handle any dependency issues. While developing, keep your Pipfile.lock file up to date by running pipenv update regularly. If you hit an issue, you can fall back to the Pipfile.lock on main in Github. When you finish developing, make sure your most up to date Pipfile.lock is included in your PR so we are able to recreate the last known stable environment for the repo.
Below is a guide as to what general practices we follow with structure and naming conventions of our repos. This section is not meant to be mandatory as the needs of each repo's structure is different depending on the complexity of the code. It is strongly encouraged to follow the below convention if your code begins to get complex. The benefit of following the conventions below is that others will be able to understand your code and its intended purpose just by the structure and names of packages.
A flat layout is where all of the files of the repo are in the root directory. This is recommended for smaller projects with few files and little code where there isn't any real benefit to structuring the code.
This is a specific layout where none of your code is in the root directory. Instead, the code is inside a src directory which contains a package where your code lives. The name of the package in the src directory should be the same as the name for your repo.
The biggest benefit to this layout is that it allows you to install your code as a package that is an editable install. This type of install is benificial for testing, and you can create a tests package in your root directory along side the src directory. More information can be found here. # TODO: Add link
A hybrid layout is the inbetween of the flat layout and the src layout. This is for projects that are complex enough where separating your code into packages will help with the organization, but creating a full src layout might be too much. With this layout, usually a main.py file will be in the root directory and the rest of the code is in packages that also live in the root directory along with main.py.
Here are common names that you may find among our repos and an explanation of what they are and what their purpose is. As mentioned in the intro of the section, every repository does not need to have these packages unless there is a need for it. If you find yourself needing to build separate workflows, then create a workflows package to store them in.
These packages have classes that are abstract representations of a concept and manages a state (ex. a class that represents an employee) or value (ex. a class that manages a queue).
Not to be confused with git repos, these packages will contain code that implements the repository design pattern. The classes that implement this pattern usually wraps a data resource (ex. external API or data warehouse connection) with common CRUD operations.
These packages have two uses.
One use is that they sit between a repository object and the business logic and perform some transformation to data. Sometimes they get data from a repository object and prepare it for use by a workflow, or take data from a workflow and shape it for insertion into a repository.
Another use is where they perform a common operation that is shared across multiple workflows.
Sessions are objects that manage meta data around a job that is running. They might keep track or runtime arguments or any other data that the job might rely on.
This package is for ancillary parts of the code. Run time args, exceptions, data maps, or helper functions that might be used across packages.
Workflows are the business logic. Some code bases may only have one job to perform and a workflow package might not be needed, others might have multiple workflows where each one handles a different edge case. Workflows are typically built using services and don't usually work directly with repositories.