A dynamic system to email K-1 PDFs to investors.
Every tax year we need to send hundreds of K-1 tax forms to investors. We receive these forms from our external accountants. This project automates the process of matching K-1 PDFs to an internal investor contact table and then emailing the attachments with an interpolated email body. The emails are sent via the Outlook API.
k1_processor.py: Contains theK1BatchProcessorclass that handles all the workmain.py: Entry pointconfig.py: Set running parameters here for each code run (not tracked due to constant changing, but can be recreated fromconfig.pytemplate)auth.py: Microsoft API authentication (more on this below)logger.py: Logging configuration
Inside the entry point, you can choose which of the external methods get run. The methods are called from outside the class to allow for step-by-step processing, instead of being forced to run everything in one shot. This is specifically built in as a safeguard because emailing investors and handling tax data is extremely sensitive.
- Manually copy K-1 PDFs into the
filesdirectory into their respective investment folders. - Ensure
investors.xlsxcontains correct investor information. - Set running parameters in
config.py, which get imported into the entry point (createconfig.pyfromconfig.pytemplateif it does not exist). See the__init__()method ofK1BatchProcessordocstring for explanations of how to set the config parameters. - Instantiating the
K1BatchProcessorclass in the entry point (ensures the correct folder structure as explained below and) gathers the K-1s from the folders to prepare for processing. "Managers" K-1s are excluded as they are not emailed to investors. - The
extract_entities()method reads the PDFs and attempts to extract the issuing entity and receiving entity from each. These are stored in apicklecache to speed up future runs on the same files (in the case of staggered emailing or testing or any other required re-run). The cache will be loaded if it exists, otherwise extraction will be run on all gathered files. - The
match_files_and_keys()method attempts to match the extracted entity information from each file to an investor contact ininvestors.xlsxto prepare for emailing. - The
send_emails()method sends emails with K-1 attachments to the matched investors. You will be prompted to(y/n)confirm that you want to send emails (another safeguard).
These directories and their contents are not tracked, however logs, snapshots, and investors.xlsx are synced to S3.
cache: Contains a single filepicklecache of extracted entities from each K-1 filedumps: Stores text files of the extracted text from each K-1 pagefiles: Contains folders for each investment, holding the K-1 PDFslogs: Stores text logs of standard output (print statements, etc.) from code runs, and csv logs of unmatched filessnapshots: Stores snapshots ofinvestors.xlsxas backups
- Every time the class is instantiated, a timestamped snapshot of
investors.xlsxis stored inside thesnapshotsdirectory for safety - A timestamped text file is stored inside the
logsdirectory, containing the standard output from every code run. Theprint_k1_array()method can be called to include appending of thek1_array(i.e., result of theextract_entities()method) to this log file. This will not print thek1_arrayto the terminal to avoid crowding - A timestamped csv file is stored inside the
logsdirectory wheneverextract_entities()is called, containing a table of the K-1 files that did not match to any investor contacts withininvestors.xlsx - A timestamped csv file is stored inside the
logsdirectory wheneversent_emails()is run, containing all attempted investor rows along with the sent status and timestamp
The logs directory, snapshots directory, and investors.xlsx file are synced to S3 whenever changes are made to them. These changes are kept track of during code runs using instance variables as flags (e.g., self.logs_changed).
As mentioned, emails are sent via the Outlook API, which uses the msal package for authentication inside auth.py. The credentials that are fed to msal are stored in AWS Parameter Store (we are only using Azure at all because it is required to use the Outlook API, but AWS is our cloud platform). Thus, your environment needs to be configured with AWS credentials in order for the Outlook API to be authenticated.