aivenpy is a python-based site monitoring system that uses Aiven for PostgreSQL and Aiven for Apache Kafka to:
- Monitor any website
- Check - using regular expression - if a particular item is present on the site
- Publish website health stats (response_code, response_time etc) to Kafka
- Receive stats from Kafka and insert to database
Before proceeding futher [Update as of 5.May 2021]
The system as of now is entirely automated. The kafka producer is scheduled to run every 10 minutes. And the kafka consumer had been scheduled to start with the release tag v.x.x (see https://github.com/briwoto/aivenpy/actions). So, if you plan to run this project on your local machine, go to Github actions and stop the release build so that you can stop the consumer in the pipeline and then start on your local
- Framework
- How to run the project
- Project Architecture
- Code flow
- Github actions pipeline
- Advantages of the project implementation
- Tests
- If I had more time
- Contact
For development: I utilized the open-source libraries instead of using full-fledged frameworks like Django or flask. And focused on code-clarity and clean-architecture
For testing: I used pytest framework
If you simply want to run the project without knowing about the mechanics of it, installing docker would be sufficient
If you want to run the code yourself on your local machine, you would need the following pre-installed on your system :
Regardless of the above two methods, you would need the values of the following environment variables:
- AVDATABASE
- AVHOST
- AVPASSWORD
- AVPORT
- AVUSER
- AV_BASE_URL
- AV_KFUSER
- AV_KFPASSWORD
- AV_KFPORT
- BOOTSTRAP_SERVER
- CA_PEM
-
First you need to clone the repo to your local
-
Clone with SSH:
git@github.com:briwoto/aivenpy.gitOR
-
Clone with HTTPS:
https://github.com/briwoto/aivenpy.git
-
-
NOTE: This step below for setting environment variables, is for Mac/Linux. If you are using Windows then, please google the steps to set the environment variables in Windows
If you need the env vars to run the project, feel free to Contact Me
-
The authentication method used for Kafka is
SASL_SSL PLAIN. So you need a pem file in addition to the credentials (notice theCA_PEMvariable mentioned in pre-requisites).Aiven kafka provides a
ca file, which you have to convert to apemfile and save the value of the file in an environment variable-
First, you have to convert the crt to a pem file. You can use this link for referernce:
-
Once you have the .pem file, open the file in any text editor and copy the contents and paste is as the value of the CA_PEM environment variable. You can use this linke for reference:
How to Export a Multi-line Environment Variable in Bash/Terminal e.g: RSA Private Key
-
If you want to run the project via Method A i.e. docker, follow these steps:
-
Build image
Run the command:
docker build -t consumer . -
Run container
Add values to the corresponding env vars below and run the command
docker run -it \ -e AVDATABASE="" \ -e AVHOST="" \ -e AVPORT="" \ -e AVUSER="" \ -e AVPASSWORD="" \ -e AV_BASE_URL="" \ -e AV_KFUSER="" \ -e AV_KFPASSWORD="" \ -e AV_KFPORT="" \ -e BOOTSTRAP_SERVER="" \ -e CA_PEM="" \ consumer
If you want to run the project via Method B i.e. directly through python, first make sure that all the pre-requisites mentioned above, are fulfilled. Then, follow these steps:
-
Copy env vars to bashrc
Open a terminal tab/window and run the command
sudo nano /etc/bashrcThe terminal might ask for a password. If it does, type the password and press Enter. Once the bashrc file opens, export all the environment variables with the correct values
Example:
export AVUSER=dummyname export AVPASSWORd=whateveryourpasswordis ... ...Save the bashrc file and open a new terminal window
-
Create and activate a virtual environment
Run this command to create a virtual environment
python3 -m venv venvThen, run this command to activate the virtual environment
source venv/bin/activate -
Install dependencies
Run this command to install all dependencies
pip install -r requirements.txt -
Run site monitor & kafka producer
If you want to get fresh stats from the websites and publish to the kafka producer, run:
make monitor_producerIf you want to run kafka consumer, run:
make consumer
The project consists of 3 layers:
- The Interactive layer says "what" to do
- The Business layer decides "how" to do it
- The Service layer decides "where" to go to get/update data
Interactive layer is the entry point of the project. This is the layer that:
- Interacts with the site monitor
- Tells kafka producer (business layer) to send data to Kafka (service layer)
- Starts the consumer
- Send messages from consumer to the database
Any interaction with a business layer component, is done at this layer.
Any interaction between two business layer components, is also done at this layer
This layer is the backbone of the project. All logic, data extraction, manipulation happens at this layer. The interactive layer tells what to do. The business layer decides "how" to do it
This layer is nothing, but the connection with in-house/third-party services to get/update what we need.
There are two separate executions
- The site monitor and kafka producer are coupled into
monitor_producer.py. - The kafka consumer is executed through
consumer.py
When you direcly run
python monitor_producer.py
the code for site-monitor and producer is executed. The sequence of code execution is as follows:
-
First, aiven.py calls the
configmodule. Thepemfile is updated here. Also theloggerlibrary, which is used in all the programs to log info/warning to the console, is initiated -
Then, aiven calls the
monitormodule to get the stats of the target website.- First, all the sites (whose stats we want) are fetched from the
sitestable - Then, stats are collected for each website
- Other than checking the stats, the code also checks for a regular expression to verify whether a particular item is present in the response.text
- Upon receiving the stats and the
regexpboolean value, aiven.py sends these stats to the producer
- First, all the sites (whose stats we want) are fetched from the
When you want to start the consumer, the consumer.py needs to be run. You may simply run the make command to start the consumer:
make consumer
- Once kafka consumer connection is set, the messages are received by consumer.py
- Each message - when received - is then formatted via
consumer_queriesmodule - Once formatted, the data is then inserted in the database
With the intent of keeping the architecture clean and isolating each layer, the talker module was created. Any interaction that needs to happen with the database, happens via the talker module.
Instead of looping with a while statement to get stats periodically, I set up a github actions pipeline.
As of now, the site monitor is completely automated. The stats are collected every 10 minutes via the github actions pipeline. This means that, every 10 minutes:
- A request is sent to the target website
- The collected stats are sent to Kafka
Visit the below link to view the status of each build:
https://github.com/briwoto/aivenpy/actions
- Although I can't claim the project to be "100% clean", the focus while writing the code was towards clean architecture. For instance:
-
talker.pymodule focuses on the part that: The database service should be a plug-and-play. Suppose in the future, we decide to move from Postgres to any other database or move from on-prem to cloud services, we should only need to create the new connections without touching any of the existing code -
Every package at every layer is basically an app in itself. This provides an opportunity to enhance the project where producer | consumer | database : all sit in different systems
-
The test suite is divided into 3 test files:
- test_producer
- test_db
- test_site_monitor
The tests are very basic at this point. The coverage is at a unit-test level. Because of lack of time and the sake of completion, I could not write detailed integration tests
-
I would like to spin-off the project into different sub-projects, to "completely" isolate each app
-
Bug
The logger logs the same output twice, in the console. Couldn't get time to fix this. I confirmed, however, that the data in the database is not getting duplicated. The issue is only with the logger function
-
Seggregate the talker module into sub-modules; where one module would be responsible to format the data and another module would be responsible to send this data to postgres.py db layer
-
Include a broader set of tests if I could spend more time on the project
-
Implement
outbox:The idea here is to
-
Take the stats and insert in the
outboxtable. Also addtopic_nameto this data, and a boolean flag that tells whether the data has been published to kafka or not -
Create a common producer app that runs periodically: fetches the unpublished data from
outboxand sends the data to kafka -
Finally, update the boolean flag in the outbox table to indicate that data has been sent to kafka
This approach is highly scalable when we have multiple apps and each app has its own kafka topic to send data to. Following the outbox approach, there won't be a need for each team to implement a code for kafka producer. Since the outbox table will contain a column called
topic_name, we can have one single producer that (a) picks up data from outbox and sends it to kafka; and (b) updates the boolean flag for that row in the table, to indicate that data has been sent -
If you would like to talk more about the project and its architecture - or if you face any issues while running it - feel free to contact me (email id mentioned below)
With best regards,
Rahul Singh
rahul.beck@gmail.com
