GitHub - jumanji27/data-engineer-test-task: An interview test task 03.2023

Run containers with DBs:

docker-compose up --build --force-recreate -d hb_db wwc_db

When they are ready, run the parser container:

docker-compose up --build --force-recreate normalizer

To configure the type of game, change ./configs/normalizer.yml type field and run the container again.

To get access to DBs, see ./configs/normalizer.yml dbs list.

To run tests (will work only after normalizer run)

docker-compose up --build --force-recreate tests

Important notes:

For the hb dataset, it is possible to detect user location by IP using an external service (ip2location.com, whatismyipaddress.com). I didn't do it deliberately because it's inaccurate, requires external dependencies, and I don't think it's part of the task
I used only data tests to check my code. I think it's enough to fit the correctness of the codebase in that case. I deliberately missed unit tests here.
As for the data pipeline for user events, it's a completely different task. For that type of data, I would prefer to use time series databases like TimescaleDB, InfluxDB, or AWS Timestream. They can easily write a massive amount of time-indexed records (like events), but they're slow in complicated queries than we want to read and aggregate our data. If we want fast performance in both reading and writing, we need to set up something like HDFS and MapReduce.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
data		data
original_data		original_data
src/normalizer		src/normalizer
.gitignore		.gitignore
README.md		README.md
TASK.txt		TASK.txt
docker-compose.yml		docker-compose.yml

Provide feedback