Skip to content

jumanji27/data-engineer-test-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Run containers with DBs:

docker-compose up --build --force-recreate -d hb_db wwc_db

When they are ready, run the parser container:

docker-compose up --build --force-recreate normalizer

To configure the type of game, change ./configs/normalizer.yml type field and run the container again.

To get access to DBs, see ./configs/normalizer.yml dbs list.

To run tests (will work only after normalizer run)

docker-compose up --build --force-recreate tests

Important notes:

  • For the hb dataset, it is possible to detect user location by IP using an external service (ip2location.com, whatismyipaddress.com). I didn't do it deliberately because it's inaccurate, requires external dependencies, and I don't think it's part of the task
  • I used only data tests to check my code. I think it's enough to fit the correctness of the codebase in that case. I deliberately missed unit tests here.
  • As for the data pipeline for user events, it's a completely different task. For that type of data, I would prefer to use time series databases like TimescaleDB, InfluxDB, or AWS Timestream. They can easily write a massive amount of time-indexed records (like events), but they're slow in complicated queries than we want to read and aggregate our data. If we want fast performance in both reading and writing, we need to set up something like HDFS and MapReduce.

About

An interview test task 03.2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors