mlfaker

[Put CI Badge Here]

Note that this is a stub README that contains boilerplate for many of the common operations done inside an ML repo. You should customize it appropriately for your specific project

Quickly and easily mock data in pandas for various ML applications

Setup

Basics

Clone the code to your machine using the standard Git clone command. If you have SSH keys setup the command is:

git clone git@github.com:manifoldai/mlfaker.git

Project goal

This project serves two primary goals:

Make it easy to mock data to accelerate development for collaborative ML teams (think about mocking APIs before the backend is built)
Generate structured data where the user provides the causal structure between variables

Initially, this data is intended to fit on a single node < 10GB in a pandas dataframe. As we mature, we can start thinking about mocking big data using dask/spark.

What is this project like?

At it's core, this project is like an extension of faker for machine learning purposes. However, there are limitations to faker:

It doesn't easily support pandas DFs--- a data science's workhorse
Doesn't allow for complicated relationships between variables-- which is the whole point of studying data

Now, sklearn has a basic tool for this. While this tool can be considered the subset of this package, it only supports numpy arrays where here the primary data structure is a pandas dataframe. Here, we aim to extend this to allow users to generate data with more control over underlying generative process as well as complicated relationships between variables. For these all of the data generative processes, we frame everything in terms of DAGS, which we find to useful framing when considering causal effects in your datasets.

1. Mock data

We tend to find ourselves mocking tabular data in a ML projects. Here are few scenarios:

Tests
Waiting on other developers to build ETL process while building out other pieces
Test a model on noise (target independent of features) or basic linear data (y linearly dependent on features)

All of these tasks are straightforward in pandas, but time consuming and repetitive. This project aims to make these tasks one-liners

2. Generate data from causal structure

Thinking of data from generative processes and having a playground to generate data has immense value in ML. Let's say you have a conceptual model (i.e. a causal DAG) of your system. What does data generated from this look like? Does this look like my data? What happens if I have a collider in my system? My linear model says this effect is positive, does that match the generative process? Is conditioning on other covariates leading me to draw the wrong conclusions? These are important questions, and ones we hope you're asking yourself. Building simple models and testing them is the key to gaining intuition and to understand what's actually going on.

This project aims to provide a nice user-face for engineers and scientists to mock data from a causal DAG:

DAG represented as a matrix -> causal DAG -> generated data

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
docsrc		docsrc
logs		logs
mlfaker		mlfaker
notebooks		notebooks
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
docker-compose.yml		docker-compose.yml
env_template		env_template
logging.yml		logging.yml
myproject.toml		myproject.toml
pull_request_template.md		pull_request_template.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlfaker

Setup

Basics

Project goal

What is this project like?

1. Mock data

2. Generate data from causal structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlfaker

Setup

Basics

Project goal

What is this project like?

1. Mock data

2. Generate data from causal structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages