How to access all the content partners data

Prerequisites

Have python 11 installed on your system (Windows/Linux)

Setup

Copy the '.env-template' file and rename the file to '.env'
In the newly created '.env' file:
- Fill in the USER_EMAIL and PASSWORD fields with the correct credentials linked to your CP account
  
  (If needed: these credentials can be found in the hasura-prd-secrets secret in kubernetes. It's not used there but added to store the credentials somewhere.)
Run the following commands in a terminal:

Linux:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip wheel
python -m pip install -r requirements.txt

Windows:

py -3.11 -m venv .venv
.venv\Scripts\activate
python -m pip install -U pip wheel
python -m pip install -r requirements.txt

Run the example scripts

Download metadata

The download_meemoo_ai_data.py script is meant to run once to pull all the data in json format. The data will be saved in the ./output/<table_name> folder for all tables:

face_(image)_tasks: all face tasks where faces/persons are detected in videos(/images) (runs only once per fragment)
face_(image)_matches: all matches of the persons detected in the videos(/images) (runs each night)
refset_persons: metadata of all persons in the refset (only metadata, no fingerprints or images)
speech_tasks: all speech tasks where STT is run on video/audio
ner_tasks: all ner tasks where NER is run on the transcriptions
speakerrecognition_tasks: all speaker recognition tasks where speakers are detected.
speakerrecognition_: all matches of speakers detected.
summary_tasks: all summary tasks.

Each json file represents 1 page of data, with the suffix of the json indication the specific page of the data.

If you want to retrieve new data, remove the outputs folder if it exists and run the script. The script should run in under 5 minutes, depending on how much data was processed.

# run this from an activated virtual environment
python download_meemoo_ai_data.py

Download raw data

On top of this the download_meemoo_raw_files.py script can be run to download all the raw ner_tasks and speech_tasks files.

speech_tasks/tasks/<task_id>
- metadata.json: metadata of that individual task
- speechmatics_raw.json: raw json response of speechmatics
- subtitles.srt: subtitles in srt format
- subtitles.vtt: subtitles in vtt format
- transcription.txt: raw text of transcription
- audio_classfication.json: metadata from the audio classification
ner_tasks/tasks/<task_id>
- metadata.json: metadata of that individual task
- textrazor_raw.json: raw textrazor response
- processed.json: processed ner data with timestamps

The script will download ALL the raw data, so ensure that the tasks folder inside the ner_tasks and speech_tasks folder doesn't exist. This script will take a long time as it needs to download the raw files for every ner/speech task. The progress bar will give you an indication on the duration.

# run this from an activated virtual environment
python download_meemoo_raw_files.py

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
docs		docs
utils		utils
.env-template		.env-template
.gitignore		.gitignore
README.md		README.md
download_meemoo_ai_data.py		download_meemoo_ai_data.py
download_meemoo_raw_files.py		download_meemoo_raw_files.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How to access all the content partners data

Prerequisites

Setup

Linux:

Windows:

Run the example scripts

Download metadata

Download raw data

About

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

How to access all the content partners data

Prerequisites

Setup

Linux:

Windows:

Run the example scripts

Download metadata

Download raw data

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages