Skip to content

viaacode/give-graphql-query-examples

Repository files navigation

How to access all the content partners data

Prerequisites

  • Have python 11 installed on your system (Windows/Linux)

Setup

  • Copy the '.env-template' file and rename the file to '.env'

  • In the newly created '.env' file:

    • Fill in the USER_EMAIL and PASSWORD fields with the correct credentials linked to your CP account

      (If needed: these credentials can be found in the hasura-prd-secrets secret in kubernetes. It's not used there but added to store the credentials somewhere.)

  • Run the following commands in a terminal:

Linux:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip wheel
python -m pip install -r requirements.txt

Windows:

py -3.11 -m venv .venv
.venv\Scripts\activate
python -m pip install -U pip wheel
python -m pip install -r requirements.txt

Run the example scripts

Download metadata

The download_meemoo_ai_data.py script is meant to run once to pull all the data in json format. The data will be saved in the ./output/<table_name> folder for all tables:

  • face_(image)_tasks: all face tasks where faces/persons are detected in videos(/images) (runs only once per fragment)
  • face_(image)_matches: all matches of the persons detected in the videos(/images) (runs each night)
  • refset_persons: metadata of all persons in the refset (only metadata, no fingerprints or images)
  • speech_tasks: all speech tasks where STT is run on video/audio
  • ner_tasks: all ner tasks where NER is run on the transcriptions
  • speakerrecognition_tasks: all speaker recognition tasks where speakers are detected.
  • speakerrecognition_: all matches of speakers detected.
  • summary_tasks: all summary tasks.

Each json file represents 1 page of data, with the suffix of the json indication the specific page of the data.

If you want to retrieve new data, remove the outputs folder if it exists and run the script. The script should run in under 5 minutes, depending on how much data was processed.

# run this from an activated virtual environment
python download_meemoo_ai_data.py

Download raw data

On top of this the download_meemoo_raw_files.py script can be run to download all the raw ner_tasks and speech_tasks files.

  • speech_tasks/tasks/<task_id>
    • metadata.json: metadata of that individual task
    • speechmatics_raw.json: raw json response of speechmatics
    • subtitles.srt: subtitles in srt format
    • subtitles.vtt: subtitles in vtt format
    • transcription.txt: raw text of transcription
    • audio_classfication.json: metadata from the audio classification
  • ner_tasks/tasks/<task_id>
    • metadata.json: metadata of that individual task
    • textrazor_raw.json: raw textrazor response
    • processed.json: processed ner data with timestamps

The script will download ALL the raw data, so ensure that the tasks folder inside the ner_tasks and speech_tasks folder doesn't exist. This script will take a long time as it needs to download the raw files for every ner/speech task. The progress bar will give you an indication on the duration.

# run this from an activated virtual environment
python download_meemoo_raw_files.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Contributors

Languages