Skip to content

How to extract the phonemes? #11

@thomas-endres-tng

Description

@thomas-endres-tng

Unfortunately your reference concerning phonemes does not provide a reference other than the link to CMU Sphinx.

I did a bit of research and ended up with the following code:

def create_phoneme(audio_wave_file):
    with wave.open(audio_wave_file, "rb") as audio:
        decoder = Decoder(samprate=audio.getframerate(), allphone=ps.get_model_path("en-us/en-us-phone.lm.bin"))
        decoder.start_utt()
        decoder.process_raw(audio.getfp().read(), full_utt=True)
        decoder.end_utt()

    input_phoneme_list = []
    if decoder.hyp():
        segments = decoder.seg()
        for seg in segments:
            input_phoneme_list.append({'phone': seg.word, 'phone_end_frame': seg.end_frame})
    else:
        raise Exception('Phoneme recognition failed')

    total_number_of_frames_in_audio = int(input_phoneme_list[-1]['phone_end_frame'] / 100 * ASSUMED_FRAME_RATE)
    print(total_number_of_frames_in_audio)

    frame_index = 0
    phone_list = []
    phone_index = 0

    while frame_index < total_number_of_frames_in_audio:
        if (frame_index * 100 / ASSUMED_FRAME_RATE) < input_phoneme_list[phone_index]['phone_end_frame']:
            phone_list.append(input_phoneme_list[phone_index]['phone'])
            frame_index += 1
        else:
            phone_index += 1

    with open(str("phindex.json")) as f:
        ph2index = json.load(f)
    phonemes = []
    for p in phone_list:
        if p in ph2index:
            phonemes.append(ph2index[p])
        else:
            print(f"Weird Phoneme found: {p}. Ignoring...")
            phonemes.append(31) # Silence

    phone_list = phonemes

    print("Phoneme generation done")

    return phone_list

I'm using the phindex.json file from https://github.com/FuxiVirtualHuman/AAAI22-one-shot-talking-face/blob/main/phindex.json and a ASSUMED_FRAME_RATE of 30 (this seems to match the number of phonemes you have in the samples rather than 25 as referenced in the papers).

However my phonemes look a lot different as compared to your samples for the sample wave files. What am I doing wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions