Skip to content

Parsing xml with greek letters #2

@Turch99

Description

@Turch99

Hello!
Thanks again for your script!
It works great!
However, I have a question, this is more of a question about Python in general, so I will be grateful for any answer if you find time for this)
I want to parse an xml file in greek, do i understand correctly that all i have to change is the word_pattern argument and the allowed_symbols argument?
In the first one, should I specify the regex formula, and in the second, the decimal value of the characters?
I experimented with different regex formulas like /^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$/
or [\u0370-\u03ff\u1f00-\u1fff].
The script runs without errors, but it doesn't output any data.
My question is: Am I making a mistake in the regex formula, or am I not setting up something in the script?

My script look like this:

import io
from datetime import datetime
from os import listdir
from bs4 import BeautifulSoup

import frequency_analysis


start = datetime.now()
file_list = listdir('444/')
word_pattern = '^[A-Za-zΑ-Ωα-ωίϊΐόάέύϋΰήώ]+$'
    #this is one of the regex options but i tried different
allowed_symbols = [*range(913, 1000)]
    #here i also tried different options
with frequency_analysis.Analysis(
    word_pattern=word_pattern, allowed_symbols=allowed_symbols
) as analyze:
    for n, file in enumerate(file_list):
        with io.open('444/' + file, mode='r', encoding='utf-8') as f:
            data = f.read()
        bs_data = BeautifulSoup(data, 'xml')

        for sentence in bs_data.find_all('s'):
            analyze.count_all(sentence.text.split(), pos=True)
        print(n, file)
print('fin at:', datetime.now().strftime('%H:%M:%S'))
print('total time:', datetime.now() - start)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions