Skip to content

预先处理数据出错 #1

@xilu0

Description

@xilu0

你好,开发者
我想学习试用你这个项目,我下载的docker镜像,我已经安装了依赖包,我遇到了一些错误,首先是:

with open(fileName, 'r', encoding='iso-8859-1') as f:  # TODO: Solve Iso encoding pb !
TypeError: 'encoding' is an invalid keyword argument for this function

一个open文件方法报错,说键值对的编码参数无效,看不出问题所在,我把这个参数删除脚本运行过了,
后面我按错误提示下载加入了语料,下载了nltk_data/tokenizers/punkt.zip,在运行下面的脚本时出错了,
python deepqa2/dataset/preprocesser.py
我尝试了python2和python3问题都一样,我下载了语料,我是在linux下解压的,出现了ascii解码错误,我都不知道是要给谁解码,我尝试看脚本了,似乎要下载一个文件,我不确定是文件还是上传的语料,下面的错误信息,如果看到麻烦指导一下,感激不尽!

root@66004f351bea:/deepqa2# python deepqa2/dataset/preprocesser.py
('Saving logs into', '/deepqa2/logs/root.log')
2017-05-22 02:11:02,781 - __main__ - INFO - Corpus Name cornell
Max Length 20
2017-05-22 02:11:02,782 - dataset.textdata - INFO - Training samples not found. Creating dataset...
Extract conversations:   3%|####2                                                                                                                                                         | 2260/83097 [00:03<02:25, 554.00it/s]
Traceback (most recent call last):
  File "deepqa2/dataset/preprocesser.py", line 42, in <module>
    main()
  File "deepqa2/dataset/preprocesser.py", line 39, in main
    'datasetTag': ''}))
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 79, in __init__
    self.loadCorpus(self.samplesDir)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 235, in loadCorpus
    self.createCorpus(cornellData.getConversations())
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 306, in createCorpus
    self.extractConversation(conversation)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 323, in extractConversation
    targetWords = self.extractText(targetLine["text"], True)
  File "/deepqa2/deepqa2/dataset/../dataset/textdata.py", line 340, in extractText
    sentencesToken = nltk.sent_tokenize(line)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 11: ordinal not in range(128)

loadCorpus
createCorpus
extractConversation
三个方法不确定是哪一个发送了错误

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions