Skip to content

Sentence tokenizer not working on Full stop  #76

@shivambatra76

Description

@shivambatra76

I have given the following input to

from summa.preprocessing.textcleaner import clean_text_by_sentences as _clean_text_by_sentences.

text='''Ad sales boost Time Warner profit
Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.
'''
This is the output i have recieved from after preprocessing. As you can see the second sentence should get separated by full stop but instead it is only separating the sentence using space on a new line by enter key pressed.
Screenshot (28)

[Original unit: 'Ad sales boost Time Warner profit' --- Processed unit: 'ad sale boost time warner profit',
Original unit: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.' --- Processed unit: 'quarter profit media giant timewarn jump bn £m month decemb m year earlier firm biggest investor googl benefit sale high speed internet connect higher advert sale',
Original unit: 'TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.' --- Processed unit: 'timewarn said fourth quarter sale rose bn bn',
Original unit: 'Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.' --- Processed unit: 'profit buoy gain offset profit dip warner bros user aol']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions