-
Notifications
You must be signed in to change notification settings - Fork 0
General Design
This is the general design of the evaluation system, which will accept a set of articles, summarize them, then evaluate the summarizations.
This does not cover the generation of a model for the summarizer, which will be done separately.
This describes a single doc reader instance, but if needed for performance, we could break it up into parallel instances. Once the topics are extracted, each topic can correspond to an instance can resolve the paths for its docset.
Input is the name of an XML document. Example:
<TACtaskdata year="2010" track="SUMMARIZATION" task="GUIDED" dataset="TEST">
<topic id = "D1003A" category = "4">
<title> Giant Panda </title>
<docsetA id = "D1003A-A">
<doc id = "XIN_ENG_20041019.0235" />
<doc id = "AFP_ENG_20050128.0218" />
<doc id = "XIN_ENG_20050222.0273" />
<doc id = "AFP_ENG_20050328.0133" />
</docsetA>
<docsetB id = "D1003A-B">
<!-- Ignore docsetB -->
</docsetB>
</topic>
<!-- More topics follow -->
</TACtaskdata>- Each doc ID must be parsed in order to identify the relevant document. The file naming scheme is a bit inconsistent, so we will need special logic for converting a document ID to a fully qualified path for that document. For example:
-
APW19990421.0284is located at/dropbox/17-18/573/AQUAINT/apw/1999/19990421_APW_ENG, under the<DOC>element with a<DOCNO>matchingAPW19990421.0284. -
XIN_ENG_20050415.0040is located at/dropbox/17-18/573/AQUAINT-2/data/xin_eng/xin_eng_200504.xml, under the<DOC>element with anidattribute of"XIN_ENG_20050415.0040"
-
Output can be a Python dict formatted like the following:
{
# Other metadata tags for the dataset go here
"topics": [
{
"id": "D1003A",
"title": "Giant Panda",
"category": "5",
# Other metadata tags for the topic go here
"docset": [
{
"id": "XIN_ENG_20041019.0235",
# Other metadata tags for the document go here
"contents": "<full HTML goes here>"
},
{
"id": "AFP_ENG_20050128.0218",
"contents": "<full HTML goes here>"
},
# More docs ...
]
},
# More topics ...
]
}Input is a document set, which is an element of "topics" from the above component, formatted as a Python dict. Example:
{
"id": "D1003A",
"title": "Giant Panda",
"category": "5",
# Other metadata tags for the topic go here
"docset": [
{
"id": "XIN_ENG_20041019.0235",
# Other metadata tags for the document go here
"contents": "<full HTML goes here>"
},
{
"id": "AFP_ENG_20050128.0218",
"contents": "<full HTML goes here>"
},
# More docs ...
]
}Output is a plain-text summary of the document set. Quoting the Deliverable #2 assignment description:
- Each summary can be no longer than 100 words (whitespace-delimited tokens). Summaries over the size limit will be truncated.
- Each summary should be well-organized, in English, using complete sentences. It should have one sentence per line. (Other formats can be used, but require modifications to the scoring configuration.) A blank line may be used to separate paragraphs, but no other formatting is allowed (such as bulleted points, tables, bold-face type, etc.).
- Summaries should be based only on the 'A' group of documents for each of the topics in the specification file.
- All processing of documents and generation of summaries must be automatic.
- Please include a file for each summary, even if the file is empty.
- Each file will be read and assessed as a plain text file, so no special characters or markups are allowed.
A sample config file is provided at /mnt/dropbox/17-18/573/code/ROUGE/rouge_run_ex.xml.
We will need to create our own config file to point to the root directory of our model files, and replace the <PEER-ROOT> string /dropbox/14-15/573/Data/mydata with a directory based on the system's location. The last part of this path is outputs/D2. The path root can be locally configurable for testing in a developer sandbox, but can default to a common deployment location on Patas.
Additionally, the <PEER> elements should be updated to match the required summarization filename format:
You should name your output files as:
- Given topic ID e.g. D0901A
- Split into:
id_part1=D0901, andid_part2=A- Output file name should be:
[id_part1]-A.M.100.[id_part2].[some_unique_alphanum]
We can write a simple script that generates this config file, which can be run at the deployment location.
A simple Bash script is sufficient for running ROUGE, but we can use Python to run the commands if desired. Here, $CONFIG_PATH refers to the config file generated above.
The command to run is:
/dropbox/17-18/573/code/ROUGE/ROUGE-1.5.5.pl \
-e /dropbox/17-18/573/code/ROUGE/data \
-a -n 4 -x -m -c 95 -r 1000 -f A -p 0.5 -t 0 -l 100 -s \
-d $CONFIG_PATH