Skip to content

General Design

Austin Almond edited this page Apr 19, 2018 · 27 revisions

This is the general design of the evaluation system, which will accept a set of articles, summarize them, then evaluate the summarizations.

This does not cover the generation of a model for the summarizer, which will be done separately.

Components

Doc Reader

This describes a single doc reader instance, but if needed for performance, we could break it up into parallel instances. Once the topics are extracted, each topic can correspond to an instance can resolve the paths for its docset.

Input

Input is the name of an XML document. Example:

<TACtaskdata year="2010" track="SUMMARIZATION" task="GUIDED"  dataset="TEST">

<topic id = "D1003A" category = "4">
        <title> Giant Panda </title>
        <docsetA id = "D1003A-A">
                <doc id = "XIN_ENG_20041019.0235" />
                <doc id = "AFP_ENG_20050128.0218" />
                <doc id = "XIN_ENG_20050222.0273" />
                <doc id = "AFP_ENG_20050328.0133" />
        </docsetA>
        <docsetB id = "D1003A-B">
                <!-- Ignore docsetB -->
        </docsetB>
</topic>

<!-- More topics follow -->

</TACtaskdata>

Document ID Parsing

  • Each doc ID must be parsed in order to identify the relevant document. The file naming scheme is a bit inconsistent, so we will need special logic for converting a document ID to a fully qualified path for that document. For example:
    • APW19990421.0284 is located at /dropbox/17-18/573/AQUAINT/apw/1999/19990421_APW_ENG, under the <DOC> element with a <DOCNO> matching APW19990421.0284.
    • XIN_ENG_20050415.0040 is located at /dropbox/17-18/573/AQUAINT-2/data/xin_eng/xin_eng_200504.xml, under the <DOC> element with an id attribute of "XIN_ENG_20050415.0040"

Output

Output can be a Python dict formatted like the following:

{
  # Other metadata tags for the dataset go here
  "topics": [
    {
      "id": "D1003A",
      "title": "Giant Panda",
      "category": "5",
      # Other metadata tags for the topic go here
      "docset": [
        {
          "id": "XIN_ENG_20041019.0235",
          # Other metadata tags for the document go here
          "contents": "<full HTML goes here>"
        },
        {
          "id": "AFP_ENG_20050128.0218",
          "contents": "<full HTML goes here>"
        },
        # More docs ...
      ]
    },
    # More topics ...
  ]
}

Document Set Summarizer

Input

Input is a document set, which is an element of "topics" from the above component, formatted as a Python dict. Example:

{
  "id": "D1003A",
  "title": "Giant Panda",
  "category": "5",
  # Other metadata tags for the topic go here
  "docset": [
    {
      "id": "XIN_ENG_20041019.0235",
      # Other metadata tags for the document go here
      "contents": "<full HTML goes here>"
    },
    {
      "id": "AFP_ENG_20050128.0218",
      "contents": "<full HTML goes here>"
    },
    # More docs ...
  ]
}

Summarization Logic

See Summarizer Architecture

Output

Output is a plain-text summary of the document set. Quoting the Deliverable #2 assignment description:

  • Each summary can be no longer than 100 words (whitespace-delimited tokens). Summaries over the size limit will be truncated.
  • Each summary should be well-organized, in English, using complete sentences. It should have one sentence per line. (Other formats can be used, but require modifications to the scoring configuration.) A blank line may be used to separate paragraphs, but no other formatting is allowed (such as bulleted points, tables, bold-face type, etc.).
  • Summaries should be based only on the 'A' group of documents for each of the topics in the specification file.
  • All processing of documents and generation of summaries must be automatic.
  • Please include a file for each summary, even if the file is empty.
  • Each file will be read and assessed as a plain text file, so no special characters or markups are allowed.

Evaluator

Config file

A sample config file is provided at /mnt/dropbox/17-18/573/code/ROUGE/rouge_run_ex.xml.

We will need to create our own config file to point to the root directory of our model files, and replace the <PEER-ROOT> string /dropbox/14-15/573/Data/mydata with a directory based on the system's location. The last part of this path is outputs/D2. The path root can be locally configurable for testing in a developer sandbox, but can default to a common deployment location on Patas.

Additionally, the <PEER> elements should be updated to match the required summarization filename format:

You should name your output files as:

  • Given topic ID e.g. D0901A
  • Split into:
    • id_part1 = D0901, and
    • id_part2 = A
  • Output file name should be: [id_part1]-A.M.100.[id_part2].[some_unique_alphanum]

We can write a simple script that generates this config file, which can be run at the deployment location.

ROUGE script

A simple Bash script is sufficient for running ROUGE, but we can use Python to run the commands if desired. Here, $CONFIG_PATH refers to the config file generated above.

The command to run is:

/dropbox/17-18/573/code/ROUGE/ROUGE-1.5.5.pl \
  -e /dropbox/17-18/573/code/ROUGE/data \
  -a -n 4 -x -m -c 95 -r 1000 -f A -p 0.5 -t 0 -l 100 -s \
  -d $CONFIG_PATH

Sequence Diagram

sequence diagram of high-level architecture

Clone this wiki locally