Skip to content

Using FlexiTerm

Unai Lopez edited this page Jul 5, 2018 · 2 revisions

Software requirements to run FlexiTerm:

  • Java version "1.6.0" or above
  • Java(TM) SE Runtime Environment (build 1.6.0-b105)
  • Java HotSpot(TM) Client VM (build 1.6.0-b105, mixed mode)

How to run FlexiTerm:

  1. Place plain text files into "text" folder.
  2. OPTIONAL: Replace file stoplist.txt in "resources" folder with your own, if needed.
  3. Run FlexiTerm.bat (Windows) or FlexiTerm.sh (Unix/Linux) at "script" folder from the command line.
  4. Check results in "out" folder. They will be presented in different formats: txt, csv and html.

Folder structure:

  • bin: Binary (Java .class) files
  • config: Contains "settings.txt" file with configuration options for FlexiTerm
  • lib: External libraries required by FlexiTerm
  • out: Output files
  • resources: Contains text resources required by FlexiTerm, including
    • resources/dict: WordNet files used by WordNet.java
    • resources/models: Models used by the Stanford CoreNLP.
    • resources/stoplist.txt : Stoplist used to filter out stopwords.
    • resources/dictionary.txt : A list of distinct tokens used as a dictionary by Jazzy to suggest similar tokens.
  • script: Windows/Unix scripts to run FlexiTerm
  • src: Source (.java) files
  • text: Input text files.

Output files format:

  • output.csv : A table of results: Rank | Term | Score | Frequency
  • output.html : A table of results: ID | Term variants | Score | Rank
  • output.txt : A list of recognised term variants ordered by their scores.
  • output.mixup : A Mixup file used by MinorThird to annotate term occurrences in text.
  • text.html : Input text annotated with occurrences of terms listed in output.html.
  • log.txt : Listing output used for debugging.

Format of configuration file settings.txt:

  • term pattern(s)
  • stoplist
  • edit distance threshold
  • minimum term candidate frequency
  • minimum (implicit) acronym frequency
  • acronym recognition mode
  • profiling of the runtime, divided into blocks (0 is off, 1 is on)

Default parameters in settings.txt:

  • pattern = "(((((NN|JJ) )*NN) IN (((NN|JJ) )*NN))|((NN|JJ )*NN POS (NN|JJ )*NN))|(((NN|JJ) )+NN)"
  • max = 3 : Jazzy distance threshold: How many operations away? Reduce for better similarity.
  • min = 2 : Term frequency threshold: occurrence > min. Increase for better precision.
  • MIN = 9 : Implicit acronym frequency threshold: occurrence > min. Increase for better precision
  • acronyms = explicit : Acronyms have to be explicitly defined in text using parentheses.
  • profiling = 0 : Profiling is disabled

Clone this wiki locally