SWI-Prolog Natural Language Processing Primitives
AllApplicationManualNameSummaryHelp

  • Documentation
    • Reference manual
    • Packages
      • SWI-Prolog Natural Language Processing Primitives
        • Double Metaphone -- Phonetic string matching
        • Porter Stem -- Determine stem and related routines
          • porter_stem/2
          • unaccent_atom/2
          • tokenize_atom/2
          • atom_to_stem_list/2
          • Origin and Copyright
        • library(snowball): The Snowball multi-lingual stemmer library
        • library(isub): isub: a string similarity measure

2 Porter Stem -- Determine stem and related routines

The library(porter_stem) library implements the stemming algorithm described by Porter in Porter, 1980, “An algorithm for suffix stripping'', Program, Vol. 14, no. 3, pp 130-137. The library comes with some additional predicates that are commonly used in the context of stemming.

porter_stem(+In, -Stem)
Determine the stem of In. In must represent ISO Latin-1 text. The porter_stem/2 predicate first maps In to lower case, then removes all accents as in unaccent_atom/2 and finally applies the Porter stem algorithm.
unaccent_atom(+In, -ASCII)
If In is general ISO Latin-1 text with accents, ASCII is unified with a plain ASCII version of the string. Note that the current version only deals with ISO Latin-1 atoms.
tokenize_atom(+In, -TokenList)
Break the text In into words, numbers and punctuation characters. Tokens are created to the following rules:

[-+][0-9]+(\.[0-9]+)?([eE][-+][0-9]+)? number
[:alpha:][:alnum:]+ word
[:space:]+ skipped
anything elsesingle-character

Character classification is based on the C-library iswalnum() etc. functions. Recognised numbers are passed to Prolog read/1, supporting unbounded integers.

It is likely that future versions of this library will provide tokenize_atom/3 with additional options to modify space handling as well as the definition of words.

atom_to_stem_list(+In, -ListOfStems)
Combines the three above routines, returning a list holding an atom with the stem of each word encountered and numbers for encountered numbers.

2.1 Origin and Copyright

The code is based on the original Public Domain implementation by Martin Porter as can be found at http://www.tartarus.org/martin/PorterStemmer/. The code has been modified by Jan Wielemaker. He removed all global variables to make the code thread-safe, added the unaccent and tokenize code and created the SWI-Prolog binding.