Interface WordTokenizer

  • All Known Implementing Classes:
    AbstractWordTokenizer, DocumentWordTokenizer, FileWordTokenizer, StringWordTokenizer

    public interface WordTokenizer

    An interface for objects which take a text-based media as input, and iterate through the words in the text stored in that media. Examples of such media could be Strings, Documents, Files, TextComponents etc.

    When the object is instantiated, and before the first call to next() is made, the following methods should throw a WordNotFoundException:
    getCurrentWordEnd(), getCurrentWordPosition(), isNewSentence() and replaceWord().

    A call to next() when hasMoreWords() returns false should throw a WordNotFoundException.

    Author:
    Jason Height (jheight@chariot.net.au)
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      java.lang.String getContext()
      Returns the context text that is being tokenized (should include any changes that have been made).
      int getCurrentWordCount()
      Returns the number of word tokens that have been processed thus far
      int getCurrentWordEnd()
      Returns an index representing the end location of the current word in the text.
      int getCurrentWordPosition()
      Returns an index representing the start location of the current word in the text.
      boolean hasMoreWords()
      Indicates if there are more words left
      boolean isNewSentence()
      Returns true if the current word is at the start of a sentence
      java.lang.String nextWord()
      This returns the next word in the iteration.
      void replaceWord​(java.lang.String newWord)
      Replaces the current word token
    • Method Detail

      • getContext

        java.lang.String getContext()
        Returns the context text that is being tokenized (should include any changes that have been made).
        Returns:
        the text being searched.
      • getCurrentWordCount

        int getCurrentWordCount()
        Returns the number of word tokens that have been processed thus far
        Returns:
        the number of words found so far.
      • getCurrentWordEnd

        int getCurrentWordEnd()
        Returns an index representing the end location of the current word in the text.
        Returns:
        index of the end of the current word in the text.
        Throws:
        WordNotFoundException - current word has not yet been set.
      • getCurrentWordPosition

        int getCurrentWordPosition()
        Returns an index representing the start location of the current word in the text.
        Returns:
        index of the start of the current word in the text.
        Throws:
        WordNotFoundException - current word has not yet been set.
      • isNewSentence

        boolean isNewSentence()
        Returns true if the current word is at the start of a sentence
        Returns:
        true if the current word starts a sentence.
        Throws:
        WordNotFoundException - current word has not yet been set.
      • hasMoreWords

        boolean hasMoreWords()
        Indicates if there are more words left
        Returns:
        true if more words can be found in the text.
      • nextWord

        java.lang.String nextWord()
        This returns the next word in the iteration. Note that any implementation should return the current word, and then replace the current word with the next word found in the input text (if one exists).
        Returns:
        the next word in the iteration.
        Throws:
        WordNotFoundException - search string contains no more words.
      • replaceWord

        void replaceWord​(java.lang.String newWord)
        Replaces the current word token

        When a word is replaced care should be taken that the WordTokenizer repositions itself such that the words that were added aren't rechecked. Of course this is not mandatory, maybe there is a case when an application doesn't need to do this.

        Parameters:
        newWord - the string which should replace the current word.
        Throws:
        WordNotFoundException - current word has not yet been set.