Class RegexFilter

  • All Implemented Interfaces:
    java.io.Serializable, java.lang.Cloneable, NodeFilter

    public class RegexFilter
    extends java.lang.Object
    implements NodeFilter
    This filter accepts all string nodes matching a regular expression. Because this searches Text nodes. it is only useful for finding small fragments of text, where it is unlikely to be broken up by a tag. To find large fragments of text you should convert the page to plain text with something like the StringBean and then apply the regular expression.

    For example, to look for dates use:

       (19|20)\d\d([- \\/.](0[1-9]|1[012])[- \\/.](0[1-9]|[12][0-9]|3[01]))?
     
    as in:
     Parser parser = new Parser ("http://cbc.ca");
     RegexFilter filter = new RegexFilter ("(19|20)\\d\\d([- \\\\/.](0[1-9]|1[012])[- \\\\/.](0[1-9]|[12][0-9]|3[01]))?");
     NodeIterator iterator = parser.extractAllNodesThatMatch (filter).elements ();
     
    which matches a date in yyyy-mm-dd format between 1900-01-01 and 2099-12-31, with a choice of five separators, either a dash, a space, either kind of slash or a period. The year is matched by (19|20)\d\d which uses alternation to allow the either 19 or 20 as the first two digits. The round brackets are mandatory. The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12. The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31. The day and month are optional, but must occur together because of the ()? bracketing after the year.
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int FIND
      Use find() match strategy.
      static int LOOKINGAT
      Use lookingAt() match strategy.
      static int MATCH
      Use match() matching strategy.
      protected java.util.regex.Pattern mPattern
      The compiled regular expression to search for.
      protected java.lang.String mPatternString
      The regular expression to search for.
      protected int mStrategy
      The match strategy.
    • Constructor Summary

      Constructors 
      Constructor Description
      RegexFilter()
      Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.
      RegexFilter​(java.lang.String pattern)
      Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.
      RegexFilter​(java.lang.String pattern, int strategy)
      Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean accept​(Node node)
      Accept string nodes that match the regular expression.
      java.lang.String getPattern()
      Get the search pattern.
      int getStrategy()
      Get the search strategy.
      void setPattern​(java.lang.String pattern)
      Set the search pattern.
      void setStrategy​(int strategy)
      Set the search pattern.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOOKINGAT

        public static final int LOOKINGAT
        Use lookingAt() match strategy.
        See Also:
        Constant Field Values
      • mPatternString

        protected java.lang.String mPatternString
        The regular expression to search for.
      • mPattern

        protected java.util.regex.Pattern mPattern
        The compiled regular expression to search for.
    • Constructor Detail

      • RegexFilter

        public RegexFilter()
        Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.
      • RegexFilter

        public RegexFilter​(java.lang.String pattern)
        Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.
        Parameters:
        pattern - The pattern to search for.
      • RegexFilter

        public RegexFilter​(java.lang.String pattern,
                           int strategy)
        Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.
        Parameters:
        pattern - The pattern to search for.
        strategy - The type of match:
        1. MATCH use matches() method: attempts to match the entire input sequence against the pattern
        2. LOOKINGAT use lookingAt() method: attempts to match the input sequence, starting at the beginning, against the pattern
        3. FIND use find() method: scans the input sequence looking for the next subsequence that matches the pattern
    • Method Detail

      • getPattern

        public java.lang.String getPattern()
        Get the search pattern.
        Returns:
        Returns the pattern.
      • setPattern

        public void setPattern​(java.lang.String pattern)
        Set the search pattern.
        Parameters:
        pattern - The pattern to set.
      • getStrategy

        public int getStrategy()
        Get the search strategy.
        Returns:
        Returns the strategy.
      • setStrategy

        public void setStrategy​(int strategy)
        Set the search pattern.
        Parameters:
        strategy - The strategy to use. One of MATCH, LOOKINGAT or FIND.
      • accept

        public boolean accept​(Node node)
        Accept string nodes that match the regular expression.
        Specified by:
        accept in interface NodeFilter
        Parameters:
        node - The node to check.
        Returns:
        true if the regular expression matches the text of the node, false otherwise.