Package org.apache.fop.util
Class CharUtilities
- java.lang.Object
-
- org.apache.fop.util.CharUtilities
-
public class CharUtilities extends java.lang.Object
This class provides utilities to distinguish various kinds of Unicode whitespace and to get character widths in a given FontState.
-
-
Field Summary
Fields Modifier and Type Field Description static char
CARRIAGE_RETURN
carriage returnstatic char
CODE_EOT
Character code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.static int
EOT
Character class: Boundary between text runsstatic char
IDEOGRAPHIC_SPACE
Ideogreaphic spacestatic char
LINE_SEPARATOR
line-separatorstatic int
LINEFEED
Character class: Line feedstatic char
LINEFEED_CHAR
linefeed characterstatic char
LRE
left-to-right embeddingstatic char
LRM
left-to-right markstatic char
LRO
left-to-right overridestatic char
MISSING_IDEOGRAPH
missing ideographstatic char
NBSPACE
non-breaking spacestatic char
NEXT_LINE
next line control characterstatic int
NONWHITESPACE
Character class: non-whitespacestatic char
NOT_A_CHARACTER
Unicode value indicating the the character is "not a character".static char
NULL_CHAR
null charstatic char
OBJECT_REPLACEMENT_CHARACTER
Object replacement characterstatic char
PARAGRAPH_SEPARATOR
paragraph-separatorstatic char
PDF
pop directional formattingstatic char
RLE
right-to-left embeddingstatic char
RLM
right-to-left markstatic char
RLO
right-to-left overridestatic char
SOFT_HYPHEN
soft hyphenstatic char
SPACE
normal spacestatic char
TAB
normal tabstatic int
UCWHITESPACE
Character class: Unicode white spacestatic char
WORD_JOINER
word joinerstatic int
XMLWHITESPACE
Character class: XML whitespacestatic char
ZERO_WIDTH_JOINER
zero-width joinerstatic char
ZERO_WIDTH_NOBREAK_SPACE
zero-width no-break space (= byte order mark)static char
ZERO_WIDTH_SPACE
zero-width space
-
Constructor Summary
Constructors Modifier Constructor Description protected
CharUtilities()
Utility class: Constructor prevents instantiating when subclassed.
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.String
charToNCRef(int c)
Convert a single unicode scalar value to an XML numeric character reference.static int
classOf(int c)
Return the appropriate CharClass constant for the type of the passed character.static java.lang.Iterable<java.lang.Integer>
codepointsIter(java.lang.CharSequence s)
Creates an iterator to iter aCharSequence
codepoints.static java.lang.Iterable<java.lang.Integer>
codepointsIter(java.lang.CharSequence s, int beginIndex, int endIndex)
Creates an iterator to iter a sub-CharSequence codepoints.static boolean
containsSurrogatePairAt(java.lang.CharSequence chars, int index)
Tells whether there is a surrogate pair starting from the given index in theCharSequence
.static java.lang.String
format(int c)
Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.static int
incrementIfNonBMP(int codePoint)
Returns 1 if codePoint not in the BMP.static boolean
isAdjustableSpace(int c)
Method to determine if the character is an adjustable space.static boolean
isAlphabetic(int c)
Indicates whether a character is classified as "Alphabetic" by the Unicode standard.static boolean
isAnySpace(int c)
Determines if the character represents any kind of space.static boolean
isBmpCodePoint(int codePoint)
Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP).static boolean
isBreakableSpace(int c)
Helper method to determine if the character is a space with normal behavior.static boolean
isExplicitBreak(int c)
Indicates whether the given character is an explicit break-characterstatic boolean
isFixedWidthSpace(int c)
Method to determine if the character is a (breakable) fixed-width space.static boolean
isNonBreakableSpace(int c)
Method to determine if the character is a nonbreaking space.static boolean
isSameSequence(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
Determine if two character sequences contain the same characters.static boolean
isSurrogatePair(char ch)
Determine if the given characters is part of a surrogate pair.static boolean
isZeroWidthSpace(int c)
Method to determine if the character is a zero-width space.static java.lang.String
padLeft(java.lang.String s, int width, char pad)
Pad a string S on left out to width W using padding character PAD.static java.lang.String
toNCRefs(java.lang.String s)
Convert a string to a sequence of ASCII or XML numeric character references.
-
-
-
Field Detail
-
CODE_EOT
public static final char CODE_EOT
Character code used to signal a character boundary in inline content, such as an inline with borders and padding or a nested block object.- See Also:
- Constant Field Values
-
UCWHITESPACE
public static final int UCWHITESPACE
Character class: Unicode white space- See Also:
- Constant Field Values
-
LINEFEED
public static final int LINEFEED
Character class: Line feed- See Also:
- Constant Field Values
-
EOT
public static final int EOT
Character class: Boundary between text runs- See Also:
- Constant Field Values
-
NONWHITESPACE
public static final int NONWHITESPACE
Character class: non-whitespace- See Also:
- Constant Field Values
-
XMLWHITESPACE
public static final int XMLWHITESPACE
Character class: XML whitespace- See Also:
- Constant Field Values
-
NULL_CHAR
public static final char NULL_CHAR
null char- See Also:
- Constant Field Values
-
LINEFEED_CHAR
public static final char LINEFEED_CHAR
linefeed character- See Also:
- Constant Field Values
-
CARRIAGE_RETURN
public static final char CARRIAGE_RETURN
carriage return- See Also:
- Constant Field Values
-
TAB
public static final char TAB
normal tab- See Also:
- Constant Field Values
-
SPACE
public static final char SPACE
normal space- See Also:
- Constant Field Values
-
NBSPACE
public static final char NBSPACE
non-breaking space- See Also:
- Constant Field Values
-
NEXT_LINE
public static final char NEXT_LINE
next line control character- See Also:
- Constant Field Values
-
ZERO_WIDTH_SPACE
public static final char ZERO_WIDTH_SPACE
zero-width space- See Also:
- Constant Field Values
-
WORD_JOINER
public static final char WORD_JOINER
word joiner- See Also:
- Constant Field Values
-
ZERO_WIDTH_JOINER
public static final char ZERO_WIDTH_JOINER
zero-width joiner- See Also:
- Constant Field Values
-
LRM
public static final char LRM
left-to-right mark- See Also:
- Constant Field Values
-
RLM
public static final char RLM
right-to-left mark- See Also:
- Constant Field Values
-
LRE
public static final char LRE
left-to-right embedding- See Also:
- Constant Field Values
-
RLE
public static final char RLE
right-to-left embedding- See Also:
- Constant Field Values
-
PDF
public static final char PDF
pop directional formatting- See Also:
- Constant Field Values
-
LRO
public static final char LRO
left-to-right override- See Also:
- Constant Field Values
-
RLO
public static final char RLO
right-to-left override- See Also:
- Constant Field Values
-
ZERO_WIDTH_NOBREAK_SPACE
public static final char ZERO_WIDTH_NOBREAK_SPACE
zero-width no-break space (= byte order mark)- See Also:
- Constant Field Values
-
SOFT_HYPHEN
public static final char SOFT_HYPHEN
soft hyphen- See Also:
- Constant Field Values
-
LINE_SEPARATOR
public static final char LINE_SEPARATOR
line-separator- See Also:
- Constant Field Values
-
PARAGRAPH_SEPARATOR
public static final char PARAGRAPH_SEPARATOR
paragraph-separator- See Also:
- Constant Field Values
-
MISSING_IDEOGRAPH
public static final char MISSING_IDEOGRAPH
missing ideograph- See Also:
- Constant Field Values
-
IDEOGRAPHIC_SPACE
public static final char IDEOGRAPHIC_SPACE
Ideogreaphic space- See Also:
- Constant Field Values
-
OBJECT_REPLACEMENT_CHARACTER
public static final char OBJECT_REPLACEMENT_CHARACTER
Object replacement character- See Also:
- Constant Field Values
-
NOT_A_CHARACTER
public static final char NOT_A_CHARACTER
Unicode value indicating the the character is "not a character".- See Also:
- Constant Field Values
-
-
Method Detail
-
classOf
public static int classOf(int c)
Return the appropriate CharClass constant for the type of the passed character.- Parameters:
c
- character to inspect- Returns:
- the determined character class
-
isBreakableSpace
public static boolean isBreakableSpace(int c)
Helper method to determine if the character is a space with normal behavior. Normal behavior means that it's not non-breaking.- Parameters:
c
- character to inspect- Returns:
- True if the character is a normal space
-
isZeroWidthSpace
public static boolean isZeroWidthSpace(int c)
Method to determine if the character is a zero-width space.- Parameters:
c
- the character to check- Returns:
- true if the character is a zero-width space
-
isFixedWidthSpace
public static boolean isFixedWidthSpace(int c)
Method to determine if the character is a (breakable) fixed-width space.- Parameters:
c
- the character to check- Returns:
- true if the character has a fixed-width
-
isNonBreakableSpace
public static boolean isNonBreakableSpace(int c)
Method to determine if the character is a nonbreaking space.- Parameters:
c
- character to check- Returns:
- True if the character is a nbsp
-
isAdjustableSpace
public static boolean isAdjustableSpace(int c)
Method to determine if the character is an adjustable space.- Parameters:
c
- character to check- Returns:
- True if the character is adjustable
-
isAnySpace
public static boolean isAnySpace(int c)
Determines if the character represents any kind of space.- Parameters:
c
- character to check- Returns:
- True if the character represents any kind of space
-
isAlphabetic
public static boolean isAlphabetic(int c)
Indicates whether a character is classified as "Alphabetic" by the Unicode standard.- Parameters:
c
- the character- Returns:
- true if the character is "Alphabetic"
-
isExplicitBreak
public static boolean isExplicitBreak(int c)
Indicates whether the given character is an explicit break-character- Parameters:
c
- the character to check- Returns:
- true if the character represents an explicit break
-
charToNCRef
public static java.lang.String charToNCRef(int c)
Convert a single unicode scalar value to an XML numeric character reference. If in the BMP, four digits are used, otherwise 6 digits are used.- Parameters:
c
- a unicode scalar value- Returns:
- a string representing a numeric character reference
-
toNCRefs
public static java.lang.String toNCRefs(java.lang.String s)
Convert a string to a sequence of ASCII or XML numeric character references.- Parameters:
s
- a java string (encoded in UTF-16)- Returns:
- a string representing a sequence of numeric character reference or ASCII characters
-
padLeft
public static java.lang.String padLeft(java.lang.String s, int width, char pad)
Pad a string S on left out to width W using padding character PAD.- Parameters:
s
- string to padwidth
- width of field to add paddingpad
- character to use for padding- Returns:
- padded string
-
format
public static java.lang.String format(int c)
Format character for debugging output, which it is prefixed with "0x", padded left with '0' and either 4 or 6 hex characters in width according to whether it is in the BMP or not.- Parameters:
c
- character code- Returns:
- formatted character string
-
isSameSequence
public static boolean isSameSequence(java.lang.CharSequence cs1, java.lang.CharSequence cs2)
Determine if two character sequences contain the same characters.- Parameters:
cs1
- first character sequencecs2
- second character sequence- Returns:
- true if both sequences have same length and same character sequence
-
isBmpCodePoint
public static boolean isBmpCodePoint(int codePoint)
Determine whether the specified character (Unicode code point) is in then Basic Multilingual Plane (BMP). Such code points can be represented using a singlechar
.- Parameters:
codePoint
- the character (Unicode code point) to be tested- Returns:
true
if the specified code point is between Character#MIN_VALUE and Character#MAX_VALUE} inclusive;false
otherwise- See Also:
from Java 1.7
-
incrementIfNonBMP
public static int incrementIfNonBMP(int codePoint)
Returns 1 if codePoint not in the BMP. This function is particularly useful in for loops over strings where, in presence of surrogate pairs, you need to skip one loop.- Parameters:
codePoint
- 1 if codePoint > 0xFFFF, 0 otherwise- Returns:
- 1 if codePoint > 0xFFFF, 0 otherwise
-
isSurrogatePair
public static boolean isSurrogatePair(char ch)
Determine if the given characters is part of a surrogate pair.- Parameters:
ch
- character to be checked- Returns:
- true if ch is an high surrogate or a low surrogate
-
containsSurrogatePairAt
public static boolean containsSurrogatePairAt(java.lang.CharSequence chars, int index)
Tells whether there is a surrogate pair starting from the given index in theCharSequence
. If the character at index is an high surrogate then the character at index+1 is checked to be a low surrogate. If a malformed surrogate pair is encountered then anIllegalArgumentException
is thrown.high surrogate [0xD800 - 0xDC00] low surrogate [0xDC00 - 0xE000]
- Parameters:
chars
- CharSequence to checkindex
- index in the CharSequqnce where to start the check- Returns:
- true if there is a well-formed surrogate pair at index
- Throws:
java.lang.IllegalArgumentException
- if there wrong usage of surrogate pairs
-
codepointsIter
public static java.lang.Iterable<java.lang.Integer> codepointsIter(java.lang.CharSequence s)
Creates an iterator to iter aCharSequence
codepoints.- Parameters:
s
-CharSequence
to iter- Returns:
- codepoint iterator for the given
CharSequence
. - See Also:
codepointsIter(CharSequence, int, int)
-
codepointsIter
public static java.lang.Iterable<java.lang.Integer> codepointsIter(java.lang.CharSequence s, int beginIndex, int endIndex)
Creates an iterator to iter a sub-CharSequence codepoints.- Parameters:
s
-CharSequence
to iterbeginIndex
- lower rangeendIndex
- upper range- Returns:
- codepoint iterator for the given sub-CharSequence.
- See Also:
- Bug JDK-5003547
-
-