com.lizardtech.djvu.text
Class DjVuText

java.lang.Object
  extended by com.lizardtech.djvu.DjVuObject
      extended by com.lizardtech.djvu.text.DjVuText
All Implemented Interfaces:
Codec, DjVuInterface

public class DjVuText
extends DjVuObject
implements Codec

This class implements annotations understood by the DjVu plugins and encoders. using: contents of TXT chunks. Contents of the FORM:TEXT should be passed to decode for parsing, which initializes this class and fills in the decoded data.

Description of the text contained in a DjVu page. This class contains the textual data for the page. It describes the text as a hierarchy of zones corresponding to page, column, region, paragraph, lines, words, etc... The piece of text associated with each zone is represented by an offset and a length describing a segment of a global UTF8 encoded byteArray.

Constants are used to tell what a zone describes. This can be useful for a copy/paste application. The deeper we go into the hierarchy, the higher the constant.


Nested Class Summary
static class DjVuText.Zone
          Data structure representing document textual components.
 
Field Summary
static int CHARACTER
          Indicates a character zone.
static int COLUMN
          Indicates a column zone.
static int end_of_column
          VT: Vertical Tab
static int end_of_line
          LF: Line Feed
static int end_of_paragraph
          US: Unit Separator
static int end_of_region
          GS: Group Separator
 boolean isUTF8
          True if UTF8 encoded.
static int LINE
          Indicates a line zone.
static int PAGE
          Indicates a page zone.
 DjVuText.Zone page_zone
          Main zone in the document.
static int PARAGRAPH
          Indicates a paragraph zone.
static int REGION
          Indicates a region zone.
protected  byte[] textByteArray
          Textual data for this page.
static int WORD
          Indicates a word zone.
 
Fields inherited from class com.lizardtech.djvu.DjVuObject
hasReferences
 
Constructor Summary
DjVuText()
          Creates a new DjVuText object.
 
Method Summary
static DjVuText createDjVuText(DjVuInterface ref)
          Creates an instance of DjVuInfo with the options interherited from the specified reference.
 void decode(DataPool pool)
          Decodes the hidden text layer TXT into internal representation.
 java.util.Vector find_text_with_rect(GRect box, java.lang.StringBuffer text)
          Find the text specified by the rectangles.
 java.util.Vector find_text_with_rect(GRect box, java.lang.StringBuffer text, int padding)
          Find the text specified by the rectangles.
 void get_zones(int zone_type, DjVuText.Zone parent, java.util.Vector zone_list)
          Get all zones of zone type zone_type under node parent.
 int getLength(int from, int end)
          Count the number of characters.
 java.lang.String getString(int start, int end)
          Query the string from the specified range of bytes.
 boolean has_valid_zones()
          Tests whether there is a meaningful zone hierarchy.
 DjVuText init(DataPool pool)
          Searches a file for TXTz and TXTa chunks and decodes each of them.
 DjVuText init(IFFInputStream iff)
          Searches a file for TXTz and TXTa chunks and decodes each of them.
 boolean isImageData()
          Query if this is image data.
 int length()
          Get the number of bytes of hidden text.
 void normalize_text()
          Normalize textual data.
 int search_string(java.util.Vector zone_list, java.lang.String string, int from, boolean search_fwd, boolean match_case)
          Searches the TXT chunk for the given byteArray.
 int search_string(java.util.Vector zone_list, java.lang.String string, int from, boolean search_fwd, boolean match_case, boolean whole_word)
          Searches the TXT chunk for the given byteArray.
 void setTextByteArray(byte[] textByteArray)
          Set the text data from an array of bytes.
 int startsWith(java.lang.String substring, int from, boolean match_case)
          Returns end position of the first character in string beyond the the found string, if text contains the same words as the substring in the same order (but possibly with different number of separators between words).
 java.lang.String toString()
          Query the entire text layer as a string
 
Methods inherited from class com.lizardtech.djvu.DjVuObject
checkLockTime, create, create, createSoftReference, createWeakReference, getDjVuOptions, getFromReference, invoke, setDjVuOptions
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface com.lizardtech.djvu.DjVuInterface
getDjVuOptions, setDjVuOptions
 

Field Detail

PAGE

public static final int PAGE
Indicates a page zone.

See Also:
Constant Field Values

COLUMN

public static final int COLUMN
Indicates a column zone.

See Also:
Constant Field Values

REGION

public static final int REGION
Indicates a region zone.

See Also:
Constant Field Values

PARAGRAPH

public static final int PARAGRAPH
Indicates a paragraph zone.

See Also:
Constant Field Values

LINE

public static final int LINE
Indicates a line zone.

See Also:
Constant Field Values

WORD

public static final int WORD
Indicates a word zone.

See Also:
Constant Field Values

CHARACTER

public static final int CHARACTER
Indicates a character zone.

See Also:
Constant Field Values

end_of_column

public static final int end_of_column
VT: Vertical Tab

See Also:
Constant Field Values

end_of_region

public static final int end_of_region
GS: Group Separator

See Also:
Constant Field Values

end_of_paragraph

public static final int end_of_paragraph
US: Unit Separator

See Also:
Constant Field Values

end_of_line

public static final int end_of_line
LF: Line Feed

See Also:
Constant Field Values

page_zone

public DjVuText.Zone page_zone
Main zone in the document. This zone represent the page.


isUTF8

public boolean isUTF8
True if UTF8 encoded.


textByteArray

protected byte[] textByteArray
Textual data for this page. The content of this byteArray is encoded using the UTF8 code. This code corresponds to ASCII for the first 127 characters. Columns, regions, paragraph and lines are delimited by the following control character:
Name Octal Ascii name
DjVuText.end_of_column 013 VT, Vertical Tab
DjVuText.end_of_region 035 GS, Group Separator
DjVuText.end_of_paragraph 037 US, Unit Separator
DjVuText.end_of_line 012 LF: Line Feed

Constructor Detail

DjVuText

public DjVuText()
Creates a new DjVuText object.

Method Detail

createDjVuText

public static DjVuText createDjVuText(DjVuInterface ref)
Creates an instance of DjVuInfo with the options interherited from the specified reference.

Parameters:
ref - Object to interherit DjVuOptions from.
Returns:
a new instance of DjVuInfo.

isImageData

public boolean isImageData()
Query if this is image data.

Specified by:
isImageData in interface Codec
Returns:
false

getLength

public int getLength(int from,
                     int end)
Count the number of characters.

Parameters:
from - byte position to start counting from
end - byte position to stop counting
Returns:
The number of characters and start of characters in the range.

getString

public java.lang.String getString(int start,
                                  int end)
Query the string from the specified range of bytes.

Parameters:
start - byte position of the first character.
end - byte position to end the string
Returns:
The converted string

setTextByteArray

public void setTextByteArray(byte[] textByteArray)
Set the text data from an array of bytes. First try interpreting the array as UTF8. If that fails consider each byte one character.

Parameters:
textByteArray - array of bytes to interpret

decode

public void decode(DataPool pool)
            throws java.io.IOException
Decodes the hidden text layer TXT into internal representation. NOTE: All separators (except word) are replaced with line feeds.

Specified by:
decode in interface Codec
Parameters:
pool - The chunk to decode.
Throws:
java.io.IOException - if an error occures.

find_text_with_rect

public java.util.Vector find_text_with_rect(GRect box,
                                            java.lang.StringBuffer text,
                                            int padding)
Find the text specified by the rectangles.

Parameters:
box - bounding box to search
text - buffer to fill with the text found
padding - number of pixels to add to each rectangle
Returns:
a vector of the smallest level rectangles representing the text found

find_text_with_rect

public java.util.Vector find_text_with_rect(GRect box,
                                            java.lang.StringBuffer text)
Find the text specified by the rectangles.

Parameters:
box - bounding box to search
text - buffer to fill with the text found
Returns:
a vector of the smallest level rectangles representing the text found

get_zones

public void get_zones(int zone_type,
                      DjVuText.Zone parent,
                      java.util.Vector zone_list)
Get all zones of zone type zone_type under node parent. zone_list contains the return value.

Parameters:
zone_type - the zone type to list.
parent - parent zone to start from
zone_list - vector to add the zones to

has_valid_zones

public boolean has_valid_zones()
Tests whether there is a meaningful zone hierarchy.

Returns:
true if there are valid zones

init

public DjVuText init(IFFInputStream iff)
              throws java.io.IOException
Searches a file for TXTz and TXTa chunks and decodes each of them.

Parameters:
iff - input stream to read.
Returns:
the initialized DjVuText object
Throws:
java.io.IOException - if an IO error occures.

init

public DjVuText init(DataPool pool)
              throws java.io.IOException
Searches a file for TXTz and TXTa chunks and decodes each of them.

Parameters:
pool - input stream to read.
Returns:
the initialized DjVuText object
Throws:
java.io.IOException - if an IO error occures.

length

public int length()
Get the number of bytes of hidden text.

Returns:
number of bytes

normalize_text

public void normalize_text()
Normalize textual data. Assuming that a zone hierarchy has been built and represents the reading order. This function reorganizes the byteArray textByteArray by gathering the highest level text available in the zone hierarchy. The text offsets and lengths are recomputed for all the zones in the hierarchy. Separators are inserted where appropriate.


search_string

public int search_string(java.util.Vector zone_list,
                         java.lang.String string,
                         int from,
                         boolean search_fwd,
                         boolean match_case,
                         boolean whole_word)
Searches the TXT chunk for the given byteArray. If the function manages to find an occurrence of the string, it will return the start of the text. If no match has been found the retval will be -1.

Parameters:
zone_list - A list of smallest zones covering the text.
string - String to be found. May contain spaces as word separators.
from - Position returned by last search. If from is out of bounds of textByteArray it will be set to -1 for searching forward and textByteArray.length for searching backwards.
search_fwd - TRUE means to search forward. FALSE - backward.
match_case - If set to FALSE the search will be case-insensitive.
whole_word - If set to TRUE the function will try to find a whole word matching the passed string. The word separators are all blank and punctuation characters. The passed string may not contain word separators, that is it must be a whole word.
Returns:
Start of text if found, otherwise -1.
Throws:
java.lang.IllegalArgumentException - if no none-white spaces are specified in the search string

search_string

public int search_string(java.util.Vector zone_list,
                         java.lang.String string,
                         int from,
                         boolean search_fwd,
                         boolean match_case)
Searches the TXT chunk for the given byteArray. If the function manages to find an occurrence of the string, it will return the start of the text. If no match has been found the retval will be -1. Does not try to match the whole word.

Parameters:
zone_list - A list of smallest zones covering the text.
string - String to be found. May contain spaces as word separators.
from - Position returned by last search. If from is out of bounds of textByteArray it will be set to -1 for searching forward and textByteArray.length for searching backwards.
search_fwd - TRUE means to search forward. FALSE - backward.
match_case - If set to FALSE the search will be case-insensitive.
Returns:
Start of text if found, otherwise -1.
Throws:
java.lang.IllegalArgumentException - if no none-white spaces are specified in the search string

startsWith

public int startsWith(java.lang.String substring,
                      int from,
                      boolean match_case)
Returns end position of the first character in string beyond the the found string, if text contains the same words as the substring in the same order (but possibly with different number of separators between words). The 'separators' in this function are blank and 'end_of_...' characters. If the text is not found then the initial from value will be returned. NOTE, that the returned position may be different from (substring.length+from) because of different number of spaces between words in substring and string.

Parameters:
substring - string to search for
from - start position
match_case - true if case sensative
Returns:
end position if the substring is found

toString

public java.lang.String toString()
Query the entire text layer as a string

Overrides:
toString in class java.lang.Object
Returns:
the converted string