DjVuText

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.lizardtech.djvu.text
Class DjVuText

java.lang.Object
  com.lizardtech.djvu.DjVuObject
      com.lizardtech.djvu.text.DjVuText

All Implemented Interfaces:: Codec, DjVuInterface

public class DjVuText
extends DjVuObject
implements Codec
extends DjVuObject
implements Codec

This class implements annotations understood by the DjVu plugins and encoders. using: contents of TXT chunks. Contents of the FORM:TEXT should be passed to decode for parsing, which initializes this class and fills in the decoded data.

Description of the text contained in a DjVu page. This class contains the textual data for the page. It describes the text as a hierarchy of zones corresponding to page, column, region, paragraph, lines, words, etc... The piece of text associated with each zone is represented by an offset and a length describing a segment of a global UTF8 encoded byteArray.

Constants are used to tell what a zone describes. This can be useful for a copy/paste application. The deeper we go into the hierarchy, the higher the constant.

Nested Class Summary
`static class`	`DjVuText.Zone` Data structure representing document textual components.

Field Summary
`static int`	`CHARACTER` Indicates a character zone.
`static int`	`COLUMN` Indicates a column zone.
`static int`	`end_of_column` VT: Vertical Tab
`static int`	`end_of_line` LF: Line Feed
`static int`	`end_of_paragraph` US: Unit Separator
`static int`	`end_of_region` GS: Group Separator
`boolean`	`isUTF8` True if UTF8 encoded.
`static int`	`LINE` Indicates a line zone.
`static int`	`PAGE` Indicates a page zone.
`DjVuText.Zone`	`page_zone` Main zone in the document.
`static int`	`PARAGRAPH` Indicates a paragraph zone.
`static int`	`REGION` Indicates a region zone.
`protected byte[]`	`textByteArray` Textual data for this page.
`static int`	`WORD` Indicates a word zone.

Fields inherited from class com.lizardtech.djvu.DjVuObject
`hasReferences`

Constructor Summary
`DjVuText()` Creates a new DjVuText object.

Method Summary
`static DjVuText`	`createDjVuText(DjVuInterface ref)` Creates an instance of DjVuInfo with the options interherited from the specified reference.
`void`	`decode(DataPool pool)` Decodes the hidden text layer TXT into internal representation.
`java.util.Vector`	`find_text_with_rect(GRect box, java.lang.StringBuffer text)` Find the text specified by the rectangles.
`java.util.Vector`	`find_text_with_rect(GRect box, java.lang.StringBuffer text, int padding)` Find the text specified by the rectangles.
`void`	`get_zones(int zone_type, DjVuText.Zone parent, java.util.Vector zone_list)` Get all zones of zone type zone_type under node parent.
`int`	`getLength(int from, int end)` Count the number of characters.
`java.lang.String`	`getString(int start, int end)` Query the string from the specified range of bytes.
`boolean`	`has_valid_zones()` Tests whether there is a meaningful zone hierarchy.
`DjVuText`	`init(DataPool pool)` Searches a file for TXTz and TXTa chunks and decodes each of them.
`DjVuText`	`init(IFFInputStream iff)` Searches a file for TXTz and TXTa chunks and decodes each of them.
`boolean`	`isImageData()` Query if this is image data.
`int`	`length()` Get the number of bytes of hidden text.
`void`	`normalize_text()` Normalize textual data.
`int`	`search_string(java.util.Vector zone_list, java.lang.String string, int from, boolean search_fwd, boolean match_case)` Searches the TXT chunk for the given byteArray.
`int`	`search_string(java.util.Vector zone_list, java.lang.String string, int from, boolean search_fwd, boolean match_case, boolean whole_word)` Searches the TXT chunk for the given byteArray.
`void`	`setTextByteArray(byte[] textByteArray)` Set the text data from an array of bytes.
`int`	`startsWith(java.lang.String substring, int from, boolean match_case)` Returns end position of the first character in string beyond the the found string, if text contains the same words as the substring in the same order (but possibly with different number of separators between words).
`java.lang.String`	`toString()` Query the entire text layer as a string

Methods inherited from class com.lizardtech.djvu.DjVuObject
`checkLockTime, create, create, createSoftReference, createWeakReference, getDjVuOptions, getFromReference, invoke, setDjVuOptions`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Methods inherited from interface com.lizardtech.djvu.DjVuInterface
`getDjVuOptions, setDjVuOptions`

Field Detail

PAGE

public static final int PAGE

Indicates a page zone.

See Also:: Constant Field Values

COLUMN

public static final int COLUMN

Indicates a column zone.

See Also:: Constant Field Values

REGION

public static final int REGION

Indicates a region zone.

See Also:: Constant Field Values

PARAGRAPH

public static final int PARAGRAPH

Indicates a paragraph zone.

See Also:: Constant Field Values

LINE

public static final int LINE

Indicates a line zone.

See Also:: Constant Field Values

WORD

public static final int WORD

Indicates a word zone.

See Also:: Constant Field Values

CHARACTER

public static final int CHARACTER

Indicates a character zone.

See Also:: Constant Field Values

end_of_column

public static final int end_of_column

VT: Vertical Tab

See Also:: Constant Field Values

end_of_region

public static final int end_of_region

GS: Group Separator

See Also:: Constant Field Values

end_of_paragraph

public static final int end_of_paragraph

US: Unit Separator

See Also:: Constant Field Values

end_of_line

public static final int end_of_line

LF: Line Feed

See Also:: Constant Field Values

page_zone

public DjVuText.Zone page_zone

Main zone in the document. This zone represent the page.

isUTF8

public boolean isUTF8

True if UTF8 encoded.

textByteArray

protected byte[] textByteArray

Textual data for this page. The content of this byteArray is encoded using the UTF8 code. This code corresponds to ASCII for the first 127 characters. Columns, regions, paragraph and lines are delimited by the following control character:

Name	Octal	Ascii name
DjVuText.end_of_column	013	VT, Vertical Tab
DjVuText.end_of_region	035	GS, Group Separator
DjVuText.end_of_paragraph	037	US, Unit Separator
DjVuText.end_of_line	012	LF: Line Feed

Constructor Detail

DjVuText

public DjVuText()

Creates a new DjVuText object.

Method Detail

createDjVuText

public static DjVuText createDjVuText(DjVuInterface ref)

Creates an instance of DjVuInfo with the options interherited from the specified reference.

Parameters:: ref - Object to interherit DjVuOptions from.
Returns:: a new instance of DjVuInfo.

isImageData

public boolean isImageData()

Query if this is image data.

Specified by:: isImageData in interface Codec

Returns:: false

getLength

public int getLength(int from,
                     int end)

Count the number of characters.

Parameters:: from - byte position to start counting from; end - byte position to stop counting
Returns:: The number of characters and start of characters in the range.

getString

public java.lang.String getString(int start,
                                  int end)

Query the string from the specified range of bytes.

Parameters:: start - byte position of the first character.; end - byte position to end the string
Returns:: The converted string

setTextByteArray

public void setTextByteArray(byte[] textByteArray)

Set the text data from an array of bytes. First try interpreting the array as UTF8. If that fails consider each byte one character.

Parameters:: textByteArray - array of bytes to interpret

decode

public void decode(DataPool pool)
            throws java.io.IOException

Decodes the hidden text layer TXT into internal representation. NOTE: All separators (except word) are replaced with line feeds.

Specified by:: decode in interface Codec

Parameters:: pool - The chunk to decode.
Throws:: java.io.IOException - if an error occures.

find_text_with_rect

public java.util.Vector find_text_with_rect(GRect box,
                                            java.lang.StringBuffer text,
                                            int padding)

Find the text specified by the rectangles.

Parameters:: box - bounding box to search; text - buffer to fill with the text found; padding - number of pixels to add to each rectangle
Returns:: a vector of the smallest level rectangles representing the text found

find_text_with_rect

public java.util.Vector find_text_with_rect(GRect box,
                                            java.lang.StringBuffer text)

Find the text specified by the rectangles.

Parameters:: box - bounding box to search; text - buffer to fill with the text found
Returns:: a vector of the smallest level rectangles representing the text found

get_zones

public void get_zones(int zone_type,
                      DjVuText.Zone parent,
                      java.util.Vector zone_list)

Get all zones of zone type zone_type under node parent. zone_list contains the return value.

Parameters:: zone_type - the zone type to list.; parent - parent zone to start from; zone_list - vector to add the zones to

has_valid_zones

public boolean has_valid_zones()

Tests whether there is a meaningful zone hierarchy.

Returns:: true if there are valid zones

init

public DjVuText init(IFFInputStream iff)
              throws java.io.IOException

Searches a file for TXTz and TXTa chunks and decodes each of them.

Parameters:: iff - input stream to read.
Returns:: the initialized DjVuText object
Throws:: java.io.IOException - if an IO error occures.

init

public DjVuText init(DataPool pool)
              throws java.io.IOException

Searches a file for TXTz and TXTa chunks and decodes each of them.

Parameters:: pool - input stream to read.
Returns:: the initialized DjVuText object
Throws:: java.io.IOException - if an IO error occures.

length

public int length()

Get the number of bytes of hidden text.

Returns:: number of bytes

normalize_text

public void normalize_text()

Normalize textual data. Assuming that a zone hierarchy has been built and represents the reading order. This function reorganizes the byteArray textByteArray by gathering the highest level text available in the zone hierarchy. The text offsets and lengths are recomputed for all the zones in the hierarchy. Separators are inserted where appropriate.

search_string

public int search_string(java.util.Vector zone_list,
                         java.lang.String string,
                         int from,
                         boolean search_fwd,
                         boolean match_case,
                         boolean whole_word)

Searches the TXT chunk for the given byteArray. If the function manages to find an occurrence of the string, it will return the start of the text. If no match has been found the retval will be -1.

Parameters:: zone_list - A list of smallest zones covering the text.; string - String to be found. May contain spaces as word separators.; from - Position returned by last search. If from is out of bounds of textByteArray it will be set to -1 for searching forward and textByteArray.length for searching backwards.; search_fwd - TRUE means to search forward. FALSE - backward.; match_case - If set to FALSE the search will be case-insensitive.; whole_word - If set to TRUE the function will try to find a whole word matching the passed string. The word separators are all blank and punctuation characters. The passed string may not contain word separators, that is it must be a whole word.
Returns:: Start of text if found, otherwise -1.
Throws:: java.lang.IllegalArgumentException - if no none-white spaces are specified in the search string

search_string

public int search_string(java.util.Vector zone_list,
                         java.lang.String string,
                         int from,
                         boolean search_fwd,
                         boolean match_case)

Searches the TXT chunk for the given byteArray. If the function manages to find an occurrence of the string, it will return the start of the text. If no match has been found the retval will be -1. Does not try to match the whole word.

Parameters:: zone_list - A list of smallest zones covering the text.; string - String to be found. May contain spaces as word separators.; from - Position returned by last search. If from is out of bounds of textByteArray it will be set to -1 for searching forward and textByteArray.length for searching backwards.; search_fwd - TRUE means to search forward. FALSE - backward.; match_case - If set to FALSE the search will be case-insensitive.
Returns:: Start of text if found, otherwise -1.
Throws:: java.lang.IllegalArgumentException - if no none-white spaces are specified in the search string

startsWith

public int startsWith(java.lang.String substring,
                      int from,
                      boolean match_case)

Returns end position of the first character in string beyond the the found string, if text contains the same words as the substring in the same order (but possibly with different number of separators between words). The 'separators' in this function are blank and 'end_of_...' characters. If the text is not found then the initial from value will be returned. NOTE, that the returned position may be different from (substring.length+from) because of different number of spaces between words in substring and string.

Parameters:: substring - string to search for; from - start position; match_case - true if case sensative
Returns:: end position if the substring is found

toString

public java.lang.String toString()

Query the entire text layer as a string

Overrides:: toString in class java.lang.Object

Returns:: the converted string

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.lizardtech.djvu.text Class DjVuText

PAGE

COLUMN

REGION

PARAGRAPH

LINE

WORD

CHARACTER

end_of_column

end_of_region

end_of_paragraph

end_of_line

page_zone

isUTF8

textByteArray

DjVuText

createDjVuText

isImageData

getLength

getString

setTextByteArray

decode

find_text_with_rect

find_text_with_rect

get_zones

has_valid_zones

init

init

length

normalize_text

search_string

search_string

startsWith

toString

com.lizardtech.djvu.text
Class DjVuText