jwo.utils.syntaxhighlighter
Class Scanner

java.lang.Object
  extended by jwo.utils.syntaxhighlighter.Scanner
All Implemented Interfaces:
TokenTypes
Direct Known Subclasses:
JavaScanner, ScriptScanner

public class Scanner
extends Object
implements TokenTypes

A Scanner object provides a lexical analyser and a resulting token array. Incremental rescanning is supported, e.g. for use in a token colouring editor. This is a base class dealing with plain text, which can be extended to support other languages.

The actual text is assumed to be held elsewhere, e.g. in a document. The change() method is called to report the position and length of a change in the text, and the scan() method is called to perform scanning or rescanning. For example, to scan an entire document held in a character array text in one go:

 scanner.change(0, 0, text.length);
 scanner.scan(text, 0, text.length);
 

For incremental scanning, the position() method is used to find the text position at which rescanning should start. For example, a syntax highlighter might contain this code:

 // Where to start rehighlighting, and a segment object
 int firstRehighlightToken;
 Segment segment;

 ...

 // Whenever the text changes, e.g. on an insert or remove or read.
 firstRehighlightToken = scanner.change(offset, oldLength, newLength);
 repaint();

 ...

 // in repaintComponent
 int offset = scanner.position();
 if (offset < 0) return;
 int tokensToRedo = 0;
 int amount = 100;
 while (tokensToRedo == 0 && offset >= 0)
 {
    int length = doc.getLength() - offset;
    if (length > amount) length = amount;
    try { doc.getText(offset, length, text); }
    catch (BadLocationException e) { return; }
    tokensToRedo = scanner.scan(text.array, text.offset, text.count);
    offset = scanner.position();
    amount = 2*amount;
 }
 for (int i = 0; i < tokensToRedo; i++)
 {
    Token t = scanner.getToken(firstRehighlightToken + i);
    int length = t.symbol.name.length();
    int type = t.symbol.type;
    doc.setCharacterAttributes (t.position, length, styles[type], false);
 }
 firstRehighlightToken += tokensToRedo;
 if (offset >= 0) repaint(2);
 

Note that change can be called at any time, even between calls to scan. Only small number of characters are passed to scan so that only a small burst of scanning is done, to prevent the program's user interface from freezing.

Version:
1.0, February, 2006 modified 29th May, 2006.
Author:
Henk Muller, Minor modifications, Jo Wood.

Field Summary
protected  char[] buffer
          The current buffer of text being scanned.
protected  int end
          The end offset in the buffer.
protected  int start
          The current offset within the buffer, at which to scan the next token.
protected  int state
          The current scanner state, as a representative token type.
protected  HashMap symbolTable
          The symbol table can be accessed by initSymbolTable or lookup, if they are overridden.
 
Fields inherited from interface jwo.utils.syntaxhighlighter.TokenTypes
BRACKET, CHARACTER, COMMENT, END_COMMENT, END_TAG, IDENTIFIER, KEYWORD, KEYWORD2, LITERAL, MID_COMMENT, NUMBER, OPERATOR, PUNCTUATION, SEPARATOR, START_COMMENT, STRING, TAG, typeNames, UNRECOGNIZED, URL, WHITESPACE, WORD
 
Constructor Summary
Scanner()
          Creates a new Scanner representing an empty text document.
 
Method Summary
 int change(int start, int len, int newLen)
          Sets the position of an edit, the length of the text being replaced, and the length of the replacement text, to prepare for rescanning.
 int find(int p)
          Finds the index of the valid token starting before, but nearest to, text position p.
 Token getToken(int n)
          Finds the nth token, or null if it is not currently valid.
protected  void initSymbolTable()
          Creates the initial symbol table.
protected  Symbol lookup(int type, String name)
          Looks up a symbol in the symbol table.
 int position()
          Finds out at what text position any remaining scanning work should start.
protected  int read()
          Reads one token from the start of the current text buffer, given the start offset, end offset, and current scanner state.
 void remove(int symType)
          Removes all symbols of the given type from the symbol table.
 int scan(char[] array, int offset, int length)
          Scans or rescans a given read-only segment of text.
 int size()
          Finds the number of available valid tokens, not counting tokens in or after any area yet to be rescanned.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

buffer

protected char[] buffer
The current buffer of text being scanned.


start

protected int start
The current offset within the buffer, at which to scan the next token.


end

protected int end
The end offset in the buffer.


state

protected int state
The current scanner state, as a representative token type.


symbolTable

protected HashMap symbolTable
The symbol table can be accessed by initSymbolTable or lookup, if they are overridden. Symbols are inserted with symbolTable.put(sym,sym) and extracted with symbolTable.get(sym).

Constructor Detail

Scanner

public Scanner()
Creates a new Scanner representing an empty text document. For non-incremental scanning, use change() to report the document size, then pass the entire text to the scan() method in one go, or if coming from an input stream, a buffer's worth at a time.

Method Detail

read

protected int read()

Reads one token from the start of the current text buffer, given the start offset, end offset, and current scanner state. The method moves the start offset past the token, updates the scanner state, and returns the type of the token just scanned.

The scanner state is a representative token type. It is either the state left after the last call to read, or the type of the old token at the same position if rescanning, or WHITESPACE if at the start of a document. The method succeeds in all cases, returning whitespace or comment or error tokens where necessary. Each line of a multi-line comment is treated as a separate token, to improve incremental rescanning. If the buffer does not extend to the end of the document, the last token returned for the buffer may be incomplete and the caller must rescan it. The read method can be overridden to implement different languages. The default version splits plain text into words, numbers and punctuation.

Returns:
Type of token that has been read.

size

public int size()
Finds the number of available valid tokens, not counting tokens in or after any area yet to be rescanned.

Returns:
Number of available valid tokens.

getToken

public Token getToken(int n)
Finds the nth token, or null if it is not currently valid.

Parameters:
n - Token number to search for.
Returns:
Token found, or null if not.

find

public int find(int p)
Finds the index of the valid token starting before, but nearest to, text position p. This uses an O(log(n)) binary chop search.

Parameters:
p - Position from which to start search.
Returns:
Closest token to the given position.

change

public int change(int start,
                  int len,
                  int newLen)
Sets the position of an edit, the length of the text being replaced, and the length of the replacement text, to prepare for rescanning.

Parameters:
start - Position of edit.
len - Length of text being replaced.
newLen - Length of the replacement text.
Returns:
Index of the token at which rescanning will start.

position

public int position()
Finds out at what text position any remaining scanning work should start.

Returns:
Position of remaining scanning work or -1 if scanning is complete.

initSymbolTable

protected void initSymbolTable()
Creates the initial symbol table. This can be overridden to enter keywords, for example. The default implementation does nothing.


remove

public void remove(int symType)
Removes all symbols of the given type from the symbol table. This can be used when resetting a new symbol table but when symbols of certain types need to be retained.

Parameters:
symType - Type of symbol to remove from table.

lookup

protected Symbol lookup(int type,
                        String name)
Looks up a symbol in the symbol table. This can be overridden to implement keyword detection, for example. The default implementation just uses the table to ensure that there is only one shared occurrence of each symbol.

Parameters:
type - Type of symbol to look for.
name - Name of symbol to look for.
Returns:
Symbol found, or new symbol if not present in lookup table.

scan

public int scan(char[] array,
                int offset,
                int length)
Scans or rescans a given read-only segment of text. The segment is assumed to represent a portion of the document starting at position(). If the result is 0, the call should be retried with a longer segment.

Parameters:
array - Text to scan.
offset - Offset from which to start scan.
length - Length of text to scan from offset.
Returns:
Number of tokens successfully scanned, excluding any partial token at the end of the text segment but not at the end of the document.


Copyright Jo Wood, 1996-2009, last modified, 17th April, 2009