Okapi-Pack

Centre For Interactive Systems Research
City University
London EC1V 0BH


Appendix E:   Making an Okapi Database

Making an Okapi database is a two stage process:

  1. Conversion of the source data to Okapi Exchange Format,
  2. Conversion to Okapi Runtime Format.


1. Okapi Exchange Format.

This is very simple. Every record in the database must have the same number of fields (though any fields in a record may of course be empty). The maximum number of fields is 31 except for databases of "text" type (intended for free text, where records may be up to several megabytes long), in which case there may not be more than four fields. Each field in an exchange format record is terminated by an end of field character (field_mark) and each record by an end of record character (record_mark). These may be any two distinct characters which do not occur in the raw data; unless otherwise specified the final conversion program expects them to be 0x1E and 0x1D respectively.

Following the final field, but preceding the record mark, there may be one or more additional temporary fields which will provide part of the contents of the "fixed" area at the beginning of the corresponding runtime record. The first of these fields, if present, contains a number representing the bitwise "or" of the limit criteria which the record satisfies; other such fields might contain accession/modification dates or other codable information, but nothing is defined at the time of writing.

Thus an exchange record is

    <field contents> <field mark> [ < field contents> <field mark> ] <record mark>

Field contents may be anything which doesn't contain field or record mark characters. Usually it is fairly straight ASCII text. Historically, certain characters in certain types of field (see Field Types), have special meanings, but in general they do not.

The indexing application, "indexer", expects to find a text file containing the exchange format database which it converts to an Okapi runtime database. If, however, you are not using "indexer" it is often not necessary to hold any data in exchange format. Having written a program or script to convert your raw data to exchange format it is pipelined into the program which converts exchange to runtime format.

1.1. Converting to exchange format

In general, you have to write your own program or script to do this. However, quite a lot of bibliographic-type data is in some kind of ISO 2709-related form (e.g. all the varieties of MARC). We have examples of programs which convert this type of data, but they always need customising for any new application. For most raw data, though, it is a matter of hacking at awk, perl or similar scripts until the results seem to be satisfactory. It is only by experience of a particular data source that one learns what types of error and exception to look for, what is safe and what is not.


2. Okapi Runtime format

This is almost equally simple. A record is:

    <fixed field> <field directory> <field> [ <field> ] [ <padding> ]

There are no field or record marks.


Table 2.1
Fixed
field
If the database has limiting facilities the first two bytes of the fixed field contain the record's limit mask as a 16-bit unsigned value.
Field
directory
For a non-text database this consists of a 16-bit unsigned field length for each data field. For a text database these directory fields are 24 bits long. Each one contains the length in bytes of the corresponding data field.
Fields May contain anything, or nothing. It is not normal for databases other than databases of type "text" to contain newline characters. Interfaces to search programs would normally format to suit the required display.
Padding Ultimately, database records have to be addressed by their offset in a disk file or sequence of files. This addressing is limited to 31 (or possibly only 30) bits. This would limit the total size of a database to about two gigabytes or less. Hence records may be padded on the end, if necessary, so that their length is a multiple of a small power of two. Increasing this power by one doubles the maximum possible size of the database. This information is recorded in the database parameter as "rec_mult" (see Database Parameters). For example, if rec_mult is 4, the maximum size of the database will be eight gigabytes. Of course if rec_mult is large compared to the mean record length rather a lot of space will be wasted; the mean amount of wasted space per record is
(rec_mult - 1)/2. Any character may be used for padding; the runtime conversion program actually inserts plus signs.


2.1. Converting to runtime format

Once the database has been converted into exchange format it must be converted into Okapi runtime format. This will be a database that is searchable by internal record number.

There is a standard program called convert_runtime to do this. It reads from stdin and writes a runtime bibfile (in <OKAPI_ROOT>/bibfiles), directory file, and (in the case of text-type databases) a paragraph file. It also fills in certain information in the main database parameter file which must exist and be writable before the program can run. The main database parameter file is found in <BSS_PARMPATH>.

"convert_runtime" is called by indexer with the following parameters:

convert_runtime -c <BSS_PARMPATH> <db_name> < <exchange format file>

e.g.

convert_runtime -c /okapi/databases med.sample < /okapi/datafiles/med.exch


3. Extended Use of "convert_runtime".

"convert_runtime" is used with very few options when called by "indexer". However, there are many command line switches that can be used with the program. The general use of the command is:

convert_runtime [-c <ctrl directory>]   [-a]   [-num <maxrecs>]   [-treclimits]  
[-fixedlimit <fixedlim>]   [-halfcollection]   [-version]   [-help]  
[-rm <record terminator character>]   [-fm <field terminator character>] [-phoney_fcno]   [-skip <skipnum>]   [-checkpoint <interval>]   [-nopar]   <database name>   <  <input file>

Typing convert_runtime   -help will list the above switches.

When running the program the database name (the main parameter file name) must come last; it will be read from:

  1. the arg following -c
  2. the environment variable BSS_PARMPATH, or
  3. the predefined CONTROL_DIR,

in that order.


Table 3.1. "convert_runtime" switches
-a causes database to be appended to an existing one of the same name.
-num <maxrecs> limits total database size to <maxrecs> records.
-skip <num> causes input records to be skipped before processing starts.
-checkpoint <num> causes files to be flushed and stats displayed after every <num> output records. Default 5000, <num>=0 prevents checkpointing.
-treclimits inserts predefined doclength limit bits (see the code).
-fixedlimit ORs the following arg into the limits field of each record.
-halfcollection sets the '1' or '2' bit in the limits field according as the record number is odd or even.
-rm <record terminator character> sets the character to be used as the record terminator. Defaults to 0x1D.
-fm <field terminator character> sets the character to be used as the field terminator. Defaults to 0x1E
-phoney_fcno puts a dummy entry in the second field of each paragraph record (use this arg if field 1 may have more than 1 'word' in it).
-nopar prevents paragraph file being made (only applies if text database)



Okapi-Pack Main Menu Mail Okapi Support Registration


Last modified:   12th November 2001