Call (800) 766-1884 for Oracle support & training
Free Oracle Tips

Oracle Consulting Support
Oracle Upgrades
Use New Oracle Features
Oracle Replication Support
Oracle Training
Remote Oracle DBA
System Documentation
Oracle Tips
Oracle Performance
 

Free Oracle Tips


 

HTML Text

BC Oracle tuning

Oracle training

Oracle support

Remote Oracle

Redneck
 


Oracle Intermedia Tips


From the Oracle alchemist:

This is a document that some Oracle iM group guy sent to Markd when we were trying to get the user_datastore to work with GuideStar's site. My apologies for the giant preformatted text chunk.
iMT New Features #0


This is a series of documents covering new features in the product formerly
known as ConText, now known as interMedia Text (with inter in italics), and
referred to here as iMT. They should give you a general understanding of
the basic iMT 8i capabilities. These are not a replacement for the real
documentation, so do not go into exhaustive detail.

Issue #1 is a nutshell introduction to iMT, including a full quick start
example. We'll then move on to indexing, including coverage of all indexing
objects. There's one issue on DML, a few on querying, and we wrap up with
document services.



iMT New Features #1
iMT in a Nutshell


What does it do?
----------------
interMedia Text is an extension to the Oracle database which allows content-
based query -- find documents which contain this word -- on tables of
documents, all using familiar, legal SQL. With iMT, traditionally file-based
text applications can be migrated to Oracle.

The easiest way to grasp what it does is to see it in action...


How do you work this thing?
---------------------------
Here is a quick example. We'll start by creating a table to hold our
documents. For now, we'll use a small varchar2 column to hold the document
text. We'll also need a primary key to identify each document.

create table docs (id number primary key, text varchar2(80));

Nothing special about that. Let's populate it with two example documents:

insert into docs values (1, 'first document');
insert into docs values (2, 'second document');
commit;

Now comes the fun part -- we build a Text index:

create index doc_index on docs(text)
indextype is ctxsys.context;

This is, for the most part, familiar DDL -- the difference is the indextype
clause, which instructs Oracle to build a Text index instead of a regular,
b-tree index.

Once the index has been created, we can run content-based queries on our
huge database of two documents using the contains function:

select id from docs
where contains(text, 'first') > 0;

This will find all rows in docs where the text column contains the word
"first", (which is document 1, for those not following closely). The > 0 part
is necessary to make it legal Oracle SQL, which does not support boolean
return values for functions (at this time).


Integration Architecture
------------------------
In 8i, Text code is more tightly integrated with the database. For those
of you familiar with previous versions of ConText, here's a shocker -- in the
previous example, no ctxsrv is required. The new Data Cartridge facility
in 8i allows us to integrate our text-specific code with the database code.
The result is a specialized "domain index" which works like any b-tree index.
It's all transparent to the user -- it looks like Oracle just knows how to
deal with text. Unlike previous versions, there are no servers to start up,
there is no query rewrite, and creating indexes is done through familiar SQL
rather than through a custom PL/SQL interface.


Next Issue
----------
Next issue concentrates on installing and setting up iMT, then we'll
delve into indexing and the indexing objects.
iMT New Features #2
Installation and Setup


Migrating from Previous Versions
--------------------------------
Due to significant architectural changes, iMT is not backward-compatible
with previous versions of ConText. Existing installations and applications
need to be migrated to work with 8i. There is a whole manual which details
the steps of this process, so I won't go into it here, but you should be aware
that some of those steps need to be completed on the existing system *before*
you install 8i.


Installation
------------
Text is part of the interMedia product suite, so if you choose to install
interMedia, you get Text. iMT can also be installed separately, if you
choose a custom install.


Important Files
---------------
?/bin/ctxsrv
This may be renamed ctxsrv8 or some such. 8i still has the ctxsrv server.
However, the only valid personality mask is M. You need to run it only
when you want background DML. Batch DML (sync) does not require a running
server. We'll go into more depth in the DML issue coming up.

?/ctx/admin
Data dictionary scripts. See below for details.

?/ctx/admin/defaults
Language-specific default preferences. See below, and an upcoming issue on
the preference system for more details.

?/ctx/data
This directory has data files needed by iMT, which includes the
linguistic lexicon files, the Xerox stemming files, and the Korean lexer
dictionary. If you are getting strange internal errors in linguistics,
stemming, etc. check this directory.

?/ctx/bin
This directory is for user-defined filters, which we'll talk about in
the upcoming filter issue. This directory should also contain the ctxhx
program which is used by the INSO filter.

?/ctx/lib
This directory has platform-specific shared libraries and template files
used by the INSO filter. It also contains the safe callout used by
indexing and document services -- libctxx8.so (name will change from
platform to platform -- look for "ctxx")


Data Dictionary Installation
----------------------------
iMT is integrated with the Oracle Database Creation Assistant (DBCA)
so the ctxsys data dictionary should be installed when using this tool.
However, should things go awry, here are the steps to install it manually:

0. Before starting, make sure that:
* The iMT files are installed
* The database does not have a ctxsys user
* Current directory is ?/ctx/admin
* You can sqlplus internal

1. Create the ctxsys user. Pass it the ctxsys password, default
tablespace, and temporary tablespace as arguments.

sqlplus internal @dr0csys

2. Install the data dictionary:

sqlplus ctxsys/ @dr0inst

The argument is the full path to the ctxx library, for instance:

sqlplus ctxsys/ @dr0inst $ORACLE_HOME/ctx/lib/libctxx8.so

3. Install appropriate language-specific default preferences. There
are forty-odd scripts in ?/ctx/admin/defaults which create language-
specific default preferences. They are named in the form drdefXX.sql,
where XX is the language code (from the Server Reference Manual).

To install the US defaults, for instance:

sqlplus ctxsys/ @defaults/drdefus.sql

We'll talk more about this in the upcoming preference system issue.

After these steps, iMT should be installed and working.



Post-Installation Setup
-----------------------
If this database was an existing ConText site, make sure to remove
text_enable from the init.ora. It is no longer used in 8i, and will actually
prevent 8i from operating properly -- you'll get errors like "cannot find
package DR_REWRITE".

Finally, ensure that the Net8 listener is running and is configured to
invoke external procedures. A brief description of the process is below, with
complete details are in the Oracle 8i Server Administrator's Guide.


1. Add an entry to the tnsnames.ora:

extproc_connection_data =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = ipc)
(KEY = DBSID))
(CONNECT_DATA = (SID = ep_agt1)))

DBSID is the database SID. ep_agt1 can be named anything.
extproc_connection_data should not be changed.

2. Add the following to the listener SID_LIST:

(SID_DESC = (SID_NAME = ep_agt1)
(ORACLE_HOME = /oracle)
(ENVS = LD_LIBRARY_PATH=/oracle/ctx/lib)
(PROGRAM = extproc))

ep_agt1 matches the CONNECT_DATA SID for extproc_connection_data
in the tnsnames.ora. The PROGRAM section tells the Net8 listener to
start the external procedure process. The ENVS section, which
is shown here for UNIX, will ensure that the environment includes
?/ctx/lib in LD_LIBRARY_PATH. This is needed so that indexing can
use the INSO filters. On NT, you may need to have ORACLE_HOME set
in this section as well.

3. Since the extproc_connection_data ADDRESS section specifies ipc,
make sure that the ADDRESS_LIST of listener.ora accepts ipc connections.


A quick way to test the Net8 configuration is to do:

exec ctx_output.start_log('log')

from SQL*Plus. If you get the error:

DRG-50704: Net8 listener is not running or cannot start external procedures

then things aren't set up correctly. Some of the things to check:

* listener is not running
* listener.ora is not configured for extproc
* need to reload the listener
* tnsnames.ora is not configured for extproc
* listener does not accept ipc connections


Next Issue
----------
There's a lot of technical information in this issue. The next issue
is a higher-level overview of indexing, followed by a discussion of the
preference system in issue #4.

iMT New Features #3
Indexing


Creating Indexes
----------------
When using iMT, you start by creating a Text index. Without a b-tree
index, value queries are slower; without a Text index, contains queries
are simply not possible. As we've seen in issue #1, the index is created
using the create index command:

create index INDEXNAME on TABLE(COLUMN)
indextype is ctxsys.context

Unlike previous versions, there is no separate policy creation step. We'll
talk more about this in the next issue, which discusses the preference system.
View indexing is not allowed in 8i, consistent with regular b-tree indexes.
Parallel index creation is also not supported in this first 8i version.

Composite indexes are not supported -- only one column is allowed in the
column list. This column must be one of the following types: CHAR, VARCHAR,
VARCHAR2, LONG, LONG RAW, BLOB, CLOB, BFILE. Date, number, and nested table
columns cannot be indexed. Object columns also cannot be indexed, but their
attributes can be, provided they are atomic datatypes.

The table must also have a primary key constraint. This is needed mainly
for identifying the documents for document services, and may be used in the
future for other purposes. Composite primary keys are supported, up to
16 columns.

The issuing user does not need the ctxapp role to create an index. If the
user has Oracle grants to create a b-tree index on the column, then they
have sufficient permission to create a Text index.

Unlike previous versions, the issuing owner, table owner, and index owner
can all be different users, just like regular b-tree indexes.

IMPORTANT: If a syntax error occurs in the create index statement, the
index is still created. This is different from regular b-tree indexes;
before you reissue the corrected statement, you must drop the failed index
first.

BORKBORKBORKBORKBORK
If an error occurs during actual indexing (e.g. you run out of tablespace)
then you can pick up where you left off (after correcting the problem,
of course) using alter index:

alter index INDEXNAME rebuild parameters ('resume')

Storage clauses to the create index statement are not used. Index storage
parameters are provided using the STORAGE object, which we will discuss in
a later issue. Partitioning clauses are also not supported at this time,
although you can manually partition the index tables if you wish, again using
the STORAGE object.

Once the index is created, any export will include the index definition.
At import time, imp will re-create the index by issuing the create index
statement.


The Indexing Pipeline
---------------------
Behind the scenes, Oracle detects that this is a Text index and
fires up the safe callout to perform the text indexing. The index is
created using a pipeline of steps we call "The Indexing Pipeline". (Give
us a break -- we're engineers, not advertising executives)

The diagram below shows the indexing pipeline, if you are reading this in
a fixed-width font:


+-----------+ +----------+ filtered +-----------+
| | doc data | | doc text | |
| datastore |---------->| filter |----------->| sectioner |---------+
| | | | | |--+ |
+-----------+ +----------+ +-----------+ | |
^ | |
| section offsets | |
|column data +---------------------------------+ |
| | |
| V |
+==========+ +----------+ +-----------+ |
| | | | | | |
| database |<-----------| engine |<-----------| lexer |<--------+
| | index data| | tokens | | plain
+==========+ +----------+ +-----------+ text


Let's step through this pipeline, stage-by-stage. Each stage will be
covered in depth in later issues, so I'll describe them only briefly here.

We start with the datastore. This stage loops over the rows of the table
and reads the data out of the column. Unlike previous versions, the rows are
not read in any particular order. The datastore passes out document data.
Usually, this will simply be the column data, but some datastores use the
column data as a pointer to the document data. The URL_DATASTORE, for
instance, uses the column data as a URL, does a GET, and passes out the
returned data.

The filter takes the document data and transforms it to some kind of text
representation. This is needed when storing binary documents such as Word
or Acrobat files. Unlike previous versions, the output of the filter does
not have to be plain text -- it can be a text format such as XML or HTML.

The sectioner, new for 8i, takes the output from the filter, and converts
it to plain text. There are different sectioners for different text formats,
including XML and HTML. Conversion to plain text includes detecting important
section tags, removing "invisible" information, and reformatting the text.

The lexer takes the plain text from the sectioner and splits it into
discrete tokens. We have lexers for whitespace-delimited languages, and
specialized lexers for Asian languages, where segmentation is quite a bit
more complex. The basic lexer also includes theme functionality to build
unified text/theme indexes.

Finally, the engine takes all the tokens from the lexer, the section
offsets from the sectioner, and a list of low-information words called
the stoplist, and builds an inverted index. An inverted index stores
tokens, and the documents in which these tokens occur. For instance,
our example from issue #1:

insert into docs values (1, 'first document');
insert into docs values (2, 'second document');

would produce an inverted index like this:

DOCUMENT ---> doc 1 position 2, doc 2 position 2
FIRST ---> doc 1 position 1
SECOND ---> doc 2 position 1

We don't actually store the document primary key. Instead, we assign each
document an identifier called a docid, and use that in the inverted index.


Logging
-------
Creating an index can take a long time. The safe callout can create a log
file which can be monitored during indexing. start_log starts the logging,
and should be called before issuing the create index, in the same session:

ctx_output.start_log('mylog');

This will create a mylog file in ?/ctx/log which can be used to monitor
indexing progress. The directory can be changed by ctxsys, using the system
parameter LOG_DIRECTORY. See documentation for details.

IMPORTANT: due to an installation bug, ?/ctx/log is not created at install
time. You may get errors like "unable to open file". Simply have the DBA
create this directory.

Logging is halted when the session exits or when end_log is called:

ctx_output.end_log;


Errors During Indexing
----------------------
Processing errors encountered during indexing -- filter errors, file cannot
be read, etc. -- do not bring indexing to a halt. Instead, the error is
logged and indexing continues to the next file.

You can see these logged errors in the ctx_user_index_errors view. It
will tell you the rowid of the failed document, and the error which was
encountered. You need to manually empty this table when you are finished
reviewing the errors:

delete from ctx_user_index_errors;


Next Issue
----------
So far we've seen how to create an index on simple data. Next issue
discusses the preference system, which is used to customize Text indexes to
handle binary data, different languages, and other specialized situations.



iMT New Features #4
The Preference System


Classes, Objects, Preferences
-----------------------------
The preference system allows you to customize Text indexing. Each stage
of the indexing pipeline is really an interface into which customized
objects can be plugged in. These stages are called "classes" in iMT lingo,
and they can be found in the ctx_classes view:

DATASTORE Reads column data and returns document data

FILTER Reads document data and returns formatted document text

SECTION_GROUP Reads formatted document text and returns section
offsets and plain document text

LEXER Reads plain document text and breaks it into tokens

WORDLIST Contains query expansion defaults

STOPLIST Contains a list of low-information words

STORAGE Index table storage parameters

We'll cover each class is specific detail in later chapters. Each class
has one or more "objects" which are like behavioral templates. The DATASTORE
class, for instance, has the following objects:

DIRECT_DATASTORE Passes column data directly as document data

DETAIL_DATASTORE Concatenates multiple rows of a detail table
to construct document data

FILE_DATASTORE Interprets column data as a filename. Reads file
and passes file contents as document data

URL_DATASTORE Interprets column data as a URL. Performs a GET and
passes return as document data

USER_DATASTORE Invokes a stored procedure to synthesize document data

Each class' objects will be covered in more detail in later chapters. Objects
can be found in the ctx_objects view.

But objects are only templates, and cannot be used directly in an index.
Instead, you create a "preference" from one of these object templates,
customize it by setting "attributes", then use the preferences to create the
index. Let's take a closer look:


Creating Preferences
--------------------
You create a preference using ctx_ddl.create_preference, specifying an
object to use as template:

ctx_ddl.create_preference('mypref', 'FILE_DATASTORE');

This creates the preference mypref, using the FILE_DATASTORE object. If you
are familiar with previous versions, the object names have been changed,
and there is no comment argument.

This is done in PL/SQL, so surround it with "begin" and "end", or use "exec"
if you are in SQL*Plus. Also, in order to create preferences, the issuing
owner must have the ctxapp role.

Once the preference is created, we can customize it using set_attribute:
For our example, let's say all our files were in /docs. We can simply
set the PATH attribute to communicate this to iMT:

ctx_ddl.set_attribute('mypref', 'PATH', '/docs');

For ConText experts, set_attribute now comes *after* create_preference,
which allows us to do attribute-level validation, and necessitates the first
argument, which is the preference name.

This attribute is a text attribute, so we set it to a text value. There are
also number attributes and boolean attributes, which should be set to 'TRUE'
or 'FALSE' (or 'YES'/'NO'). The attributes for each object and their types
can be found in view ctx_object_attributes or in the documentation.

Some attributes have a list of allowed values. The allowed values can be
found in the view ctx_object_attribute_lov. Unlike previous versions, LOV
attributes can be set using a descriptive "tag" rather than a numeric value.


Non-preference Classes
----------------------
Preferences are used for all classes except the SECTION_GROUP and STOPLIST
classes. These classes use specialized objects and have their own API.
We'll discuss this in the respective upcoming issues.


Using Preferences
-----------------
Once you've created your preferences, you can build all kinds of customized
indexes by mixing and matching them, chinese menu-style. Because all objects
of a class adhere to a common interface, any filter can be hooked up to any
datastore, etc. Got a bunch of Korean Word 97 files? No problem -- hook
up a FILE_DATASTORE preference, an INSO_FILTER preference, and a KOREAN_LEXER
preference.

You plug in your preferences using the parameters clause of create index:

create index doc_index on docs(text)
indextype is ctxsys.context
parameters ('datastore mypref')

This tells create index to use mypref as the datastore. Each of the classes
has a parameters keyword to name a preference. The keyword is the same as
the class name, except for SECTION_GROUP, whose keyword is the phrase
"section group". You can specify multiple classes by simply tacking the
keyword-preference pairs on the end of the string:

parameters('datastore mypref filter myfilter')

Any user can use any preference. To specify a preference in another user's
schema, simply add the owner name in the parameters string:

parameters('datastore kenny.mypref')

Unlike previous versions, the index makes a copy of the preference settings,
which means that after use in an index, the preference can be modified or
dropped without affecting the index. Note that only the preference owner can
modify or drop it.


The Default System
------------------
In the example above, we are setting only the datastore class. Since we
did not specify preferences for the other classes in the parameters string,
they get their values from the default system.

The default system uses system parameters. System parameters are name-
value pairs which apply to the iMT installation as a whole -- kind of our
version of init.ora. They can be seen in the view ctx_parameters.

The values for the parameters used by the default system are preference
names. For instance, in our example we have not specified a lexer preference.
iMT gets the value of the system parameter DEFAULT_LEXER, and uses the
preference named there. This allows the administrator to set the default
behavior for the system.

The set of parameters used by the default system are:

DEFAULT_DATASTORE
The default datastore preference. At install time (the default default)
it is set to CTXSYS.DEFAULT_DATASTORE, a preference which uses the
DIRECT_DATASTORE object.

DEFAULT_FILTER_BINARY
This is the filter preference to use when indexing binary datatypes such
as RAW, LONG RAW, or BLOB. At install time it is set to
CTXSYS.INSO_FILTER, which uses the INSO_FILTER object.

DEFAULT_FILTER_FILE
This is the filter preference to use when indexing files, either BFILE
or the FILE_DATASTORE. At install time it is set to CTXSYS.INSO_FILTER.

DEFAULT_FILTER_TEXT
This is the filter preference to use when indexing text datatypes such
as CHAR, VARCHAR2, or CLOB. At install time it is set to
CTXSYS.NULL_FILTER, which does no filtering.

DEFAULT_SECTION_HTML
This is the section group to use when the filter class uses the INSO
filter, or the datastore uses the URL_DATASTORE. The INSO filter
filters binary files to HTML. More on this in the filter issue.
At install time it is set to CTXSYS.HTML_SECTION_GROUP, which merely
converts the HTML to plain text.

DEFAULT_SECTION_TEXT
This is the section group to use in all other cases. At install time
it is set to CTXSYS.NULL_SECTION_GROUP, which does no sectioning.

DEFAULT_LEXER
This is the default lexer preference. At install time it is set to
CTXSYS.DEFAULT_LEXER. The definition of that preference is language-
specific. See below for more details.

DEFAULT_WORDLIST
This is the default wordlist preference. At install time it is set to
CTXSYS.DEFAULT_WORDLIST. The definition of that preference is language-
specific. See below for more details.

DEFAULT_STOPLIST
This is the default stoplist. At install time it is set to
CTXSYS.DEFAULT_STOPLIST. The contents of that stoplist are language-
specific. See below for more details.

DEFAULT_STORAGE
This is the default storage preference. At install time it is set to
CTXSYS.DEFAULT_STORAGE, which has no settings.


Language-Specific Defaults
--------------------------
The LEXER, WORDLIST, and STOPLIST classes are the most affected by the
language of the documents. Every installation gets DEFAULT_LEXER,
DEFAULT_WORDLIST, and DEFAULT_STOPLIST preferences, but the definition of
these preferences will depend on the language chosen at install time.

?/ctx/admin/defaults holds language-specific scripts for each Oracle-
supported language. Based on the language chosen, DBCA runs the matching
language-specific script, which creates the default preferences most
appropriate for that language.

The result is that a German instance gets a DEFAULT_LEXER which is case-
sensitive and does German decompounding, a DEFAULT_STOPLIST which has common
German words, and a DEFAULT_WORDLIST with German stemming. A Japanese
instance gets different defaults, more tailored to Japanese.


Browsing the Data Dictionary
----------------------------
Here is a list of views useful for browsing the iMT data dictionary:

ctx_classes Available classes

ctx_objects Available objects for each class

ctx_object_attributes Attribute names and types for each object

ctx_object_attribute_lov Allowed values for each LOV object attribute

ctx_preferences Preferences

ctx_preference_values Attribute values of each preference

ctx_user_index_objects Objects used by the user's indexes

ctx_user_index_values Attribute values for the user's indexes


Next Issue
----------
We will now cover each class and their objects in a bit more detail,
starting with datastores in issue #5, and continuing down the indexing
pipeline in subsequent issues.
iMT New Features #5
Datastores


The DATASTORE class
-------------------
The datastore class is responsible for reading the column data from the
database, getting the document data based on the column data, and passing that
up to the filter.

It is important to keep in mind that iMT is an extension to the database,
and so only works when some form of the document data is already in a table.
Datastore objects which access external resources such as files and web pages
still require a table of pointers to those resources. iMT does not, at least
in this version, crawl to discover new documents or access out-of-database
foreign data sources.

Datastore preferences are plugged into indexes using the datastore keyword:

create index
...
parameters ('datastore mydstore');

The datastore class has five objects.


DIRECT_DATASTORE
----------------
The direct datastore is the simplest case -- it assumes that document
data is stored in the indexed column. It is so simple it has no attributes
to customize. There should be no reason to create a preference based on
the DIRECT_DATASTORE object -- CTXSYS.DEFAULT_DATASTORE can be used for
any index.

IMPORTANT: If you are indexing BFILEs, make sure the user ctxsys has
READ permission on the BFILE directory.


FILE_DATASTORE
--------------
The file datastore reads the column data as a filename. It opens the file
and returns the contents as the document data. The indexed column cannot be
LONG or LOB types. The files must be accessible from the database host
machine.

This object has one attribute: PATH. PATH is optional, and specifies the
directory where the documents are stored.

IMPORTANT: If PATH is specified, the column data must be simple file names;
if not specified, the column data must be full path names. You cannot mix and
match -- for instance, with PATH of /A, and a column of B/C.TXT it will NOT
find /A/B/C.TXT.

On some platforms, PATH can specify multiple directories. If you do this,
make sure that your documents have unique names -- if a document C.TXT is in
directories /A and /B, and /A is in the PATH before /B, only /A/C.TXT will
be indexed.

There is not much of a difference between this object and BFILEs. It's a
matter of choice. I personally think you should try to use BFILEs, since
this is a mainstream Oracle feature and thus should be improved over time.
Like BFILE's, when the file changes, the row will not be automatically
reindexed. You should update the column value to itself to trigger a reindex.


URL_DATASTORE
-------------
The URL datastore reads the column data as a URL. It performs a GET and
returns the contents as the document data. The URLs for the documents must
in the table -- the URL datastore does not crawl. The indexed column cannot
be LONG or LOB types.

http, ftp, and file protocols are supported. The URL datastore also has
multi-threaded read (on most platforms) to maximize throughput. It is very
customizable, and has several attributes:

TIMEOUT Read request timeout in seconds
MAXTHREADS Maximum number of threads to use for multithreaded read
URLSIZE Maximum length of an URL
MAXURLS Maximum number of URL's to read at one time
MAXDOCSIZE Maximum length of document
HTTP_PROXY HTTP proxy
FTP_PROXY FTP proxy
NO_PROXY Domain for proxy exclusion

Like the file datastore, if the web page changes, then you should manually
update the column value to itself to trigger the reindex.


DETAIL_DATASTORE
----------------
Many legacy applications store text line-by-line, in multiple rows of a
detail table. The detail datastore constructs documents by concatenating
the text of these lines into a single, coherent document.

The detail datastore must be told the specifics of the detail table:

DETAIL_TABLE Name of the detail table
DETAIL_KEY The foreign key column(s) in the detail table
DETAIL_LINENO The sequence column in the detail table
DETAIL_TEXT The text column in the detail table

For instance, let's say the master and detail tables look like this:

the_master my_detail

ID TITLE M_ID SEQ LINETEXT
1 Grate Expectations 1 1 It was the best of times
1 2 it was the worst of times.

"That's G-R-A-T-E Expectations, also by Edmund Wells."

The attributes of the preference should be set like this:

DETAIL_TABLE my_detail
DETAIL_KEY m_id
DETAIL_LINENO seq
DETAIL_TEXT linetext

There is one final attribute: BINARY. By default, the detail datastore
treats each row as a separate line, and, when concatenating the contents,
will automatically stick a newline at the end of the text for each row.
BINARY set to TRUE suppresses this. In our example, we should set BINARY
to FALSE or simply leave it unset.

The index itself should to be built on the master table:

create index myidx on master(somecolumn)
indextype is ctxsys.context
parameters ('datastore my_detail')

The master table must have a primary key column, just like any other table
indexed by iMT. This primary key column is used to find the corresponding
detail rows, where detail.fk = master.pk.

The indexed column can be any column of allowable type -- the contents are
not used by iMT. If you can afford the space, you can add a dummy char(1)
column called "text" or "body", to make the queries more readable:

select author, title from master
where contains(text, 'best & worst') > 0;

If detail rows are added, removed, or changed without changing the master
text column, you should manually update the text column to itself to trigger
a reindex.

For those of you familiar with previous versions of ConText, this is
similar to the master detail new datastore. The old master detail datastore
is no longer supported


USER_DATASTORE
--------------
The user datastore is new for 8i. You write a stored procedure which,
given a rowid, synthesizes a document. iMT calls this procedure whenever
document data is required. Let's look at an example. Say we have a table
like this:

articles
id number
author varchar2(80)
title varchar2(120)
text clob

and we want to automatically have author and title be part of the indexed
document text. We can write a stored procedure following the user datastore
interface:

create procedure myproc(rid in rowid, tlob in out clob) is
offset number := 1;
begin
for c1 in (select author, title, text from articles
where rowid = rid)
loop
append_varchar_to_lob(tlob, c1.title, offset);
append_varchar_to_lob(tlob, 'by '||c1.author, offset);
dbms_lob.append(tlob, c1.text);
end loop;
end;

This procedure takes in a rowid and a temporary clob locator, and concatenates
all the articles columns into the temp clob. I am assuming a helper procedure
append_varchar_to_lob for simplicity. It is not difficult to write.

To ensure that the DBA has signed off on the code, only ctxsys-owned
stored procedures are allowed for the user datastore. Thus, we need to
do something like this as ctxsys:

create procedure s_myproc(rid in rowid, tlob in out clob) is
begin
appowner.myproc(rid, tlob);
end;

And, we need to make sure that the index owner can execute the stub
procedure, so:

grant execute on s_myproc to appowner

Now, back as the app owner, we create the preference, setting the PROCEDURE
attribute to the name of the ctxsys stub procedure:

ctx_ddl.create_preference('myud', 'user_datastore');
ctx_ddl.set_attribute('myud', 'procedure', 's_myproc');

When we then create the index on articles(text) using this preference, iMT
indexing will see author and title in the document text.

The user datastore can be used for any on-the-fly document synthesis,
including more complex master-detail relationships, nested tables, text
preprocessing, or multicolumn concatenation, like the example.

There are four constraints on the procedures used in user datastores:

1. They must be owned by ctxsys

2. They must be executable by the index owner

3. They cannot issue DDL or transaction control statements like "commit"

4. They cannot be safe callouts or call safe callouts

If you change the stored procedure, indexes based upon it will not be notified,
so you should manually recreate such indexes. iMT cannot tell what you are
doing in the stored procedure, so if the stored procedure makes use of other
columns, and those column values change, the row will not be reindexed. The
row is only reindexed when the indexed column changes.


Next Issue
----------
Next up is the filter class, which has only three objects. That will
be followed by what is shaping up to be a longish issue on sectioners and
section indexing.
iMT New Features #6
Filters

The FILTER Class
----------------
The filter class takes the document data provided by the datastore class,
and filters it to readable text, passing it to the sectioner. For 8i, this
does not have to be plain text -- it can be a text format such as HTML.

Filter preferences are plugged into indexes using the filter keyword:

create index
...
parameters ('filter myfilter');

The filter class has three objects.


NULL_FILTER
-----------
The null filter is used when the document contents are not in a binary
format. It simply passes text from the datastore to the sectioner. The
default CTXSYS.NULL_FILTER preference can be used in any index to employ
the null filter.


CHARSET_FILTER
--------------
The charset filter is new for 8i. It converts documents from a foreign
character set to the database character set. This is most useful for
Japanese customers, who have to deal with two widely-used and incompatible
character sets (and one infrequently-used incompatible character set).

The charset filter has one attribute, CHARSET, which takes the NLS name
of the source character set. A list of NLS charsets can be found in the
Oracle National Language Support Guide. The destination character set
is always the database character set, so does not need to be specified.

Additionally, you can specify JAAUTO, which is a custom setting for Japanese
character set auto-detection. We will automatically detect Shift-JIS, JIS7,
or EUC for each document and convert it to the database charset if needed.


USER_FILTER
------------
The user filter is a method for plugging in your own filters. You write a
filtering program, place it in ?/ctx/bin, and the indexing engine will invoke
it to filter each document. Here's an example -- we'll create an uppercase
filter which will uppercase every word.

We start by creating a program to do the filtering -- in this case, we'll
write a perl script:

#!/usr/local/bin/perl

open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);

while ()
{
tr/a-z/A-Z/;
print OUT;
}

close(IN);
close(OUT);

If you don't know perl, you should learn it, BUT it's not important for now.
Just trust that this script converts a file to uppercase.

User filter programs like this take two arguments. The first argument is
the name of the input file. The second argument is the name of the output
file. Our filter reads the contents of the input file, filters it, and writes
the output to the output file.

We place the program (let's call it upcase.pl) in ?/ctx/bin, and ensure
that it's executable by the oracle operating-system user. Now we create the
preference:

ctx_ddl.create_preference('mypref', 'USER_FILTER');
ctx_ddl.set_attribute('mypref', 'EXECUTABLE', 'upcase.pl');

When this preference is used in an index, the indexing engine will invoke
the user filter program for each document when the text is required.


INSO_FILTER
-----------
The INSO filter automatically recognizes and filters over a hundred different
formats, including Word 97 and Acrobat. The full list can be found in the
documentation.

The INSO filter does not have any attributes at this time, so the
CTXSYS.INSO_FILTER preference can be used in any index needing filtering.

IMPORTANT: This filter outputs HTML, not plain text. Make sure you employ
the HTML section group or all these tags will be indexed. The default system
will employ the HTML section group when the INSO filter is detected. See
issue #3 for more on the default system.

The INSO filter uses code from Inso, "the world leader in filtering and
viewer technology," and Adobe. We have avoided integrating this code
directly, instead following the user filter architecture. The user filter
executable for the INSO filter is ctxhx.

ctxhx requires shared libraries and other files (.flt and .tpt files)
located in ?/ctx/lib. The installer should copy the correct platform-
specific files to this directory. Additionally, ensure that the external
procedure agent has ?/ctx/lib in the LD_LIBARY_PATH environment variable
(PATH on NT). This can be done using the ENVS section in the listener.ora.
On NT you may need to have ORACLE_HOME set in the ENVS section, as well.
See issue #2 for more details.

If you are encountering problems using this filter, it can be run by hand.
First get a formatted binary file. Let's call it testfile.doc. You do:

$ORACLE_HOME/ctx/bin/ctxhx testfile.doc testfile.txt

This should create the file testfile.txt with the HTML representation. The
error messages you get, if any, should help you determine what's wrong.

The INSO filter is supported only on Solaris, HP-UX, AIX, and NT. Other
platforms are not able to use this filter. We are currently working on
filtering solutions for other platforms.


Where's the HTML filter ?
-------------------------
Some of you familiar with previous versions of ConText may be wondering
about the HTML filter. In 8i, this code has been moved to the new sectioner
class, which allows any filter to spit out HTML if it so desires.


Next Issue
----------
Next issue concentrates on the new sectioner class, and covers all aspects
of section indexing.
iMT New Features #7
Section Groups


The SECTION_GROUP Class
-----------------------
The section group class is new for 8i, and incorporates functionality
that was part of the wordlist and filter classes in ConText. It takes a
text format, such as XML or HMTL, as input. It is unique in having two
outputs -- the section boundaries, which go to the engine, and the plaintext,
which goes to the lexer.


Creating Section Groups
-----------------------
Section groups are not created using create_preference -- they have their
own API in ctx_ddl:

ctx_ddl.create_section_group('mygroup','html_section_group');

The first argument is the name of the new section group. The second argument
is the section group type. This specifies the input text format, and tells
the sectioner the rules for detecting section boundaries and transforming to
plaintext. Each section group type is covered briefly below.

An empty section group can transform formatted test to plaintext, but will
not index any section boundaries. You must first tell it which sections to
index by adding sections to the group.

Sections have three important attributes: TAG, NAME, and TYPE. TAG tells the
section group how to recognize the section. Because the section group already
knows the format, you do not need to specify start and end tags; a section in
an HTML section group with TAG of B knows that the section starts with and
ends with , for instance. Tags are unique across the sections of a
section group.

NAME is how you refer to the section in queries. You may want to name the B
tag BOLD, for readability. Multiple tags can be mapped to the same name --
they are then treated as instances of the same section. H1, H2, H3 can all be
treated as instances of the HEADLINE section. We recommend avoiding non-
alphanumeric characters -- such as underscore -- in section names. Using
these characters will force you to escape the names in queries.

TYPE is totally new for 8i. There are three different types of sections:
ZONE, SPECIAL, and FIELD.


Section Types
-------------
ZONE sections are like sections from previous versions. iMT records where
in the document the section start and end tags occur. WITHIN queries check
that the hit words occur between the start and end word offset.

If a ZONE section repeats, each instance is treated separately in query
semantics. See Examples, below. ZONE sections can enclose other sections,
including themselves, and can be enclosed by other sections. ZONE sections
are added like this:

ctx_ddl.add_zone_section('groupname', 'sectionname', 'tag');

SPECIAL sections are so named because they are not recognized by tags.
There are two special sections in 8i -- SENTENCE and PARAGRAPH -- both
recognized by punctuation in the lexer. They index the start and end of
sentences and paragraphs, respectively. You add special sections like this:

ctx_ddl.add_special_section('groupname', 'sectionname');

No tag argument is needed. There are only two allowed values for section
name: SENTENCE, and PARAGRAPH.

ZONE and SPECIAL sections index section start and end word offsets, but do
nothing to the words in the section. FIELD sections, on the other hand,
extract their contents and index them separately from other words in the
document. WITHIN queries are run on this separate, smaller index. This makes
field section query faster than zone section query -- up to three times as
fast in tests we've conducted -- especially when the section tags occur in
every document.

This speed comes at the cost of flexibility. FIELD sections are meant for
non-repeating, non-overlapping sections. If a field section repeats, it is
treated as a continuation of the section, not a distinct instance. If a field
section is overlapped by itself or by another field section, it is implicitly
closed at the point where the other section starts. Also, there is a maximum
of 64 field sections in any section group. This is 64 section names, not
tags. Remember that you can map multiple tags to the same section name. You
add field sections like this:

ctx_ddl.add_field_section('groupname', 'sectionname', 'tag');

This can be overwhelming, so let's work through a couple of examples to
illustrate the different types of sections and the impact on query semantics.
We'll use the following document as an example:

ratox
tiger rabbit
dragonsnake

This is an XML-style markup, but without a DTD, so we will use the basic
section group type:

ctx_ddl.create_section_group('mygroup','basic_section_group');

Let's start with ZONE sections:

ctx_ddl.add_zone_section('mygroup', 'asec', 'a');
ctx_ddl.add_zone_section('mygroup', 'bsec', 'b');
ctx_ddl.add_zone_section('mygroup', 'csec', 'c');

This tells the section group to recognize A tags as the section ASEC, B tags
as the section BSEC, etc. To do section queries, we use the WITHIN operator.
Remember to use the section name and not the tag:

contains(text, 'rat within asec') > 0

This finds the document.

contains(text, 'tiger within asec') > 0

This does not find the document. Although it has "tiger", it does not occur
in the ASEC. If instead of the original setup we had mapped A and B to the
same section name:

ctx_ddl.add_section('mygroup', 'asec', 'a');
ctx_ddl.add_section('mygroup', 'asec', 'b');

Then both:

contains(text, 'rat within asec') > 0
contains(text, 'tiger within asec') > 0

would find the document, because both A and B are treated as ASEC.

An important facet of ZONE section query semantics is that each instance
is treated distinctly. That is, a query like this:

contains(text, '(tiger and rabbit) within bsec') > 0

finds the document, but a query like this:

contains(text, '(rat and ox) within asec') > 0

does not find the document. Although the document has "rat" and has "ox",
and "rat" is in ASEC, and "ox" is in ASEC, "rat" and "ox" are not within
the SAME ASEC. Note that

contains(text, '(dragon and snake) within csec') > 0

finds the document, since they are both in the outer CSEC, even though
the inner CSEC contains only "snake".

Special sections follow the same semantics, so a query like this:

contains(text, '(jumbo and shrimp) within sentence') > 0

is looking for documents with "jumbo" and "shrimp" in the SAME sentence.

Now let's look at field sections. What if in setup we made ASEC a field
section instead of a zone section:

ctx_ddl.add_field_section('mygroup', 'asec', 'a');

Unlike zone sections, each instance of a field section is considered a
continuation of the section. The query

contains(text, '(rat and ox) within asec') > 0

finds the document, although it didn't when ASEC was a zone section. The
field section simply stitches the two instances together. Field sections
work best on sections like BSEC, which does not repeat like ASEC nor
nest like CSEC.

One last subtlety with field sections. If ASEC is a zone section, then

contains(text, 'rat') > 0

finds the document -- it contains the word rat. However, remember that field
sections work by extracting the document contents and indexing them separately.
Thus, if ASEC were a field section, this query would not find the document.
"rat" is in ASEC, which is separate from the document contents.

You can, however, change this by making the field section visible. There
is an optional boolean fourth argument to add_field_section. If this boolean
argument is set to true:

ctx_ddl.add_field_section('mygroup', 'asec', 'a', TRUE);

then the field section is visible, and the section contents are visible to
non-within queries, just like zone sections. This is accomplished by double-
indexing the word -- once as part of the extracted section, and once as part
of the body, so this option has index space cost.


Using Section Groups
--------------------
Section groups are plugged into indexes using the section group keyword:

create index
...
parameters ('section group mygroup');

Unlike previous versions, you do not need to set an attribute in the wordlist
object. You also no longer need to set STARTJOINS and ENDJOINS in the lexer.
However, the WHITESPACE, NEWLINE, and PUNCTUATIONS settings in the basic lexer
affect sentence and paragraph boundary recognition, which then impact special
sections.

There are five section group types.


NULL_SECTION_GROUP
------------------
The null section group is used when there is no section information to
be extracted -- it simply passes the text through to the lexer. Only
special sections can be added to the null section group.


BASIC_SECTION_GROUP
-------------------
The basic section group is for simple XML-style markup without going full-
bore on the DTD stuff. It can be used in custom situations where full XML
is unnecessary. Tags are in the form .... Entities and tag
attributes are not supported. The only processing it does for plaintext is
removal of the markup tags.


HTML_SECTION_GROUP
------------------
The HTML section group is for HTML. Good name, huh? It knows the rules for
HTML, including ISO Latin-1 entity translation, HTML to plaintext conversion,
and omission. It knows HTML 4.0 tags, and has code to deal with unknown
tags.

SCRIPT and STYLE contents, and comments are removed. Contents of the TITLE
section are preserved. We are currently working on META tag information
indexing for future versions.

For those familiar with previous versions, there is no need for KEEP_TAG
anymore -- this is automatic when you add the tag to the section group.


XML_SECTION_GROUP
-----------------
The XML section group is for XML. It does no processing to plaintext
except for entity translation and tag removal. It can handle non-system,
non-parameter entities in the internal DTD. It cannot access external DTD's
in this version.


NEWS_SECTION_GROUP
------------------
The news section group handles newsgroup-style postings -- RFC-1036 in
particular. This format is used in newsgroups and is similar to the one
used for e-mail. Note that while RFC-1036 defines the set of allowable header
fields, we do not enforce this -- any header fields can be used.

Messages in this format consist of a header section, a blank line, and a
body section. The news section group recognizes TAG: as
the start of sections, and as the end of the section. It stops
looking for this pattern when the blank line is encountered.

Translation to plaintext consists of removing all header lines which are
not important to section searching. A message like this:

From: me
To: you
X-ref: 5!oowj

hello! How are you!

with only the from and to tags in the section group, will look like this
to the indexing engine:

me
you
hello! How are you!


Next Issue
----------
Next up is the lexer, followed by a single issue on the other three
indexing classes.
iMT New Features #8
Lexers

The LEXER Class
---------------
The lexer class gets plaintext from the sectioner and splits it into words.
This is not a easy as it sounds, especially for non-whitespace delimited
languages.

Lexer preferences are plugged into indexes using the lexer keyword:

create index
...
parameters ('lexer mylex');

The lexer class has five objects.


BASIC_LEXER
-----------
The basic lexer can be used for most European languages. It is programmed
with default rules for splitting whitespace-delimited text into words. You
can modify its parsing behavior to some extent with the lexical attributes.

Ordinarily, non-alphanumeric characters end a word. SQL*Plus is lexed as
"SQL" and "Plus". The JOINS attributes allow you to modify this behavior by
declaring sets of non-alphabetic characters to treat as valid word letters.

PRINTJOINS are non-alphanumeric characters which are treated as valid
characters, and become part of the word. Declaring PRINTJOINS of "*" lets
SQL*Plus be lexed as "SQL*Plus".

SKIPJOINS are like PRINTJOINS, but they do not become part of the word.
Declaring SKIPJOINS of "*" lets SQL*Plus be lexed as "SQLPlus".

STARTJOINS are like PRINTJOINS, but they only come at the start of the
word. If they occur in the middle of a word, a new word is started. Having
"*" as STARTJOINS lets SQL*Plus be lexed as "SQL" and "*Plus". Multiple
STARTJOINS are allowed at the start of a word.

ENDJOINS are like STARTJOINS, but for the end of the word. They implicitly
end a word, too. So, "*" as ENDJOINS lexes SQL*Plus as "SQL*" and "Plus".
Multiple ENDJOINS are allowed at the end of a word. STARTJOINS and ENDJOINS
used to be important for section searching in previous versions, but with the
new sectioner class, they are no longer needed for that purpose.

Each of these four JOINS attributes is a set of characters. Setting
PRINTJOINS to "_%*" means that "_", "%", and "*" are all PRINTJOINS.

NUMJOIN is a single character which is the numeric join character. NUMGROUP
is the single numeric group character. They are defaulted based on your
NLS_LANG setting. For US, NUMJOIN is "." and NUMGROUP is ",".

Finally, CONTINUATION are those characters which indicate line continuation.
These and the following newline are removed from the text. Default value
is hyphen and backslash.

Then there are three attributes important for sentence/paragraph indexing.
PUNCTUATIONS are the set of punctuation marks (?!. by default). WHITESPACE
is a set of additional whitespace characters other than space and tab, which
you get for free and cannot change. NEWLINE can be set to NEWLINE or
CARRIAGE_RETURN. This controls whether lines are ended by \n or \r\n.

A sentence is then recognized as a PUNCTUATION followed by one or more
WHITESPACE. A paragraph is a PUNCTUATION followed by a NEWLINE. There are
other patterns which are variations on this, but that's the basic idea.

The third set of basic lexer attributes controls term normalization -- the
process of converting words to a standard form. BASE_LETTER is a boolean
attribute which, if set to YES, will convert accented characters to their
unaccented forms.

By default, each word is uppercased during indexing so that queries are
case-independent. MIXED_CASE is a boolean, which, if set to YES, does not
do this. This makes indexing and queries case-sensitive.

ALTERNATE_SPELLING is an attribute which, if set, uses letter transformation
rules for common alternate spellings. It can be set to GERMAN, DANISH, or
SWEDISH. In GERMAN mode, for instance, A-umlaut gets transformed to AE.
Since the lexer is used at both indexing and query time, query for a word
containing A-umlaut will find the word spelled with A-umlaut or AE.

Finally, COMPOSITE controls word decompounding. In German, for instance,
multiple words are often combined into a single string. This makes lexing
difficult, because the words are no longer whitespace-delimited. If COMPOSITE
is set, then the lexer breaks up these compound words, which allows iMT to
index and query as usual. COMPOSITE can be set to GERMAN or DUTCH. Unset
or DEFAULT means no decompounding.

The last set of basic lexer attributes control text and theme indexing.
New for 8i, the basic lexer can generate and index themes for each document,
in addition to splitting it up into words. This merges the functionality
found in the separate THEME LEXER and BASIC LEXER of previous versions.

INDEX_TEXT is a boolean which controls word indexing. INDEX_THEMES
controls theme indexing. Setting INDEX_THEMES to NO makes it a text-only
index. Setting INDEX_TEXT to NO makes it a theme-only index. They cannot,
of course, be both NO. Indexing themes takes longer and uses a bit more
index space, but improves the efficacy of ABOUT queries. We'll talk more
about this when we discuss ABOUT query.

The CTXSYS.DEFAULT_LEXER preference has theme indexing ON for the English
language-specific scripts.


JAPANESE_VGRAM_LEXER
--------------------
Japanese is not a whitespace-delimited language (except in textbooks) so
it is very difficult for computers to pick out individual words. The Japanese
V-gram lexer gets around this problem by indexing overlapping clusters of
characters. The Japanese word ABC might be decomposed into AB and BC, for
instance. Query terms are also decomposed this way. Thus, the contains query
is not looking for words, but for patterns of characters.

The Japanese lexer does not have any attributes in this version.


CHINESE_VGRAM_LEXER
-------------------
Chinese is also not whitespace-delimited, so we have a similar solution
for Chinese, as well. The Chinese v-gram lexer also does not have any
attributes.


CHINESE_LEXER
-------------
New for 8i is a segmenting Chinese lexer, which can actually recognize
some Chinese words as whole entities. The rest of the text is still broken
into v-grams, but this should be more space-efficient and have faster query
than the pure v-gram method. The Chinese segmentation lexer does not have
any attribtues.


KOREAN_LEXER
------------
Korean is whitespace-delimited, but has problems with verbs, which can
have thousands of different forms. Our Korean lexer is a lexicon-driven
engine (using a third-party 3-soft lexicon) which simply eliminates verbs
from indexing. New for 8i is the ability to eliminate adverbs and adjectives,
do various form conversions and perform morphological and segmentation
decomposition.


Next Issue
----------
Next issue we'll finish up indexing with a combined issue on the wordlist,
stoplist, and storage classes.



iMT New Features #9
Other Indexing Classes

The STOPLIST Class
------------------
Not every word in a document is worth indexing. Linguistic lubricant like
prepositions and conjunctions are important for language understanding, but
are not very useful for information retrieval -- they are very common and so
convey very little information about the document by their presence. Most
of the time, it's not worth the space to index these words.

The stoplist class holds a list of these words, called stop words. During
indexing, the engine consults the stoplist and filters out the stop words.
The stoplist class does not use the create_preference API. Instead, it has
its own API:

ctx_ddl.create_stoplist('mylist');

Words are added using add_stopword:

ctx_ddl.add_stopword('mylist', 'the');

Here we have added the article "THE" to the stoplist. You can see the words
in a stoplist using the ctx_stopwords view. A stoplist can have a maximum
of 4095 stopwords. Each can be up to 64 bytes in length.

Stopwords are case-sensitive in a stoplist, but if the index is not case-
sensitive, the stopwords will not be case-sensitive. So, you could add "THE"
and "the" to a stoplist, but if the index is not case-sensitive, there would
be no difference -- you'd just be wasting a stopword slot.

Stoplists are plugged into indexes using the stoplist keyword:

create index
...
parameters ('stoplist mylist');

New for 8i, stopwords can be added to the index without re-indexing:

alter index myidx rebuild parameters ('add stopword AND');

This adds "AND" as a new stopword after indexing. Remember that the index
makes its own copy of everything, so this does not affect the stoplist used to
create the index.

The default stoplist, CTXSYS.DEFAULT_STOPLIST, is language-specific, and
default stoplists are provided for most European languages.


Other Stop Objects
------------------
Stoplists can also hold two other kinds of objects: stop themes and stop
classes. Stop classes are classes of words, rather than individual words.
The only stop class available in 8i is NUMBERS, which stops all numbers from
being indexed. Alphanumeric words, such as 123.456.789 and abc231 are still
indexed. This behavior depends on the lexer to recognize numbers.

Stop themes are used only during theme indexing. The entries added as
stop themes are themed, and these themes AND THEIR CHILDREN are not indexed.
For instance, say you enter "biology" as a stop theme. This will prevent
themes like "Physical Sciences:Biology:Genetics" and "Physical Sciences:
Biology:Botany" from being indexed. This is a made-up example, and does
not actually reflect the knowledge base.

Stop themes are used only during theme indexing, so adding "biology" as a
stop theme will not stop the word "biology" or biology-associated words from
being indexed.


The WORDLIST Class
------------------
The wordlist class is an odd class because it has no effect on indexing.
Instead, it holds fuzzy and stem expansion settings used at query time.

Wordlist preferences are plugged into indexes using the wordlist keyword:

create index
...
parameters ('wordlist mylist');

There is only one wordlist object: BASIC_WORDLIST, with the following
attributes:

STEMMER
This attribute is set to the default stemming type. Stemming is an
expansion of a word to different forms. The stem expansion of GO might
include GO, GOING, WENT, and GONE. Obviously the rules for this vary
from language to language. We'll discuss it a bit more in the issue on
expansion query operators.

STEMMER can be set to one of the following: ENGLISH (inflectional),
(English) DERIVATIONAL, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, and
NULL, which means no stemming. The CTXSYS.DEFAULT_WORDLIST setting is
language-specific.

FUZZY_MATCH
This attribute is set to the default type of fuzzy match. Fuzzy match
is an expansion technique designed to find words close in form, such as
mistyped or mis-OCR'ed versions.

FUZZY_MATCH can be set to one of the following: GENERIC, JAPANESE_VGRAM,
KOREAN, CHINESE_VGRAM, ENGLISH, DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH,
and OCR. The CTXSYS.DEFAULT_WORDLIST setting is language-specific.

FUZZY_SCORE
This is the default score floor for fuzzy expansions. New for 8i
is the ability to limit the fuzzy expansion to the best matches.
The fuzzy score is a measure of how close the expanded word is to
the query word -- higher is better. Setting fuzzy score means that
fuzzy expansions below this score will not be produced. You can set
this from 0 to 80.

FUZZY_NUMRESULTS
This is the default maximum number of fuzzy expansions. Setting fuzzy
numresults limits the fuzzy expansion to a certain number of the best
matching words. You can set this up to 5000.


The STORAGE Class
-----------------
The storage class holds storage parameters for the index tables. The Text
index is made up of five oracle objects, so the single storage clause in the
create index statement is insufficient to specify storage.

Storage preferences are plugged into indexes using the storage keyword:

create index
...
parameters ('storage mystore');

There is only one storage object: BASIC_STORAGE, with the following attributes:

I_TABLE_CLAUSE
This attribute is the storage parameters for the I table, which is
the main index table. There will be at least one row per unique word
in your index -- probably multiple rows per word. Each row has a BLOB
which tries to stay inline, making the row a bit large. The value is
simply tacked on to the end of the create table statement, so you could
set it like this:

ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',
'tablespace foo storage (initial 1K)');

K_TABLE_CLAUSE
This attribute is the same, but for the K table, which is an IOT for
rowid to docid conversion. There will be one row per row in your base
table. Each row consists of a number and a rowid.

R_TABLE_CLAUSE
This attribute is the same, but for the R table, which is a BLOB table
for docid to rowid conversion. There will be only a few rows in this
table, and will have a large BLOB in each row.

N_TABLE_CLAUSE
This attribute is the same, but for the N table, which has invalid
docids waiting for garbage collection. The number of rows in this table
will depend on the frequencies of update and delete DML, and of garbage
collection optimization. Each row is short, holding two numbers. We will
discuss optimization in another issue.

I_INDEX_CLAUSE
This attribute is the same, but for the unique index on the I table.


Indexing Memory
---------------
Those of you familiar with previous versions may be wondering where the
index memory setting is. For new users, iMT builds an inverted index, so
documents are not added to the index one at a time, like a b-tree index.
Instead, information from multiple documents is buffered up and periodically
flushed to the index during indexing. In previous versions, you could set the
size of this buffer in the analogue of the storage class. Setting this
buffer higher reduces index fragmentation but uses more memory. We will be
discussing index fragmentation in a later issue.

In 8i, index memory can be set directly in the parameters string. This
allows you to set index memory on a per-operation basis. The keyword used
is MEMORY:

create index
...
parameters ('memory 2M');

Here we will use a paltry 2 megabytes of memory for indexing. If not
specified, the system parameter DEFAULT_INDEX_MEMORY is consulted for the
systemwide default index memory. At install time, this is 12M.

The amount of memory cannot exceed system parameter MAX_INDEX_MEMORY.
This allows the system administrator to disallow outrageous amounts of index
memory.

The MEMORY keyword is also allowed in the alter index parameter string
for batch DML (background DML uses DEFAULT_INDEX_MEMORY). This allows you to
have different memory sizes for create index and DML, something which is not
possible in previous versions. You simply specify a different memory amount
in the create index parameter string then in the alter index parameter string.


Next Issue
----------
This finishes indexing. Next issue discusses keeping the indexes
up-to-date via DML. iMT New Features #10
DML


iMT DML Processing
------------------
Most text search engines have asynchronous document updates and additions --
changes to the index are usually done in batches, after and separately from
the document changes. This is due to three aspects of inverted indexing:

1. Text indexing a single document is a lot of work. The values are
usually long, and at the very least you have to break it into words.
Depending on the objects, you may also have to do web page requests,
filtering, and HTML parsing.

2. Inverted indexes, composed of lists of documents by word, are best
updated in batches of documents at a time. Updating the index
one document at a time leads to either word lists one document long
or a lot of appending to existing lists.

3. Most text applications are fairly static, having relatively lower DML
frequency, and less need for full consistency semantics than traditional
transaction processing systems.

iMT faces the same challenges. Here is what iMT does for each DML on an
indexed column:

INSERT
The document rowid is placed into a queue, dr$pending, for later addition
to the text index. Queries before this DML is processed will not find
the new document contents.

UPDATE
The old document contents are invalidated immediately, and the document
rowid is placed into the dr$pending queue for later reindexing. Queries
before this DML is processed will not find the old contents, but
neither will it find the new contents.

DELETE
The old document contents are invalidated immediately. No further
DML processing is required -- queries will no longer find the document.

Thus, iMT has synchronous invalidation but asynchronous addition. This
extends to transactions, as well:

delete from table where contains(text, 'delword') > 0;

select * from table where contains(text, 'delword') > 0;
--> you get no rows returned

rollback;

select * from table where contains(text, 'delword') > 0;
--> you get the rows returned

and:

insert into table values (1, 'insword');

select * from table where contains(text, 'insword') > 0
--> you do not get the row

commit;

select * from table where contains(text, 'insword') > 0
--> you still do not get the row

The synchronous invalidation is new to this version. Also new is that the DML
notification is done through integrated kernel code, and not through triggers
as in previous versions.


Processing Additions
--------------------
So how do you process your inserts and updates when they are sitting in the
queue? iMT provides two methods: sync and background.

Sync is the manual way -- you control when and how it's invoked. Each
invocation processes all pending inserts and updates for an index:

BORKBORKBORKBORKBORK

alter index myindex rebuild online parameters ('sync')

The ONLINE keyword is very important. Without this, during the sync operation
queries are blocked and DML on the base table fails. Make sure to use it
for sync operations.

Background DML requires you to start a ctxsrv background daemon in the
operating system:

ctxsrv -user ctxsys/

Once started, the daemon polls the dr$pending queue for DML, and automatically
processes additions as they become available.

You can use either or both -- it's largely a matter of your application
requirements. Background DML scans for DML constantly. This leads to new
additions being indexed automatically and quickly. However, it also tends to
process documents in smaller batches, which increases index fragmentation.
Sync DML is done at the user's request, so usually the batches are larger and
thus there's less index fragmentation. A happy compromise could possibly be
reached by invoking sync through dbms_sql in a dbms_job which runs at longer
intervals. We'll talk more about index fragmentation in the next issue.


The DML Queues
--------------
The dr$pending queue holds the documents waiting to be indexed. It should
be queried through the ctx_user_pending view, which makes it more readable.
Additionally, there is a dr$waiting queue which is used when documents are
waiting to be indexed and they are updated again. The drq_inprog, drq_batches,
and drq_batchno tables from previous versions are all no longer needed.


Next Issue
----------
Next issue we discuss optimization, including a discussion on index
fragmentation. That will finish the indexing side, then we'll shift to
a series of issues on query syntax and semantics.
iMT New Features #11
Optimization


Index Fragmentation
-------------------
The Text index is an inverted index, so essentially consists of words and
lists of documents which contain that word. When new documents are added
to this index, these lists have to be extended. Instead of appending to these
lists, we simply create more rows for the word. However, this can lead to
many rows with sub-optimal, short lists -- called "index fragmentation".
For instance, say we have the following in our index:

DOG DOC 1 DOC 3 DOC 5

Now a new document with "dog" comes along. We will simply add it to the
index as a new row:

DOG DOC 1 DOC 3 DOC 5
DOG DOC 7

Subsequent DML will also create new rows:

DOG DOC 1 DOC 3 DOC 5
DOG DOC 7
DOG DOC 9
DOG DOC 11

This is why background DML generally produces more index fragmentation than
spot invocation of sync: processing DML less frequently in larger batches
allows newly added rows to have longer lists, which reduces the number of rows
in the index table.

Keep in mind that a certain amount of fragmentation is expected in 8i. iMT
will try to keep index table rows below 4000 bytes long, to inline the LOB
and to speed up index row merges used during phrase search, AND, section
search, etc. However, frequent DML will probably result in much more
fragmentation than this.


Document Invalidation
---------------------
During an update or delete, the old document information must be removed
from the index so that queries will no longer hit it. However, because the
inverted index consists of words and rows of occurrences, this would entail
finding all words in the old version of the document (which may not be
available), and removing an occurrence from the lists of those rows in the
index table. This is too much work to do synchronously, and deletes must be
synchronous in the integrated model.

Instead, iMT marks the old document contents as invalid and does not touch
the index. The mark signals queries to remove the document from any query
result sets. However, this leaves the old information in the index, taking
up space in the index tables.


Optimization
------------
To solve these potential problems in the index, you run optimization.
Optimization has two modes: FAST and FULL. FAST optimization targets
fragmentation only:

alter index myindex rebuild online parameters ('optimize fast');

The ONLINE keyword is important -- without it, queries are blocked and DML
fails with an error. FAST optimization runs through the whole index table
and glues fragmented rows together, to a maximum of 4000 bytes for a single
row. This reduces the number of rows in the index table.

FULL optimization does both defragmentation and garbage collection, which
removes the old information left over after document invalidation:

alter index myindex rebuild online parameters ('optimize full');

It loops through each row of the index table, but, unlike fast optimization,
cracks open each row's list, removing old document information. It also
glues separate rows together where possible.

Because garbage collection is more involved and time-consuming than
defragmentation, FULL optimization does not have to be run on the whole
table at one time. Instead, you can run it for a limited period of time:

... parameters ('optimize full maxtime 5')

which means run for a maximum of 5 minutes. At the end of 5 minutes,
it will stop. The next time you run optimization, it will pick up where
it left off. This lets you do a little of bit optimization each night
during off times, and ensure that it is not loading the system when
the off time ends.

Garbage collection is needed only after document invalidation, which
happens only for delete and updates. If your document table is append-only,
meaning that documents are inserted and never changed or removed, then
it should suffice to run only defragmentation.


Optimization Concurrency
------------------------
Unlike previous versions, optimization (either mode) does not block
background DML processing. However, because oracle prevents alter index
operations from running concurrently, sync invocations will be prevented.


Next Issue
----------
That finishes indexing and index maintenance. Next we'll turn our
attention to querying for a couple of issues. iMT New Features #12
Querying


The Basics
----------
If you've been following the issues so far, we've covered all aspects
of building your index and keeping it up to date. However, indexes alone
are worthless -- they are a means to an end, which in our case is content-
based query.

IMPORTANT: Unlike a value query, which is slower without a b-tree index,
contains queries are completely disallowed without a Text index.

You do a query using the contains operator:

select id
from texttab
where contains(textcol, 'query') > 0

The first argument to contains is the name of the text column. The second
is the text query, which is limited to 2000 bytes in length. It returns
a number, which indicates the strength of match. 0 means not a match, so
we constrain it to > 0. The result is that this query finds the id's
of all rows in texttab where the textcol has the word "query".


Scoring
-------
Of course, now you're wondering how to use that return value in the select
list and order by clause. We've provided an operator just for that purpose:

select id, score(1)
from texttab
where contains(textcol, 'query', 1) > 0
order by score(1) desc

The "1" in the score and contains operators is called the "contains label"
and it's a number which links the score in the select list to the contains
in the where clause -- it can be any number, so long as it matches.

Score can be between 0 and 100, but the top-scoring document in the query
will not necessarily have a score of 100 -- scoring is relative, not absolute.
This means that scores are not comparable across indexes, or even across
different queries on the same index. Score for each document is computed
using the standard Salton formula:

3f(1+log(N/n))

Where f is the frequency of the search term in the document, N is the total
number of rows in the table, and n is the number of rows which contain the
search term. This is converted into an integer in the range 0 - 100.

Query operators like AND and OR operate on the scores of their operands.
As we discuss each operator in upcoming issues, we'll discuss their impact
on operand scores.


Where Can I Use contains?
-------------------------
Unlike the transparent query rewrite of previous versions, you can include
a contains clause in any SQL statement using any tool which can issue oracle
SQL. Here is a short list of new places where contains queries can pop up:

* subqueries

* virtual tables

* PL/SQL

* View definitions

* DML (insert as select, e.g.)

We are also no longer subject to the limitation on the length of the contains
SQL statement, although there is a limit on the query term length.


Other Query Methods
-------------------
BORKBORKBORKBORK
For those of you familiar with previous versions, there is no more two-step
or text cursors (a.k.a. in-memory query). If you absolutely must have result
tables, use insert as select. Instead of text cursors, use real PL/SQL
cursors. To sort by score, use an order by clause with a FIRST_ROWS hint.
The hint will avoid an oracle SORT ORDER BY. We'll discuss this further in
an upcoming issue solely on query optimization.

However, we still have count_hits. count_hits is a fast way to get the
hit count for a text query, in PL/SQL:

num_hits := ctx_query.count_hits('indexname','query',TRUE);

The first argument is the name of the index. This should be a string, so
use quotes if you are using a literal value. The second argument is the
query string, and the third is a boolean toggle called "exact". If exact
is TRUE, then the result is accurate. If exact is FALSE, then the result
is only an upper bound. The actual count could be lower. exact FALSE
is faster, because it doesn't screen out invalid documents (see issues on
DML and optimization for more details).

You could also do "select count(*)" with a contains clause, and this is
recommended when you have structured conditions, but count_hits is faster
for text-only queries.


Simple Queries
--------------
Enough about contains -- let's talk about what you can stick in there as
a text query. The simplest query is a single word:

contains(text, 'dog') > 0

You can escape your word with curlies:

contains(text, '{dog}') > 0

This is useful when you want to query on a reserved word or your word has
special characters which are query operators. The list of reserved words
and the query operator characters are in the documentation.

You can query on a phrase just as easily:

contains(text, 'dog my cat') > 0

Unlike internet search engines, this searches for the phrase "dog my cat",
not just any document which has these words.

If your phrase contains a stopword:

contains(text, 'dog the cat') > 0

then the stopword ("the" in this case) is treated as a wildcard, matching any
word. This query would match all the following documents:

dog the cat
dog my cat
dog cat cat

but not this document:

dog cat frog

In other words, the stopword must match something. Stopwords alone
disappear from the query. The query:

contains(text, 'the & cat') > 0

is reduced to

contains(text, 'cat') > 0

Queries on just stopwords will return no rows.


Next Issue
----------
We'll now cover each query operator, in groups. These are not official
groupings, just convenient for this kind of bulletin.

iMT New Features #13
Query Operators Part 1


Boolean Operators
-----------------
Boolean operators are for combining the results of subqueries using boolean
algebra. iMT has three boolean operators:

AND (&)
OR (|)
AND and OR are the basic boolean operators. However, the operands are
returning numeric scores, rather than boolean values, so we've implemented
AND as a minimum of its operand scores, and OR as a maximum of its operand
scores. This results in expected behavior -- if any operand of AND scores
0 (the word is not in the document) -- then the AND retuns 0 (the document
is not returned). You can use either word or symbol: 'dog and cat' is the
same as 'dog & cat'.

NOT (~)
NOT is not the unary negator. Instead, it is "AND NOT". 'dog NOT cat'
returns all document which have "dog" except those which also have "cat".
The score returned is the score of the left child.


Subqueries and Grouping
-----------------------
You can use parentheses for subqueries and precedence, as in:

(dog my cat AND fido) OR horse

Without parentheses, the operators follow a precedence hierarchy which can
be found in the documentation. Since AND is higher precedence than OR,
the parens are not actually needed in the example above. However, they are
needed in this query:

dog my cat AND (fido OR horse)

in order to override the precedence hierarchy.


Scoring Operators
-----------------
The scoring operators operate on the scores of their operands. There
are four scoring operators:

WEIGHT (*)
The weighting operator multiplies a search term's score to make it more
or less important in the query. The multiplier can be from .1 to 10:

contains(text, '(dog*2) AND cat') > 0

This query looks for documents with "dog" and "cat", but the score for
"dog" is multiplied by 2. This makes documents with "dog" more likely
to have a higher score in the result set than those with "cat".
WEIGHT does not have a word equivalent -- you must use the character *.

THRESHOLD (>)
The threshold operator filters out documents below a particular score.
'dog > 10' will result only those documents containing "dog" which
score higher than 10. You can also do it this way:

contains(text, 'dog') > 10

but threshold is implemented at a lower level, so will be faster.
THRESHOLD does not have a word equivalent -- you must use >.

MINUS (-)
The minus operator takes the left operand score and subtracts the right
operand score. 'dog - cat' looks for documents with "dog", but those
documents which also have "cat" will probably score lower than those
without. This is not the same as NOT, which completely filters out
the document if the right operand search term is present.

ACCUM (,)
Accumulate groups several words or phrases, and scores higher when more
of its operands are hit. 'dog, cat, frog' will score any document with
all three words higher than any document with two of the three, which
will score higher than any document with only one of the three. This
is changed from previous versions, where ACCUM merely added the scores
of its operands. You can either use the symbol , or the word ACCUM.


Set Operators
-------------
For those of you familiar with previous versions, the MAXDOC (:) and
FIRST/NEXT (#) operators have been removed. You can use cursor fetching
loops to get the same effect.


Word Expansion Operators
------------------------
Word expansion operators expand a word to find similar forms of the word
in the document set. For 8i, the limit on maximum number of expansions has
been removed. However, the more words a term expands to, the slower the query
will run. There are five word expansion operators:

WILDCARD (% _)
You can use the SQL like wildcards % and _ to find words matching a
pattern. 'do%', for instance, finds all documents with words beginning
with do, such as dog, door, etc. This is done via a like query on the
index table, so a wildcard query like '%do%' will be slow, as a full
table scan of the index table is required.

FUZZY (?)
Fuzzy query finds words with similar form, using a proprietary
algorithm. This is useful for finding mis-typed or mis-OCR'd words.
The fuzzy operator is ?, as in '?dog'. You can change the rules used
for fuzzy and limit the expansion using the wordlist class. FUZZY has
no word equivalent -- you must use the ? operator.

STEM ($)
Stem query finds words with the same stem form, using integrated
Xerox linguistic code. This is useful for finding "GOING" and "WENT"
from "GO", for instance. You can change the settings for stem expansion
using the wordlist class preferences. STEM has no word equivalent --
you must use the $ operator, as in '$go'.

SOUNDEX (!)
Soundex query finds words which sound alike. This no longer uses a
separate table, but is instead a specialized fuzzy expansion. SOUNDEX
has no word equivalent -- you must use the ! operator, as in '!dog'.

EQUIV (=)
Equivalence is for manually inputting the various forms of a word, as
in 'dog = cat', searching for documents with 'dog' or 'cat', treating
them as different forms of the same word. Equiv only works on simple
words -- you cannot equiv phrases. You can either use the symbol = or
the word EQUIV.


Proximity Operator
------------------
Other than phrases, iMT has one proximity operator -- NEAR, which finds
documents where the input phrases are close to each other. The closer
the words are to each other, the higher the score. NEAR has two forms.
The shorthand form uses the ; character;

dog ; cat ; boat

This query finds documents which have dog, cat, and boat, and scores the
document higher the closer they are to each other.

The function form has options to set range and directionality:

NEAR((dog,boat), 10, TRUE)

The first argument to NEAR is the list of words. In the example, we are
looking for dog and boat. The second argument is the maximum allowed span.
In the example, we are constraining it to within a 10-word span. That means
there cannot be more than 10 words between "dog" and "boat". This range can
be up to 100. It is optional -- if omitted, then it defaults to 100. The
third argument is directionality. If TRUE, as in the example, then the words
have to appear in the order specified in the list. If FALSE, the words can
appear in any order, as long as they fit within the specified word span. This
can be omitted, in which case FALSE is the default.


Next Issue
----------
These are the core operators. Next issue we'll look at the rest of
the operators. This will be followed by an issue on query tuning.


iMT New Features #14
Query Operators Part 2


Within
------
The within operator limits a subquery to a particular section:

dog within title

will find all documents with "dog", but only if it occurs in the "title"
section. The name of the section is used rather than the tag. The "title"
section has to have defined in your section group -- you cannot specify
arbitrary tags. If your section name has non-alphanumeric characters, you
should enclose the name in curlies:

dog within {my_title}

which is why we don't recommend using non-alphanumeric characters in section
names. Within is high on the precedence hierarchy, so if you have a subquery,
it is safer to use parentheses to group it:

(dog and cat) within title

If you don't use parentheses, it will search for documents with "dog" anywhere
and with "cat" in the title section:

dog and cat within title == dog and (cat within title)

which is a different query.

If the section is a zone or special section, the subquery is constrained
to a particular instance of the section:

(dog and cat) within sentence

means dog and cat within the SAME sentence. This is different from:

(dog within sentence) and (cat within sentence)

for instance -- which means dog in any sentence, and cat in any sentence.
Section types and query semantics were covered in issue #7.

Within does not nest, so this is not allowed:

(dog within paragraph) within title <---- BAD


ABOUT
-----
BORKBORKBORKBORK
ABOUT is an advanced query operator. We take the input to ABOUT and do
our best to increase precision and recall. If your index has a theme
component (INDEX_THEMES is YES for BASIC_LEXER) then ABOUT does a theme
query. It determines the theme of the input term, and finds documents
with the same theme. This allows a query like this:

contains(text, 'about(canines)')

to find relevant documents even if they don't have the word "canines" in
them -- they might have "dogs" or "wolves" instead, for instance. It will
also filter out documents which have the word "canines" but are not
significantly about canines, such as an article on dentistry, for instance.

The input to ABOUT does not have to be a single term. It can be an
unstructured phrase:

about(japanese banking investments in indonesia)

This phrase will most likely have multiple themes -- banking, Japan, etc. The
documents retrieved by this query will have all those themes. This allows you
to use plainly-worded descriptive phrases instead of rigid query syntax and
still find relevant documents.

If your index does not have a theme component, then ABOUT will do a query
expansion to increase recall -- it will split the input into words, stem each,
then string it back together using accumulate:

about(go home now)

is transformed to

$go,$home,$now

Note that the purpose of ABOUT is to have iMT do what it can to find the
best-fitting documents. It does not have a formulaic definition like the
other operators. The underlying expansion algorithm may change as we do
continual improvement.

ABOUT can be combined with other operators:

contains(text, 'about(dogs) and $cat')

for instance. This allows you to do a combined theme and text search, which
was not as easy to do in previous versions.


Thesaurus Operators
-------------------
The thesaurus operators allow you to expand a thesaurus phrase using a
thesaurus. The SYN operator, for instance, expands a word into synonyms:

SYN(dog) == {dog} | {mutt} | {canine}

The expansion uses a previously loaded thesaurus. This is usually a thesaurus
of your own creation -- iMT does not install a thesaurus by default, although
a sample thesaurus is included in ?/ctx/sample/thes. You can see the expansion
by using the ctx_thes package functions:

declare
output varchar2(80);
begin
output := ctx_thes.syn('dog');
dbms_output.put_line(output);
end;

Thesaurus operators take simple thesaurus terms as input -- they cannot
nest or take expansions as input:

SYN($dog) <-- BAD
SYN(BT(dog)) <-- BAD

We will cover the individual thesaurus operators in the next issue.


SQE
---
SQE's are Stored Query Expressions -- a macro operation. You store a query
string and give it a short name, using the ctx_query PL/SQL package:

ctx_query.store_sqe('cats', 'cat = feline = kitty');

The first argument is the name of the SQE -- "cats". The second argument is
the stored expression. When I use it in a query:

contains(text, 'SQE(cats)')

iMT expands it to the stored query expression, making this equivalent to:

contains(text, 'cat = feline = kitty')

SQE text can even refer to other SQE's:

ctx_query.store_sqe('animals', 'frog | hog | sqe(cats)');

Although if you set up a circular reference, you will get an error at
query time.

There are fundamental changes to SQE in 8i. First, stored query
expressions are no longer tied to an index -- you can use any SQE in any
query on any index. You can even use other user's SQE's by prepending
the owner name:

SQE(bob.cats)

Second, SQE partial results are no longer stored. Instead, SQE works as a
macro operation, replacing lexically during query parsing. Finally, there
is no longer the notion of session-level SQE -- store_sqe does a commit.


PL/SQL
------
For those of you familiar with previous versions, the PL/SQL operator (@)
has been removed. This is due mainly to stricter transactional restrictions
resulting from tighter integration with the kernel


Next Issue
----------
Next issue we'll take a closer look at the thesaurus functionality.
iMT New Features #15
Thesaurus


Overview of Thesaurus Functionality
-----------------------------------
iMT provides a program to load formatted, file-based thesauri into ctxsys-
owned tables in the database. These thesauri can then be used at query
time to expand query terms for improved recall. iMT also provides PL/SQL
functions for browsing the thesaurus while it's in the database, and a program
for exporting the thesaurus back out to a file.

These thesauri are usually customer-defined. iMT provides a sample
thesaurus in ?/ctx/sample/thes, but this is not imported by default.


The Thesaurus File
------------------
The first step in thesaurus import is construction of the thesaurus file.
This is a list of terms in the thesaurus, each followed by their relationship
to other terms. For instance:

canine
SYN mutt
BT mammal
NT wolf

lizard
BT reptile

The first term in this thesaurus is "canine". The second line declares that
"canine" has a synonym, "mutt". The following lines declare that the broader
term of "canine" is "mammal", and a narrower term of "canine" is "wolf".

Terms, such as "canine" or "lizard" in our example above, must be on their
own line, and must be at the start of the line -- no whitespace is allowed
between the beginning of the line and the term.

Relationship lines, such as "SYN mutt", must have whitespace before them.
The relationship word (SYN) is not case-sensitive. You cannot have more than
one relationship word / word pair on a single line:

canine
SYN mutt SYN doggie <--- BAD

Relationships cannot be on the same line as the term:

canine SYN mutt <--- BAD

These rules are required for proper parsing of the thesaurus file. The more
relaxed rules in previous versions were not able to correctly parse certain
cases, especially when terms had common substrings with relationship words,
such as "NT 4.0".


Thesaurus Relationships
-----------------------
The thesaurus file supports the following set of relationship words:

SYN
Synonymity. A SYN B means that A and B mean the same thing.
A SYN B implies B SYN A. A SYN B and B SYN C implies A SYN C.

UF
Use For. Same as SYN.

PT
Preferred term. A PT B means that B is the preferred term for A.
There can be only one preferred term for any thesaurus term. If
multiple PT lines are seen, latter PT lines override previous lines.

USE
Same as PT.

SEE
Same as PT.

BT
Broader Term. A BT B means that B is a broader term of A. A BT B
implies B NT A (A is a narrower term of B). Terms can have multiple
broader terms. Circular references should not be introduced.

You can specify whole hierarchies under a single term using BTn:

dog
BT1 canine
BT2 mammal
BT3 vertebrate
BT4 animal

Here we aren't declaring multiple broader terms of dog, but instead
the whole hierarchy above dog -- canine is the first-level BT of dog,
then mammal is the second-level BT -- meaning that mammal is the BT of
canine. BTn goes up to BT16.

BTP
Broader Term Partative. A BTP B means that B is a broader term of
A in the sense that A is a part of B. Hard drive BTP Computer, for
instance -- Computer is a broader term of hard drive in the sense that
a hard drive is a part of a computer (NCA notwithstanding). A BTP B
implies B NTP A, but does not imply A BT B -- the partative hierarchy
is separate from the normal hierarchy. This relationship is sometimes
called "meronymity", although not by anyone with a life.

BTG
Broader Term Generic. A BTG B means that B is a broader term of A
in the sense that B is a generic name for A. Rats BTG Rodents, for
instance. A BTG B implies B NTG A, but does not imply A BT B. This
relationship is sometimes called "holonymity".

BTI
Broader Term Instance. A BTI B means that B is a broader term of A
in the sense that A is an instance of B. Cinderella BTI Fairy Tale,
for instance. A BTI B implies B NTI A, but does not imply A BT B.
I don't know any fancy name for this relationship.

NT
Narrower Term. The opposite of BT. You can use NTn, just like BTn.

NTP
Narrower Term Partative. The opposite of BTP.

NTG
Narrower Term Generic. The opposite of BTG.

NTI
Narrower Term Instance. The opposite of BTI.

RT
Related Term. A RT B means that A and B are associated, but are not
synonyms and cannot be arranged into a hierarchy. A RT B implies
B RT A. A RT B and B RT C does not imply A RT C.

:
Foreign language translation. can be any label you wish
to use for the language, 10 characters or less. This applies to
the immediately preceding term, even if it is a relationship word:

cat
french: chat
RT hat
french: chapeau

Here "chapeau" is the french translation for "hat", not "cat".

SN
Scope note. You can attach a little note to a term as a comment.
This can be up to 2000 characters. If you need multiple lines,
simply repeat the SN keyword -- the text is concatenated:

cat
SN I had a cat once. He was brown and
SN grey and well-behaved.

Our thesaurus functionality also supports homographic disambiguation
using parenthesis:

mercury (god)
BT Greek mythology

mercury (element)
SYN quicksilver

as well as compound terms using the + sign:

thermometer
SYN temperature + instrument


Importing a Thesaurus File
--------------------------
Once you have the file, you need to import it into the iMT data dictionary.
>>From the OS command line:

ctxload -user ctxsys/ctxsys -thes -name mythes -file mythes.txt

The login user can be any CTXAPP user -- it does not have to be ctxsys.
-thes is the toggle for thesaurus import. mythes is the name for the
new thesaurus, and mythes.txt is the file to import.

If a thesaurus of this name already exists, you will get an error and
you must drop the existing thesaurus before importing this one. Thesaurus
names are database-global.

Optionally, you can specify the thescase argument to make the thesaurus
terms case-sensitive:

ctxload ... -thescase Y

The default is N, meaning that the thesaurus is not case-sensitive. Mercury
(the god) and mercury (the element) are the same term if found in the
thesaurus file.


Thesaurus Query Operators
-------------------------
Once the thesaurus is loaded, any user can use it in any query on any index
using the thesaurus operators. These mimic the relationship names for the
most part. For instance:

contains(text, 'SYN(dog, mythes)') > 0

will search for documents which contain any of the synonyms of "dog" defined
in the "mythes" thesaurus. The actual expansion uses OR like this:

{canine}|{doggie}|{mutt}|{dog}

Homographic disambiguators are not included in the expansion:

mercury (element)
SYN quicksilver

SYN(quicksilver) === {quicksilver}|{mercury}

Compound phrases are treated as a conjunctive:

thermometer
SYN temperature + instrument

SYN(thermometer) == {thermometer}|({temperature}&{instrument})

Thesaurus operators take simple thesaurus terms as input -- they cannot nest
or take expansions as input:

SYN($dog) <-- BAD
SYN(BT(dog)) <-- BAD

Here are the specific thesaurus functions:

SYN( term [, thesname] )
The input term is always part of the output; if no synonyms are found
then term is the expansion. thesname is the name of the thesaurus
to use. If omitted, thesaurus DEFAULT is consulted. There is no
DEFAULT thesaurus installed automatically -- it is up to the user to
load a thesaurus named DEFAULT.

PT( term [, thesname] )
Preferred term. Only the preferred term is returned in the expansion.

BT( term [, level [,thesname]] )
BTP( term [, level [,thesname]] )
BTG( term [, level [,thesname]] )
BTI( term [, level [,thesname]] )
NT( term [, level [,thesname]] )
NTP( term [, level [,thesname]] )
NTG( term [, level [,thesname]] )
NTI( term [, level [,thesname]] )
Broader/Narrower terms of all types. The optional second argument is
a numeric scope indicating how many levels of