InvertedIndex: EDIT Keyword to Map Words to Data Base

Fast full text searches with words in arbitrary order, as typical for search engine indexing algorithms
Phrase searches with words in the specified order (not necessarily adjacent), similar to the use of quotations marks in search engines.

Basic principle without further options:
1. - EDIT(Text=original, IDentification=docname, InvertedIndex=InvIdx) ! to build or update InvIdx
2. - result = EDIT(Text=query, InvertedIndex=InvIdx) ! ⇾ all occurences of query in original(s)
- To search for the index position of a substring in a string without first building an inverted index, use the INDEX function.
- To build and query a vocabulary of words and/or expressions contained in strings or text files it is more efficient to use the EDIT keyword Lexicon.
- To build an inverted index of one or more files , READ file after file to an in-memory buffer.
  ⇾ How to read a disk file to a character variable and write it back

keyword	position	short	action
Text	first, required	T=ORIGINAL	in-memory string to be analyzed.
IDentification	middle optional	ID=name	include a name (e.g. file name) in InvIdx (default is "noID")
EXTRA	middle optional	EXTRA=".()"	include non-alfanum 1-character words to InvIdx, eg "." or "(" or ")" or Linefeed for use in later queries (eg to find line boundaries if EXTRA included the linefeed character $LF)
ID1	middle optional	id1=left	start position of an ID-string in the "ORIGINAL"
ID2	middle optional	id2=right	its end position
Mark1	middle optional	M1	start position of a clipping of ORIGINAL
Mark2	middle optional	M2	its end position
OFfSet	middle optional	ofs=p1	added to InvIdx positions. Useful if ORIGINAL is just a separate clipping of the complete document generated by another statement.
Option	middle optional	opt = 1	1 to respect case (default is 0 to ignore case) Use for special problems only: "The" and "the" will be indexed separately, e.g.
SorTSequence	middle optional	sts=32	sort_query_result (32 means: first column3, then column2)
InvertedIndex	last, required (except DO)	ii=invidx	receives the inverted index. If the name ID is already indexed, inv_idx updates the existing index.
DO	last optional	DO	DO or DO=count can be used to break up ORIGINAL by resetting Mark1 and Mark2 in the loop.

at build:

EDIT(T=..., ID="any string, no LineFeeds", II=...) ! file name is a good choice for later queries
EDIT(T=Full_Doc, [Move], Id1, [Move], ID2, II=...) ! [Move] is any combination of EDITs loc commands
EDIT(T=Part, OFfSet=ofs, ID="anyName", II=...) ! ofs is start byte of Part in Full

Possible choices to set ID:

once_only_shakespeare_example
: Inversely index Shakespeare Complete Works all in one:
- Initialize:
  - CHARACTER ALL*1E6, InvIdx*1E6, found*1E5
  - ! Read works to ALL:
  - OPEN(FIle="E:\TEXTS\Shakespeare\SHAKESPEARE COMPLETE WORKS.txt", UNFormatted)
  - READ(FIle="E:\TEXTS\Shakespeare\SHAKESPEARE COMPLETE WORKS.txt", CLoSe=1) ALL
  - InvIdx = " " ! default initialize
  The all-default-statement to build an inverted index from ALL (5.590.193 bytes):
  - EDIT(Text=ALL, ID="Shakespeare", InvertedIndex=InvIdx)
  InvIdx has now 4.211.280 bytes and it took 22.4 sec on a somewhat antiquated Laptop (T6670 @ 2.2 GHz).
frame_part_document
of ORIGINAL with the SetPosition and Mark keywords, e.g.
- EDIT(T=ORIGINAL, SetPos=1124245, Mark1, Right=41, M2, ID="Hamlet", II=InvIdx)
- adds "To be or not to be, that is the question." to InvIdx with ID=Hamlet

of ORIGINAL in a single statement can be controlled by Mark1 and Mark2.

The example shown will separately index text clippings between empty lines (2 pairs of carriage-return linefeed)
- EDIT(T=ORIGINAL, M1, Right=$CRLF//$CRLF, M2, II=InvIdx, Right, DO)
IDs written to InvIdx will be the positions M1, M2 in ORIGINAL "; startbyte endbyte"
- TIME(CPU=1) ! forces DO-progress to be shown on Windows status bar
Includinging something like
- ..., IDentification="E:\Shakespeare\Hamlet.txt", ...
will register the clipping as "E:\Shakespeare\Hamlet.txt; startbyte endbyte" in the IDs section of InvIdx.

: The ID1, ID2 markers can be used to extract an individual IDname from ORIGINAL:

CHARACTER TB="by William Shakespeare", TE="THE END"
! Build a complete inverse index of "SHAKESPEARE COMPLETE WORKS.txt":
EDIT(T=ALL, R=TB, L=$CR, Opt=128, L='\a', Opt=0, L=$LF, R, ID1, R=$CR, L, ID2, M1, Opt=1, R=TE, L, M2, Opt=0, II=InvIdx, DO) ! 21.2 sec on T6670 @ 2.2 GHz

Above statement looks terrifying, but really is quite straightforward:

keyword short	full keyword	action
T	Text	defines string ALL as the ORIGINAL
R	Right	find right TB ("by William Shakespeare")
L	Left	find left carriage return ($CR)
Opt	Option	128 == Regex (regular expression)
L	Left	find left \a (alfabetic character, Regex)
Opt	Option	0 == all standard again
L	Left	find left line feed ($LF, CR+LF == new line in Windows)
R	Right	no argument: Move right 1 character
ID1	ID1	start position for IDentification string
R	Right	find right carriage return
L	Left	move left 1 character
ID2	ID2	end position for IDentification string
M1	Mark1	start position for clipping
Opt	Option	1 == case (upper case / lower case)
R	Right	find right TE ("THE END", upper case! )
L	Left	move left 1 character
M2	Mark2	end position for clipping
Opt	Option	0 == all standard again
II	InvertedIndex	defines string InvIdx to receive the result
DO	DO	repeat until error (here: no TB found anymore)

The first entries to the IDs section of InvIdx are:

THE SONNETS; 7920 113221

ALLS WELL THAT ENDS WELL; 113786 263466

THE TRAGEDY OF ANTONY AND CLEOPATRA; 264042 436228

AS YOU LIKE IT; 436783 576608

add_and_update
of InvIdx is automatic. To update ID must be referenced with the full name, e.g. "AS YOU LIKE IT; 436783 576608"
- a new ID is appended in the IDs section, words are integrated in the index
- the index of an existing ID is overwritten
- if updated with Text=" " the index for this ID is removed, but the ID name is retained.
stopwords
are to be ignored. They are neither registered during build, nor are they relevant in a query.
- EDIT(Text="a and it the", ID="IGNORE", II=InvIdx) ! will ignore these words in all later IDs.
preset_lexicon
to index and analyze later documents with a constant list of exclusive words:
- InvIdx =" "
- EDIT(Text="Include these words only ( $ and underscore OK)", ID="CONST", ii=InvIdx)
InvIdx is set to: FMT=bin4 LEX=CONST ... NWORDS=8 NIDS=1 ...
It contains the word list: " $ and include ok only these underscore words".
testing_invidx
is facilitated, if indexed in hexadecimal format instead the default 32 bit binary integer "FMT=int4":
- InvIdx = "FMT=hexN" ! with N = 1..9 to write N characters minimum width hex numbers adjusted automatically
⇾ Inverted index data structure