[Air-l] neat indexing system

jeremy hunsinger jhuns at vt.edu
Thu Feb 28 04:56:49 PST 2002


actually this is the way that the search engine we developed at cddc 
works, so i find it interesting to see another one just like it
the method is simple:

take any text
strip html
put the text into a table(1) row with an index number (TID)
parse the text into words
put each of those words into a  new table(2) with new  row with TID
you can build an index from that(2) easily
	hint:
	you select unique using sql
	you use select like using sql
then you parse the text into html linked to the individual words in the 
index and insert that into a column in table(1)
then you can display the text as a hyperlinked index. which you can 
display

our(center for digital discourse and culture, myself and my assistants)  
innovations on this design
1. we added to table2 the ability to add definitions to individual word 
entries, so that if you click a word to find where it is indexed you 
could put in a definition if one was not present or read the definition 
that is present.  The definitions are all indexed also.
2. we added a table3 combined up of 2 to 4 word phrases which speeds up 
searching + allows you to find proper names much easier and I am 
currently developing code that will clean table 3 based upon what people 
tend to search on, ie table 4
3.  save all search strings and first 3 answers  in table 4
4. we began stripping common words from the base parser for table 2, 
(the, is, are, an, so, that, etc)  this speeds parsing and indexing 
immensely.


the basic code for this will be released in a few weeks on sourceforge 
search for cddc.  We're releasing the complete initial codebase.

and I guess i should probably generate some sort of paper out of this:)







On Thursday, February 28, 2002, at 04:26 AM, Zunt at aol.com wrote:

> I've not run across this method before, and thought folks on this list 
> might
> enjoy puzzling over it.
>
> http://www.ugcs.caltech.edu/~harel/lyrics.html
>
> The website contains a collection of texts (popular song lyrics).  
> Having
> made a selection from the contents, you can click on various linked 
> words
> within the text (not all possible words are linked).  That action 
> triggers
> (a) enumeration of the texts in the library that contain the target 
> word, and
> (b) a hyperlinked index connecting you back to those available texts.  
> Each
> instance of target word use appears in the index list.
>
> It looks to me like quite a bit of HTML page generation is done 
> automatically
> via scripting on the server side.
>
> Cheers,
>
> Bob Briggs
> Westport, MA
>
>
> _______________________________________________
> Air-l mailing list
> Air-l at aoir.org
> http://www.aoir.org/mailman/listinfo/air-l
>
>
jeremy hunsinger
jhuns at vt.edu
on the ibook
www.cddc.vt.edu
www.cddc.vt.edu/jeremy
www.dromocracy.com





More information about the Air-L mailing list