[Air-l] neat indexing system
jeremy hunsinger
jhuns at vt.edu
Thu Feb 28 04:56:49 PST 2002
actually this is the way that the search engine we developed at cddc
works, so i find it interesting to see another one just like it
the method is simple:
take any text
strip html
put the text into a table(1) row with an index number (TID)
parse the text into words
put each of those words into a new table(2) with new row with TID
you can build an index from that(2) easily
hint:
you select unique using sql
you use select like using sql
then you parse the text into html linked to the individual words in the
index and insert that into a column in table(1)
then you can display the text as a hyperlinked index. which you can
display
our(center for digital discourse and culture, myself and my assistants)
innovations on this design
1. we added to table2 the ability to add definitions to individual word
entries, so that if you click a word to find where it is indexed you
could put in a definition if one was not present or read the definition
that is present. The definitions are all indexed also.
2. we added a table3 combined up of 2 to 4 word phrases which speeds up
searching + allows you to find proper names much easier and I am
currently developing code that will clean table 3 based upon what people
tend to search on, ie table 4
3. save all search strings and first 3 answers in table 4
4. we began stripping common words from the base parser for table 2,
(the, is, are, an, so, that, etc) this speeds parsing and indexing
immensely.
the basic code for this will be released in a few weeks on sourceforge
search for cddc. We're releasing the complete initial codebase.
and I guess i should probably generate some sort of paper out of this:)
On Thursday, February 28, 2002, at 04:26 AM, Zunt at aol.com wrote:
> I've not run across this method before, and thought folks on this list
> might
> enjoy puzzling over it.
>
> http://www.ugcs.caltech.edu/~harel/lyrics.html
>
> The website contains a collection of texts (popular song lyrics).
> Having
> made a selection from the contents, you can click on various linked
> words
> within the text (not all possible words are linked). That action
> triggers
> (a) enumeration of the texts in the library that contain the target
> word, and
> (b) a hyperlinked index connecting you back to those available texts.
> Each
> instance of target word use appears in the index list.
>
> It looks to me like quite a bit of HTML page generation is done
> automatically
> via scripting on the server side.
>
> Cheers,
>
> Bob Briggs
> Westport, MA
>
>
> _______________________________________________
> Air-l mailing list
> Air-l at aoir.org
> http://www.aoir.org/mailman/listinfo/air-l
>
>
jeremy hunsinger
jhuns at vt.edu
on the ibook
www.cddc.vt.edu
www.cddc.vt.edu/jeremy
www.dromocracy.com
More information about the Air-L
mailing list