[futurebasic] Re: [FB] Metaphone

Message: < previous - next > : Reply : Subscribe : Cleanse
Home   : September 2001 : Group Archive : Group : All Groups

From: Ken Shmidheiser <k.shmidheiser@...>
Date: Tue, 4 Sep 2001 04:04:51 -0400
Responding to my simple query, jonathan asked:

>  > I have already created a Soundex function that is smaller than the
>>  one posted on Pix Mix and works very well, but I feel the full-blown
>>  Metaphone function is what I need for a project I'm exploring.
>>
>  > Also, has any here experimented with the Levensthein distance
>>  algorithm? Any code on that?
>
>i read this post and then stopped and read again. it looks like english,
>well, a strong ressemblance. the words seem to be english, and i have
>vaguely heard of soundex. but this is uncomprehensible and so must have
>largely exceeded my personal Levensthein distance.
>
>ken, for those who will be unable to help you, but would still like to
>understand... could you post a translation... y'know, words of 0.8 syllables
>and all the rest.

My dear gnomish lad, jonathan:

First it was Master Charles (Dickman) who dared descendeth with me to 
the secret vaults of MacDom in quest of the Holy Hex.

Now you come forward as a journeymen in the very essence of the word.

Are you prepared?

Have you cleansed your conscience of C++ impurities as we prepare for 
a flight across the space-time continuum in search of understanding 
that will amaze and astonish you?

The answer rests in a journey via the SHMIDDYEEQUSALSMC2IMMOBILIZER 
Time Machine (Model 0.8) that may well leave your pointy noggin 
shaken-- but not stirred -- in your trek for the crystal key of 
understanding.

I warn you -- and the stowaways who think they have smuggled 
themselves aboard  unnoticed -- that this trip is not for the 
faint-hearted. But the rewards of traversing the Levensthein distance 
in search of the Metaphone grail bring rewards of such magnitude as 
to render a Golden Mallet as valueless as chaff.

Even the Mighty FB^3 Lord Staz may discover a nugget that could well 
find its way etched into the scintillating disks of future CDs, pray 
he no longer accuse my virgin lips of having caressed any noxious 
substance.

But enough of that. The flux-capacitor is arching and time waits for no one.


'------ BOARD HERE FOR THE JOURNEY... (or be smart and move on to the 
next thread)------


Up, Up And Away as we are whisked northward in our chariot my 
black-clad Parisienne!

As the towering Eiffel fades in the distance and we head across the 
channel, there-- closing quickly at 12 o'clock-- are the chalky 
cliffs of Dover. A snap roll to the East brings us past Brighton 
Beach... home sweet home for our young expatriated gnome.

But where are those hoards of tourists from London and points north?

Wait!

Your eyes deceive you not:  That is royalty stirring below. And 
Brighton is still the playground of princes. Commoners have not yet 
tarnished the scared soil of privilege.

You see my young jonnie, we are traveling back in time...

To the turn of the previous century.

To a kinder, gentler era.

Sweeping across the Atlantic Ocean our Model 0.8 accelerates.

In a word, we are boogyin'!

And there, rapidly approaching over the cowling, is the gift of your 
adopted country to ours.

In her majestic splendor, torch hand extended in a timeless gesture 
of welcome, is "The Lady."

The Statute of Liberty.

And do you see them swarming below?

Of course, dummy. (Apologies to Terrald)

On Ellis Island.

Yes, we call them "huddled masses."

The sound?!?!?

Why, that's yearning, my impertinent apprentice!

Sure it's a lot of tired, wretched and poor.

In fact, the Ellis Island officials have a problem?

How do you count all those huddled masses?!?!?

(Of course the IRS was watching back then. Would you let me tell the 
story without interrupting.)

Why, most of the huddled masses don't even speak English.

Pity their wretched souls.

As we hover over Ellis, let me tweak the Model 0.8's timespanalyzer throttle.

Got it.

Here we are: April 2, 1918, and the post-Great War huddled masses are 
really rollin' in at Ellis, but we have moved south a little to 
Pennsylvania.

(Yes this involves computers, gimme a minute, will ya?!?!)

See him?

That's Robert C. Russell.

(Of course he's headed to the Patent Office. Can't you see the trail 
of attorneys behind him you pointy-headed dunce!)

Robert C. (no, not our Robert C., stupid, he's back in gWorld. This 
is Robert C. Russell) has invented a way to count huddled masses. He 
does it by coding 8 phonetic sound types with a few additional rules. 
Even when the Ellis Island bureaucrats screw up some poor Italian 
guy's name-- like  Sylvain Guillemette to Sonny Gilleyman-- Russell's 
system assists with the indexing based on phonetics, the American 
English pronunciation of the name.

Well why else do you think they'd call it Soundex?

No, Rolexes haven't even been invented, and even if they were the 
huddled masses couldn't afford them you silly imp.

(Tell me: Why do I even bother?)

Of course there's more to the story.

In 1922, Robert C. together with Margaret K. Odell obtain a second 
patent with some variations. This they sell to various commercial and 
governmental organizations.

Now, let's see what this ole Model 0.8 can do... hang on.

The years are flyin' by now.

The Depression.

Soundex is taken up in a modified form in the 1930's by the Social 
Security Administration under a work creation scheme to extract 
certain data from the U.S. Census and to index its records. And of 
course it's in common use for indexing immigration records.

Inching forward on the timespanalyzer throttle I bring us back safely 
into our friendly era of mass mutual destruction we call the 21st 
Century.

Robert and Margaret are gone and-- fortunately for us-- so are their patents.

Long expired.

Soundex is in public domain!

And very alive and happy.

<http://www.nara.gov/genealogy/soundex/soundex.html>

Hard as it is to believe, the algorithm--in all kinds of modified 
forms-- is in use everywhere.

<http://www.bluepoof.com/Soundex/info2.html>

We have none other than the Alogrithmmeiser Donald E. Knuth himself 
to thank for that. In his book "The Art of Computing", Volume 3 
"Sorting and Searching" (Addison Wesley) he describes an algorithm to 
encode names using the Soundex system.

1.) Remember the initial letter.
2.) Convert each letter (including the first) according to the 
following table. Ignore punctuation such as apostrophes, spaces and 
hyphens.
         0 = AEIOUWYH
         1 = BPFV
         2 = CSKGJQXZ
         3 = DT
         4 = L
         5 = MN
         6 = R
3.) Change all consecutive duplicate digits to a single example. e.g. 
change 22 to 2
4.) Replace the first digit by the letter remembered in step A.
5.) Delete all zeros.
6.) Adjust to four characters by truncating or padding to the right with zeros.

The neat thing about Soundex for the computer world is that it allows 
"fuzzy" searches.

Shades of artificial intelligence!

Of course I'll explain. Sit your little gnome butt down on that toadstool.

Several names with similar phonetic sounds may be indexed under the 
same number with Soundex.

For instance, "Smith" and "Smythe" both equate to a Soundex index 
number of S530.

It's-- as Herbie Glunder would say--"trivial" to index a list of 
words with Soundex, and then cross index the index to produce fuzzy 
searches, i.e., the user inputs "Smith" and "Smythe" pops back as an 
alternative spelling, along with all the other S530 index words.

Wanna little wider search?

Simple, expand your search to the S529 through S530 Soundex numbers.

You maka the index, you picka the search parameters.

But Soundex works for more than names-- it also works for words.

Ah, now you're beginning to see.

You can index an entire dictionary of names or places of words of 
phrases with Soundex, and then use that index to look up similar 
sounding words.

Of course that's how spell checkers and search engines work. You 
don't think we've come all this way for nothing, do you?!?!?

In other worlds, this stuff is already all figured out.

This may look somewhat familiar (ignore the "Visual", it's still 
BASIC, like in Future....)

<http://www.developersdomain.com/vb/articles/soundex.htm>

Here's the same thing on Robert P.'s level:

<http://www.codeproject.com/cpp/spellchecker_mg.asp>

And where do you get a dictionary to index?

Ah!

The secret.

(Watch as Staz begins twitching.)

A very accommodating man name Grady Ward has assembled some of the 
finest, commercial quality dictionaries and thesauri, along with 
word, name and place lists available... and in several languages.

They're so big he's nicknamed them the "Moby" files.

Call me Ishmaelheiser!

And, unbelievably, they are free.

I mean like totally free.

You can download them, you can trade them, you can modify them...

You can even sell them.

Free.

<http://www.dcs.shef.ac.uk/research/ilash/Moby/>

or here to download the entire 25MB compressed file.

<ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/moby.tar.Z>

In fact, Grady's dictionaries and thesauri form the heart of many 
commercial products.

But don't tell anyone.

This is bigtime programmer secret stuff.

You know, the secret coder handshake, etc.

Yes, gimme a minute and I' ll let you figure out your Soundex index number.

Here goes:

'------ BEGIN KEN'S WONDERFUL FB^3 SOUNDEX CODE ----------

LOCAL FN getSoundCodeNumber$( character AS STR255 )
DIM codeNumStr AS STR255
       
'Accepts a character and returns the
'appropriate number from the Soundex table

codeNumStr = ""

SELECT CASE character
CASE "B", "F", "P", "V"
codeNumStr = "1"
CASE "C", "G", "J", "K", "Q", "S", "X", "Z"
codeNumStr = "2"
CASE "D", "T"
codeNumStr = "3"
CASE "L"
codeNumStr = "4"
CASE "M", "N"
codeNumStr = "5"
CASE "R"
codeNumStr = "6"
END SELECT

END FN = codeNumStr

LOCAL FN soundex$( codeWord AS STR255 )
DIM i           AS INTEGER
DIM codeWordLength  AS LONG
DIM codeStr     AS STR255
DIM charStr     AS STR255
DIM lastCodeStr AS STR255
DIM outputStr   AS STR255

outputStr = ""

// Grab the first letter
     codeStr = UCASE$( MID$( codeWord, 1, 1 ))
lastCodeStr = FN getSoundCodeNumber$( codeStr )

// Store the codeWord length
  codeWordLength = LEN( codeWord )

// Continue the code, starting at the second letter
FOR i = 2 TO codeWordLength
charStr = FN getSoundCodeNumber$( UCASE$( MID$( codeWord, i, 1 )))

// If adjacent numbers are the same, only count one of them

LONG IF LEN( charStr ) > 0 AND lastCodeStr <> charStr
codeStr = codeStr + charStr
END IF

lastCodeStr = charStr

NEXT i

// Trim it down to a maximum of four characters...

outputStr = Mid$(codeStr, 1, 4)

// ... but if it's less than four characters,
// pad it out with a bunch of zeros...
LONG IF LEN( codeStr ) < 4
outputStr = outputStr + STRING$( 4 - LEN( codeStr ), "0")
END IF

END FN = outputStr

WINDOW 1
DIM myStr AS STR255

// Pop your name in here:

myStr = "jonathan"
PRINT myStr; " in Soundex code = "; FN soundex$( myStr )

PRINT:PRINT "Click button to end."

DO
HANDLEEVENTS
UNTIL FN BUTTON

'---------- ALL DONE --------

Yes, jonathan, you are Soundex J535.

Has a nice ring, eh?

There is only one problem with Soundex:

Like you, it's a little too simple.

There's a bunch of guys out there with funny names-- names like 
Shmidheiser, which everyone knows should be Schmidhauser-- but which 
some immigration official screwed up and which forever stuck with 
Great Great Great Grandpa Gottlob Shmidheiser and all us other little 
Shmidheisers.

So in 1990 a very smart guy named Lawrence Philips takes Soundex and 
modifies it big time.

We're talking heavy duty, industrial strength.

We're talking Metaphone.

Then he comes back and whammies it again.

We're talking Double Metaphone.

You wanna search engine?

You wanna index?

You wanna find "Shmidheiser?"

Then you wanna Double Metaphone.

Everybody grabs it.

<http://aspell.sourceforge.net/metaphone/>

C/C++, Perl, VisualBASIC... and other languages the mention of which 
are banished here.

All but dear old FB^3.

Hence, my original post.

Anyone out there have Metaphone in FB?

I figure if we get can translate Quintuple Metaphone into FB^3, slap 
a Jay Reeve XREF compression  algorithm on the puppy, put it in a 
Robert P. Fast FN, and let Alain clean up the code, we should be able 
to index the universe in 15 bytes and derive the answer before we 
input the question.

Then Staz can put the whole thing on Release 6, buy mom's Winnebago and retire.

Well, the flux capacitor is running low and its time to head home.

Hope you enjoyed the trip.


Next episode:  Running the Levensthein distance.