Responding to my simple query, jonathan asked: > > I have already created a Soundex function that is smaller than the >> one posted on Pix Mix and works very well, but I feel the full-blown >> Metaphone function is what I need for a project I'm exploring. >> > > Also, has any here experimented with the Levensthein distance >> algorithm? Any code on that? > >i read this post and then stopped and read again. it looks like english, >well, a strong ressemblance. the words seem to be english, and i have >vaguely heard of soundex. but this is uncomprehensible and so must have >largely exceeded my personal Levensthein distance. > >ken, for those who will be unable to help you, but would still like to >understand... could you post a translation... y'know, words of 0.8 syllables >and all the rest. My dear gnomish lad, jonathan: First it was Master Charles (Dickman) who dared descendeth with me to the secret vaults of MacDom in quest of the Holy Hex. Now you come forward as a journeymen in the very essence of the word. Are you prepared? Have you cleansed your conscience of C++ impurities as we prepare for a flight across the space-time continuum in search of understanding that will amaze and astonish you? The answer rests in a journey via the SHMIDDYEEQUSALSMC2IMMOBILIZER Time Machine (Model 0.8) that may well leave your pointy noggin shaken-- but not stirred -- in your trek for the crystal key of understanding. I warn you -- and the stowaways who think they have smuggled themselves aboard unnoticed -- that this trip is not for the faint-hearted. But the rewards of traversing the Levensthein distance in search of the Metaphone grail bring rewards of such magnitude as to render a Golden Mallet as valueless as chaff. Even the Mighty FB^3 Lord Staz may discover a nugget that could well find its way etched into the scintillating disks of future CDs, pray he no longer accuse my virgin lips of having caressed any noxious substance. But enough of that. The flux-capacitor is arching and time waits for no one. '------ BOARD HERE FOR THE JOURNEY... (or be smart and move on to the next thread)------ Up, Up And Away as we are whisked northward in our chariot my black-clad Parisienne! As the towering Eiffel fades in the distance and we head across the channel, there-- closing quickly at 12 o'clock-- are the chalky cliffs of Dover. A snap roll to the East brings us past Brighton Beach... home sweet home for our young expatriated gnome. But where are those hoards of tourists from London and points north? Wait! Your eyes deceive you not: That is royalty stirring below. And Brighton is still the playground of princes. Commoners have not yet tarnished the scared soil of privilege. You see my young jonnie, we are traveling back in time... To the turn of the previous century. To a kinder, gentler era. Sweeping across the Atlantic Ocean our Model 0.8 accelerates. In a word, we are boogyin'! And there, rapidly approaching over the cowling, is the gift of your adopted country to ours. In her majestic splendor, torch hand extended in a timeless gesture of welcome, is "The Lady." The Statute of Liberty. And do you see them swarming below? Of course, dummy. (Apologies to Terrald) On Ellis Island. Yes, we call them "huddled masses." The sound?!?!? Why, that's yearning, my impertinent apprentice! Sure it's a lot of tired, wretched and poor. In fact, the Ellis Island officials have a problem? How do you count all those huddled masses?!?!? (Of course the IRS was watching back then. Would you let me tell the story without interrupting.) Why, most of the huddled masses don't even speak English. Pity their wretched souls. As we hover over Ellis, let me tweak the Model 0.8's timespanalyzer throttle. Got it. Here we are: April 2, 1918, and the post-Great War huddled masses are really rollin' in at Ellis, but we have moved south a little to Pennsylvania. (Yes this involves computers, gimme a minute, will ya?!?!) See him? That's Robert C. Russell. (Of course he's headed to the Patent Office. Can't you see the trail of attorneys behind him you pointy-headed dunce!) Robert C. (no, not our Robert C., stupid, he's back in gWorld. This is Robert C. Russell) has invented a way to count huddled masses. He does it by coding 8 phonetic sound types with a few additional rules. Even when the Ellis Island bureaucrats screw up some poor Italian guy's name-- like Sylvain Guillemette to Sonny Gilleyman-- Russell's system assists with the indexing based on phonetics, the American English pronunciation of the name. Well why else do you think they'd call it Soundex? No, Rolexes haven't even been invented, and even if they were the huddled masses couldn't afford them you silly imp. (Tell me: Why do I even bother?) Of course there's more to the story. In 1922, Robert C. together with Margaret K. Odell obtain a second patent with some variations. This they sell to various commercial and governmental organizations. Now, let's see what this ole Model 0.8 can do... hang on. The years are flyin' by now. The Depression. Soundex is taken up in a modified form in the 1930's by the Social Security Administration under a work creation scheme to extract certain data from the U.S. Census and to index its records. And of course it's in common use for indexing immigration records. Inching forward on the timespanalyzer throttle I bring us back safely into our friendly era of mass mutual destruction we call the 21st Century. Robert and Margaret are gone and-- fortunately for us-- so are their patents. Long expired. Soundex is in public domain! And very alive and happy. <http://www.nara.gov/genealogy/soundex/soundex.html> Hard as it is to believe, the algorithm--in all kinds of modified forms-- is in use everywhere. <http://www.bluepoof.com/Soundex/info2.html> We have none other than the Alogrithmmeiser Donald E. Knuth himself to thank for that. In his book "The Art of Computing", Volume 3 "Sorting and Searching" (Addison Wesley) he describes an algorithm to encode names using the Soundex system. 1.) Remember the initial letter. 2.) Convert each letter (including the first) according to the following table. Ignore punctuation such as apostrophes, spaces and hyphens. 0 = AEIOUWYH 1 = BPFV 2 = CSKGJQXZ 3 = DT 4 = L 5 = MN 6 = R 3.) Change all consecutive duplicate digits to a single example. e.g. change 22 to 2 4.) Replace the first digit by the letter remembered in step A. 5.) Delete all zeros. 6.) Adjust to four characters by truncating or padding to the right with zeros. The neat thing about Soundex for the computer world is that it allows "fuzzy" searches. Shades of artificial intelligence! Of course I'll explain. Sit your little gnome butt down on that toadstool. Several names with similar phonetic sounds may be indexed under the same number with Soundex. For instance, "Smith" and "Smythe" both equate to a Soundex index number of S530. It's-- as Herbie Glunder would say--"trivial" to index a list of words with Soundex, and then cross index the index to produce fuzzy searches, i.e., the user inputs "Smith" and "Smythe" pops back as an alternative spelling, along with all the other S530 index words. Wanna little wider search? Simple, expand your search to the S529 through S530 Soundex numbers. You maka the index, you picka the search parameters. But Soundex works for more than names-- it also works for words. Ah, now you're beginning to see. You can index an entire dictionary of names or places of words of phrases with Soundex, and then use that index to look up similar sounding words. Of course that's how spell checkers and search engines work. You don't think we've come all this way for nothing, do you?!?!? In other worlds, this stuff is already all figured out. This may look somewhat familiar (ignore the "Visual", it's still BASIC, like in Future....) <http://www.developersdomain.com/vb/articles/soundex.htm> Here's the same thing on Robert P.'s level: <http://www.codeproject.com/cpp/spellchecker_mg.asp> And where do you get a dictionary to index? Ah! The secret. (Watch as Staz begins twitching.) A very accommodating man name Grady Ward has assembled some of the finest, commercial quality dictionaries and thesauri, along with word, name and place lists available... and in several languages. They're so big he's nicknamed them the "Moby" files. Call me Ishmaelheiser! And, unbelievably, they are free. I mean like totally free. You can download them, you can trade them, you can modify them... You can even sell them. Free. <http://www.dcs.shef.ac.uk/research/ilash/Moby/> or here to download the entire 25MB compressed file. <ftp://ftp.dcs.shef.ac.uk/share/ilash/Moby/moby.tar.Z> In fact, Grady's dictionaries and thesauri form the heart of many commercial products. But don't tell anyone. This is bigtime programmer secret stuff. You know, the secret coder handshake, etc. Yes, gimme a minute and I' ll let you figure out your Soundex index number. Here goes: '------ BEGIN KEN'S WONDERFUL FB^3 SOUNDEX CODE ---------- LOCAL FN getSoundCodeNumber$( character AS STR255 ) DIM codeNumStr AS STR255 'Accepts a character and returns the 'appropriate number from the Soundex table codeNumStr = "" SELECT CASE character CASE "B", "F", "P", "V" codeNumStr = "1" CASE "C", "G", "J", "K", "Q", "S", "X", "Z" codeNumStr = "2" CASE "D", "T" codeNumStr = "3" CASE "L" codeNumStr = "4" CASE "M", "N" codeNumStr = "5" CASE "R" codeNumStr = "6" END SELECT END FN = codeNumStr LOCAL FN soundex$( codeWord AS STR255 ) DIM i AS INTEGER DIM codeWordLength AS LONG DIM codeStr AS STR255 DIM charStr AS STR255 DIM lastCodeStr AS STR255 DIM outputStr AS STR255 outputStr = "" // Grab the first letter codeStr = UCASE$( MID$( codeWord, 1, 1 )) lastCodeStr = FN getSoundCodeNumber$( codeStr ) // Store the codeWord length codeWordLength = LEN( codeWord ) // Continue the code, starting at the second letter FOR i = 2 TO codeWordLength charStr = FN getSoundCodeNumber$( UCASE$( MID$( codeWord, i, 1 ))) // If adjacent numbers are the same, only count one of them LONG IF LEN( charStr ) > 0 AND lastCodeStr <> charStr codeStr = codeStr + charStr END IF lastCodeStr = charStr NEXT i // Trim it down to a maximum of four characters... outputStr = Mid$(codeStr, 1, 4) // ... but if it's less than four characters, // pad it out with a bunch of zeros... LONG IF LEN( codeStr ) < 4 outputStr = outputStr + STRING$( 4 - LEN( codeStr ), "0") END IF END FN = outputStr WINDOW 1 DIM myStr AS STR255 // Pop your name in here: myStr = "jonathan" PRINT myStr; " in Soundex code = "; FN soundex$( myStr ) PRINT:PRINT "Click button to end." DO HANDLEEVENTS UNTIL FN BUTTON '---------- ALL DONE -------- Yes, jonathan, you are Soundex J535. Has a nice ring, eh? There is only one problem with Soundex: Like you, it's a little too simple. There's a bunch of guys out there with funny names-- names like Shmidheiser, which everyone knows should be Schmidhauser-- but which some immigration official screwed up and which forever stuck with Great Great Great Grandpa Gottlob Shmidheiser and all us other little Shmidheisers. So in 1990 a very smart guy named Lawrence Philips takes Soundex and modifies it big time. We're talking heavy duty, industrial strength. We're talking Metaphone. Then he comes back and whammies it again. We're talking Double Metaphone. You wanna search engine? You wanna index? You wanna find "Shmidheiser?" Then you wanna Double Metaphone. Everybody grabs it. <http://aspell.sourceforge.net/metaphone/> C/C++, Perl, VisualBASIC... and other languages the mention of which are banished here. All but dear old FB^3. Hence, my original post. Anyone out there have Metaphone in FB? I figure if we get can translate Quintuple Metaphone into FB^3, slap a Jay Reeve XREF compression algorithm on the puppy, put it in a Robert P. Fast FN, and let Alain clean up the code, we should be able to index the universe in 15 bytes and derive the answer before we input the question. Then Staz can put the whole thing on Release 6, buy mom's Winnebago and retire. Well, the flux capacitor is running low and its time to head home. Hope you enjoyed the trip. Next episode: Running the Levensthein distance.