lime icon

Phosphorus and Lime

A Developer's Broadsheet

This blog has been deprecated. Please visit my new blog at klenwell.com/press.
Greedy Phonotactic Coordination
This was the puzzle I was trying to solve: I have a long list of English words (over 100,000 words). It includes a phonetic representation of each word. I want to break the phonetic representation of the word up into syllables. Specifically, I want an algorithm that will do this for me.

I'm not the first person to confront this problem:

summary of responses to syllabification algorithm query

Surprisingly, however, I couldn't find a working algorithm to solve this problem. Well, there was this: ale-trale man node 83. But hell if I knew what it meant.

Greedy Phonotactic Coordination is my solution. I'm still testing it. But there were only about 50 words that it couldn't handle in the 100,000+ list, and the majority of those were either outright foreign appropriations or mistranscriptions.

A couple examples:

encourages -> E.n.k.3r.I.J.I.z -> En.k3r.I.JIz
mackler -> m.{.k.l.3r -> m{.kl3r
petunia -> p.V.t.u.n.i.V -> pV.tu.ni.V

How it works: basically, it finds a vowel phoneme, finds the next, and then looks at every coda/onset break between the two vowels looking for a valid combination. It favors onsets at the outset and moves left so that it ends up with an empty onset to the next syllable and all the consonants in the coda of the current syllable.

I created functions for valid onsets and codas with the information on this wikipedia page.

Not perfect. But pretty good.