This is the challenge I faced: tokenize a string into words. Easy enough you say, just use one of the many built-in PHP functions, like str_count_words(). But now, do it successfully with a string like this:
I 'said', "This is my test's string of over '8.8' words.About 16 to be ex-act!"
I think I've found the solution:
function str_to_words($str)
{
$re = '%[^A-Za-z0-9\-\'\.]+|([A-Za-z]+)\.([A-Za-z]+)%';
$CHR = str_split('abcdefghijklmnopqrstuvwxyz0123456789');
$W = preg_split($re, $str, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
// make sure first and last characters are alphanumeric
// do we sacrifice ' in plural possessive (e.g. the girls' toys)? no
foreach ( $W as $w0 )
{
$w1 = ( !in_array(strtolower(substr($w0,0,1)),$CHR) ) ? substr($w0,1) : $w0;
$w1 = ( !in_array(strtolower(substr($w1,-1)),$CHR + array("'")) ) ? substr($w1,0,-1) : $w1;
$WORDS[] = $w1;
}
return $WORDS;
}
Result:
Array
(
[0] => I
[1] => said
[2] => This
[3] => is
[4] => my
[5] => test's
[6] => string
[7] => of
[8] => over
[9] => 8.8
[10] => words
[11] => About
[12] => 16
[13] => to
[14] => be
[15] => ex-act
)
Of course, there are a few fringe cases it won't. And bear in mind that it's about 200x slower than str_count_words(). Fortunately, it's still fast enough for my purposes.
Labels: php