All Rights Reserved. * This is a free program for use by anyone for any purpose. * If you make extensive use of this code, please let me know. * I just like knowing my code is being used. ******************************************************************** * This is a Bayesian-like filter for PHP. It calculates the * probability that a given phrase belongs to one of a set of phrase * categories. Unlike most implementations, this allows the use of * multiple categories. ******************************************************************** * COMPATABILITY: * This uses direct calls to the operating system. By default, these * are 'grep', 'wc', and 'echo'. I have tested this on Fedora Linux, * FreeBSD, and Debian Linux. I expect it to work on most Unix-like * operating systems. * I have not tested this on Windows servers. I don't know of any * equivalent to 'grep' in Windows, which would be a requirement. ******************************************************************** * INSTALLATION: * Copy bayesian.php to your PHP code area. * Create a directory that is readable/writable by PHP. * Place text files in the directory - one for each category you want * to detect - that contain phrases from those categories. * Set the variable $BAYES_ROOT to the directory's PATH (not name). * Note: You give this class the directory name in the constructor. * That allows you to have multiple filter directories. * Ensure the definitions at the beginning of this script are * available to PHP's popen/exec functions. * Note: if you want a case-insensitive filter, ensure the definition * for the grep command is a case-insensitive grep (ie: grep -i) ******************************************************************** * USAGE: * Include/Require once the bayesian.php script in your script. * Create the class: $my_filter = new Bayesian_Filter("dir_name"); * To add a phrase to a category: * $my_filter->add("Phrase to add", "category"); * Is a phrase probably in a category? * if($my_filter->is_category("Phrase to check", "category")) * print "It is probably in the category."; * else * print "It is probably not in the category." * Get most probable category (null if no match): * $category = $my_filter->probable_category("Phrase to check"); * Get probabilities for all categories: * $probs_array = $my_filter->probablities("The phrase to check"); * // Returns an array with $probs_array[category]=probability; ******************************************************************** * CATEGORY FILES: * In the directory you designate as the filter directory, you will * have ONLY text files. Each text file will be named for a category * you want to filter phrases into. The contents of the text files * will be phrases that belong in the category. * DO NOT FILL YOUR CATEGORY FILES WITH KEYWORDS! You'll get bad * results if you do that. Use real phrases that you will parse. * DO NOT START OUT WITH HUGE CATEGORY FILES! You'll get a very slow * filter. Just start out with a handful of phrases and add only the * phrases that the filter gets wrong. ******************************************************************** * NOTES: * FORMULA SIMPLIFICATION: * The Bayesian Formula is: * Probability = Word_Freq_In_Cat * Cat_Freq / Word_Freq. * Expanding out the variables... * Word_Freq_In_Cat = words_in_cat / lines_in_cat * Cat_Freq = lines_in_cat / lines_in_all_cats * So, it is obvious that lines_in_cat cancels out, leaving: * words_in_cat / lines_in_all_cats / Word_Freq * Continuing... * Word_Freq = words_in_all_cats / lines_in_all_cats * Again, it is obvious that lines_in_all_cats cancels out, leaving: * words_in_cat / words_in_all_cats * So, I use the simplified formula. I commented out the functions * that produce the word and category frequencies. If you want those * values for some other use, just uncomment them. * MULTIPLE CATEGORIES: * Bayesian Filters are commonly used to decide between one of two * categories. This class is not limited to two categories. It will * calculate the percentage probability for each category. * MULTIPLE INSTANCES: * The constructor requires a name for the filter. The name is the * directory name containing the category text files. This allows * you to create multiple filters at the same time - each using a * different set of category files. * SPEED AND CATEGORY FILES: * Keep the category files as small as possible. You *could* pack * them full of thousands of examples. It will just take forever to * parse that much information. So, start with a handful of entries * in each file. Then, only add the phrases that the filter gets * wrong so it will get them right later on. It is possible (though * not very likely) that it will continue to get a phrase wrong. If * this happens, enter the phrase into the correct category file more * than once to give it more weight in the filter. The catch is that * you'll probably ruin the filter's ability to parse a lot of other * phrases correctly. ********************************************************************/ /** DEFINED OPERATING SYSTEM COMMANDS ******************************/ // GREP should return a list of matching lines to standard output. if(!defined("grep")) define("grep", "grep -i"); // LINE_COUNT should return the count of lines in standard input. if(!defined("line_count")) define("line_count", "wc -l"); // ECHO_LINE should echo a line to standard output. if(!defined("echo_line")) define("echo_line", "echo"); /** GLOBAL VARIABLES ***********************************************/ // The root must end with a file separator, ie: "Path/To/Bayes/" // Also, the root directory MUST exist. // This class will not attempt to create it. if(!isset($BAYES_ROOT)) $BAYES_ROOT = "./"; /** * Bayesian Filter class. * A PHP implementation of a multiple category Bayesian Filter. */ class Bayesian_Filter { var $name; var $categories = array(); var $catcounts = array(); var $total_count = -1; /** * CONSTRUCTOR * $name Name of the category file directory. */ function Bayesian_Filter($name) { global $BAYES_ROOT; $this->name = $BAYES_ROOT.$name; if(!file_exists($name)) mkdir($name); $this->set_categories(); } /** * Perform a self test to ensure that the filter * will work properly. Test results are printed * to standard-output. */ function self_test() { global $BAYES_ROOT; print "BAYESIAN FILTER SELF TEST\n"; print " Testing root directory $BAYES_ROOT\t"; if(!file_exists($BAYES_ROOT)) print "[fail]\n"; else print "[ok]\n"; print " Testing filter directory ".$this->name."\t"; if(!file_exists($this->name)) print "[fail]\n"; else print "[ok]\n"; print " Category List:\n"; $this->set_categories(); foreach($this->categories as $cat) print "\t$cat\n"; if(sizeof($this->categories) == 0) print "\tNO CATEGORY FILES!\n"; } /** * Is the phrase in the category? * $phrase Phrase to check. * $cat Category to match. * [$cutoff] Percentage probability cutoff (0.9 is default) * return true/false */ function is_category($phrase, $cat, $cutoff=".9") { $probs = $this->probabilities($phrase); if(!isset($probs[$cat])) return false; return ($probs[$cat] > $cutoff); } /** * What category is the phrase probably in? * $phrase Phrase to check. * return Category with highest probability. * null if no category matches. */ function probable_category($phrase) { $probs = $this->probabilities($phrase); $highc = null; $highp = 0; foreach($probs as $c=>$p) if($p > $highp) $highc = $c; return $highc; } /** * Get all category probabilities for a phrase. * $phrase Phrase to check. * return Array of probabilities as $cat=>$percentage_probability */ function probabilities($phrase) { $probs = array(); // Use preg_split instead of explode to include all whitespace and punctuation. $words = preg_split("/[\W]+/", $phrase); foreach($words as $word) { if($word == "") continue; $wprobs = $this->probabilities_word($word); foreach($wprobs as $c=>$p) { if(!isset($probs[$c])) $probs[$c] = $p; else $probs[$c] += $p; } } foreach($probs as $c=>$p) $probs[$c] = $p/sizeof($words); return $probs; } /** THE FOLLOWING ARE SUPPORT METHODS **********************/ /** YOU WILL PROBABLY NEVER USE THEM DIRECTLY **************/ /** * Get the probablity that a word belongs to each category. * $word Word to check. * Return array of probabilities as $cat=>$probability */ function probabilities_word($word) { $probs = array(); foreach($this->categories as $category) $probs[$category] = $this->bayes($word, $category); return $probs; } /** * Set $this->categories to contain the names of each category. */ function set_categories() { $this->categories = array(); $this->catcounts = array(); $this->total_count = -1; if($dh = opendir($this->name)) { while(false !== ($file = readdir($dh))) { if(substr($file, 0, 1) != ".") { array_push($this->categories, $file); $this->catcounts[$file] = -1; } } closedir($dh); } } /** * Append a phrase to a category text file. * If the category does not exist, it will be created. * $phrase Phrase to add. * $cat Category to add to. */ function add($phrase, $cat) { $command = addslashes($phrase)." >> ".$this->name."/".$cat; exec(echo_line." ".addslashes($phrase)." >> ".$this->name."/".$cat); $this->set_categories(); // Just in case this is a new category. $this->catcounts[$cat] = -1; $this->total_count = -1; } /** * Calculate the Bayesian probability that a word belongs to a category. * $word Word to check. * $cat Category to check. * return Probability that $word belongs to $cat. */ function bayes($word, $cat) { $wic = $this->get_count(grep." ".addslashes($word)." ".$this->name."/".$cat." | ".line_count); $wit = $this->get_count(grep." ".addslashes($word)." ".$this->name."/* | ".line_count); if($wit == 0) return 0; return $wic / $wit; } /******************************************************************* * If you want to have custom probability functions, you can use the * following to get common probabilities. They are the proper * Bayesian statistics of "Word in Category", "Category in Total", * and "Word in Total". * I comment them out here because I simplified the Bayesian formula * to be simply: Word_Count_In_Category / Word_Count_In_Total ********************************************************************/ /******************************************************************* function prob_word_cat($word, $cat) { if(!file_exists($this->name."/".$cat)) return 0; $word_count = $this->get_count(grep." ".addslashes($word)." ".$this->name."/".$cat." | ".line_count); if(!isset($this->catcounts[$cat]) || $this->catcounts[$cat] < 0) $this->catcounts[$cat] = $this->get_count(line_count." ".$this->name."/".$cat); if($this->catcounts[$cat] == 0) return 0; return $word_count / $this->catcounts[$cat]; } function prob_word($word) { $word_count = $this->get_count(grep." ".addslashes($word)." ".$this->name."/* | ".line_count); if(!isset($this->total_count) || $this->total_count < 0) $this->total_count = $this->get_count(line_count." ".$this->name."/*"); if($this->total_count == 0) return 0; return $word_count / $this->total_count; } function prob_cat($cat) { if(!file_exists($this->name."/".$cat)) return 0; if(!isset($this->catcounts[$cat]) || $this->catcounts[$cat] < 0) $this->catcounts[$cat] = $this->get_count(line_count." ".$this->name."/".$cat); if(!isset($this->total_count) || $this->total_count < 0) $this->total_count = $this->get_count(line_count." ".$this->name."/*"); if($this->total_count == 0) return 0; return $this->catcounts[$cat] / $this->total_count; } ********************************************************************/ /** * Get the integer count value for a system command. * If the command returns multiple lines, the last line is used. * $command Command to execute. * return Int value. */ function get_count($command) { $ph = popen($command, "r"); while(!feof($ph)) { $b = trim(fgets($ph, 1024)); if($b != "") $count = intval($b); } pclose($ph); return $count; } } ?>