Bayesian Filter

This is a simple Bayesian filter that allows you to have multiple categories. Most filters only allow two categories (such as Spam and Not Spam). Once trained, this will allow you to calculate the probability that a phrase belongs to a category.

Download

Click Here for the Source Code. It is saved here as a ".txt" file, but you will want to save it as a ".php" file.

Description

This is intended to be extremely simple to use. Once you set the defines and global variable at the top of the script, you create your Bayesian directory. Inside this directory, you will have a text file for each category. The category files simply contain example phrases that belong to the category.
Warning! Do not overtrain the filter with thousands of example phrases. It will just run extremely slow. Instead, start it with minimal examples and then add only the phrases that it gets wrong.
Once you have a folder setup, create the filter class in your script with: $my_filter=new Bayesian_Filter("bayesian_directory_name");
You have your filter created and you want to get the category of a phrase: $cat=$my_filter->probable_category("The phrase to check.");
That is all for common usage. There are examples of advanced usage in the script's comments.

Windows Users Note

This script was written for Linux. It works on Unix also. I have not tested it on Windows. The issue of compatability has to do with the direct calls to the operating system. By default, they are:

If Windows has equivalents to these commands, this script should work. I do not have a Windows machine, so I cannot test it.

How Does It Work?

Assume you have the following three category files:
Name Address CSZ
Abraham Lincoln
John Lennon
Douglas Adams
Shaun Wagner
1600 Pennsylvania Ave.
One Microsoft Way
29-A Lincoln Center
Washington, DC 20500
Redmond, WA 98052-6399
Kansas City, MO 64154
29 Palms, CA
Charleston, SC 29401
Charleston, SC 29407
Notice that the files do not necessarily contain the same number of lines. This script takes that into account when it calculates probabilities.
Now, we have a filter that can classify a phrase as a person's name, a street address, or a combination of city, state, and zip code. Let's start with a simple example: classify John Wagner.

Next, try something a little more complicated: 29 Lennon Way. We can quickly see that it is a street address. How about the filter? You can see that this would fail on the name "Charles Washington" (it would claim it to be a CSZ with Charleston and Washington being the only matches). However, adding Charles Washington to the names list would help. It may not cure the problem completely. That is why this is all about PROBABILITIES, not ABSOLUTES.