Bayesian Filter
This is a simple Bayesian filter that allows you to have multiple categories. Most filters only allow two categories (such as Spam and Not Spam). Once trained, this will allow you to calculate the probability that a phrase belongs to a category.
Download
Click Here for the Source Code. It is saved here as a ".txt" file, but you will want to save it as a ".php" file.
Description
This is intended to be extremely simple to use.
Once you set the defines and global variable at the top of the script, you create your Bayesian directory.
Inside this directory, you will have a text file for each category.
The category files simply contain example phrases that belong to the category.
Warning!
Do not overtrain the filter with thousands of example phrases.
It will just run extremely slow.
Instead, start it with minimal examples and then add only
the phrases that it gets wrong.
Once you have a folder setup, create the filter class in your script with:
$my_filter=new Bayesian_Filter("bayesian_directory_name");
You have your filter created and you want to get the category of a phrase:
$cat=$my_filter->probable_category("The phrase to check.");
That is all for common usage. There are examples of advanced usage in the script's comments.
Windows Users Note
This script was written for Linux. It works on Unix also. I have not tested it on Windows. The issue of compatability has to do with the direct calls to the operating system. By default, they are:
grep: Return the lines of a file/stdin that contain a supplied keyword.wc -l: Return the number of lines in a file/stdin.echo: Print a value to stdout.
How Does It Work?
Assume you have the following three category files:
| Name | Address | CSZ |
|---|---|---|
|
Abraham Lincoln
John Lennon Douglas Adams Shaun Wagner |
1600 Pennsylvania Ave.
One Microsoft Way 29-A Lincoln Center |
Washington, DC 20500
Redmond, WA 98052-6399 Kansas City, MO 64154 29 Palms, CA Charleston, SC 29401 Charleston, SC 29407 |
Now, we have a filter that can classify a phrase as a person's name, a street address, or a combination of city, state, and zip code. Let's start with a simple example: classify John Wagner.
- Break phrase into words: "John" and "Wagner"
- Get Bayesian Probability for "John"
- Calculate Frequencies: Frequency is the number of lines in which the word occurs divided by the number of lines total.
- Is in 1/4 of Names.
- Is in 0/3 of Addresses.
- Is in 0/6 of CSZ.
- Is in 1/13 of all samples.
- Calculate Bayesian Probability: Bayesian Probability is the frequeny for the category divided by the frequency for all lines.
- Names: (1/4)/(1/13) = 3.25
- Addresses: (0/4)/(1/13) = 0
- CSZ: (0/6)/(1/13) = 0
- Get Bayesian Probability for "Wagner"
- You can see that it is appears only once in Names, so it is identical to the Bayesian probability for "John".
- Average the Bayesian probabilities for each word: Since both words had the same values, the average is identical the probabilities for either word.
- Return the category with the highest probability: Names is the highest value, so it is returned.
- Check the words "29", "Lennon", and "Way".
- 29:
- Names: (0/4)/(2/13) = 0
- Addresses: (1/3)/(2/13) = 2.17
- CSZ: (1/6)/(2/13) = 1.08
- Lennon:
- Names: (1/4)/(1/13) = 3.25
- Addresses: (0/3)/(1/13) = 0
- CSZ: (0/6)/(1/13) = 0
- Way:
- Names: (0/4)/(1/13) = 0
- Addresses: (1/3)/(1/13) = 4.33
- CSZ: (0/6)/(1/13) = 0
- Average the probabilities:
- Names: (0+3.25+0)/3 = 1.08
- Addresses: (2.17+0+4.33)/3 = 2.17
- CSZ: (1.08+0+0)/3 = 0.36
- Return the highest category: Addresses











