The Top Ten utility is a tool for use with POPFile to list the top ten (or some other quantity you select) words in each bucket's corpus ranked high to low on the probability and the word count.
This version has been tested in a Windows environment with version 0.19.x and 0.20.x of POPFile and version 0.21.0 of POPFile, it is not compatible with earlier versions of POPFile. The author believes that the utility is platform independent and will work properly on non-Windows POPFile installs, but has not tested on those platforms.
POPFile is an automatic email classification tool authored by John Graham-Cumming available from SourceForge.
Download the correct version of script to your POPFile install directory, normally c:\Program Files\Popfile by clicking below;
Open a DOS Command box (click the DOS icon on your desktop or Start/Run and type command in the open box and click ok).
Change to your POPFile installation directory, e.g.,
cd "\program files\popfile"
Run topten.pl using Perl.
perl topten.pl > topten.htm
The resulting report will be in the file named 'topten.htm', open it with your browser to view it.
start topten.htmOr browse to your POPFile install directory and open it from there with Explorer.
Note: to select more (or less) words simply place the commandline option -topten_count with an integer value on the command line when you execute topten.pl, e.g.,
perl topten.pl -topten_count 50 >topten.htmThe above would list the top 50 words in each bucket.
The following is a sample of the output from topten run against the author's corpus on June 22, 2003 with the comandline option of 50 to show the top 50.
Users can easily create a batch file (see below) and schedule it (also below) in the Task Scheduler to run periodically. By bookmarking the output file in your favorites, the latest run will be available to you at any time from that bookmark.
Create a batch file as follows:
perl topten.pl -topten_count 50 > topten.htm cls @exit
Save the batch file in your POPFile directory, name it topten.bat
Use the Wizard to browse to your POPFile installation directory, usually "c:\Program Files\Popfile", and select the batch file topten.bat
change the name of the task to top ten
Select the frequency to run it
Select the time and day(s) to run it
Click finish
Close the task scheduler (or test it by right clicking on the new entry you made and selecting run)
Alternatively, Windows users who have Tim Charron's Blat utility can easily set up topten to run automatically and email the results.
Obtain and install Blat from Tim Charron's page here.
install Blat in a directory in your path, or the POPFile directory
run Blat -install <server address> <senders address> to get Blat configured correctly. Make sure that <server address> points to an smtp server that you are permitted to relay mail thru, usually this will be the same smtp server you set up in your mail client.
Create a batch file as follows:
perl topten.pl -topten_count 50 | blat - -t youremail@address.here -s "POPFile Top Ten Report" -html cls @exit
Save the batch file in your POPFile directory, name it topten.bat
Use the Wizard to browse to your POPFile installation directory, usually "c:\Program Files\Popfile", and select the batch file topten.bat
change the name of the task to top ten
Select the frequency to run it
Select the time and day(s) to run it
Click finish
Close the task scheduler (or test it by right clicking on the new entry you made and selecting run)
You're done. The task scheduler will run the batch file at the time(s) you scheduled. The batch file will run the Top Ten report and email it off to you. No muss, no fuss <g>
Can you add links to POPFile's word Lookup function so I can just click on the corpus word and see POPFile's current statistics for the word?
No, at present, POPFile requires the session key to display the lookup results. Since this program runs as a separate process, it has no way to access or create a valid session key. If that changes in a future version of POPFile, we will add this feature.
I noticed a topten subdirectory was created in my POPFile folder, why is this?
This occurs only with V 0.19.x or v 0.20.x of POPFile. The topten program uses the POPFile API to gather all of the corpus data. The API calls automatically create a couple of files, popfile.pid and a popfile#.log file. In order to ensure that running this program does not interfere with your running POPFile installation, we divert the version of those files created by this program to a safe place, the topten subdirectory, where they will be harmless. You can delete the subdirectory and contents at will.
What do the columns mean on the report?
Here's my layman's understanding:
- Word Count The simple count of the number of times this word appears in the corpus.
- % Bucket The word count of this word divided by the total word count for the bucket times 100, e.g., (wc/wcbuckettotal)*100. POPFile often refers to this as frequency.
- % Total The word count of this word divided by the total global word count for all buckets in the corpus times 100, e.g., (wc/globalwc)*100.
- ScoreThe probability for the word's "independent" most probable bucket.
- Probability The probability for the word appearing in this bucket.
Copyright (C) 2003 - 2004 Scott W. Leighton
Licensed under the terms of the GNU General Public License.
Contributed to the POPFile project under the terms of the POPFile License Agreement.