AB's Blog: An Implementation of a Binary Bayes Classifier in F# - Part 1

Tuesday, May 24, 2011

An Implementation of a Binary Bayes Classifier in F# - Part 1

Over the past few months, in my limited spare time, I have started learning F#. The reason for learning F# is for quite a time now I have found the idea of functional programming and the arguments for it fairly compelling, and so wanted to see for myself. About 9 months ago I started looking at Clojure, but found there were too many peripheral 'things' (like learning Emacs) to deal with, and I really only wanted to focus on the language and functional programming. So being already so familiar with the .NET Framework, Visual Studio and the Microsoft ecosystem as a whole, I figured F# would be the best way to go for now.

onwards and upwards..

For my first F# program I decided to write a binary Bayes classifier. This is a fairly well defined and documented approach to text classification and seemed a complex enough problem to help me learn some of the fundamentals. For this reason I intend to do this as a multi-part post. Being a first attempt at F#, I'm pretty certain that there are even better ways to express the code than I have here, so hopefully I can improve the code over time.

Part 1: Is a high level view of what I intend to implement.
Part 2: Details implementation of the functions for manipulating the classifications.
Part 3: The functions for calculating probabilities, classifying text and summing up.

the classifier..

I'm not going to go into the details of a Bayes classifier since it is such a well known classification strategy and there is plenty written about it 'out there'. But a starting point is of course wikipedia. There are also many different implementations for the algorithm and it has found its most common use in spam filtering. The core of this F# code is derived from quite an old, but well known article by Gary Robinson titled, "A Statistical Approach to the Spam Problem" from Linux Journal.


the how..

I've simply written this code in an F# script file (.fsx) and used F# interactive in Visual Studio to execute statements as I needed. This is a really easy way to incrementally develop the solution.

Before getting to the implementation, the goal of this classifier is to simply determine the probability a given text belongs to a particular classification relative to another. For this post, I'll stick with the canonical example of  the spam or ham problem. To achieve this we are going to have to represent the corpus of each classification somehow. For a text based Bayesian classifier we need to represent a word and an occurrence count. I've chosen to use the simplest way to represent this using tuples. e.g. ("offer", 4832). So the corpus of a classification is represented by a list of tuples. ie. (string * int) list

A very basic spam corpus and ham corpus representation might be expressed in F# as:

let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]
let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]


The next step would be to take some kind of unknown text, pass it through some kind of parser and classify it as either spam or ham. e.g.

let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."

unknownText |> classifyText spamCorpusSample hamCorpusSample

The result of this will be a value between 0 and 1. A spam classification will be > 0.5 and a ham classification would be < 0.5. Clearly for practical purposes you'd likely have a small uncertainty range around 0.5 to indicate an unknown classification, but for the purposes of this exercise I have ignored that.

Aside from actually being able to classify text, we're also going to want to train the system. So we might have a well known set of spam documents and a well know set of ham documents used to train the system with. Alternatively it might even be self training over time by having users identify spam documents. Whichever approach is taken we will need a function something like:

let knownSpamText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."
let newSpamCorpus = knownSpamText |> trainCorpus spamCorpus


The trainCorpus function will integrate the new knownSpamText document with the existing spamCorpus. But being a functional language, the spamCorpus list is immutable, so the function returns a new list representing the integrated result of spamCorpus and knownSpamText.

That's pretty much the functionality of the classifier at a high level. In the next parts I'll deal with the implementation of the F# code to get the results we want.

No comments:

Post a Comment