AB's Blog: May 2011

Tuesday, May 24, 2011

An Implementation of a Binary Bayes Classifier in F# - Part 3

0 comments
In the previous posts (part 1 & part 2) I showed the implementation of two functions (addWordTo and trainCorpus) which deal with managing the tuple lists that represent the classifications.

This part will deal with the functions needed to calculate probabilities and finally classify a document into spam or ham.

There is essentially one 'utility' function required, called countFrom, which takes a corpus list and a word, in the form of a simple string, as arguments. It attempts to match the word in in the corpus list and returns the integer part of the tuple, otherwise it returns 0.

let countFrom corpus word =  
    let result = corpus |> List.tryFind (fun (w, c) -> w = word)
    match result with
        | None -> 0
        | Some(v) -> snd v


The next 3 functions are the probability functions required to determine what the classification of the text is.
The prCategory first calculates the rank probability of the category a word belongs to. This comes directly from the Gary Robinson article - "A Statistical Approach to the Spam Problem", so there is no need to go into detail about it here.

let prCategory (count0, total0) (count1, total1) = 
    let pword =  (count0 / total0) / ((count0 / total0) + (count1 / total1))
    let ctotal = count0 + count1
    ((0.5 + (ctotal * pword) ) / (1.0 + ctotal))

let prWord fstCorpus sndCorpus word =
    let fstCount = double (word |> countFrom fstCorpus)
    let sndCount = double (word |> countFrom sndCorpus)
    let sum corpus = double (corpus |> List.map (fun (_, c:int) -> c) |> List.sum)
    let pr = prCategory (fstCount, sum fstCorpus) (sndCount, sum sndCorpus)
    if Double.IsNaN(pr) then None else Some(pr)

let prBayesian prOptionList =
    let pList = prOptionList |> List.choose (fun v -> v) // removes all None's from the list
    let prod = pList |> List.reduce ( * )
    let inv = pList |> List.map (fun v -> 1.0 - v) |> List.reduce ( * )
    prod / (prod + inv)


The power of these functions is in the use of the List map and reduce functions.


classifying (at last)..

The last two functions are what actually classify a text. The first function, parse, is just a very simple parser used to return a list of words from a text. Typically this function might also include additional features such as a stemming algorithm or frequent/common word filter. For this example it applies a basic regular expression, converts to uppercase and returns a list of word-frequency tuples with the frequency set to 1.

The second function,  classifyText, takes two corpus lists and a parsed text. It then calculates the probability that the text belongs to the first of the two corpus relative to the second and returns the result as a value between 0 and 1.

let parse src = 
    // could add additional data cleansing functionality here
    Regex.Replace(src, @"[^\w\'@-]", " ").Split(' ')
    |> Array.map (fun w -> w.ToUpper()) 
    |> Array.filter (fun w -> w <> System.String.Empty) 
    |> Set.ofArray  |> Set.map (fun w -> (w, 1)) |> Set.toList

let classifyText fstCorpus sndCorpus text =
     prBayesian (text |> List.map (fun (w, _) -> prWord fstCorpus sndCorpus



using it..

// Setup the spam and ham corpora
let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]
let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]

// Classify as Spam
let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."
classifyText spamCorpusSample hamCorpusSample unknownText


The fsi output

> 

val spamCorpusSample : (string * int) list =
  [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]

> 

val hamCorpusSample : (string * int) list =
  [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]

> 

val unknownText : (string * int) list =
  [("A", 1); ("ADDRESS", 1); ("AND", 1); ("DRAW", 1); ("ENTER", 1); ("FOR", 1);
   ("FREE", 1); ("LIMITED", 1); ("NAME", 1); ("OFFER", 1); ("PRIZE", 1);
   ("THE", 1); ("YOUR", 1)]

> 
val it : float = 0.9969589807
> 

the code..

The full F# script is available at https://gist.github.com/987997

final thoughts..

So far I really like where F# and functional programming is taking me. The more I learn about it the more the concepts from functional programming influence my approach development, and the more I want to keep learning and using functional languages.

An Implementation of a Binary Bayes Classifier in F# - Part 2

0 comments
In Part 1 of this series I outlined what high level functions the binary Bayes classifier should have and how I was going to represent the classifications as lists of (string * int) tuples.

In this part, I will step through the F# implementation of the part of the classifier which deals with managing the corpus lists. As I noted in part 1, this is my initial foray into F# and functional programming. So I expect there are possibly going to be much better ways of expressing this code. If by chance I have nailed it on the first go, then I'll just have to call that beginners luck.  :)


into the guts..

The first function is a recursive function, addWordTo. It takes as arguments a newCorpus, the current corpus (a list of tuples) to be added to and the word to add to it. The value returned from this function is the new corpus list.

let rec addWordTo newCorpus corpus word =
    match corpus with
        | [] -> 
            match word with
                | None -> newCorpus
                | Some(value) -> newCorpus @ [value]
        | h::t -> 
            match word with
                | None -> addWordTo (newCorpus @ [h]) t None
                | Some(value) ->
                    if (fst h = fst value) then
                        addWordTo (newCorpus @ [(fst h, snd h + snd value)]) t None
                    else
                        addWordTo (newCorpus @ [h]) t word 


Next is the training function outlined in part 1. This is also a recursive function which takes a corpus and a text argument to add or "train" that corpus with. This adds each word tuple in the text to the corpus list by traversing the list and returns the new corpus list result. In other words:
('a * int) list -> ('a * int) list -> ('a * int) list when 'a : equality

let rec trainCorpus corpus text =
    match text with
        | [] -> corpus
        | h::t ->
            trainCorpus (addWordTo [] corpus (Some h)) t


a word on tail recursion..

Most functional programming languages, including F# guarantee tail recursion optimisation on specially structured recursive functions. A tail recursive function is one where the last instruction called must be the recursive call (with the exception of a return). The result of this is it allows the compiler to eliminate the need for the nested calls of the recursive function being pushed on to the stack, effectively converting the recursion into a loop. The result of which is to eliminate any possibility of the recursive function causing a stack overflow.

Going back to the function above, the structure of the trainCorpus function allows it to take advantage of tail recursion since the recursive call back into the function is the very last statement in the function declaration.

I'd like to come back to the addWordTo function in a future post and focus on eliminating the newCorpus argument through the use of an accumulator as well as possibly optimising for tail recursion.

That takes care of managing the corpus lists. Two rather terse functions is all that is required here. Part 3 will be dedicated to the functions used for calculating probabilities and performing the actual classification.

An Implementation of a Binary Bayes Classifier in F# - Part 1

0 comments
Over the past few months, in my limited spare time, I have started learning F#. The reason for learning F# is for quite a time now I have found the idea of functional programming and the arguments for it fairly compelling, and so wanted to see for myself. About 9 months ago I started looking at Clojure, but found there were too many peripheral 'things' (like learning Emacs) to deal with, and I really only wanted to focus on the language and functional programming. So being already so familiar with the .NET Framework, Visual Studio and the Microsoft ecosystem as a whole, I figured F# would be the best way to go for now.

onwards and upwards..

For my first F# program I decided to write a binary Bayes classifier. This is a fairly well defined and documented approach to text classification and seemed a complex enough problem to help me learn some of the fundamentals. For this reason I intend to do this as a multi-part post. Being a first attempt at F#, I'm pretty certain that there are even better ways to express the code than I have here, so hopefully I can improve the code over time.

Part 1: Is a high level view of what I intend to implement.
Part 2: Details implementation of the functions for manipulating the classifications.
Part 3: The functions for calculating probabilities, classifying text and summing up.

the classifier..

I'm not going to go into the details of a Bayes classifier since it is such a well known classification strategy and there is plenty written about it 'out there'. But a starting point is of course wikipedia. There are also many different implementations for the algorithm and it has found its most common use in spam filtering. The core of this F# code is derived from quite an old, but well known article by Gary Robinson titled, "A Statistical Approach to the Spam Problem" from Linux Journal.


the how..

I've simply written this code in an F# script file (.fsx) and used F# interactive in Visual Studio to execute statements as I needed. This is a really easy way to incrementally develop the solution.

Before getting to the implementation, the goal of this classifier is to simply determine the probability a given text belongs to a particular classification relative to another. For this post, I'll stick with the canonical example of  the spam or ham problem. To achieve this we are going to have to represent the corpus of each classification somehow. For a text based Bayesian classifier we need to represent a word and an occurrence count. I've chosen to use the simplest way to represent this using tuples. e.g. ("offer", 4832). So the corpus of a classification is represented by a list of tuples. ie. (string * int) list

A very basic spam corpus and ham corpus representation might be expressed in F# as:

let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]
let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]


The next step would be to take some kind of unknown text, pass it through some kind of parser and classify it as either spam or ham. e.g.

let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."

unknownText |> classifyText spamCorpusSample hamCorpusSample

The result of this will be a value between 0 and 1. A spam classification will be > 0.5 and a ham classification would be < 0.5. Clearly for practical purposes you'd likely have a small uncertainty range around 0.5 to indicate an unknown classification, but for the purposes of this exercise I have ignored that.

Aside from actually being able to classify text, we're also going to want to train the system. So we might have a well known set of spam documents and a well know set of ham documents used to train the system with. Alternatively it might even be self training over time by having users identify spam documents. Whichever approach is taken we will need a function something like:

let knownSpamText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."
let newSpamCorpus = knownSpamText |> trainCorpus spamCorpus


The trainCorpus function will integrate the new knownSpamText document with the existing spamCorpus. But being a functional language, the spamCorpus list is immutable, so the function returns a new list representing the integrated result of spamCorpus and knownSpamText.

That's pretty much the functionality of the classifier at a high level. In the next parts I'll deal with the implementation of the F# code to get the results we want.