AB's Blog: An Implementation of a Binary Bayes Classifier in F# - Part 2

Tuesday, May 24, 2011

An Implementation of a Binary Bayes Classifier in F# - Part 2

In Part 1 of this series I outlined what high level functions the binary Bayes classifier should have and how I was going to represent the classifications as lists of (string * int) tuples.

In this part, I will step through the F# implementation of the part of the classifier which deals with managing the corpus lists. As I noted in part 1, this is my initial foray into F# and functional programming. So I expect there are possibly going to be much better ways of expressing this code. If by chance I have nailed it on the first go, then I'll just have to call that beginners luck.  :)


into the guts..

The first function is a recursive function, addWordTo. It takes as arguments a newCorpus, the current corpus (a list of tuples) to be added to and the word to add to it. The value returned from this function is the new corpus list.

let rec addWordTo newCorpus corpus word =
    match corpus with
        | [] -> 
            match word with
                | None -> newCorpus
                | Some(value) -> newCorpus @ [value]
        | h::t -> 
            match word with
                | None -> addWordTo (newCorpus @ [h]) t None
                | Some(value) ->
                    if (fst h = fst value) then
                        addWordTo (newCorpus @ [(fst h, snd h + snd value)]) t None
                    else
                        addWordTo (newCorpus @ [h]) t word 


Next is the training function outlined in part 1. This is also a recursive function which takes a corpus and a text argument to add or "train" that corpus with. This adds each word tuple in the text to the corpus list by traversing the list and returns the new corpus list result. In other words:
('a * int) list -> ('a * int) list -> ('a * int) list when 'a : equality

let rec trainCorpus corpus text =
    match text with
        | [] -> corpus
        | h::t ->
            trainCorpus (addWordTo [] corpus (Some h)) t


a word on tail recursion..

Most functional programming languages, including F# guarantee tail recursion optimisation on specially structured recursive functions. A tail recursive function is one where the last instruction called must be the recursive call (with the exception of a return). The result of this is it allows the compiler to eliminate the need for the nested calls of the recursive function being pushed on to the stack, effectively converting the recursion into a loop. The result of which is to eliminate any possibility of the recursive function causing a stack overflow.

Going back to the function above, the structure of the trainCorpus function allows it to take advantage of tail recursion since the recursive call back into the function is the very last statement in the function declaration.

I'd like to come back to the addWordTo function in a future post and focus on eliminating the newCorpus argument through the use of an accumulator as well as possibly optimising for tail recursion.

That takes care of managing the corpus lists. Two rather terse functions is all that is required here. Part 3 will be dedicated to the functions used for calculating probabilities and performing the actual classification.

No comments:

Post a Comment