AB's Blog

Tuesday, May 24, 2011

An Implementation of a Binary Bayes Classifier in F# - Part 3

In the previous posts (part 1 & part 2) I showed the implementation of two functions (addWordTo and trainCorpus) which deal with managing the tuple lists that represent the classifications.

This part will deal with the functions needed to calculate probabilities and finally classify a document into spam or ham.

There is essentially one 'utility' function required, called countFrom, which takes a corpus list and a word, in the form of a simple string, as arguments. It attempts to match the word in in the corpus list and returns the integer part of the tuple, otherwise it returns 0.

let countFrom corpus word =  
    let result = corpus |> List.tryFind (fun (w, c) -> w = word)
    match result with
        | None -> 0
        | Some(v) -> snd v

The next 3 functions are the probability functions required to determine what the classification of the text is.
The prCategory first calculates the rank probability of the category a word belongs to. This comes directly from the Gary Robinson article - "A Statistical Approach to the Spam Problem", so there is no need to go into detail about it here.

let prCategory (count0, total0) (count1, total1) = 
    let pword =  (count0 / total0) / ((count0 / total0) + (count1 / total1))
    let ctotal = count0 + count1
    ((0.5 + (ctotal * pword) ) / (1.0 + ctotal))

let prWord fstCorpus sndCorpus word =
    let fstCount = double (word |> countFrom fstCorpus)
    let sndCount = double (word |> countFrom sndCorpus)
    let sum corpus = double (corpus |> List.map (fun (_, c:int) -> c) |> List.sum)
    let pr = prCategory (fstCount, sum fstCorpus) (sndCount, sum sndCorpus)
    if Double.IsNaN(pr) then None else Some(pr)

let prBayesian prOptionList =
    let pList = prOptionList |> List.choose (fun v -> v) // removes all None's from the list
    let prod = pList |> List.reduce ( * )
    let inv = pList |> List.map (fun v -> 1.0 - v) |> List.reduce ( * )
    prod / (prod + inv)

The power of these functions is in the use of the List map and reduce functions.

classifying (at last)..

The last two functions are what actually classify a text. The first function, parse, is just a very simple parser used to return a list of words from a text. Typically this function might also include additional features such as a stemming algorithm or frequent/common word filter. For this example it applies a basic regular expression, converts to uppercase and returns a list of word-frequency tuples with the frequency set to 1.

The second function,  classifyText, takes two corpus lists and a parsed text. It then calculates the probability that the text belongs to the first of the two corpus relative to the second and returns the result as a value between 0 and 1.

let parse src = 
    // could add additional data cleansing functionality here
    Regex.Replace(src, @"[^\w\'@-]", " ").Split(' ')
    |> Array.map (fun w -> w.ToUpper()) 
    |> Array.filter (fun w -> w <> System.String.Empty) 
    |> Set.ofArray  |> Set.map (fun w -> (w, 1)) |> Set.toList

let classifyText fstCorpus sndCorpus text =
     prBayesian (text |> List.map (fun (w, _) -> prWord fstCorpus sndCorpus

using it..

// Setup the spam and ham corpora
let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]
let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]

// Classify as Spam
let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."
classifyText spamCorpusSample hamCorpusSample unknownText

The fsi output


val spamCorpusSample : (string * int) list =
  [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]


val hamCorpusSample : (string * int) list =
  [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]


val unknownText : (string * int) list =
  [("A", 1); ("ADDRESS", 1); ("AND", 1); ("DRAW", 1); ("ENTER", 1); ("FOR", 1);
   ("FREE", 1); ("LIMITED", 1); ("NAME", 1); ("OFFER", 1); ("PRIZE", 1);
   ("THE", 1); ("YOUR", 1)]

val it : float = 0.9969589807

the code..

The full F# script is available at https://gist.github.com/987997

final thoughts..

So far I really like where F# and functional programming is taking me. The more I learn about it the more the concepts from functional programming influence my approach development, and the more I want to keep learning and using functional languages.

An Implementation of a Binary Bayes Classifier in F# - Part 2

In Part 1 of this series I outlined what high level functions the binary Bayes classifier should have and how I was going to represent the classifications as lists of (string * int) tuples.

In this part, I will step through the F# implementation of the part of the classifier which deals with managing the corpus lists. As I noted in part 1, this is my initial foray into F# and functional programming. So I expect there are possibly going to be much better ways of expressing this code. If by chance I have nailed it on the first go, then I'll just have to call that beginners luck.  :)

into the guts..

The first function is a recursive function, addWordTo. It takes as arguments a newCorpus, the current corpus (a list of tuples) to be added to and the word to add to it. The value returned from this function is the new corpus list.

let rec addWordTo newCorpus corpus word =
    match corpus with
        | [] -> 
            match word with
                | None -> newCorpus
                | Some(value) -> newCorpus @ [value]
        | h::t -> 
            match word with
                | None -> addWordTo (newCorpus @ [h]) t None
                | Some(value) ->
                    if (fst h = fst value) then
                        addWordTo (newCorpus @ [(fst h, snd h + snd value)]) t None
                        addWordTo (newCorpus @ [h]) t word 

Next is the training function outlined in part 1. This is also a recursive function which takes a corpus and a text argument to add or "train" that corpus with. This adds each word tuple in the text to the corpus list by traversing the list and returns the new corpus list result. In other words:
('a * int) list -> ('a * int) list -> ('a * int) list when 'a : equality

let rec trainCorpus corpus text =
    match text with
        | [] -> corpus
        | h::t ->
            trainCorpus (addWordTo [] corpus (Some h)) t

a word on tail recursion..

Most functional programming languages, including F# guarantee tail recursion optimisation on specially structured recursive functions. A tail recursive function is one where the last instruction called must be the recursive call (with the exception of a return). The result of this is it allows the compiler to eliminate the need for the nested calls of the recursive function being pushed on to the stack, effectively converting the recursion into a loop. The result of which is to eliminate any possibility of the recursive function causing a stack overflow.

Going back to the function above, the structure of the trainCorpus function allows it to take advantage of tail recursion since the recursive call back into the function is the very last statement in the function declaration.

I'd like to come back to the addWordTo function in a future post and focus on eliminating the newCorpus argument through the use of an accumulator as well as possibly optimising for tail recursion.

That takes care of managing the corpus lists. Two rather terse functions is all that is required here. Part 3 will be dedicated to the functions used for calculating probabilities and performing the actual classification.

An Implementation of a Binary Bayes Classifier in F# - Part 1

Over the past few months, in my limited spare time, I have started learning F#. The reason for learning F# is for quite a time now I have found the idea of functional programming and the arguments for it fairly compelling, and so wanted to see for myself. About 9 months ago I started looking at Clojure, but found there were too many peripheral 'things' (like learning Emacs) to deal with, and I really only wanted to focus on the language and functional programming. So being already so familiar with the .NET Framework, Visual Studio and the Microsoft ecosystem as a whole, I figured F# would be the best way to go for now.

onwards and upwards..

For my first F# program I decided to write a binary Bayes classifier. This is a fairly well defined and documented approach to text classification and seemed a complex enough problem to help me learn some of the fundamentals. For this reason I intend to do this as a multi-part post. Being a first attempt at F#, I'm pretty certain that there are even better ways to express the code than I have here, so hopefully I can improve the code over time.

Part 1: Is a high level view of what I intend to implement.
Part 2: Details implementation of the functions for manipulating the classifications.
Part 3: The functions for calculating probabilities, classifying text and summing up.

the classifier..

I'm not going to go into the details of a Bayes classifier since it is such a well known classification strategy and there is plenty written about it 'out there'. But a starting point is of course wikipedia. There are also many different implementations for the algorithm and it has found its most common use in spam filtering. The core of this F# code is derived from quite an old, but well known article by Gary Robinson titled, "A Statistical Approach to the Spam Problem" from Linux Journal.

the how..

I've simply written this code in an F# script file (.fsx) and used F# interactive in Visual Studio to execute statements as I needed. This is a really easy way to incrementally develop the solution.

Before getting to the implementation, the goal of this classifier is to simply determine the probability a given text belongs to a particular classification relative to another. For this post, I'll stick with the canonical example of  the spam or ham problem. To achieve this we are going to have to represent the corpus of each classification somehow. For a text based Bayesian classifier we need to represent a word and an occurrence count. I've chosen to use the simplest way to represent this using tuples. e.g. ("offer", 4832). So the corpus of a classification is represented by a list of tuples. ie. (string * int) list

A very basic spam corpus and ham corpus representation might be expressed in F# as:

let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)]
let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)]

The next step would be to take some kind of unknown text, pass it through some kind of parser and classify it as either spam or ham. e.g.

let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."

unknownText |> classifyText spamCorpusSample hamCorpusSample

The result of this will be a value between 0 and 1. A spam classification will be > 0.5 and a ham classification would be < 0.5. Clearly for practical purposes you'd likely have a small uncertainty range around 0.5 to indicate an unknown classification, but for the purposes of this exercise I have ignored that.

Aside from actually being able to classify text, we're also going to want to train the system. So we might have a well known set of spam documents and a well know set of ham documents used to train the system with. Alternatively it might even be self training over time by having users identify spam documents. Whichever approach is taken we will need a function something like:

let knownSpamText = parse "Enter the prize draw for free. A limited offer. Enter your name and address."
let newSpamCorpus = knownSpamText |> trainCorpus spamCorpus

The trainCorpus function will integrate the new knownSpamText document with the existing spamCorpus. But being a functional language, the spamCorpus list is immutable, so the function returns a new list representing the integrated result of spamCorpus and knownSpamText.

That's pretty much the functionality of the classifier at a high level. In the next parts I'll deal with the implementation of the F# code to get the results we want.

Thursday, May 27, 2010

Australian Coding Kata Championships

I don't know if this idea has legs, but I've been contemplating over the past few weeks whether it might be worth raising the idea of holding a bit of a fun event in Australia in the form of  a "Coding Kata Championships". At this very embryonic stage I'm thinking something along the lines of:

  1. Involve as many of the technical developer user groups as might be interested. eg. .Net, Java, Ruby, Python etc. This would need to be a totally open event.
  2. Hold state based playoffs with a final national championship. Any prizes to be provided through sponsorship.
  3. A set of rules and judging to be developed: One aspect of judging might be based around how "cleanly" a competitor performed their kata based off mistakes/typos.
  4. All competitors would have to choose from a limited set of kata (several months in advance) from which they would then register and would perform on the day of competition.
  5. Competitions might need to be broken into divisions, maybe by programming paradigms to facilitate judging. Although there isn't exactly a clean separation with regards to that, so I'm not quite sure how that would be managed.

Of course the judging and rules might end up being the thing which ultimately makes or breaks the viability of the concept, so please...

Discuss. :)

Monday, April 5, 2010

Programming for kids

So how much programming can a six year old learn? OK, it's not exactly a scientific question nor do I intend to conduct this scientifically, but I think it would be interesting to see just how much of a basic understanding of programming a six year old can learn and be interested in. I know there are plenty of programming languages for kids, but I'm more interested in seeing how easy or hard the concepts of a mature mainstream language are.

So what I'd like to do is try this with Python and/or Ruby. Of course there are some extra benefits over and above actually learning to program, in that essential skills are also learning to read, type and spell. Compilers and interpreters are somewhat less forgiving than us humans in reading a typo. (hmm now there's an idea - I wonder if anyone has ever implemented a Levenshtein distance formula into a compiler/interpreter to cope with minor typos.)

So I hope to be able to make this a multi-part series of posts tracking our progress. Naturally if he shows zero interest it could be the shortest series of blog posts in the history of the internet. And I wouldn't want to be accused of experimenting on my children either. :)

Monday, March 15, 2010

Using Microsoft's Natal as a tool for children with Autistic Spectrum Disorder.

Last week I stumbled across an article about Microsoft's Project Natal for the Xbox. I had seen something about Natal a while back, but until last week, had totally forgotten about it, which got me thinking again. For quite some time now I have had a real interest in NUI design, both as a personal interest and also something I have repeatedly tried to raise as a genuinely interesting possibility for various multi-touch client applications for the company I develop software for. But so far this hasn't been embraced with much enthusiasm. However Natal takes things one step further, and I quipped in a comment on Richard Banks blog on Rethinking the IDE, that perhaps Natal could be used as part of a coding IDE in response to his question at the end. So after spending a little time looking at the videos from the Project Natal site, I came across the Project Natal: Meet Milo video, from Lionhead Studios.

Now it's a little bit hard to tell exactly how much of what's going on in that video is smoke and mirrors and how much is real AI and "non-scripted", I'm guessing there is a reasonable amount of smoke and mirrors, but I don't think this takes away from the concept, at least in terms of what I was thinking. My initial reaction to this was not that this takes interactive gaming to a whole new level, which it certainly has the potential to, rather, a variation of this could possibly be an amazingly useful tool for kids with autism and in particular Asperger's Syndrome.

Firstly a little background. I assume most people have at least heard of autism, and being in the software/engineering/science field, we seem to be over represented in the statistics for autism, especially Asperger's Syndrome (which is just part of autism), and many of us certainly have a tendency towards that end of the spectrum or certain aspects of that personality type (see the now famous, although unscientific article from Wired titled "The Geek Syndrome"). But for anyone not entirely familiar, autism is a classified as a developmental spectrum disorder primarily affecting social interaction and communication (far more profoundly than "just shy") as well as a range of other behavioural and cognitive differences. That's the one simple liner. For a better, but by no means definitive, overview I'd recommend wikipedia's entries on Autism Spectrum and Asperger's Syndrome. But just to be clear though Geekiness != Asperger's, although we know that Asperger's Syndrome is often associated with Geekiness.

The reality for people with ASD , however, is they are required to live mostly in a world dominated by complex social interactions which can confuse, frustrate and even baffle them at times. Equally, "normal people", or as the autistic community likes to call them, "neurotypicals" (NT's), have a difficult time understanding the differences of people with autism with respect to their own "normal" behaviour. Fortunately there are a number of interesting as well as possibly clinically valid therapies and approaches to assisting people with ASD cope and understand the social world. And there are some pretty dedicated psychologists, researchers and therapist as well. But there are no magic bullets and no "cures". (I use the word cure very loosely here as some would argue; "what is there to be cured of?").

So how does this all fit with Natal? For those who have not really had anything to do with ASD before it might not be immediately obvious, but as I outline above, one of the core characteristics of ASD is a different/impaired (in a relative sense) understanding of social interaction and communication. Current mainstream therapies revolve around teaching social interactions and conventions using role play and story cards known as Social Stories(tm). And this is where the "Meet Milo" concept created by Lionhead Studios triggered off one of those idea connection moments. Perhaps it could be adapted to be something like a very fluent and natural personal trainer for people with ASD to assist with teaching social interaction and communication, especially children. I think anyone who has had any close experience with ASD and who has watched the Meet Milo video would immediately see the possibilities for a tool along the lines of a social personal training application. Many of these kids have a keen interest and attraction for computers, especially computer games, as it's all about predictability. A computer isn't vague and nuanced, and doesn't suddenly change mood for no apparent reason. But software might just be able to simulate aspects of that.

During the video it is mentioned that Milo exhibits body language and facial expression, which are usually very difficult aspects of human behaviour for people with ASD to pick up on. So it probably wouldn't be too difficult to create some kind of interactivity where a character behaves in a particular way, similar say to how Milo behaved, and where the person interacting is required to respond in the appropriate way. I don't think there would really be a requirement for the software to try and read or gauge the users body language or emotion (like supposedly shown in the video - not convinced about that by the way). But the main thing would be to have the user pick up on the appropriate social communication being displayed by the character and to choose or perform an appropriate response. The idea would be for the user to learn and practice particular social scenarios which in theory can be applied to real life when needed.

I could also envisage a scenario whereby the characters might start out being quite abstract and not very human like and gradually become more realistic as the user become more comfortable and progressed. Or it could even be presented somewhat like an adventure where there are challenges to complete interwoven making the learning experience both more fun as well as subtle.
And just to add my two cents worth...

With the prevalence of ASD now looking to be about 1% and consistent across all age groups and possibly cultures (see CDC Survey and NHS Study), it seems the evidence is currently leaning towards the disorder being an underlying state of the general population. Although a real rise in numbers can't be completely ruled out it is looking increasingly unlikely a real rise is the case, but rather an artefact in better and broader diagnosis .

OK. I've already made a slight detour from my intended aim of this blog, but this idea and topic just seemed too interesting and compelling to not blog about.

Normal transmission shall now resume..

Thursday, March 11, 2010

Setup Moq to return different values on multiple calls

I recently came across a problem, using the Moq mocking framework, in which a few tests, required a method on one of my mocked objects to be called multiple times but return a different value depending on when the call was made. Something you can do with Rhino Mocks. But nevertheless.

The sequence may have looked something like:
Call 1: mockedObject.IsValid() returns true
Call 2: mockedObject.IsValid() returns false

My solution to the problem, which is sort of outlined in the Moq Quick Start was:

public void MultipleReturnValue_Test()
   bool firstCall = true;
   var mockedObject = new Mock<ISomeInterface>();

   mockedObject.Setup(o => o.IsValid()).Returns(() =>
         if (firstCall == true)
            firstCall = false;
            return true;
         return false;

   var sut = new MySystemUnderTest(mockedObject.Object);


Not beautiful, but It works fine. The first call to IsValid() returns true, while any subsequent calls return false.

This raises the question, how much logic in a test is too much?