This part will deal with the functions needed to calculate probabilities and finally classify a document into spam or ham.
There is essentially one 'utility' function required, called countFrom, which takes a corpus list and a word, in the form of a simple string, as arguments. It attempts to match the word in in the corpus list and returns the integer part of the tuple, otherwise it returns 0.
let countFrom corpus word = let result = corpus |> List.tryFind (fun (w, c) -> w = word) match result with | None -> 0 | Some(v) -> snd v
The next 3 functions are the probability functions required to determine what the classification of the text is.
The prCategory first calculates the rank probability of the category a word belongs to. This comes directly from the Gary Robinson article - "A Statistical Approach to the Spam Problem", so there is no need to go into detail about it here.
let prCategory (count0, total0) (count1, total1) = let pword = (count0 / total0) / ((count0 / total0) + (count1 / total1)) let ctotal = count0 + count1 ((0.5 + (ctotal * pword) ) / (1.0 + ctotal)) let prWord fstCorpus sndCorpus word = let fstCount = double (word |> countFrom fstCorpus) let sndCount = double (word |> countFrom sndCorpus) let sum corpus = double (corpus |> List.map (fun (_, c:int) -> c) |> List.sum) let pr = prCategory (fstCount, sum fstCorpus) (sndCount, sum sndCorpus) if Double.IsNaN(pr) then None else Some(pr) let prBayesian prOptionList = let pList = prOptionList |> List.choose (fun v -> v) // removes all None's from the list let prod = pList |> List.reduce ( * ) let inv = pList |> List.map (fun v -> 1.0 - v) |> List.reduce ( * ) prod / (prod + inv)
The power of these functions is in the use of the List map and reduce functions.
classifying (at last)..
The last two functions are what actually classify a text. The first function, parse, is just a very simple parser used to return a list of words from a text. Typically this function might also include additional features such as a stemming algorithm or frequent/common word filter. For this example it applies a basic regular expression, converts to uppercase and returns a list of word-frequency tuples with the frequency set to 1.
The second function, classifyText, takes two corpus lists and a parsed text. It then calculates the probability that the text belongs to the first of the two corpus relative to the second and returns the result as a value between 0 and 1.
let parse src = // could add additional data cleansing functionality here Regex.Replace(src, @"[^\w\'@-]", " ").Split(' ') |> Array.map (fun w -> w.ToUpper()) |> Array.filter (fun w -> w <> System.String.Empty) |> Set.ofArray |> Set.map (fun w -> (w, 1)) |> Set.toList let classifyText fstCorpus sndCorpus text = prBayesian (text |> List.map (fun (w, _) -> prWord fstCorpus sndCorpus
using it..
// Setup the spam and ham corpora let spamCorpusSample = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)] let hamCorpusSample = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)] // Classify as Spam let unknownText = parse "Enter the prize draw for free. A limited offer. Enter your name and address." classifyText spamCorpusSample hamCorpusSample unknownText
The fsi output
> val spamCorpusSample : (string * int) list = [("FREE", 10); ("SELL", 5); ("OFFER", 24); ("NAME", 4)] > val hamCorpusSample : (string * int) list = [("NAME", 16); ("DATE", 8); ("REGARDS", 27)] > val unknownText : (string * int) list = [("A", 1); ("ADDRESS", 1); ("AND", 1); ("DRAW", 1); ("ENTER", 1); ("FOR", 1); ("FREE", 1); ("LIMITED", 1); ("NAME", 1); ("OFFER", 1); ("PRIZE", 1); ("THE", 1); ("YOUR", 1)] > val it : float = 0.9969589807 >
the code..
The full F# script is available at https://gist.github.com/987997
final thoughts..
So far I really like where F# and functional programming is taking me. The more I learn about it the more the concepts from functional programming influence my approach development, and the more I want to keep learning and using functional languages.