Parts-of-speech tagging with Clojure
One of the classical problems in Natural Language Processing is
parts-of-speech tagging i.e. given a sentence of length n
, find the
sequence of part-of-speech tags for each word in the sentence. For
example,
$$\text{POS(“the”,“man”,“walks”) = (“article”,“noun”,“verb”)}$$
There are several approaches to parts-of-speech tagging, and I’ve managed to build a fairly simple tagger based on something called Hidden Markov Models. There are several resources explaining HMMs out there already so I won’t try explaining that in detail here.
Basically what you need to know is that we model the tag sequence to be a first-order Markov process. This means that the likelihood of a tag occurring is determined by what the previous tag was. So we’re going to have to obtain a map that looks like this:
{:noun {:verb 0.7 :article 0.1 :noun 0.2} :article {:noun 0.8 :verb 0.1 :article 0.1} :verb {:noun 0.4 :verb 0 :article 0.6}
The probability of a verb following a noun is 0.7, an article following a noun is 0.1, and so on. These are called the transition probabilities.
The next thing we need to obtain is called the emission probabilities, which is the probability of a tag emitting a word. When represented in a Clojure map, this looks something like:
{:noun {"man" 0.04 "samrat" 0.001 ...} :verb {"walks" 0.002} ... } ;; and so on
The Brown tagged corpus
To obtain these two maps, we will use the Brown corpus which is a pretty substantial collection of sentences tagged painstakingly by humans way back in the 60s. A sentence in the Brown corpus looks like this:
The/at three/cd men/nns stepped/vbd out/rp to/in the/at side/nn to/to wait/vb for/in Captain/nn-tl Clemens'/np$ signal/nn ./.
Here, “The” is tagged with “at”, “three” with “cd”, and so on. To understand what these tags mean, you could open up this reference but that is unnecessary for now.
Instead of dealing with the raw Brown corpus, we will use a file containing the processed information we need. The contents of this file looks like this:
2 WORDTAG nns 1880s 9 WORDTAG in underneath 1 WORDTAG vbn agitated ... 13510 1-GRAM cd 3282 1-GRAM bed 49 1-GRAM rb-hl ... 265 2-GRAM vb wrb 20 2-GRAM jj-tl jj ... 5 3-GRAM in jj-tl jj 6 3-GRAM cc rbr vbn 1 3-GRAM beg pp$ nn
The code used to obtain these counts is
here. I
won’t discuss that here either. You can find the counts
file
here.
The code
The namespace declaration:
(ns pos-tagger.hmm (:require [clojure.string :as str] [clojure.core.match :refer [match]]))
Next is the code to read brown.counts
file mentioned above:
(defn read-brown-counts [counts-file] (with-open [rdr (clojure.java.io/reader counts-file)] (reduce (fn [[ngram-counts word-tag-counts] line] (let [split-line (str/split line #"\s")] (match split-line [c "3-GRAM" t u v] [(assoc ngram-counts [t u v] (Integer/parseInt c)) word-tag-counts] [c "2-GRAM" t u] [(assoc ngram-counts [t u] (Integer/parseInt c)) word-tag-counts] [c "1-GRAM" t] [(assoc ngram-counts [t] (Integer/parseInt c)) word-tag-counts] [c "WORDTAG" t word] [ngram-counts (assoc-in word-tag-counts [t word] (Integer/parseInt c))]))) [{} {}] (line-seq rdr))))
This is simply loading what was in the counts file into memory. This gives us a vector of two maps which we’ll use next:
(let [counts (read-brown-counts "brown.counts")] (def ngram-count (first counts)) (def word-tag-count (second counts)) (def tag-space (conj (keys word-tag-count) "START")))
tag-space
is the set of all possible tags. The "START"
tag is
something the code producing brown.counts
adds to the beginning of
sentences. It is useful to have that because it makes the job of
calculating the transition probability of the first
tag($t(\text{START}\Rightarrow{t_1})$, which is really called the initial state
probability) easier.
With this we can obtain the transition and emission probabilities we wanted:
(defn transition-probability [ngram] (try (/ (ngram-count ngram) (ngram-count (butlast ngram))) (catch Exception _ 0))) (defn emission-probability "Probability of tag emitting word." [[tag word]] (try (/ (get-in word-tag-count [tag word]) (apply + (vals (word-tag-count tag)))) (catch Exception _ 0)))
Now, using these two functions it is possible to calculate the probability of any tag sequence, $t_1,t_2,..t_n$ for a given sentence, $x_1,x_2,..x_n$:
$$P(t_1,t_2\vert{x_1,x_2}) = t(START\Rightarrow{t_1})
e({t_1}\Rightarrow{x_1}) t({t_1}\Rightarrow{t_2})
e({t_2}\Rightarrow{x_2})$$
However, what we want to find is the most likely tag sequence, which
is simply the tag sequence for which
$P(t_1,t_2,..t_n\vert{x_1,x_2,..x_n})$ is the maximum. Now, if given a
sentence, an index(posn
) and the tag at posn
, the following
function gives the probability of the most likely tag sequence upto
posn
:
(def viterbi "Returns the probability of the most likely tag sequence upto posn." (memoize (fn [sentence tag posn] (if (= posn 1) (* (emission-probability [tag (first sentence)]) (transition-probability ["START" tag])) (* (emission-probability [tag (nth sentence (dec posn))]) (apply max (map (fn [prev-tag] (* (transition-probability [prev-tag tag]) (viterbi sentence prev-tag (dec posn)))) tag-space)))))))
The function is memoized in order to avoid doing repeated computations which occur due to the recursive calls to itself.
To find the most likely tag for a posn
, we find the $argmax$ of viterbi
:
(defn viterbi-tag "Return the most likely tag for word at posn." [sentence posn] (apply max-key (fn [tag] (viterbi sentence tag posn)) tag-space))
Finally, to get the tag sequence:
(defn tag-sequence [sentence] (map (partial viterbi-tag sentence) (range 1 (inc (count sentence)))))