1(a). Naive Bayes

Solution

$C$ is the set of classes and $c$ is the class in set $C$
$f$ is a feature vector, $f_i$ is the $i^{th}$ feature and $v_i$ is the value of feature $f_i$
$n$ is the number of feature vectors
$argmax$ return the class c obtaining the highest probability

1(b). Assumptions

Which simplifying assumptions are made by the Naive Bayes model? Why can these assumptions result in less accurate classifiers compared to other learning algorithms?

Solution

The Naive Bayes model assume that $x_i$ are conditionaly independent given by label $y$ that why we have the the thirs “=” in the formula above.

$$ \hat{y} = argmaxP(y)P(x|y) = argmaxP(y)P(x_1,x_2,...x_n|y) \\= argmaxP(y)\prod_{i=1}^{n}P(x_i|y) $$

However, in general $P(x_1,x_2,...x_n|y)$ is far from $\prod_{i=1}^{n}P(x_i|y)$ so this assumption is not realistic. For example the word ‘Barack’ and ‘Obama’ is considered co-occur, they are not independent features. So that’s the reason Naive Bayes leads to less accurate classifiers compared to other learning algorithm.

2(a). HMM tagger

Consider the two sentences

February made me shiver.