Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Real life applications of mathematics

$$ \int { } $$

I actually used advanced maths once in my life. I dropped my keys in the toilet, so I had to use an integral-shaped wire to get them out.

I need your attention

An Intuitive Explanation of

Bayes' Theorem

Bayes' Theorem

An Intuitive Explanation

“Ye shall know them by their fruits. Do men gather grapes of thorns, or figs of thistles?”

Matthew 7:16


“Jūs pažinsite juos iš vaisių. Argi kas gali priskinti vynuogių nuo erškėčių ar figų nuo usnių?!”

Mato 7:16


“По плодам их узнаете их. Собирают ли с терновника виноград, или с репейника смоквы?”

От Матфея 7:16

A U

$$P\left( A \right) =\frac { \left| A \right| }{ \left| U \right| }$$ $$ P\left( \neg A \right) =1-P\left( A \right) $$

B U

$$P\left( B \right) =\frac { \left| B \right| }{ \left| U \right| }$$ $$ P\left( \neg B \right) =1-P\left( B \right) $$

A B AB U

$$P\left( AB \right) =\frac { \left| AB \right| }{ \left| U \right| }$$

A B AB

$$P\left( { A }|{ B } \right) =\frac { \left| AB \right| }{ \left| B \right| } =\frac { \frac { \left| AB \right| }{ \left| U \right| } }{ \frac { \left| B \right| }{ \left| U \right| } } =\frac { P\left( AB \right) }{ P\left( B \right) }$$ $$P\left( { B }|{ A } \right) =\frac { \left| AB \right| }{ \left| A \right| } =\frac { \frac { \left| AB \right| }{ \left| U \right| } }{ \frac { \left| A \right| }{ \left| U \right| } } =\frac { P\left( AB \right) }{ P\left( A \right) }$$

Independence

A,B independent:

$$ P\left( { A }|{ B } \right) =P\left( A \right), \quad P\left( { B }|A \right) =P\left( B \right)$$

$$ P\left( { A }|{ B } \right) =\frac { P\left( AB \right) }{ P\left( B \right) } =P\left( A \right) $$

$$ P\left( AB \right) =P\left( A \right) \cdot P\left( B \right) $$

Example of Independence

A B AB 0.3 0.3 0.2 0.2

$$P\left( A \right) =0.6,\quad P\left( B \right) =0.5\\ P(AB)=0.6\cdot 0.5=0.3$$

Now we have everything we need to derive Bayes’ theorem $$ P\left( { B }|{ A } \right) =\frac { P\left( AB \right) }{ P\left( A \right) } \quad \Rightarrow \quad P\left( AB \right) =P\left( { B }|{ A } \right) \cdot P\left( A \right) $$ $$ P\left( { A }|{ B } \right) =\frac { P\left( AB \right) }{ P\left( B \right) } \quad \Rightarrow \quad P\left( AB \right) =P\left( { A }|{ B } \right) \cdot P\left( B \right) $$ $$ P\left( { A }|{ B } \right) \cdot P\left( B \right) =P\left( { B }|{ A } \right) \cdot P\left( A \right) $$ $$ \boxed { P\left( { A }|{ B } \right) =\frac { P\left( { B }|{ A } \right) \cdot P\left( A \right) }{ P\left( B \right) } } $$

Bayes Rule

$$\underbrace { P\left( { A }|{ B } \right) }_{ posterior } =\frac { \overbrace { P\left( { B }|{ A } \right) }^{ likelihood } \cdot \overbrace { P\left( A \right) }^{ prior } }{ \underbrace { P\left( B \right) }_{ marginal\quad likelihood } } $$ $$ P\left( { \neg A }|{ B } \right) =\frac { P\left( { { B } }|\neg A \right) \cdot P\left( \neg A \right) }{ P\left( B \right) } $$ $$ P\left( { A }|{ B } \right)+P\left( { \neg A }|{ B } \right)=1 $$ $$ P\left( { B } \right) ={ P\left( { B }|{ A } \right) \cdot P\left( A \right) +P\left( { B }|{ \neg A } \right) \cdot P\left( \neg A \right) } $$

Example 1

Here's a story problem about a situation that doctors often encounter:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?”

A B women with cancer women with positive mammogram

Solution 1


$P\left( B \right) =P\left( { B }|{ A } \right) \cdot P\left( A \right) +P\left( { B }|{ \neg A } \right) \cdot P\left( \neg A \right)$ $=P\left( { B }|{ A } \right) \cdot P\left( A \right) +P\left( { B }|{ \neg A } \right) \cdot \left( 1-P\left( A \right) \right)$ $=0.8\cdot 0.01+0.096\cdot \left( 1-0.01 \right) $ $=0.10304$


$P\left( { A }|{ B } \right) =\frac { P\left( { B }|{ A } \right) \cdot P\left( A \right) }{ P\left( B \right) } =\frac { 0.8\cdot 0.01 }{ 0.10304 } \approx 0.07763975$

The chance of actually having breast cancer given a positive mammogram is 7.76%.

Example 2

We have two coins, one fair ${ P }_{ 1 }\left( heads \right) =0.5$ and one loaded with ${ P }_{ 2 }\left( heads \right) =1$. We now pick a coin at random with 0.5 chance.

Solution 2.1

$P\left( { loaded }|{ heads } \right) =\frac { P\left( { heads }|{ loaded } \right) \cdot P\left( loaded \right) }{ P\left( heads \right) } =$

$=\frac { 1\cdot 0.5 }{ 0.75 } \approx 0.6667$


$P(heads)=P\left( { heads }|{ loaded } \right) \cdot P\left( loaded \right) +$ $+P\left( { heads }|fair \right) \cdot P\left( fair \right) =$ $=1\cdot 0.5+0.5\cdot 0.5=0.75$

Solution 2.2

$P\left( { loaded }|{ { heads }_{ 1 }{ heads }_{ 2 } } \right) =$ $=\frac { P\left( { { { heads }_{ 1 }{ heads }_{ 2 } } }|{ loaded } \right) \cdot P\left( loaded \right) }{ P\left( { { heads }_{ 1 }{ heads }_{ 2 } } \right) } =$ $=\frac { P\left( { { { heads }_{ 1 } } }|{ loaded } \right) \cdot P\left( { { { heads }_{ 2 } } }|{ loaded } \right) \cdot P\left( loaded \right) }{ P\left( { { heads }_{ 1 } } \right) \cdot P\left( { { heads }_{ 2 } } \right) } =$ $=\frac { 1\cdot 1\cdot 0.5 }{ 0.75\cdot 0.75 } =0.8$

Text classification

using Naive Bayes

Document: $ d=\left< { { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d } } } \right> $

Document space: $ d\in D $

Categories: $ C=\left\{ { c }_{ 1 },{ c }_{ 2 },\dots ,{ c }_{ j } \right\} $

Classification function: $$\gamma : D\rightarrow C$$

MAP: maximum a posteriori

$$ { c }_{ map }=\underset { c\in C }{ argmax } \quad P\left( { { c } }|{ d } \right) $$

Straight way approach

$$ P\left( { { c } }|{ d } \right) =\frac { P\left( { d }|{ c } \right) \cdot P\left( c \right) }{ P\left( d \right) } =\frac { P\left( { { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d } } }|{ c } \right) \cdot P\left( c \right) }{ P\left( { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d } } \right) } =\\ =\frac { P\left( { { w }_{ 1 } }|{ c } \right) \cdot P\left( { { w }_{ 2 } }|{ c,{ w }_{ 1 } } \right) \cdot \dots \cdot P\left( { { w }_{ { n }_{ d } } }|{ c,{ w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d }-1 } } \right) \cdot P\left( c \right) }{ P\left( { { w }_{ 1 } } \right) \cdot P\left( { { w }_{ 2 } }|{ { w }_{ 1 } } \right) \cdot \dots \cdot P\left( { { w }_{ { n }_{ d } } }|{ { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d }-1 } } \right) } $$

$$ P\left( { e }_{ 1 },e_{ 2 },\dots ,{ e }_{ N } \right) =P\left( { { e }_{ 1 } } \right) \cdot P\left( { { e }_{ 2 } }|{ { e }_{ 1 } } \right) \cdot \dots \cdot P\left( { e_{ N } }|{ e_{ 1 },{ e }_{ 2 },\dots ,{ e }_{ N-1 } } \right) $$

Normalization constant

$$ P\left( { { c } }|{ d } \right) =\frac { P\left( { d }|{ c } \right) \cdot P\left( c \right) }{ P\left( d \right) } =\alpha \cdot P\left( { d }|{ c } \right) \cdot P\left( c \right) \\ P\left( { { c } }|{ d } \right) \propto P\left( { d }|{ c } \right) \cdot P\left( c \right) $$

Conditional independence assumption

$$ P\left( { { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d } } }|{ c } \right) =\\ =P\left( { { w }_{ 1 } }|{ c } \right) \cdot P\left( { { w }_{ 2 } }|{ c } \right) \cdot \dots \cdot P\left( { { w }_{ { n }_{ d } } }|{ c } \right) =\\ =\prod _{ i=1 }^{ { n }_{ d } }{ P\left( { { w }_{ i } }|{ c } \right) } $$

Naive Bayes

$$ P\left( { { c } }|{ d } \right) \propto P\left( { d }|{ c } \right) \cdot P\left( c \right) =\\ =P\left( { { w }_{ 1 },{ w }_{ 2 },\dots ,{ w }_{ { n }_{ d } } }|{ c } \right) \cdot P\left( c \right) =\\ =P\left( c \right) \cdot \prod _{ i=1 }^{ { n }_{ d } }{ P\left( { { w }_{ i } }|{ c } \right) } $$

$$ P\left( { { c } }|{ d } \right) \propto P\left( c \right) \cdot \prod _{ i }^{ { n }_{ d } }{ P\left( { { w }_{ i } }|{ c } \right) } \\ { c }_{ map }=\underset { c\in C }{ argmax } \quad P\left( c \right) \cdot \prod _{ i=1 }^{ { n }_{ d } }{ P\left( { { w }_{ i } }|{ c } \right) } $$

Example 3

SPAM HAM
offer is secret play sport today
click secret link went play sport
secret sport link secret sport event
  sport is today
  sport costs money

$$ P\left( spam \right) =\frac { 3 }{ 3+5 } =\frac { 3 }{ 8 } ,\quad P\left( ham \right) =\frac { 5 }{ 3+5 } =\frac { 5 }{ 8 } $$

$$ P\left( { secret }|{ spam } \right) =\frac { 3 }{ 9 } =\frac { 1 }{ 3 } ,\quad P\left( { secret }|{ ham } \right) =\frac { 1 }{ 15 } $$

$$ P\left( { sport }|{ spam } \right) =\frac { 1 }{ 9 } ,\quad P\left( { sport }|{ ham } \right) =\frac { 5 }{ 15 } =\frac { 1 }{ 3 } $$

Example 3.1

$$ P\left( { spam }|secret,sport \right) \propto \\ \propto P\left( { secret }|spam \right) \cdot P\left( sport|spam \right) \cdot P\left( spam \right) $$

$$ P\left( { secret }|spam \right) \cdot P\left( sport|spam \right) \cdot P\left( spam \right) = \\=\frac { 1 }{ 3 } \cdot \frac { 1 }{ 9 } \cdot \frac { 3 }{ 8 } =\frac { 1 }{ 72 } $$

$$ P\left( { secret }|ham \right) \cdot P\left( sport|ham \right) \cdot P\left( ham \right) = \\=\frac { 1 }{ 15 } \cdot \frac { 1 }{ 3 } \cdot \frac { 5 }{ 8 } =\frac { 1 }{ 72 } $$

$$ P\left( spam|secret,sport \right) =\frac { \frac { 1 }{ 72 } }{ \frac { 1 }{ 72 } +\frac { 1 }{ 72 } } =\frac { 1 }{ 2 } $$

Houston, we have a problem

$$ P\left( { spam }|today,is,secret \right) =? $$

$$ P\left( today|spam \right) \cdot P\left( is|spam \right) \cdot P\left( secret|spam \right) \cdot P\left( spam \right) =\\ =\frac { 0 }{ 9 } \cdot \frac { 1 }{ 9 } \cdot \frac { 3 }{ 9 } \cdot \frac { 3 }{ 8 } =0 $$

$$ P\left( today|ham \right) \cdot P\left( is|ham \right) \cdot P\left( secret|ham \right) \cdot P\left( ham \right) =\\ =\frac { 2 }{ 15 } \cdot \frac { 1 }{ 15 } \cdot \frac { 1 }{ 15 } \cdot \frac { 5 }{ 8 } =\frac { 1 }{ 2700 } $$

$$ P\left( spam|today,is,secret \right) =\frac { 0 }{ 0 +\frac { 1 }{ 2700 } } = {0} $$

Use the Laplace Smoothing, Luke

$ LS\left( k \right) $:

$$ P\left( x \right) =\frac { { N }_{ x }+k }{ N+k\cdot \left| X \right| } $$

LS(1)

SPAM HAM
offer is secret play sport today
click secret link went play sport
secret sport link secret sport event
  sport is today
  sport costs money

$$ P\left( spam \right) =\frac { 3+1 }{ 8+2 } =\frac { 4 }{ 10 } ,\quad P\left( ham \right) =\frac { 5+1 }{ 8+2 } =\frac { 6 }{ 10 } $$

$$ P\left( { secret }|{ spam } \right) =\frac { 3 + 1 }{ 9 + 12 } =\frac { 4 }{ 21 } ,\quad P\left( { secret }|{ ham } \right) =\frac { 1 + 1 }{ 15 + 12 } =\frac { 2 }{ 27 }$$

$$ P\left( { sport }|{ spam } \right) =\frac { 1 + 1 }{ 9 + 12} = \frac { 2 }{ 21 },\quad P\left( { sport }|{ ham } \right) =\frac { 5 + 1 }{ 15 + 12 } =\frac { 2 }{ 9 } $$

Houston, we have no problem

$$ P\left( { spam }|today,is,secret \right) =? $$

$$ P\left( today|spam \right) \cdot P\left( is|spam \right) \cdot P\left( secret|spam \right) \cdot P\left( spam \right) =\\ =\frac { 1 }{ 21 } \cdot \frac { 2 }{ 21 } \cdot \frac { 4 }{ 21 } \cdot \frac { 2 }{ 5 } = \frac { 16 }{ 46305 } $$

$$ P\left( today|ham \right) \cdot P\left( is|ham \right) \cdot P\left( secret|ham \right) \cdot P\left( ham \right) =\\ =\frac { 1 }{ 9 } \cdot \frac { 2 }{ 27 } \cdot \frac { 2 }{ 27 } \cdot \frac { 3 }{ 5 } =\frac { 4 }{ 10935 } $$

$$ P\left( spam|today,is,secret \right) =\frac { \frac { 16 }{ 46305 } }{ \frac { 16 }{ 46305 } +\frac { 4 }{ 10935 } } = \frac { 324 }{ 667 } $$

Demo #01

Solving the problem of floating point underflow

$$ { c }_{ map }=\underset { c\in C }{ argmax } \quad P\left( c \right) \cdot \prod _{ i=1 }^{ { n }_{ d } }{ P\left( { { w }_{ i } }|{ c } \right) } $$

It's better to perform the computation by adding logarithms of probabilities instead of multiplying probabilities.

$$ \log { \left( x\cdot y \right) =\log { \left( x \right) } } +\log { \left( y \right) } $$

$$ { c }_{ map }=\underset { c\in C }{ argmax } \quad \left[ \log { \left( P\left( c \right) \right) +\sum _{ i=1 }^{ { n }_{ d } }{ \log { \left( P\left( { { w }_{ i } }|{ c } \right) \right) } } } \right] $$

Improving text classification

using feature selection

Confusion matrix
for a binary classifier

    Actual value
    positive
(TP+FN)
negative
Predicted value
positive
(TP+FP)
True positive
(TP)
False positive
(FP)
negative False negative
(FN)
True negative
(TN)

Precision

$$ Precision=P\left( { actual\quad positive }|{ predicted\quad positive } \right) $$ $$ =\frac { TP }{ TP+FP } $$

Recall/Sensitivity

$$ Recall=P\left( { predicted\quad positive }|{ actual\quad positive } \right) $$ $$ =\frac { TP }{ TP+FN } $$

Precision vs. Recall

F-measure

$$ F=\frac { 1 }{ \alpha \frac { 1 }{ P } +(1-\alpha )\frac { 1 }{ R } } =\frac { ({ \beta }^{ 2 }+1)PR }{ { \beta }^{ 2 }P+R } $$

where

$ \quad { \beta }^{ 2 }=\frac { 1-\alpha }{ \alpha } $

$ \quad \alpha \in [0,1] $ and thus $ \beta \in [0,\infty ] $

Balanced F-measure

$$ { F }_{ 1 }={ F }_{ \beta =1 }=\frac { 2PR }{ P+R } $$

Combinatorial Entropy

Layer 1

Multinomial coefficient: $$ W=\frac { N! }{ \prod { { N }_{ i }! } } =\frac { 10! }{ 5!\cdot 3!\cdot 2! } $$

Combinatorial entropy: $${ S }_{ comb }=\frac { \log _{ 2 }{ W } }{ N } =\frac { 1 }{ N } \log _{ 2 }{ \left( \frac { N! }{ \prod { { N }_{ i }! } } \right) } $$

Shannon entropy

$ { S }_{ comb }=\frac { 1 }{ N } \log _{ 2 }{ \left( \frac { N! }{ \prod { { N }_{ i }! } } \right) } =\frac { 1 }{ N } \left( \log _{ 2 }{ \left( N! \right) } -\log _{ 2 }{ \left( \prod { { N }_{ i }! } \right) } \right) $

Stirling's approximation: $$ \ln { \left( N! \right) } =N\ln { \left( N \right) } -N+O\left( \ln { \left( N \right) } \right) \approx n\ln { \left( n \right) } -n $$

$$ { S }_{ comb }\approx -\sum { \left( \frac { { N }_{ i } }{ N } \log _{ 2 }{ \left( \frac { { N }_{ i } }{ N } \right) } \right) } =-\sum { { p }_{ i }\log _{ 2 }{ { p }_{ i } } } $$

Entropy and
Mutual Information

$$ H(X)=-\sum _{ x\in X }^{ }{ p(x)\log _{ 2 }{ p(x) } } $$

The conditional entropy of X given Y is denoted: $$ H\left( { X }|{ Y } \right) =-\sum _{ y\in Y }^{ }{ p(y) } \sum _{ x\in X }^{ }{ p\left( { x }|{ y } \right) \log _{ 2 }{ p\left( { x }|{ y } \right) } } $$

Mutual Information between X and Y: $$ I(X;Y)=H(X)-H\left( { X }|{ Y } \right) $$

$$ =\sum _{ x\in X }^{ }{ \sum _{ y\in Y }^{ }{ p(xy)\log _{ 2 }{ \frac { p(xy) }{ p(x)p(y) } } } } $$

Mutual Information of
term and class

$$ I(U,C)=\sum _{ { e }_{ t }\in \left\{ 1,0 \right\} }^{ }{ \sum _{ { e }_{ c }\in \left\{ 1,0 \right\} }^{ }{ p(U={ e }_{ t },C={ e }_{ c })\log _{ 2 }{ \frac { p(U={ e }_{ t },C={ e }_{ c }) }{ p(U={ e }_{ t })p(C={ e }_{ c }) } } } } $$

$ { e }_{ t }=1 $ - the document contains term t

$ { e }_{ t }=0 $ - the document does not contain t

$ { e }_{ c }=1 $ - the document is in class c

$ { e }_{ c }=0 $ - the document is not in class c

Features with high mutual information scores for six Reuters-RCV1 classes

Effect of feature set size on accuracy for multinomial and Bernoulli models

Spelling correction

using Bayes Theorem*

* P.Norvig: How to Write a Spelling Corrector

We are trying to find the correction c, out of all possible corrections, that maximizes the probability of c given the original word w:

$$ { argmax }_{ c }\quad P\left( { c }|{ w } \right) $$

By Bayes' Theorem:

$$ { argmax }_{ c }\quad P\left( { c }|{ w } \right) =\\ ={ argmax }_{ c }\quad P\left( { w }|c \right) \cdot P\left( c \right) $$

The possible corrections c of a given word w



For a word of length $n$, there will be $n$ deletions, $n-1$ transpositions, $26\cdot n$ alterations, and $26\cdot \left( n+1 \right)$ insertions, for a total of $54\cdot n+25$ (of which a few are typically duplicates).

The literature on spelling correction claims that 80% to 95% of spelling errors are an edit distance of 1 from the target.

Model assumptions

All known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0.

$$ P\left( { { c }_{ 0 } }|{ w } \right) \gg P\left( { { c }_{ 1 } }|{ w } \right) \gg P\left( { { c }_{ 2 } }|{ w } \right) $$

$$ P\left( { 'word' }|{ 'word' } \right) \gg \\ \gg P\left( { 'world' }|{ 'word' } \right) \gg \\ \gg P\left( 'bored'|{ 'word' } \right) $$

What would Bayes say?

$$ P\left( { { c }_{ 0 } }|{ w } \right) \gg P\left( { { c }_{ 1 } }|{ w } \right) \gg P\left( { { c }_{ 2 } }|{ w } \right) $$

$$ \frac { P\left( { w }|{ { c }_{ 0 } } \right) \cdot P\left( { c }_{ 0 } \right) }{ P\left( w \right) } \gg \frac { P\left( { w }|{ { c }_{ 1 } } \right) \cdot P\left( { c }_{ 1 } \right) }{ P\left( w \right) } \gg \frac { P\left( { w }|{ { c }_{ 2 } } \right) \cdot P\left( { c }_{ 2 } \right) }{ P\left( w \right) } $$

$$ P\left( { w }|{ { c }_{ 0 } } \right) \cdot P\left( { c }_{ 0 } \right) \gg P\left( { w }|{ { c }_{ 1 } } \right) \cdot P\left( { c }_{ 1 } \right) \gg P\left( { w }|{ { c }_{ 2 } } \right) \cdot P\left( { c }_{ 2 } \right) $$

Correction

$$ \underset { { c }_{ i }\in { E }_{ i }\left( w \right) }{ argmax } \quad P\left( { { c }_{ i } }|{ w } \right) =\\ =\underset { { c }_{ i }\in { E }_{ i }\left( w \right) }{ argmax } \quad P\left( { w }|{ c }_{ i } \right) \cdot P\left( { c }_{ i } \right) =\\ =\underset { { c }_{ i }\in { E }_{ i }\left( w \right) }{ argmax } \quad P\left( { c }_{ i } \right) =\\ =\underset { { c }_{ i }\in { E }_{ i }\left( w \right) }{ argmax } \quad { N }_{ { c }_{ i } } $$

${ E }_{ i }\left( w \right)$ - all known words of edit distance $i$

Demo #02

  • S. Russell, P. Norvig. Artificial Intelligence: A Modern Approach (3rd Edition)
  • T. Segaran. Programming Collective Intelligence
  • C.D. Manning, P. Raghavan, H. Schütze. Introduction to Information Retrieval
  • B. Croft, D. Metzler, T. Strohman. Search Engines: Information Retrieval in Practice

Courses

Back to home

Use a spacebar or arrow keys to navigate