The Master Algorithm







No free lunch theorem.

Curses of machine learning:

1. Over fitting.

2. High dimensionality – Often you need to learn from data of very high dimensions like number of pixels in an image, number of clicks or tastes to identify your preference, words on pages etc. Most of these dimensions or attributes are irrelevant to what you are currently trying to learn. Considering these irrelevant attributes in our learning may result in bad prediction. We don’t have an automated way of finding relevancy for that particular learning task. This is particularly the case in nearest neighbor.

As the number of features/dimensions grows, the amount of Data we need to generalize accurately grows exponentially.

Rationalism vs empiricism

How can we ever be justified from generalizing from what we have seen to what we haven’t .


Credit assignment problem

Boltzmann machine.

Back propagation algorithm. Doesn’t know global optimum.

Gradient ascent and descent.

Autoencoder: multi layer perceptron whose output is same as that of input. To make hidden layer much smaller than the input and output layer.

Stacked Sparse auto encoders to learn high level concepts like face from low level concepts like edges and shades, hierarchically.

Human intelligence boils down to a single algorithm-andrew  ng

Convolutional neural networks are modeled by the inspiration of visual cortex.

Optimal learning is the bayesian’s central goal.

Laplace’s Rule of Succession:

The probability that an event will occur after it has occurred n times successively

= \frac { n+1 }{ n+2 }

A controversy in the definition of probability:

  1. Prior probability is a subjective degree of belief.
  2. Probability is the frequency with which a subset event occurs in the sample space.

Bayesian learning is computationally costly.

A learner that assumes different effects are independent given the cause is called the Naive-Bayes classifier.

Page-rank uses the idea of markov chain. Web pages with many incoming links are probably more important than views.

Hidden Markov models are used for inference in speech recognition,

Continuous version of HMM is Kalman filter.

Bayesian networks

Markov chain Monte Carlo to converge the distributions of Bayesian network.

MCMC is a random walk on Markov chains and in long run, number of times each node is visited is proportional to its probability.

Applying probability to medical diagnosis.


Markov networks


Analogizers can work with less data.

They don’t form a model.

Analogy is behind many scientific advances.

If two things are similar , the thought of one will tend to trigger the thought of other – Aristotle

Algorithms in analogy domain are

1. K-Nearest neighbor (weighted)

Doesn’t work well with lots of dimensions, hyperspace.

Can’t Identifying relevant attributes

Discovering blanket space in hyper space.

2. Support vector machines

Weights have a single optimum instead of many local ones – advantage over multi-layer perceptron.

Only one layer.

Extending to new dimension.

3. Full blown analogical reasoning.
Structure mapping
Learning cross problem domains.


An algorithm that could spontaneously group together similar objects or different images of the same object – clustering problem.

K means, naive Bayes, em algorithms.


Dimensionality reduction


ISOmap – for nonlinearity reduction.


Reinforcement learning – optimal control in an unknown environment.

Deep mind at the intersection of reinforcement and multilayer perception.


Meta learning: combining all learning algorithms.

Stacking – a metalearner

Bagging – divide the training set into multiple samples by random sampling and apply the learning algorithm to each set. This would decrease the variance and increases accuracy.

Boosting – a meta learner.

Master algorithm resides in the circles of optimization town, towers of representation.evaluation

Optimization techniques in different tribes:

Inverse deduction – symbolists

Gradient descent – connectionist

Genetic search ( cross over and mutation) – evolutionaries

Constrained optimization – analogizers

What are to be combined?

Decision trees, multi layer perceptions, classifier systems, naive Bayes,svms

Popular lines.

The most important thing in an equation is all the quantities that don’t appear in it.

The universe maximises entropy subject to keeping energy constant.

PCA is to unsupervised learning what linear regression is to supervised learning.

Five most important personality traits to look for, extroversion,agreeableness,conscientiousness,neuroticism and openness to experience.

The law of effect

Children explore and adults exploit.

Snippets of reinforcement learning also known as habits make up most of what you do.

You don’t try to outrun a horse , you ride it. It’s not computer vs humans , its humans with computers vs without computers.

Whatever is true of everything we have seen is true of everything in the universe – Newton

Time is the principal component of memory.

Leave a Reply

%d bloggers like this: