Concept Learning
- Assume a set X of possible examples
- A concept is a subset C of X containing all examples for which a given boolean statement is true
- A supervised learning algorithm seeks to learn a function f that correctly classifies any example drawn from X:
- f(x) = 1 if x is in C
- f(x) = 0 if x is not in C
- So what is f?
- A perceptron with a single output
- Anything else that can take an input and return a value
- Multi-label classifiers
- Can be thought of as learning one concept per label
Probable Approximate Correctness (PAC)
- Big Question:
- How many examples must we use for training?
- Alternative formulation: Is a learned function any good?
- If it is bad, we will find out with high probability after a small number of examples.
- If it is consistent with a large set of training examples, it is unlikely to be wrong.
- Important assumption:
- The test set and training set are drawn randomly from X with the same probability distribution.
- Approximate correctness of concept learning:
- Let Pr(X) be the probability distribution
- Let D = {x | (f(x) = 0 and x in C) or (f(x) = 1 and not (x in C)}
- In other words, D is the set of all incorrect classifications
- Let Error(f) be the sum, for all x in D, of Pr(x)
- We say that f is approximately correct with accuracy ε if and only if Error(f) ≤ ε
- Probable approximate correctness
- f is Probably Approximately Correct (PAC) with probability 1-δ if and only if Pr(Error(f) > ε) < δ
- So, how many examples do we need?
- Depends upon the size of the hypothesis space
- Let |H| be the size of the hypothesis space
- If f is a "bad" hypothesis (i.e., Error(f(x)) > ε), then:
- Probability that the "bad" f is consistent with m examples is ≤ (1 - ε)m
- In other words, this is the chance that the "bad" f fools us
- Probability that a "bad" f exists in H is |H| (1-ε)m
- This is δ that we referred to earlier
- Let's do some algebra:
- |H| (1-ε)m ≤ δ
- (1-ε)m ≤ δ / |H|
- m ≥ log(δ/|H|) / log(1 - ε)
- This result is a lower bound on the number of training examples (m) we need to guarantee PAC for any learning algorithm
- This result is optimistic:
- For a specific learning algorithm, m could well be larger
- Does not really take noisy data into account
- This result is pessimistic:
- Both training and test examples can exhibit high similarity, or even duplication
- Using a distribution-dependent PAC model gets complicated very quickly
- A concept C is PAC-Learnable by a hypothesis space H if:
- There exists an algorithm A that terminates in time polynomial in the number of training examples
- There are a polynomial number of examples given δ and &epsilon'
- A terminates with probablity 1 - δ a learned function f such that Error(f) ≤ ε
Calculating Example Sizes