Concept Learning

Assume a set X of possible examples
A concept is a subset C of X containing all examples for which a given boolean statement is true
A supervised learning algorithm seeks to learn a function f that correctly classifies any example drawn from X:
- f(x) = 1 if x is in C
- f(x) = 0 if x is not in C
So what is f?
- A perceptron with a single output
- Anything else that can take an input and return a value
Multi-label classifiers
- Can be thought of as learning one concept per label

Probable Approximate Correctness (PAC)

Big Question:
- How many examples must we use for training?
Alternative formulation: Is a learned function any good?
- If it is bad, we will find out with high probability after a small number of examples.
- If it is consistent with a large set of training examples, it is unlikely to be wrong.
Important assumption:
- The test set and training set are drawn randomly from X with the same probability distribution.
Approximate correctness of concept learning:
- Let Pr(X) be the probability distribution
- Let D = {x | (f(x) = 0 and x in C) or (f(x) = 1 and not (x in C)}
  - In other words, D is the set of all incorrect classifications
- Let Error(f) be the sum, for all x in D, of Pr(x)
- We say that f is approximately correct with accuracy ε if and only if Error(f) ≤ ε
Probable approximate correctness
- f is Probably Approximately Correct (PAC) with probability 1-δ if and only if Pr(Error(f) > ε) < δ
So, how many examples do we need?
- Depends upon the size of the hypothesis space
- Let |H| be the size of the hypothesis space
- If f is a "bad" hypothesis (i.e., Error(f(x)) > ε), then:
  - Probability that the "bad" f is consistent with m examples is ≤ (1 - ε)^m
    - In other words, this is the chance that the "bad" f fools us
    - Probability that a "bad" f exists in H is |H| (1-ε)^m
    - This is δ that we referred to earlier
  - Let's do some algebra:
    - |H| (1-ε)^m ≤ δ
    - (1-ε)^m ≤ δ / |H|
    - m ≥ log(δ/|H|) / log(1 - ε)
  - This result is a lower bound on the number of training examples (m) we need to guarantee PAC for any learning algorithm
  - This result is optimistic:
    - For a specific learning algorithm, m could well be larger
    - Does not really take noisy data into account
  - This result is pessimistic:
    - Both training and test examples can exhibit high similarity, or even duplication
    - Using a distribution-dependent PAC model gets complicated very quickly
A concept C is PAC-Learnable by a hypothesis space H if:
- There exists an algorithm A that terminates in time polynomial in the number of training examples
- There are a polynomial number of examples given δ and &epsilon'
- A terminates with probablity 1 - δ a learned function f such that Error(f) ≤ ε

Concept Learning

Probable Approximate Correctness (PAC)

Calculating Example Sizes