Probability-the Science of Uncertainty and Data
by Fabi´an Kozynski
PROBABILITY
Probability models and axioms
Definition (Sample space) A sample space Ω is the set of all possible outcomes. The set’s elements must be mutually exclusive, collectively exhaustive and at the right granularity.
Definition (Event) An event is a subset of the sample space. Probability is assigned to events.
Definition (Probability axioms) A probability law \mathbb{P} assigns probabilities to events and satisties the following axioms:
a. Nonnegatively : \mathbb{P}\left(A\right) \ge 0 for all events A.
b. Normalization : \mathbb{P}\left(\Omega\right) = 1.
c. (Countable) additivity: For every sequence of events A1,A2…,An such that A_i \cap A_j = \varnothing \colon \mathbb{P}\left(\bigcup\limits_{i} A_i\right) = \sum\limits_i \mathbb{P}(A_i).
Corollaries (Consequences of the axioms):
1. \mathbb{P}\left(\varnothing\right) = 0.
2. For any finite collection of disjoint events A1,…,An,
\mathbb{P}\left(\bigcup\limits_{i=1}^n A_i\right) = \sum\limits_{i=1}^n \mathbb{P}(A_i).
3. \mathbb{P}\left(A\right) + \mathbb{P}\left(A^c\right) = 1.
4. \mathbb{P}\left(A\right) \le 1.
5. \textrm{If } A \subset B, \textrm{ then } \mathbb{P}(A) \leq \mathbb{P}(B).
6. \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) – \mathbb{P}(A \cap B).
7. \mathbb{P}(A \cup B) \leq \mathbb{P}(A) + \mathbb{P}(B).
Example (Discrete uniform law) Assume Ω is finite and consists of n equally likely elements. Also, assume that A \subset \Omega with k elements. Then \mathbb{P}(A) = \tfrac{k}{n}.
Definition (Event) An event is a subset of the sample space. Probability is assigned to events.
Definition (Probability axioms) A probability law \mathbb{P} assigns probabilities to events and satisties the following axioms:
a. Nonnegatively : \mathbb{P}\left(A\right) \ge 0 for all events A.
b. Normalization : \mathbb{P}\left(\Omega\right) = 1.
c. (Countable) additivity: For every sequence of events A1,A2…,An such that A_i \cap A_j = \varnothing \colon \mathbb{P}\left(\bigcup\limits_{i} A_i\right) = \sum\limits_i \mathbb{P}(A_i).
Corollaries (Consequences of the axioms):
1. \mathbb{P}\left(\varnothing\right) = 0.
2. For any finite collection of disjoint events A1,…,An,
\mathbb{P}\left(\bigcup\limits_{i=1}^n A_i\right) = \sum\limits_{i=1}^n \mathbb{P}(A_i).
3. \mathbb{P}\left(A\right) + \mathbb{P}\left(A^c\right) = 1.
4. \mathbb{P}\left(A\right) \le 1.
5. \textrm{If } A \subset B, \textrm{ then } \mathbb{P}(A) \leq \mathbb{P}(B).
6. \mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) – \mathbb{P}(A \cap B).
7. \mathbb{P}(A \cup B) \leq \mathbb{P}(A) + \mathbb{P}(B).
Example (Discrete uniform law) Assume Ω is finite and consists of n equally likely elements. Also, assume that A \subset \Omega with k elements. Then \mathbb{P}(A) = \tfrac{k}{n}.
Conditioning and Bayes’ rule
Definition (Conditional probability) Given that event B has occurred and that \mathbb{P}(B) \gt 0, the probability that A occurs is
\mathbb{P}(A|B) \triangleq \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}.
Remark (Conditional probabilities properties) They are the same as ordinary probabilities. Assuming \mathbb{P}(B) > 0:
a. \mathbb{P}(A|B) \gt 0.
b. \mathbb{P}(\Omega|B) = 1.
c. \mathbb{P}(B|B) = 1.
d. \textrm{If} A \cap C = \varnothing, \mathbb{P}(A \cup C |B) = \mathbb{P}(A | B) + \mathbb{P}(C|B) .
Proposition (Multiplication rule)
\mathbb{P}(A_1 \cap A_2 \cap \cdots \cap A_n) = \mathbb{P}(A_1) \cdot \mathbb{P}(A_2|A_1) \cdots \mathbb{P}(A_n|A_1 \cap A_2 \cap \cdots \cap A_{n-1}).
Theorem (Total probability theorem) Given a partition {A1, A2, …} of the sample space, meaning that \bigcup\limits_i A_i = \Omega and the events are disjoint, and for every event B, we have
\mathbb{P}(B) = \sum\limits_i{\mathbb{P}(A_i)\mathbb{P}(B|A_i)}.
\mathbb{P}(A|B) \triangleq \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}.
Remark (Conditional probabilities properties) They are the same as ordinary probabilities. Assuming \mathbb{P}(B) > 0:
a. \mathbb{P}(A|B) \gt 0.
b. \mathbb{P}(\Omega|B) = 1.
c. \mathbb{P}(B|B) = 1.
d. \textrm{If} A \cap C = \varnothing, \mathbb{P}(A \cup C |B) = \mathbb{P}(A | B) + \mathbb{P}(C|B) .
Proposition (Multiplication rule)
\mathbb{P}(A_1 \cap A_2 \cap \cdots \cap A_n) = \mathbb{P}(A_1) \cdot \mathbb{P}(A_2|A_1) \cdots \mathbb{P}(A_n|A_1 \cap A_2 \cap \cdots \cap A_{n-1}).
Theorem (Total probability theorem) Given a partition {A1, A2, …} of the sample space, meaning that \bigcup\limits_i A_i = \Omega and the events are disjoint, and for every event B, we have
\mathbb{P}(B) = \sum\limits_i{\mathbb{P}(A_i)\mathbb{P}(B|A_i)}.
Theorem (Bayes’ rule) Given a partition {A1,A2,…} of the sample space, meaning that \bigcup \limits_{i} A_i = \Omega and the events are disjoint, and if \mathbb{P}(A_i) > 0 for all i, then for every event B, the conditional probabilities \mathbb{P}(A_i | B) can be obtained from the conditional probabilities \mathbb{P}(B|A_i) and the initial probabilities \mathbb{P}(A_i) as follows:
\mathbb{P}(A_i | B) = \frac{\mathbb{P}(A_i)\mathbb{P}(B|A_i)}{\sum_j \mathbb{P}(A_j)\mathbb{P}(B|A_j)}.
\mathbb{P}(A_i | B) = \frac{\mathbb{P}(A_i)\mathbb{P}(B|A_i)}{\sum_j \mathbb{P}(A_j)\mathbb{P}(B|A_j)}.
Independence
Definition (Independence of events) Two events are independent if occurrence of one provides no information about the other. We say that A and B are independent if
\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B).
Equivalently, as long as \mathbb{P}(A) > 0 and \mathbb{P}(B)>0,
\mathbb{P}(B|A) = \mathbb{P}(B)\qquad\mathbb{P}(A|B) = \mathbb{P}(A).
Remarks
a. The definition of indepence is symmetric with respect to A and B
b. The product definition applies even if \mathbb{P}(A) = 0 or \mathbb{P}(B) = 0.
Corollaries If A and B are independent, then A and Bc are independent. Similarly for Ac and B, or for Ac and Bc.
Definition (Conditional independence) We say that A and B are independent conditioned on C, where \mathbb{P}(C) > 0, if
\mathbb{P}(A \cap B | C) = \mathbb{P}(A|C)\mathbb{P}(B|C).
Definition (Independence of a collection of events) We say that events A1,A2,…,An are independent if for every collection of distinct indices i1,i2,…ik, we have
\mathbb{P}(A_{i1} \cap … \cap A_{ik}) = \mathbb{P}(A_{i1}) \cdot \mathbb{P}(A_{i2}) \cdots \mathbb{P}(A_{ik}).
\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B).
Equivalently, as long as \mathbb{P}(A) > 0 and \mathbb{P}(B)>0,
\mathbb{P}(B|A) = \mathbb{P}(B)\qquad\mathbb{P}(A|B) = \mathbb{P}(A).
Remarks
a. The definition of indepence is symmetric with respect to A and B
b. The product definition applies even if \mathbb{P}(A) = 0 or \mathbb{P}(B) = 0.
Corollaries If A and B are independent, then A and Bc are independent. Similarly for Ac and B, or for Ac and Bc.
Definition (Conditional independence) We say that A and B are independent conditioned on C, where \mathbb{P}(C) > 0, if
\mathbb{P}(A \cap B | C) = \mathbb{P}(A|C)\mathbb{P}(B|C).
Definition (Independence of a collection of events) We say that events A1,A2,…,An are independent if for every collection of distinct indices i1,i2,…ik, we have
\mathbb{P}(A_{i1} \cap … \cap A_{ik}) = \mathbb{P}(A_{i1}) \cdot \mathbb{P}(A_{i2}) \cdots \mathbb{P}(A_{ik}).
Original title: All Cheat Sheets / Machine Learning, Deep Learning, Artificial Intelligence by Stanford University and Massachusetts Institute of Technology (thank you)