12
$\begingroup$

I have been brushing up on my probability and measure theory lately. So I understand how measure theory defines a sigma algebra over a sample space, and then a "measure" assigns a numerical value to the sets in the sigma algebra. However, I am struggling to understand the connection between the sigma algebra in the probability space, and a regression model and regressor variables in statistics. Actually, in looking at measure theory texts, like Axler, or Billingsly, the authors end with the assignment of probability, but do not discuss the connection of statistics with measure theory.

But I am trying to remember how measure theory/probability formally connects to statistical models. I recall from grad school the notion of the CLT as the sum of random variables being normal, and the idea of sampling. But I was looking for a more mathematical or formal understanding of how probability connects back to specific statistical models.

Let me be a little more precise. So in the idea of a sigma algebra, we have a set containing all of the possible outcomes in the sample space, as well as the complements to those outcomes, and countable unions and intersections. Now in a linear regression we have regressor variables. I am trying to just understand--in a precise mathematical sense--how the elements in the sigma algebra relate to the regressor variables in a regression--or if there is any relationship?

In a linear regression our model looks like:

$$ \bf{Y} = XB + \epsilon $$ Where $\bf{Y}$ is a vector of outcomes, $\bf{X}$ is a design matrix containing the regressor variables, $\bf{B}$ is a vector of coefficients, and $\bf{\epsilon}$ is a vector of observation level errors.

So in a mathematical sense, is the sigma algebra from measure theory related to the variables in $\bf{X}$, or is that just related to the error term $\bf{\epsilon}$?

$\endgroup$
10
  • 2
    $\begingroup$ The short answer is, you do not need a rigorous understanding of probability theory to do statistics. The purpose of measure theory is to provide a foundation for probability theory. I would argue that if you care about statistics primarily then you should prioritize that and only learn measure theory on a need-to-know-basis. If you start from the ground up then you will probably never reach statistics. $\endgroup$ Commented 2 days ago
  • $\begingroup$ For example, what is a random variable? The intuitive definition is that it is a random number generator following some distribution. The rigorous definition is that it is a mapping on some measure space to the real line. If you need to prove something (for instance, CLT), then you need the foundational understanding of a random variable. But if you are doing statistics, you often do not need to know that. Of course, knowing the foundations is helpful but not necessary. $\endgroup$ Commented 2 days ago
  • 1
    $\begingroup$ Measure theory is the framework/basis of formal probability. Statistical inference calculations are based probability calculations and so ultimately rely on that framework. The formality can sometimes come up in practical problems (normally when trying to deal with inference on less typical things than the more common analyses). $\endgroup$
    – Glen_b
    Commented 2 days ago
  • $\begingroup$ @NicolasBourbaki thanks for the insight. Yeah, I understand what you mean. The problem I am running into is reading some articles on stochastic processes, particularly the stochastic finance/Ito calculus stuff, and also Durrett's book on random graph dynamics. So I keep hitting this language of measurability and measurable functions. Now I know enough measure theory to understand what they mean. I was trying to figure out how much effort to invest in really understanding that measure language "deeply" in the sense of the end to end connection with stats. $\endgroup$
    – krishnab
    Commented 2 days ago
  • 1
    $\begingroup$ You might be interested in stats.stackexchange.com/a/199328/11887 $\endgroup$ Commented 2 days ago

4 Answers 4

10
$\begingroup$

Measure theory relates to probability theory because "probability" is defined as a normed measure (i.e., a probability measure). In practical statistical work, we don't usually work directly with these measures and the sigma-fields for their domain. Instead we work with density functions that relate to these probability measures. We use the probability measure whenever we go back and compute probabilities for events of interest in the model.


In regression analysis the model form defines a conditional density for the response variable conditional on the explanatory variables, so a typically Gaussian regression model uses the conditional density function:

$$p_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}| \mathbf{x}) = \text{N}(\mathbf{y} | \mathbf{x} \boldsymbol{\beta}, \sigma \mathbf{I}).$$

The resulting probability measure is related to this conditional density by:

$$\mathbb{P}(\mathbf{Y} \in \mathscr{Y} | \mathbf{x}) = \int \limits_\mathscr{Y} p_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}| \mathbf{x}) \ d \mathbf{y} = \int \limits_\mathscr{Y} \text{N}(\mathbf{y} | \mathbf{x} \boldsymbol{\beta}, \sigma \mathbf{I}) \ d \mathbf{y}.$$

Whereas the density function has a single real input argument $\mathbf{y} \in \mathbb{R}^n$ the probability measure has an input argument $\mathscr{Y} \subseteq \mathbb{R}^n$ which is a set of values that is an element of a corresponding sigma-algebra. Typically we are working with real numbers so the sample space is the set $\mathbb{R}^n$ and the sigma-algebra is the corresponding class of Borel sets on $\mathbb{R}^n$. When we are doing statistical analysis we usually don't mention measure theory or the sigma-algebra that the probability measure is defined on, but it is there in the background.

$\endgroup$
5
  • 4
    $\begingroup$ I would add that in 99.9% of all real world statistics either the sample space is $\mathbb{R}^n$ and the sigma-algebra is the Borel sets or the sample space is finite, the sigma-algebra consists of the entire power set and you don't need any measure theory. $\endgroup$
    – quarague
    Commented 2 days ago
  • $\begingroup$ What is $\text{N}(\mathbf{y} | \mathbf{x} \boldsymbol{\beta}, \sigma \mathbf{I})$? I am more used to $\text{N}(\mathbf{x} \boldsymbol{\beta}, \sigma \mathbf{I})$. $\endgroup$ Commented yesterday
  • $\begingroup$ @RichardHardy: The former is the density function evaluated at the input $\mathbf{y}$ whereas the latter is the entire function (i.e., the distribution). (At least, that is how I use the notation.) $\endgroup$
    – Ben
    Commented 19 hours ago
  • $\begingroup$ So we have the argument before the vertical bar and the parameters after it? I guess I was confused by the use of a vertical bar in such a context; a semicolon could have been less confusing to me. $\endgroup$ Commented 9 hours ago
  • $\begingroup$ Fair enough. Some people use the semicolon for that; I prefer the bar. $\endgroup$
    – Ben
    Commented 7 hours ago
8
$\begingroup$

So in a mathematical sense, is the sigma algebra from measure theory related to the variables in $\bf{X}$, or is that just related to the error term $\bf{\epsilon}$?

It may become easier to see if you write the regression as a multivariate normal distributed variable

$$\mathbf{Y} \sim \mathcal{N}(\mathbf{X}\mathbf{B}, \mathbf{\Sigma})$$

which maybe you can see better now, that this is a definition for a probability measure.

A sigma algebra is a set with all intervals/regions $$\prod_{i=1}^n (a_i \leq y_i \leq b_i)$$ and all their their unions. Such intervals describe the events to which we can assign probabilities.

$\endgroup$
5
  • 1
    $\begingroup$ If $\mathbf{X}$ is a random variable as well, then you can define the probability space to include it. $\endgroup$ Commented 2 days ago
  • $\begingroup$ Okay, so this makes a lot more sense to me. This is really helpful, because it lays out how the sigma algebra is defined over the regression model. So each interval $(a_i \leq y_i \leq b_i)$ is an event. If the $\bf{X}$ is random, then I just would product over the whole mess of combined $X, Y$ intervals. $\endgroup$
    – krishnab
    Commented 2 days ago
  • $\begingroup$ This makes sense. I don't know what I have not found any book that lays it out so simply. Would you happen to know a good text or article that actually discusses or describes this. Thanks so much again. $\endgroup$
    – krishnab
    Commented 2 days ago
  • $\begingroup$ @krishnab possibly any text that does not use measure theory is a good text to start with. For example Fisher's "on the mathematical foundations of theoretical statistics" doesn't talk about probability spaces and sigma algebras, and instead in talks about 'the probability $df$ of an observation in the range $dx$' which is an approach that allows you to do most of all sorts of computations. The situation might be more complex when the events are not so simply described by regions in finite euclidian space and instead we have events described in an infintie space... $\endgroup$ Commented 2 days ago
  • $\begingroup$ ... so a book or article that describes the infinite spaces that you might be dealing with (if you are dealing with them) could be a good start. Or possibly there is something other non-trivial where you want to apply probability. Then reading about that would be helpful. $\endgroup$ Commented 2 days ago
5
$\begingroup$

A general, rigorous framework for regression problems can indeed be constructed.

Consider a family of stochastic kernels $\{P_\eta\}_{\eta~\in(a,b)}$ on $(\mathscr Y, \boldsymbol{\mathfrak B}) ,$ that is, for a fixed $B\in\boldsymbol{\mathfrak B}, ~\eta \mapsto P_\eta(B)$ is a measurable map. Let $Y$ be a random variable with values in $\mathscr Y$ and is influenced by regressor $X$ taking values in in $\mathscr X$ with $(\mathscr X, \boldsymbol{\mathfrak A}) .$

Denote by $\mathsf C(\Delta,\mathscr X) $ the class of functions that are continuous in $\theta$ and measurable in $x$ (it is assumed that $\Delta$ is a separable metric space). The regression function $g: \Delta\times \mathscr X\to (a, b), ~g\in \mathsf C(\Delta,\mathscr X)$ determines the distribution of the observation $Y$ given the value of the regressor, that is, if $X=x, $ the distribution of the observation $Y$ is $P_{g_\theta(x) }.$

Regression models involve both cases of regressors: one where $X$ is random and one it being non-random.

When $X$ is random, the conditional distribution of $Y$ given $X=x,$ as discussed above, is given by $\mathsf K_\theta(~\cdot\mid x) =P_{g_\theta(x) }.$ If the marginal distribution of $X$ on $(\mathscr X, \boldsymbol{\mathfrak A}) $ is denoted by $\mu, $ then each independent pair $(Y_i, X_i), ~i\in\overline{1, 2,\ldots, n}, $ has the common joint distribution is $$\mathcal L(Y_i, X_i) =\mathsf K_\theta\otimes \mu.$$

In case of non-random regressors, an experimental design fixes the values $x_i\in\mathscr X;$ the observations $Y_i$s are independent with corresponding distributions being $P_{g_\theta(x_i) }.$

When $\theta, x\in\mathrm{l\! R}^d$ and $g_\theta(x)=\theta^\top x, $ the regression problem obtained is linear. When $g_\theta(x) =\varphi(\theta^\top x), $ where $\varphi$ is continuous, then the regression problem obtained is generalized linear.

--

Reference:

Statistical Decision Theory: Estimation, Testing and Selection, Friedrich Liese, Kalus-J. Miescke, Springer Science$+$Business, $2008, $ sec. $7.5, $ pp. $343-344.$

$\endgroup$
4
$\begingroup$

It's a bit hard to fit in a comment, so I put it as an answer, but we'd be remiss to not mention Gelman's take on this (selected excerpt from his blog):

Stephen Senn quips: “A theoretical statistician knows all about measure theory but has never seen a measurement whereas the actual use of measure theory by the applied statistician is a set of measure zero.”

Which reminds me of Lucien Le Cam’s reply when I asked him once whether he could think of any examples where the distinction between the strong law of large numbers (convergence with probability 1) and the weak law (convergence in probability) made any difference. Le Cam replied, No, he did not know of any examples. Le Cam was the theoretical statistician’s theoretical statistician, so there’s your answer.

$\endgroup$
1
  • 1
    $\begingroup$ Haha, I get the joke. And I have enough measure theory to understand what you mean :). I read through Gelman's post and I understand where he is coming from. There is a well developed structure for statistical inference that does not require a recourse to measure theory. But at the same time, I wonder whether there is a case for a more applied measure theory in some instances. Of course sets of measure zero is not the issue, but the time dependent evolution of the sigma algebra and corresponding probability seems very relevant to applied problems. But I have not thought deeply on this topic. $\endgroup$
    – krishnab
    Commented 21 hours ago

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.