Estimation of Time Varying Partial Correlation Networks

Estimation of Time Varying Partial Correlation Networks

Lately, many applications in social, natural and information sciences led to an increased interest in the statistical analysis of networks or, in other words, graph models. Examples within the fields of economics and finance are given amongst others, in a series of articles authored by Francis X. Diebold and Kamil Yilmaz with various collaborators; cf. www.financialconnectedness.org for further details. Kolar et al. [1], inter alia, provide examples from the biology (genome networks) and sociology (U.S. Senate voting behavior) literature. A common feature of these examples is the underlying time dependence of such systems.

Basic Terminology for Networks

Before digging into the estimation of networks, a short overview of what networks actually are and some basic terminology are in order. A network, or graph model, is defined as the tuple \mathcal{N} = (\mathcal{V},\mathcal{E}) where \mathcal{V} = \{1, ...,N\} is called the set of vertices or nodes (think of individuals, firms, countries, genomes, etc.) and  \mathcal{E} \subseteq \mathcal{V} \times \mathcal{V} the set of edges (connections between the vertices). Depending on the direction of these connections, we can distinguish two basic kinds of networks: directed and undirected networks. The latter are graphs in which all edges are bidirectional (unordered pairs of elements of \mathcal{N}) whereas in the former all edges point into a direction (ordered pairs of elements of \mathcal{N}). Another major distinction between graphs is that of a weighted vs. an unweighted graph. In unweighted graphs we only display whether two vertices share a connection or not, whereas in weighted graphs we associate weights with each edge displaying the strength of the connection between vertices.

Left panel illustrates an undirected and right panel a directed unweighted network. A weighted network could be illustrated by using different edge thickness or coloring. Taken from [14, 15].

Figure 1: Left panel illustrates an undirected and right panel a directed unweighted network. A weighted network could be illustrated by using different edge thickness or coloring. Taken from [14, 15].

Having presented the basic terminology for networks we can turn our focus to the statistical part of networks. That is, we will now discuss different approaches on how to estimate the set of edges, depending on the underlying data structure, for (weighted) undirected networks. But first, we need to define when an edge between two vertices exists in a statistical meaning. Following the literature, cf. [2], this can be done by means of partial correlation networks. In this context, an edge between two vertices exists if and only if the partial correlation,  \rho_{ij} , between the two vertices i and j is non-zero. That is

 \displaystyle \mathcal{E}= \{(i,j) \in \mathcal{V}\times\mathcal{V}: \rho_{ij} \neq 0 \}

Note that, for weighted networks, ij also defines the weight of the edge between vertices i and j. To formalize this idea a bit more, suppose that we have a N-dimensional random vector x with generic i-th element  x_i , with mean zero and covariance matrix  \Sigma . Moreover, define the precision or concentration matrix  \Omega \equiv \Sigma^{-1} . A classical result states that the (i, j)-th element of  \Omega ],  w_{ij} , is proportional to the partial correlation coefficient between  x_i and  x_j , cf. [3]. In other words, an edge between vertices i and j exists only if  w_{ij} \neq 0 . Hence, estimating  \mathcal{E} turns out to be the same problem as estimating  \Omega . Thus, a partial correlation network can also be seen as a graphical representation of the covariance structure of the high-dimensional random vector x. Due to the seminal work of [4], arguing that most networks are sparse (i.e. the majority of entries in  \Omega are equal to zero) and supported by subsequent empirical analyses, estimating  \mathcal{E} , respectively  \Omega , constitutes a covariance selection problem as introduced in the seminal work of [5]. Since  \mathcal{O}(N^2) parameters determine the network structure, the case of small N and relatively large T, i.e. number of observations, can be tackled with classical estimation methods as, for example, described in [6]. However, in the examples mentioned in the introduction the opposite is usually the case. That is, N is big relative to T and thus, classical methods of estimating  \Omega stop working and we are in need for so called big data methods which are capable of handling cases with a huge amount of unknown parameters relative to the number of samples T. A first proposal to tackle this problem in context of network estimation can be dated to [2] who apply a neighborhood selection (i.e. selecting the edges in a neighborhhood of vertix i) approach based on the Least Absolute Shrinkage and Selection Operator (LASSO). However, this approach only works well for unweighted graphs due to its sequential structure of first estimating the set of edges and then determining the weights based on the previously found restrictions on the concentration matrix. In case of weighted networks the LASSO approach of [7] seems more promising in also correctly determining the weights associated with each edge. However, a major drawback of these methods is that they are only suitable for i.i.d. data and thus, neglect major aspects, such as the long-run or dynamic behavior, of time series.

Illustration of sparsity for a fission yeast protein interaction network. One can clearly see that, by far, not all vertices are connected. Taken from [13].

Figure 2: Illustration of sparsity for a fission yeast protein interaction network. One can clearly see that, by far, not all vertices are connected. Taken from [13].

Moving to Time Dependent Systems

One of the first attempts to relax the i.i.d. assumption on the data was brought forward by [8] who consider smooth variations in the network structure. That is, these authors allowed for smooth changes in the concentration matrix, still maintaining the independence assumption of the data. Mladen Kolar and Eric P. Xing, together with various collaborators, build on the proposal of [8] to further refine methods which relax the assumption of identically distributed data. However, these authors do not yet allow for time dependencies of the data. A first proposal to overcome this hurdle was brought forward by [9] who consider general processes \mathbf x_t which possess a VAR(\infty) -representation (and to be approximated by a VAR(p)-process for estimation) and thus, allow them to redefine partial correlation networks for time-dependent processes. We will now quickly review their approach to gain some intuition before moving to time varying networks based on  dependent data. Therefore, let \mathbf x_t follow a stationary VAR(p)-process

(1)   \begin{equation*} \mathbf x_t=\sum\limits_{k=1}^p\mathbf A_k\mathbf x_{t-k}+\boldsymbol\epsilon_t,\quad\boldsymbol\epsilon_t\sim w.n.(0,\boldsymbol\Sigma_\epsilon) \end{equation*}

where \mathbf A_k are N\times N-matrices of parameters. Now, recall that the precision matrix of \mathbf x_t is defined to be the inverse of the covariance matrix of \mathbf x_t. It is easy to show, under suitable conditions, that the long-run covariance, \boldsymbol\Sigma_{\mathbf x}, of \mathbf x_t is given by

\boldsymbol\Sigma_{\mathbf x}=\left(\mathbf I_N-\sum\limits_{k=1}^p\mathbf A_k\right)^{-1}\boldsymbol\Sigma_\epsilon\left(\mathbf I_N-\sum\limits_{k=1}^p\mathbf A_k^\top\right)^{-1}

and thus, the long-run precision matrix by

(2)    \begin{align*} \boldsymbol\Omega_{\mathbf x}\equiv\boldsymbol\Sigma_{\mathbf x}^{-1} & =\left(\mathbf I_N-\sum\limits_{k=1}^p\mathbf A_k^\top\right)\boldsymbol\Sigma_\epsilon^{-1}\left(\mathbf I_N-\sum\limits_{k=1}^p\mathbf A_k\right)^{-1}\\ & \equiv(\mathbf I_N-\mathbf G)^\top\mathbf C(\mathbf I_N-\mathbf G) \end{align*}

Thus, the long-run network of the process \mathbf x_t can be characterized via the set of edges

(3)    \begin{equation*} \mathcal E_{\mathbf x}=\Big\{(i,j)\in\mathcal V\times\mathcal V:\,\,-\omega_{\mathbf x}^{ij}/\sqrt{\omega_{\mathbf x}^{ii}\omega_{\mathbf x}^{jj}}\neq0\Big\} \end{equation*}

where \omega_{\mathbf x}^{ij} denotes the (i,j)-th entry of \boldsymbol\Omega_{\mathbf x}.

From Equation (3) it is straightforward to see that estimating the long-run network of \mathbf x_t entails estimating the parameters of the VAR(p)-process and the inverse of the covariance matrix of the white noise innovations \boldsymbol\epsilon_t. Recalling that the network structure usually is sparse this can be done, for example, by applying LASSO techniques to both problems separately. In particular, [9] propose to estimate G by estimating each of the N VAR(p)-equations by the adaptive LASSO separately (due to dimensionality issues) and to estimate C using the iterative LASSO approach of [7].

A final remark is now in order. Equation (2) allows us to disentangle two different effects influencing the long-run network structure. That is, we can model effects to the long-run network due to a) Granger causality within \mathbf x_t via the G-component (also referred to as the Granger network, a weighted directed graph) and b) correlations within the innovations through C (also referred to as the contemporaneous network, a weighted undirected graph). This has the nice feature that we  can gain some more insights into the structures of the long-run network underlying \mathbf x_t and their origins. However, this approach still does not allow us to model time evolving networks.

Including Time Variations Into the Network Structure

Based on Equations (1)–(3) it seems natural to model time variation via varying VAR-parameters and time varying covariance matrices for the innovations. For simplicity, we will only focus on VAR-processes with structural breaks:

(4)    \begin{align*} \mathbf x_t=\sum\limits_{k=1}^{p_l}\mathbf A_{k,l}\mathbf x_{t-k}+\boldsymbol\epsilon_{t,l},\quad&\boldsymbol\epsilon_{t,l}\sim w.n.(0,\boldsymbol\Sigma_{\epsilon,l}),\,\, \\ & t\in[t_{l-1};t_l)\end{align*}

and we assume that there is a structural break at t=t_l when either the lag order, an element of the parameter matrices or of the covariance matrix changes at t_l. If this process is stationary on each segment l, t\in[t_{l-1};t_l), we shall call it a piece-wise stationary process. On each such segment where the parameters do not change, the long-run network can then be retrieved from

(5)    \begin{equation*} \Omega_{\mathbf x,l}=[\mathbf I_N-\mathbf G_l]^\top\mathbf C_l[\mathbf I_N-\mathbf G_l]. \end{equation*}

The estimation of \mathbf G_l is more involving, compared to time stable networks, since we also have to estimate the number and location of break points and we shall now briefly outline a possible approach to tackle this problem elegantly. In what follows, we will estimate each VAR-equation separately. Thus, take x_t to be a generic element of  \mathbf x_t (we suppress the subscript i to ease notational burden) and consider its representation derived from (4)

(6)   \begin{equation*} x_t=\sum\limits_{k=1}^{p_l}\mathbf a_{k,l}\mathbf x_{t-k}+\epsilon_{t,l}\equiv\mathbf g_l\tilde{\mathbf x}_{t-1}+\epsilon_{t,l},\quad t\in[t_{l-1};t_l) \end{equation*}

where \mathbf a_{k,l} denotes the i-th row of \mathbf A_{k,l}, \mathbf g_l the i-th row of \mathbf G_l and \tilde{\mathbf x}_{t-1}=\operatorname{vec}(\mathbf x_{t-1},...,\mathbf x_{t-p}). Moreover, let \mathbf x^0=(x_1,..,x_T)^\top, \boldsymbol\epsilon=(\epsilon_1,...,\epsilon_T)^\top and

 \mathbf X=\begin{pmatrix} \tilde{\mathbf x}_0^\top & 0 & \ldots & 0\\ \tilde{\mathbf x}_1^\top & \tilde{\mathbf x}_1^\top & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ \tilde{\mathbf x}_{T-1}^\top & \tilde{\mathbf x}_{T-1}^\top & \ldots & \tilde{\mathbf x}_{T-1}^\top\\ \end{pmatrix}.

Finally, define \boldsymbol\beta_1=\mathbf g_1^\top and

 \boldsymbol\beta_t=\begin{cases} \mathbf g_l^\top-\mathbf g_{l-1}^\top & \mbox{whenever } t=t_l\\ \mathbf 0 & \mbox{otherwise} \end{cases}

and let \boldsymbol\beta=\operatorname{vec}(\boldsymbol\beta_1,...,\boldsymbol\beta_T). Thus, (6) can be rewritten in matrix form as

(7)    \begin{equation*} \mathbf x^0=\mathbf X\boldsymbol\beta+\boldsymbol\epsilon \end{equation*}

By construction of \boldsymbol\beta, estimation of (7) constitutes a sparse estimation problem and thus LASSO techniques render themselves a suitable choice. For example, we can utilize, due to the model specific sparsity structure, the (sparse) group LASSO approach:

(8)    \begin{align*} \hat{\boldsymbol\beta}=\arg\min\limits_{\boldsymbol\beta}& \|\mathbf x^0-\mathbf X\boldsymbol\beta\|_2^2+(1-\alpha)\lambda_x\sum\limits_{k=1}^T\|\boldsymbol\beta_k\|_2 \\ & +\alpha\lambda_x\|\boldsymbol\beta\|_1, \end{align*}

where \lambda_x is a regularization parameter and \alpha\in[0;1] controls between (number of structural breaks) and within (number of zero parameters on each segment) sparsity of the constructed groups. To restore the missing oracle property of the LASSO and its extensions, we can modify Equation (8) to an adaptive approach using some pre-estimators. Of course, there are also other methods available, such as the parsimoniously time varying VAR-approach of [12].

To estimate the time varying contemporaneous network \mathbf C_l we follow Kolar and Xing (2012) and employ their temporal-difference LASSO (TD-LASSO) which utilizes a neighborhood selection set up similar to [2]. As mentioned earlier, due to the sequential procedure of the neighborhood selection procedure, other methods might be preferred but, to the best of my knowledge, another proposal to solve this problem does not yet exist (except for solutions with stronger assumptions, e.g., imposing certain probability structures on the edge formation). Due to heavy notational burden of the TD-LASSO in addition to space constraints, we refer the interested reader to [10] for a presentation of this approach. Furthermore, we note that the TD-LASSO is only capable of determining whether there exists an edge between vertices i and j or not. That is, we cannot determine the weight associated with this edge via the TD-LASSO. However, we can combine it with the iterative procedure of [7] to estimate the weights of the edges on each segment.

Concluding, we reviewed the most important terminology for networks in a statistical setting. Afterwards, we outlined estimation of partial correlation networks based on i.i.d. data, time-varying independent data and extensions to networks based on dependent data and time varying structures.

References

[1] M. Kolar, L Song, A. Ahmed and E.P. Xing, 2010, Estimating Time-Varying Networks, The Annals of Applied Statistics, 6, 2069–2106
[2] N. Meinshausen and P. Bühlmann, 2006, High Dimensional Graphs and Variable Selection with the LASSO, Annals of Statistics, 34, 1436–1462
[3] S.L. Lauritzen, 1996, Graphical Models (Oxford Statistical Science Series), Oxford University Press, USA
[4] S. Milgram, 1967, The Small-World Problem, Psychology Today, 1, 61–67
[5] A.P. Dempster, 1972, Covariance Selection, Biometrics, 28, 157–175
[6] D. Edwards, 2000, Introduction to Graphical Modelling (Springer Texts in Statistics), Springer-Verlag New York, USA
[7] J. Peng, P. Wang, N. Zhou and J. Zhu, 2009, Partial Correlation Estimation by Joint Sparse Regression Models, Journal of the American Statistical Association, 104, 735–746
[8] S. Zhou, J. Lafferty and L. Wasserman, 2008, Time Varying Undirected Graphs, In Proc. Conf. Learning Theory, 455–466
[9] M. Barigozzi and C. Brownlees, 2014, NETS: Network Estimation for Time Series, Working Paper
[10] M. Kolar and E.P. Xing, 2012, Estimating Networks with Jumps, Electronic Journal of Statistics, 6, 2069–2106
[11] M. Kolar and E.P. Xing, 2011, On Time Varying Undirected Graphs, Proc. of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)
[12] L. Callot and J.T. Kristensen, 2014, Vector Autoregressions with Parsimoniously Time Varying Parameters and an Application to Monetary Policy, Working Paper
[13] V. Pancaldi, Ö.S. Sarac, C. Rallis, J.R. McLean, M. Prevorovsky, K. Gould, A. Beyer and J. Bähler, 2012, Predicting the Fission Yeast Protein Interaction Network, G3, 2, 453–467
[14] http://mathinsight.org/definition/directed_graph
[15] http://mathinsight.org/definition/undirected_graph

Text by: Mario Rothfelder

Comments