SymmetricalDataSecurity: Structural studies of the global networks exposed in the Panama papers

Although the metrics computed on the global Panama network are instructive, they have limited interpretive validity, due to considerable heterogeneity both in the types of nodes and relations used in the network model. To obtain useful insights, we derive three other directed, unlabeled networks from the original set of triples. We also consider, by way of a baseline, the unlabeled multi-network in the previous section as a fourth unlabeled, directed multi-network designated as G. These three other networks have specific link semantics, as we describe below, and hence, are more amenable to tractable and interpretable structural analysis.

1
G_o: This network is constructed by only retaining links in G that are typed as officer_of in the full graph G.
2
G_i: This network is constructed by only retaining links in G that are typed as intermediary_of in the full graph G.
3
G_io: This network is the union of the two networks above. It is also the same as G except with registered_address links removed.

One caveat from the description above is that, while we do control for links in the networks below, the nodes are still heterogeneously typed i.e. the officer_of network (G_o) contains both officers and non-officers, since the end_id of a triple that has an officer_of type relation is a non-officer. In a later section, we also consider ‘higher-order’ networks where we perform a systematic closure to obtain networks with homogeneously typed nodes.

Once constructed, we do not distinguish between nodes of different types; nor are link labels considered. Hence, the network becomes unlabeled (at the edge level) and untyped (at the node level). This is necessary for computing standard structural metrics. Specifically, we study and compare these networks from various perspectives (using G as a baseline, where appropriate), first by plotting the degree distributions in Fig. 5. We find once again that the distributions are power-law, with only one exception (the in-degree distribution of G_i). At first glance, this would seem to suggest that these networks are not very different from ordinary social or organizational networks, which tend to have similar power law distributions. However, some other metrics, subsequently described, paint a different picture.

Fig. 5

In-degree and out-degree distributions (on a log-log scale) of G_i, G_o and G_io. G_i has a trivial in-degree distribution as nodes have either degree 1 or 0 with frequencies 213,634 and 14,074 respectively. For this reason, we do not reproduce it as a plot below. The y-axis represents the empirical frequency of the degree in all cases

To assess connectivity, we consider the undirected equivalents of these networks, and calculate the distribution of connected components as a function of size (in terms of number of nodes within the connected component). The connected component size distributions also obey a power-law distribution (Fig. 6), showing that, although the networks are disconnected, there is high connectivity in large portions of the networks. Lack of connectivity, and a systematic distribution of component size, is one indication that the networks are dissimilar from social networks which tend to be connected (or almost connected), and where the diameter of the network has been shown to be decreasing over time due to densification phenomena (Leskovec et al. 2007). In other words, the Panama Papers do not seem be exhibiting any ‘small-world’ phenomenon to the extent supported by available data (Watts 2004).

Fig. 6

Connected component size distributions of G, G_i, G_o, and G_io. The x-axis is the number of nodes in a connected component and the y-axis is the frequency of that size in the dataset

The specific power-law exponents are tabulated^{Footnote 11} in Table 3. Not including the in-degree distribution of G_i, of which the exponent is non-interpretable due to the trivial nature of the distribution, the in-degree distribution of the other networks exhibit a higher exponent than the out-degree distributions. While all the values are between the range of 1.5 and 3.5, the usual range in ordinary social and information networks is between 2 and 3, which is only true for the in-degree distribution of G_o and the out-degree distribution of G in the table. The connected component size distribution exponent is close to 1, except G_o, where it achieves a high value of 1.568.

Table 3 Exponents for the various power-law distributions characterizing the structure of G, G_i, G_o, and G_io. Because of the trivial nature of the G_i in-degree graph, its in-degree exponent does not have the usual interpretive validity (as it does not remotely resemble a power law distribution); hence, we denote it with an x

Many of the network measures computed in Table 2 are also computed for the selectively constructed networks in Table 4. In comparing the statistics in Table 4 with those of the original network in Table 2, we observe that officer_of nodes are more heavily represented in the non-singleton portion of the network than intermediaries (almost 359k vs. 228k), an increase of 57%, whereas the increase in the number of edges is relatively smaller (37% increase for the simple network and 45% increase for the multi-network). Transitivity in all networks is very low, either zero or near zero. Density is very low in all networks as well, and the assortativity coefficient (small negative values throughout) shows consistent but small disassortativity. Similar to the original network (though not tabulated here), all networks were once again found to be ‘almost’ simple. The number of edges in the multi-network equivalents of G, G_i, G_o and G_io were respectively found to be 4.308e-06, 8.240e-06, 4.803e-06 and 3.342e-06 (nearly identical to simply graph density). A similar observation was made for the number of edges in the multi-network equivalents.

Table 4 Network measures computed for G (reproduced as a baseline from Fig. 2), G_i, G_o, G_io

We also compute and tabulate the local and ‘meso-scale’ metrics of the four networks in Table 5. As a first step, we tabulate the maximum core number of both the whole network and the largest component, which turned out to be the same for all four networks. To compute the maximum core number for a graph, we note first that a k-core is a maximal subgraph that contains nodes of degree k or more. The core number of a node is the largest value k of a k-core containing that node. To compute the maximum core number of a graph or sub-graph, we simply compute the maximum core number observed in that graph, with the maximum taken over all nodes. The maximum core numbers show that there are ‘tight clusters’ where all nodes have degree 6 or 7. G_i is an exception, which shows that the intermediary network behaves differently from the other networks in this regard.

Table 5 Local and ‘meso-scale’ metrics characterizing the structure of the undirected equivalents of G, G_i, G_o, and G_io. Details on the metrics, as well as the core-periphery algorithm used for computing the partition of nodes in each network into cores and peripheries are provided in the text

A bridge in a connected component is an edge that, were it to be removed, would lead in the connected component ‘breaking’ into two connected components. The results in Table 5 show that the percentage of bridges (as a percentage of the number of edges) is high in all networks, and is extreme (100%) in the largest component in G_i. This suggests that every edge in the largest component in G_i is a bridge, which is true for graphs that are star-like or linear chains (or a combination of the two).

We also studied the core-periphery structure of each network by using the Kojaku-Masuda algorithm with the configuration model^{Footnote 12} (Kojaku et al. 2019). This algorithm partitions the nodes in the component into a set of cores and a set of peripheries. For full details, including theoretical justifications on core-periphery models, we refer the reader to the seminal references inBaldwin et al. (2011);Forslid and Ottaviano (2003);Krugman (1991). Herein, we note that, the intuitive idea is to detect nodes where the network seems to be concentrated (the cores) with the peripheries representing the ‘outlying’ nodes. While a network is not inherently spatial (which is the context in which the modern core-periphery model was introduced, at least in Krugman’s work on New Economic Geography (Krugman 1991)), there is some intuitive semblance of concentration in most networks. In counting the number of cores in all networks, we see that only G_i seems to have a single core, lending further credence to the earlier hypothesis that the network is star-like. The other networks have roughly equal numbers of cores and peripheries. This suggests that (i) either the algorithm is not appropriate for this kind of network, and a novel approach may be required to determine what the cores and peripheries are (if it is theoretically unjustifiable to have so many cores in the network), or (ii) there truly are many cores in the networks (which we claim is plausible, due to the transactional, global but highly decentralized nature of the entities in the Panama Papers^{Footnote 13}), suggesting a degree of robustness that we comment on in a later section. Since there are no clear cores, the problem of disrupting such a network becomes much harder for federal and transnational agencies tasked with minimizing the impacts of money laundering and organized crime.

We also computed the number of undirected connected triples in each graph.^{Footnote 14} A connected triple in an undirected graph is every 3-tuple of nodes (a,b,c) such that there is an edge between a and b and an edge between b and c. For Since the connected triple is undirected, triples (a,b,c) and (c,b,a) are treated the same (and only counted once). For example, a triangle would contribute three connected triples. On this measure, the difference emerges in G_o, which has far fewer connected triples (almost an order of magnitude less) than the other graphs. We believe that the reason lies in the network construction itself; namely if A is an officer of B, then by definition, A is an officer-type node, and B is either an intermediary or an offshore entity. However, since G_o only contains officer-type relations, the only way that B could participate in a connected triple is (i) if A is also the officer of other organizations, in which case B would be the third (or equivalently, first) element of all connected triples where one of the organizations is at the ‘opposite’ end of the triple (either third or first element) with A in the middle, (ii) if B has other officers, in which case B would be the middle element of (one or more) connected triples with the two officers (one of whom is A serving as the ‘ends’ of the triples. Since both of these possibilities are likely, we do not observe a value of 0 for G_o; however, the observed value is still very low compared to other networks. This may, in turn, suggest that organizations in the Panama Papers simply do not share many officers, or that organizations do not have many officers to begin with. Since many companies in the Panama Papers are suspected to be shell companies, rather than real businesses, this result can be interpreted in the sociological context of the papers.

The raw s-metric of a graph is defined as the sum (over all edges (u,v)) of the quantity d(u)∗d(v) where d is the degree function. The correct way to interpret this metric is across the four networks. Once again, we find that G_o has smaller s-metric than the other networks. In general, this means that degrees of nodes in participating edges are simply not high compared to the other networks. However because the s-metric grows very quickly with even a few reasonably high-degree nodes, the difference may not seem as stark as the numbers may suggest.

Finally, in considering the numbers of unique 3-clique and 4-clique motifs in the networks, we find that there are only 2 triangle motifs (or 3-cliques) in the overall network, corresponding to the two instances of triad 030T in Fig. 3, and no triangles in the other networks. This is in conformance with the extremely low transitivities and densities observed earlier for these networks. The number of 4-motifs was found to be 0 in all networks, which means that the maximum sized clique in all graphs (except G where it is 3) is 2. The graphs are sparse, although they do have interesting structural properties. Taken together, our findings strengthen the claim that, while the Panama Papers embody a complex system with many actors and players spread across the globe, they do not seem to follow the same kinds of laws that other complex systems with social players seem to follow (such as friendship and follower networks of social media platforms).

Country assortativity analysis

A key value proposition in conducting a study of this nature is to analyze the country dependencies in the various networks. We conduct such an analysis using two mechanisms. First, we compute country assortativity for all four networks (Table 6) after removing nodes that (i) either have no country associated with the node, or (ii) have more than one country associated with the node. Country assortativity is a special instance of the broader notion of attribute assortativity (where the attribute is the country, following the two pre-processing steps above), which is defined as:

$$ r = \frac{tr(M)-||M^{2}||}{1-||M^{2}||} $$

(2)

Table 6 Country attribute assortativity for G, G_i, G_o, and G_io. Nodes in these networks that were associated with more than one country (or with no country at all) were not included in the analysis

Here, r is the assortativity coefficient, tr is the trace and M is the mixing matrix or the joint probability distribution of the specified attribute. Full details and analysis of attribute assortativity may be found in the seminar paper byNewman (2003). The intuition is that, if two nodes affiliated with the same country are linked in the network, the country assortativity would be positive and high; otherwise, it would be low (or even negative).

While nodes that have no country associated with the node, or that have more than one country associated with the node, may potentially be of interest to investigators (or are interesting elements of study in their own right), they constitute a relatively small fraction of the overall network; for each of G,G_i,G_o and G_io, the percentage of such nodes (i.e. associated with no country or multiple countries) was found to be 17.4%, 1.2%, 26.2% and 17.4% respectively. The officer network is especially interesting because it seems to suggest that we either do not know the country associated with an officer, or the officer is associated with multiple countries, which may be a noisy artifact of the data. After removing such nodes, we are guaranteed to only have one country associated with each node, leading to a well-defined and interpretable analysis of country assortativity in the various networks.

We find that the country assortativity is generally high across all four networks, but is especially high in the intermediary network G_i. In other words, when A is an intermediary of B, they are very likely to belong to the same, rather than different, country. This is in accordance with what we would ordinarily expect, rather than in highly illicit networks. However, one caveat that should be noted is that the companies and intermediaries in the network are not ‘ordinary’ in the usual sense i.e. many of the organizations in the Panama Papers are ‘offshore entities’ that may themselves be associated with a bigger company or individual. Offshore entities in a given country would, for obvious reasons, prefer to partner with an intermediary in that country (i.e. a Swiss offshore entity would intuitively prefer to work with a Swiss law firm or accountant). In fact, the lower values of country assortativity for the other networks may suggest that offshore entities are set up in a country precisely to transact with intermediaries in that country.

Since the networks studied thus far have not been connected, we also conducted experiments via a second mechanism to understand how the country distributions are reflected in connected components of various sizes. The data in Table 6 seems to be suggesting that countries generally tend to co-occur together. However, this analysis is inherently limited because it only considers information at the edge-level. To gain a better sense of country representations and mixtures, we computed the information-theoretic entropy of the empirical probability distribution of countries within each connected component, again ignoring nodes that have no country or multiple countries. Specifically, the information-theoretic entropy may be defined as follows. Given an empirical probability distribution P, the information-theoretic entropy is given by −P(x)Σlog_b(P(x)) where the sum is taken over all discrete observations x. The base b could be the natural exponent e, 2 or 10, but these do not give comparable or normalized values when there are many 0 entries, as is the case when computing the country entropy for a given connected component (i.e. most countries do not occur in the component and have probability 0). Therefore, we set b to be the number of unique countries observed across all nodes in the connected component. Since this number can vary across components, b can also vary with the component. An advantage of using a varying b, however, is that the entropy is always between 0 and 1, where a value skewing towards 0 indicates that very few countries (and in the case of 0, only one country) are represented in the component, while a value that skews towards 1 implies high diversity. As a result, the entropies across components become comparable, as they are on a normalized [0.1] scale.

A plot of entropy vs. connected component size (in terms of number of nodes in the component) is reproduced in Fig. 7. We find, intriguingly, that while there is a negative relationship between entropy and component size (implying that, on average, smaller components tend to be much more diverse while still retaining connectivity, though many small components also have entropy 0 or close to 0), the very largest component (in G, G_o and G_io) has an entropy that is reasonably high (in the range of 0.6), which is unusual considering the size of the component. It suggests coordinated activity across companies and intermediaries in different (but not too many) countries, and may only be amenable to investigation (for illegal or nebulous activity) if national agencies agree to cooperate. By way of contrast, when we plot the density of the network instead of entropy (Fig. 8), components with more nodes are found to be less dense, exactly as would be expected for ordinary networks. The density of the largest component is only slightly above 0, suggesting high sparsity, even though the component (by definition) comprises a set of nodes that are connected to one another via at least one path. The difference between the sizes of the largest and second-largest components is also significant in all networks except G_i. This suggests that the largest component may be an interesting subject of study in its own right.

Fig. 7

Entropy of connected components in G, G_i, G_o, and G_io vs. size of connected component measured using number of nodes. The text provides an explanation of the base b used in the logarithm of the entropy formula. The plots are on a semi-log scale, where the x-axis is on the log scale and the y-scale is on the ordinary scale in the [0,1] range

Fig. 8

Density (expressed as a percentage) of connected components in G, G_i, G_o, and G_io vs. size of connected component measured using number of nodes. The plots are on a semi-log scale (where the x-axis is on the log scale and the y-scale is on the ordinary percentage scale in the [0,100] range)

Higher-order homogeneously typed networks

An issue with the networks constructed thus far is that they do not model (and in fact, completely ignore) potential relationships that are implied between nodes of the same type through nodes of different types. For example, two individuals may be officers at the same organization, but are not directly linked in the network (in fact, there is no relationship in the Panama Papers data, as currently available, that models any connection between two officer-type nodes). To model a connection between these officers, we have to take into account the fact that they share a neighbor (in the overall graph) that belongs to an organization-type node (i.e. the node is either an offshore entity or intermediary). By adding an edge between two nodes (of the same type) in the ‘first-order network’ (of which we saw four examples in the previous sections) if they indeed share such a (differently typed) neighbor, we obtain a more complete network that we denote as a higher-order homogeneously typed network. We consider three such networks in this section:

1
$G_{o}^{*}$: This is a network of officers, where we declare a link between two officers A and B iff (i) there is a direct link between them in the original triples set, (ii) if there are two triples in the triples-set of the form (A,officer_of,C), (B,officer_of,C) i.e. the two officers share a common organization. The second type of link (which is ‘indirect’) constitutes almost all links in the network as there are virtually no direct links between officers. We believe that this has preempted a structural study of an ‘officer’ social network implied by the Panama Papers.
2
$G_{o}^{o}$: The nodes in this network are offshore entity nodes, and an edge exists between two offshore entities iff (i) there is a direct link between the offshore entities (which is never the case empirically), (ii) the two offshore organizations share an officer.
3
$G_{o}^{i}$: This network is similar to the above (nodes are offshore entity nodes) but we create a link between two offshore entity nodes if they share an intermediary.

Basic network statistics are reproduced in Table 7. Similar to what we observed for the first-order networks, the higher-order networks were also found to be ‘almost’ simple. The edges and density of the multi-networks were both found to be nearly identical to those of the underlying simple networks, as noted in Table 7. A major difference that we start to observe, at least for $G_{o}^{*}$, is higher assortativity and transitivity. This implies that $G_{o}^{*}$ resembles a social network much more than the first-order networks did. While densities are still low, they are still higher by an order of magnitude compared to the densities tabulated earlier in Table 4.

Table 7 Network measures computed for the higher-order networks ($G_{o}^{*}$, $G_{o}^{o}$, and $G_{o}^{i}$), described in the text. NA means that the metric could not be computed for that network due to computational complexity or resource limitation issues

Most important, unlike the earlier networks, the degree distributions of the higher-order networks in Fig. 9 are unusual in that they do not obey a power-law distribution. There is a significant deviation, in particular, in the mid-degree range; in the $G_{o}^{*}$ network, a set of nodes with degree 100 or higher exhibit deviation from the power law and have higher frequency than the law would indicate. These nodes may be worth investigating from a structural and country standpoint, though we do not conduct this analysis within the work. There is a definite curvature in the $G_{o}^{o}$ network, but towards the right of the curve, we see a spread (across frequencies) of nodes with very high degrees. Perhaps the most interesting curve with respect to degree distribution is $G_{o}^{i}$, which initially seems to express the usual power-law distribution, but then reverses and exhibits another power-law distribution with a positive exponent, a rare occurrence that we could not find an explanation or theory for in the existing network science literature.

Fig. 9

Degree distributions of higher-order networks. a, b and c respectively refer to networks $G_{o}^{*}$, $G_{o}^{o}$, and $G_{o}^{i}$. The x-axis is the degree, and the y-axis is the frequency of the degree

One hypothesis (for explaining these unusual observations) that we are currently investigating through further analysis is that the offshore entities in Fig. 9c can be sub-divided into different markets. Smaller offshore-entities may be set up, for example, to fulfill a formal requirement in the country in which they hope to transact or do business with another entity (possibly through an intermediary). Larger offshore-entities may be set up for acquisitions or other purposes. If we are able to systematically segment offshore entities into such ‘markets’, we may be able to observe a standard power-law distribution for some of the markets, but others may yield interesting insights obeying a separate set of laws.

A similar theory may apply to the other higher-order networks. For example, in both of the other higher-order networks illustrated in Fig. 9 there is deviation in some sections of the networks, but overall, the networks seem to follow an approximate power-law distribution. The deviation may also be suggestive of a separate ‘market’ (e.g., there are different categories of officers, such as shareholders and beneficiaries, and separate networks for these different categories of officers may result in a correction of the observed deviation), which may be corrected by isolating those nodes (and their relationships) and treating them as a separate system. However, this begs the question of how to segment nodes into markets in a systematic manner. We leave this issue for future research to investigate.

Another piece of evidence that the unexpected degree distributions are likely not due to data artifacts or random noise is the power-law distributions of the connected components (Fig. 10). The plots in the figure illustrate no such unexpected behavior; they are analogous, structurally, to what we found earlier for the four networks selectively constructed earlier. The evidence leads further credence to the hypothesis that there is a fundamental difference between the phenomena expressed in the Panama Papers and in typical social and organizational networks that exhibit power law distributions.

Fig. 10

Connected component size distributions (where size is measured using number of nodes) of higher-order networks. a, b and c respectively refer to networks $G_{o}^{*}$, $G_{o}^{o}$, and $G_{o}^{i}$. The x-axis is the size of the component in terms of the number of nodes, and the y-axis is the frequency of the size

from Hacker News https://ift.tt/33tebG0

SymmetricalDataSecurity

Saturday, October 3, 2020

Structural studies of the global networks exposed in the Panama papers

Country assortativity analysis

Higher-order homogeneously typed networks

No comments:

Post a Comment

Blog Archive

Search This Blog

Total Pageviews