A problem that arise a lot when you play with data is to figure out how things are connected. It could be for example to determine from all your friends, and your friends connection, and your friends friends connections, … to whom you are directly or indirectly connected, or how many degrees of separation you have with such and such connection. Luckily there are some tools at your disposal to perform such analysis. Those tools comes under the umbrella of Network Theory and I will cover some basic tricks in this post.
First let’s go with some terminology. Nodes are the things we are connecting e.g. you, your friends, your friends friends. Vertex are how those nodes are connected. For example, below, node 0 is connecting to node 1 and 2 using two vertices to describe those connections. Node 1 is connecting to node 3 via one vertex, etc. For this first example, we use uni-directional vertex, but nothing prevents us from using bi-directional vertex. In general, if all vertex are bi-directional we will talk of non-directed graph, which would be the case of friends (usually) since you know your friend, and he knows you as well!
A first important concept to introduce in network analysis is that of an Adjacency matrix. The adjacency matrix is a matrix representing the vertex connections between the nodes. The first row in the Adjacency matrix represent connections of node 0. Thus, node 0 is connecting to node 1 and 2, but not to itself or to node 3. So the first row is 0, 1, 1, 0. Second row represent connection of node 1, which is only connecting to node 3. So the second row is 0, 0, 0, 1. Note that we could have bi-directional connections, in such a case the connection would appear on both the row and the column, but this is not the case in this example.
By inspecting the Adjacency matrix, we can reconstruct the Node/Vertex graph. It informs us on the first hop connections: who are your friends. But how can we know about the second hop connections e.g. node 0 is connected to node 3 via node 1 and node 2? A really simple way is to multiply the adjacency matrix by itself (A*A). The result of this multiplication are the second hop connection. Here, we see that node 0 is connecting through 2 hops to node 1 (via node 2), and is connecting through 2 hops to node 3. We can even see that there is 2 such connection in two hops from node 0 to node 3. Lastly we see that node 2 is connecting through 2 hops to node 3 (via node 1).
If we were to multiply again A*A by A itself, we would get the three hop connections, which in this case is limited to node 0 being connected to node 3.
In general, the network that will interest us are way bigger than this simple 4 nodes diagram. Also in general, all nodes are not connected to each other node. Well, they say everyone is connected to everyone by six degrees of separations (six hops), but for most other practical applications, not all nodes are connected to each others. Let’s take a look at a bigger example to see how the principles illustrated above can apply at scale. Let’s assume the following non-directional network graph. Here since we have a non-directional network graph, you will see the connection values appears in both the rows and the columns. This special case shows a symmetry about the diagonal.
As before, if we compute A*A, we get the second hops connections. Notice that nodes becomes connected to themselves via a second hop. For example, node 1 is connected 3 times to itself through a second hop via node 0, 7 and 8.
If you are interested in all the first hop connections and the second hop connections, you could add together A*A and A, thus leading to the following matrix. You could proceed forward to find the third hops onward, but in this example nothing else is connected, so although that the numbers you see here would grow, the pattern of zeros would not change. We have found all connections of this graph. We found that node 0 is connected to nodes 1, 7 and 8. Nodes 2, 3 and 4 are connected. Nodes 5, 6 and 9 are connected. Finally we see that node 10 is not connected to any other nodes.
In practice the matrix multiplication works well to find the next hops neighbours. If it happens also that for your problem (as in the one above) most connections are non existent i.e. 0, then you could use sparse matrices to store (and potentially compute with) your Adjacency matrix. However, those becomes quickly really huge matrices which requires a lot of operations to compute. A nice trick if you are using SQL or spark could be to use joins on tables.
To do so, you need to reverse the problem on its head. Instead of creating an Adjacency matrix of how the nodes are connected, you will create a table of the connections. So to keep with our second example, you could have something like the following network graph being turned into a node/connection table.
Now that we have that node/connection table, our goal will be to reduce the number of connections to the minimum possible and in the end get something like the following as a way to see everything connected (we won’t care about how many hops leads us there).
To get there we will iterate through a two step process. First we will perform connection reduction and then update the node/connection table. Then we rinse and repeat until we can no longer reduce the number of connections.
Assuming the above node/connection table (node_connections), we can reduce the number of connections via a the following SQL query and store it as the new_connections table:
SELECT A.connection, MIN(B.connection) AS new_connection
FROM node_connections AS A
JOIN node_connections AS B
ON A.node = B.node
GROUP BY A.connection
Then you can update the node_connection table with the following SQL query:
SELECT DISTINCT B.new_connection AS connection, A.node
FROM node_connections AS A
JOIN new_connections AS B
WHERE A.connection = B.connection
You iterate those two steps until the node_connections table change no more et voilà, you have a map of all nodes connected through distincts connections.
This is only one of the possible use case, but for large scale application it is probably easier and quicker to join tables than to create and multiply Adjacency matrices. I showed the logic with SQL, but obviously you could achieve similar results using spark (for my specific application, I use pySpark).
If you have questions or interesting ideas of application of the network theory to some problem, feel free to jump in the conversation!
Cover photo by Michael Gaida at Pixabay.