This post is a follow-up to Who am I connected to? As stated in the previous post, a problem that arise a lot is figuring out how things are connected. Is this server directly or indirectly connected to that pool? Or who am connected to through a chain of friends. If you ever have to implement such an algorithm (and about that you can refer to my previous post), one thing you might encounter are superstars, false friends or black holes. Name them the way you want 😉 . Those are “nodes” which are connected to an abnormally high number of other nodes. Well, when someone has 50k friends, you should be suspicious that those are not all real friends! The problem with reporting fake friends is many fold.
First, if you go through the process described last time, you get a number of very high density groups which normally wouldn’t be grouped together if it was not because of those black hole nodes. This may well make any conclusion pointless, so you should take care of removing (or not considering) those superstar nodes to start with.
Second, assuming you start with big data, joining a number of those superstars on themselves will lead to an exponential growth of your data set (at least temporarily) and it will take forever to complete the associated spark tasks (if successful at all). Ok, those might be legit friends, in that case you might not have a choice and maybe Fighting the Skew in Spark can help you solve that issue. But otherwise, if those are indeed false friends, you should take a step of removing those black hole nodes before hand.
In an ever changing world of data, it may not be easy to spot those black holes, but a good first filter may be as simple as (using PySpark notation this time, just to keep you on your toes):
filter_out = node_table .groupBy('node') .count() .filter(F.col('count') > black_hole_threshold)
The nodes captured by that filter-out “rule” can then be automatically removed from your node table, or examined and added to black lists if needs be. To automatically remove the filter_out nodes from your node_table, the join_anti is your friend!
output = node_table .join( filter_out.select('node'), on='node', how='left_anti')
You still need to perform the connection finding algorithm on this “output”, but at least you would have removed all nodes which have an above black_hole_threshold abnormal number of connections from your inputs.
What else can go wrong? Again, if you have big data, this process as a whole (especially since it is iterative) can take some serious time to execute. Moreover, even with the black holes removed, the join on itself part may consume a lot of resource from your cluster. The interesting part is that if you keep your “node” definition constant, you could run the algorithm in an online additive fashion which would run faster because most of the data wouldn’t change and already be reduced to find who’s who friend, so only the additional delta would in fact “move”. I know it is not that simple and quick, but it is still quicker than doing the process on the initial input data again an again…
Again, I hope this can be of help. If you apply this method or another equivalent one, let me know and let’s discuss about our experience!