Mapping out the probabilistic social graph

Update (26 Sept 2017): Turns out this idea is pretty well-studied! Just a little poking around reveals such papers as Chen et. al. (aka, LinkProbe) from 2013 - among many others. Fascinating stuff! I will have to dig in to see how easily those concepts can be brought to bear on the problems of political organizing.

A couple of years ago, Facebook shut down a critical piece of the Graph API, which made it easy to map out a deterministic social graph. Now, it’s time to think about replacing that API with something even better.

The good old days of the deterministic graph

Before the shut-down, the developer of an app could grab the friend list of the app’s users, and thereby efficiently map out their social graph. This feature made it relatively easy for apps to reach out from their base of users, and to get a hold of new prospects. In terms of progressive politics, it was something of a gold mine, because it made it easy for volunteers to contact their friends on behalf of a campaign. Conventional wisdom among organizers is that friend-to-friend outreach is the most effective form of contact, since people tend to trust their friends much more than they do strangers.

It’s been a couple of years since Facebook disabled this feature, and since then we don’t really have a good replacement. That may be just as well: the Facebook Graph API created the perception that we could capture what I’d call the deterministic social graph: a social graph in which two people are connected only if we have some form of factual proof that those two people are friends. But it turns out that gathering all those facts is really difficult work; we’ll never get a complete social graph that way. What if we didn’t need to do so much work, but had a much more complete graph?

What I have in mind is something I call the probabilistic social graph. In this graph, the nodes are people, and edges between nodes are probabilistically defined. For two people a and b, the edge between their nodes is weighted by the probability that a and b are friends: P(a, b).

If you like you can vary this definition slightly, and think about the different kinds of relationships between two people: they could be related by family ties, co-workers, classmates, etc. Supposing there are N different relationship types we are interested in, we can define N different social graphs, where social graph i has edges whose weights are defined by the probability that a and b have relationship i: P(a, b, i), where i goes from 1 to N.

The result is N different matrices, with dimensions equal to the universe of people whose social graph you care to investigate. Right away it should become apparent that the matrices are quite sparse (in the sense that most entries are very close to 0) for any large population, and for small, cohesive social groups the matrices are dense.

Using the graph

How could we use these matrices? To be sure, we could do the sorts of things that the Facebook API used to let us do: for person a, find all people b such that P(a, b) > x, where x is some threshold value that indicates the degree of certainty we want that a and b are friends: let’s say, 0.95.

We could of course flip the problem on its head. Suppose you are managing voter outreach for a progressive campaign, and you have some set of people who you want to recruit as supporters, and some other set of people who are already volunteering to help out your campaign. Let R be the list of people you want to recruit, and S the list of supporters you already have. Now you want to find, for each desired recruitee r, the supporter s who maximizes P(r, s, i), where i is the type of relationship which you think most likely to be productive. To make this concrete, let’s suppose that your campaign is a ballot initiative to support free tuition at in-state colleges; you may set R to be the list of people who have graduated from in-state colleges, and i to be a relationship like “class-mate”. Now, you are getting your supporters to contact their class-mates from in-state colleges and to recruit those class-mates into your campaign. You could even take this approach one step further, and find the relationship i which maximizes the number of supporters you have with relationships to your recruiting universe.

There are other, still more interesting things that you can try. In certain contexts, first-order friendships of the kind that P defines are actually not that useful. For example, when job-hunting, it’s usually best not to bother drumming up prospects from the people you are closest to. Those people already know that you’re job-hunting, and if they know of any good leads, they’d have told you long ago. What you’re more interested are second-order contacts, which is to say friends-of-friends. I suspect that in some political contexts, similar kinds of effects are in place. If you’re a supporter of a progressive campaign, especially one which has a lot of publicity, then probably many of your friends are already on board, and the ones who aren’t will probably not be swayed by whatever arguments you care to re-tread with them. It’s perhaps more fruitful to be able to strike up conversations with your friends-of-friends, in which you can intermingle political philosophy with idle gossip about your shared friends. Well, your friends-of-friends are easily determined using the probabilistic social graph; if you’re a, then your friends-of-friends are just those people c for whom there exists some b for which P(a, b, i) > x and P(b, c, j) > x, for some threshold x and some relationships i and j.

Computing the graph

All of this math is very nice, but of course it’s predicated on some rather challenging abstractions, in particular the value of P(a, b, i). How are we to calculate this powerful little function?

Let’s step back a bit and consider perhaps a simpler version of the social graph, where edge weights are determined by the function f, where f(a, b) = 1 if a and b are acquainted in some way, and f(a, b) = 0 if not: f describes the deterministic social graph. It should be immediately obvious that f is poorly defined, because social relationships are themselves so ambiguous: suppose that you, dear reader, are the a, and consider what the value of f would be, if we plugged in some of the following bs: a distant cousin whom you’ve never met; the principal of your grade school; Nelson Mandela. We could of course nail down f with a little more precision: f(a, b) = 1 if a and b have ever had a conversation lasting more than 30 minutes, let’s say, and again we would run into trouble. We get f(a, b) = 0 if we set b to be that distant cousin (who you could well become acquainted with, if you so chose), and f(a, b) = 1 if we set b to be a customer service rep for the cable company who once helped you figure out a tricky billing issue, but whose name you’ve long since forgotten.

The point here is that the deterministic social graph is impossible to define accurately, because social relationships themselves defy accurate mathematical definition. Consequently, if we approach P through the most obvious techniques - machine learning of some kind, perhaps a classification problem on the space of all |R| * |S| possible relationships - then we will immediately be stymied by the fact that any imaginable training set is biased in some way.

Finding good training data

That doesn’t mean the problem is outright impossible, just very difficult. It is certainly possible to prepare training sets through a variety of methods. For example, one could collate a variety of birth certificates, marriage and death certificates, and similar documents, to get a very good picture of immediate family relationships (in fact this piece of the training set puzzle is perhaps the easiest to find.) Certain kinds of institutional relationships can be acquired as well, perhaps with some difficulty: directories from schools, labor unions, churches, and similar organizations can provide a lot of training data about some of the weaker forms of social ties, such as “a was in the same graduating class as b”.

Then there are various forms of digital relationships: email address books, mobile phone contact lists, Twitter follower lists. There’s a lot of very exciting work underway to mine these databases. The data is extremely useful in some ways, because it captures actual use patterns, telling us something about whether a actually communicates with b, but that is both more and less useful than a school directory. We tend to communicate digitally both with people who we consider very close friends, and those who we quickly and easily forget. There are any number of attempts to determine which precise communications are the ones we care most about, Google’s Priority Inbox being perhaps the most interesting and successful, but all of them suffer from the simple fact that digital communication is an imperfect window into our social worlds.

At the end of the day, we can collate all sorts of interesting training sets; but whatever we do, they are likely to be emphatically biased in some way or the other, because there does not exist any sort of representative slice of the hypothetical “answer sheet”, since the deterministic graph determined by f is itself impossible to define.

What are the features?

But the problem is not just about the lack of a good training set. A classification problem on the space of possible social relationship requires what is probably a very rich set of features. Let’s forget the math for a second and ask ourselves this question: suppose we pick a random name from the voter file. How likely is it that you, dear reader, are friends with this person? To make things interesting, let’s attach a wager to the question, with a $1,000 bonus if you correctly identify a friend, and a $100 fine for every false positive.

Before you venture an answer, there are a great many questions you probably want to ask: How old is this voter? What gender? Where does the voter live? Where did the voter grow up? Socioeconomic status? Race? Religion? Veteran? Job? And so forth; eventually your questions would come to resemble the list of suspect qualifications for various non-discrimination statutes. Reliable answers to perhaps a dozen of these sorts of questions should boost your chances of answering correctly by quite a bit. (We’ll suppose, for simplicity’s sake, that a question like “where did this person grow up?” even has a well-defined answer in the first place.)

To return to machine learning, we want a whole lot of features on our training set before we venture to learn the classifications for prospective relationships. With reference to most voter files, some of these features, like age and address, are readily at hand. For many voters, there are a handful of features, such as birthplace, which could be deduced by matching the voter file to other data sets, like a birth registry. There are also some features, like socioeconomic status and race, which are guessable to varying degrees of uncertainty, based on the voter’s address.

It should be clear by now, hopefully, that the challenge here is tremendous. Collating a training set, gathering sufficient features on the training set or any new data - each of these require a prodigious amount of effort, and commensurate resources.

The problem of time

Should you not be convinced of the magnitude of the task just yet, let me also posit that P may or may not be fixed in time, as we’ve implicitly assumed above. As we all know, relationships evolve over time: people gain new friends, classmates fall out of touch, and so forth. In terms of our deterministic function f above, clearly it is a four-dimensional function, f(a, b, i, t), where t represents a moment of time. It’s not altogether clear to me whether P requires similar treatment, given that the probability of a relationship, given relatively coarse-grain features as we’ve discussed above, probably does not change nearly as frequently as the deterministic social graph: for a randomly-chosen firefighter in Schenectady, New York, the probability that the firefighter is friends with a teacher in Albany is probably about the same today as it was five or ten years ago. Surely there is a role for time, given what we know about changing patterns of inequality, race and gender relations, etc., but its effect on P may be so negligible as to be inconsequential for any practical purpose; by the time a change in t has had a noticeable impact on the probabilities, we may have developed technology sufficient to make P obsolete in the first place. Of course, this assessment of P is entirely conjecture: it may well be that the function changes on the granularity of months or even weeks. The larger point is that this function will be devilishly tricky to pin down.

My view, however, is that the investment is well worth the reward. We are just now at the point in political technology where we are learning to take advantage of peer-to-peer contacts that leverage social ties effectively. Currently we’re leveraging these approaches for the most obvious purposes, such as voter contact. That is all to the good, but I suspect that much more interesting, and more powerful, uses are sure to come.

If you’re tempted to comment that I’m talking about replacing a difficult problem - capturing the deterministic social graph - with a nearly impossible one, then your point is well-taken. The impossibility of the task is in this case commensurate to its value: P covers the entire universe of people we want to contact, and we simply can’t do the same with f.

In a forthcoming post, I hope to break down the problem of the probabilistic social graph a bit further, to see if there are different angles of attack to the problem. My hope is to unearth simpler versions of the graph which might be a good deal easier to calculate. Stay tuned, and certainly let me know if you have ideas to contribute!

Image courtesy of Annie Spratt