Spaces:
Running
Running
WEBVTT | |
00:00.000 --> 00:10.880 | |
How many friends do you have? | |
00:10.880 --> 00:13.480 | |
At least you have more friends than I do. | |
00:13.480 --> 00:15.960 | |
Well, on average. | |
00:15.960 --> 00:18.280 | |
Don't get me wrong, I am not a pity person. | |
00:18.280 --> 00:23.920 | |
This is a mathematical fact known as the friendship paradox. | |
00:23.920 --> 00:30.800 | |
Suppose we have two persons, A who has one friend and B who has three friends. | |
00:30.800 --> 00:36.520 | |
Now let me ask in which friend list am I likely to appear? | |
00:36.520 --> 00:42.400 | |
Because B has three times more friends, I am three times more likely to appear in the | |
00:42.400 --> 00:45.600 | |
B's friend list. | |
00:45.600 --> 00:52.140 | |
The friendship paradox dictates that on average, your friends have more friends than you do. | |
00:52.140 --> 00:58.280 | |
The more friends someone has, the more likely someone appears in your friend list. | |
00:58.280 --> 01:04.120 | |
Beyond an interesting piece of trivia, the friendship paradox has substantial importance | |
01:04.120 --> 01:10.040 | |
because it may introduce biases in graph embeddings. | |
01:10.040 --> 01:15.680 | |
Hello everyone, my name is Sadamori Kojak, and we will walk you through a new insight | |
01:15.680 --> 01:21.340 | |
into biases in graph embedding arising from the friendship paradox. | |
01:21.340 --> 01:26.160 | |
The graph embedding is a technique to map a graph into a vector space that reflects | |
01:26.160 --> 01:28.360 | |
the structure of the graph. | |
01:28.360 --> 01:34.040 | |
A widespread paradigm is the approach based on Word2Vec. | |
01:34.040 --> 01:39.480 | |
In this approach, one somehow generates a sequence of nodes from the graph. | |
01:39.480 --> 01:45.600 | |
The nodes in the sentences are then mapped to a vector space by Word2Vec. | |
01:45.600 --> 01:52.360 | |
Now the key is that Word2Vec does not directly learn the graph, but through the sentences | |
01:52.360 --> 01:55.360 | |
generated from the graph. | |
01:55.360 --> 02:01.120 | |
Unlike the word embedding, where the input sentences are the actual data, for graph embedding, | |
02:01.120 --> 02:07.640 | |
the input sentence is artificially generated, and how to generate it is a critical modeling | |
02:07.640 --> 02:08.640 | |
decision. | |
02:08.640 --> 02:15.280 | |
This leads us to the question of how to generate the sentences from the graph. | |
02:15.280 --> 02:20.160 | |
A common way is to use random walks. | |
02:20.160 --> 02:28.560 | |
The worker starts from a node in the graph, and this node is the first node in the sentence. | |
02:28.560 --> 02:32.840 | |
Then the worker moves to one of the neighbors selected randomly. | |
02:32.840 --> 02:35.920 | |
This new node is added to the sentence. | |
02:35.920 --> 02:43.320 | |
By repeating this process, we can generate a sentence of nodes from this graph. | |
02:43.320 --> 02:48.940 | |
The friendship paradox comes into play when the worker follows an edge. | |
02:48.940 --> 02:53.320 | |
It is more likely to visit a node with many neighbors. | |
02:53.320 --> 02:58.800 | |
In other words, following edges is a bias sampling that preferentially leads random | |
02:58.800 --> 03:02.400 | |
workers to nodes with many neighbors. | |
03:02.400 --> 03:07.640 | |
To see this effect, let us consider a graph with co-peripheral structure, where kernels | |
03:07.640 --> 03:10.600 | |
have more neighbors than periphery. | |
03:10.600 --> 03:15.560 | |
A sentence can be generated from this graph by running a random walk. | |
03:15.560 --> 03:21.200 | |
Now, the kernels are about 20% of nodes in the graph. | |
03:21.200 --> 03:26.200 | |
But when looking at the generated sentence, the kernels are overrepresented, which is | |
03:26.200 --> 03:30.160 | |
because of the bias due to the friendship paradox. | |
03:30.160 --> 03:38.160 | |
The fact that the sentence is biased by the friendship paradox leads us to our main question. | |
03:38.160 --> 03:41.760 | |
Does the sampling bias have negative impact? | |
03:41.760 --> 03:44.360 | |
If so, how can we fix it? | |
03:44.360 --> 03:50.920 | |
Surprisingly, it has no effect because Word2Vec itself has an overlooked built-in devising | |
03:50.920 --> 03:56.440 | |
feature that happens to negate the bias due to the friendship paradox. | |
03:56.440 --> 04:03.040 | |
This built-in devising feature can be easily utilized to negate other types of biases, | |
04:03.040 --> 04:06.640 | |
and we demonstrate how to do this. | |
04:06.640 --> 04:10.280 | |
Our starting point is a sentence of words. | |
04:10.280 --> 04:17.480 | |
Word2Vec picks a word called center and surrounding words called context, and then models the | |
04:17.480 --> 04:24.400 | |
conditional probability using a softmax function, where the conditional probability is reflected | |
04:24.400 --> 04:28.880 | |
as a dot similarity of the two vectors of the words. | |
04:28.880 --> 04:35.240 | |
We want to fit this model to the data, but it is computationally challenging due to the | |
04:35.240 --> 04:42.080 | |
normalization constant, which extends over all unique words in the corpus. | |
04:42.080 --> 04:46.160 | |
A common way to reduce this burden is negative sampling. | |
04:46.160 --> 04:53.720 | |
Now, it is often underappreciated that negative sampling is actually a simplified version | |
04:53.720 --> 04:56.800 | |
of noise contrastive estimation. | |
04:56.800 --> 05:04.920 | |
And it is this simplification that gives rise to an interesting feature of Word2Vec. | |
05:04.920 --> 05:09.880 | |
How does the noise contrastive estimation, or NCE, works? | |
05:09.880 --> 05:16.840 | |
NCE samples k random contexts from so-called noise distribution. | |
05:16.840 --> 05:23.400 | |
This noise distribution is roughly proportional to the frequency of a word in the corpus. | |
05:23.400 --> 05:30.240 | |
The random contexts are labeled as 0, and the actual context is labeled as 1. | |
05:30.240 --> 05:37.160 | |
Then NCE calculates the probability that a word comes from actual data using a Bayesian | |
05:37.160 --> 05:39.040 | |
framework. | |
05:39.040 --> 05:46.040 | |
By putting the prior likelihood together, we have a posterior like this. | |
05:46.040 --> 05:52.320 | |
This function is a sigmoid function and takes the dot similarity and the noise distribution | |
05:52.320 --> 05:54.400 | |
as the arguments. | |
05:54.400 --> 06:01.160 | |
Now the key feature of the NCE is that it is asymptomatically unbiased for the model | |
06:01.160 --> 06:03.000 | |
of the Word2Vec. | |
06:03.000 --> 06:08.160 | |
Meaning if the data is actually generated from this model, and we increase the number | |
06:08.160 --> 06:14.240 | |
of trainings, then the embedding vectors converge to the true vectors. | |
06:14.240 --> 06:20.560 | |
Beyond Word2Vec, the noise contrastive estimation is also an unbiased estimator for a more general | |
06:20.560 --> 06:27.680 | |
model that takes a real value function f instead of the dot similarity. | |
06:27.680 --> 06:33.400 | |
Now the negative sampling simplifies the noise contrastive estimation. | |
06:33.400 --> 06:40.640 | |
It estimates the same probability, but variably drops the term of the noise distribution. | |
06:40.640 --> 06:45.000 | |
You might be wondering what happens without this term. | |
06:45.000 --> 06:50.880 | |
To see this, we rewrite it in form of the noise contrastive estimation, where we define | |
06:50.880 --> 06:59.480 | |
a new function f' which consists of the original function f as well as the noise distribution. | |
06:59.480 --> 07:08.640 | |
This is asymptomatically unbiased for a probability model which now includes the noise distribution. | |
07:08.640 --> 07:16.160 | |
So all in all, Word2Vec trained with skip-gram-negative sampling is asymptomatically unbiased for | |
07:16.160 --> 07:23.440 | |
this probability model, or more specifically for Word2Vec, this function. | |
07:23.440 --> 07:30.840 | |
In this model, the noise distribution offsets the modeled probability, serving as a baseline. | |
07:30.840 --> 07:35.720 | |
The embedding vectors captures the residual from the baseline. | |
07:35.720 --> 07:41.760 | |
Now, remind that the baseline probability is roughly proportional to the frequency. | |
07:41.760 --> 07:48.160 | |
Therefore, the embedding vectors capture the information other than the frequency. | |
07:48.160 --> 07:57.320 | |
In other words, SGNS Word2Vec has a built-in debiasing feature for frequency bias. | |
07:57.320 --> 08:01.800 | |
Now let us revisit the friendship paradox. | |
08:01.800 --> 08:08.280 | |
The sampling bias due to the friendship paradox is that the frequency of a word is determined | |
08:08.280 --> 08:12.000 | |
thoroughly by the degree of noise. | |
08:12.000 --> 08:17.480 | |
Notice that this frequency is actually accounted for by the baseline probability. | |
08:17.480 --> 08:24.840 | |
Therefore, the friendship paradox has no effect thanks to the built-in debiasing feature of | |
08:24.840 --> 08:28.400 | |
SGNS Word2Vec. | |
08:28.400 --> 08:33.840 | |
This realization leads us to Residual2Vec. | |
08:33.840 --> 08:41.120 | |
The key idea is to model the baseline probability explicitly to control what bias to remove | |
08:41.120 --> 08:43.160 | |
in embedding. | |
08:43.160 --> 08:47.880 | |
So how can we model the baseline more specifically? | |
08:47.880 --> 08:54.880 | |
We start from the given graph and randomize the structure, then generate a sequence using | |
08:54.880 --> 09:01.600 | |
random walks, then calculate the conditional probability as the baseline, which is based | |
09:01.600 --> 09:08.080 | |
on the idea that we should remove biases arising from the trivial structure. | |
09:08.080 --> 09:12.640 | |
This debiasing feature is useful to predict links in the graph. | |
09:12.640 --> 09:20.320 | |
Residual2Vec performs the best or nearly the best for all six graphs of different domains. | |
09:20.320 --> 09:27.360 | |
Furthermore, Residual2Vec is the best or the second best performer for a community detection | |
09:27.360 --> 09:29.360 | |
benchmark. | |
09:29.360 --> 09:36.280 | |
To showcase the debiasing feature, we constructed a citation graph of general issues using the | |
09:36.280 --> 09:43.800 | |
web of science, where the nodes are general issues connected by undirected and weighted | |
09:43.800 --> 09:45.800 | |
citations. | |
09:45.800 --> 09:51.640 | |
When applying grove embedding, all genres are concentrated on the center, reflecting | |
09:51.640 --> 09:55.040 | |
temporal aspects of the issues. | |
09:55.040 --> 10:01.440 | |
This is because the old issues have time to accumulate many citations, and therefore well | |
10:01.440 --> 10:04.840 | |
connected to many different issues. | |
10:04.840 --> 10:10.720 | |
For subject-wise, grove separates different fields to some extent. | |
10:10.720 --> 10:15.560 | |
With Residual2Vec, we can remove the biases due to time. | |
10:15.560 --> 10:21.720 | |
In effect, the old genres now spread out, and the disciplinary separations are more | |
10:21.720 --> 10:23.920 | |
clearly visible. | |
10:23.920 --> 10:29.480 | |
Beyond eyeballing the embeddings, we test the embeddings quantitatively by predicting | |
10:29.480 --> 10:35.160 | |
the genre impact factor as well as the subject categories. | |
10:35.160 --> 10:41.280 | |
We find that the impact factor and the subject of genres can be well predicted by removing | |
10:41.280 --> 10:46.560 | |
the temporal biases as well as the friendship paradox effect. | |
10:46.560 --> 10:54.600 | |
In summary, we show that World2Vec has a built-in debiasing feature attributed to negative sampling. | |
10:54.600 --> 11:00.320 | |
Inspired by this finding, we propose Residual2Vec that can negate other types of structural | |
11:00.320 --> 11:02.320 | |
biases. | |
11:02.320 --> 11:08.360 | |
We demonstrate that removing biases not only improves the performance, but also enabling | |
11:08.360 --> 11:13.400 | |
us to control on the biases in the final representation. | |
11:13.400 --> 11:19.480 | |
Our results highlighted a new potential of negative sampling as a way to mitigate biases | |
11:19.480 --> 11:27.200 | |
in representations, which may be useful to address the problem of the biases in AI. | |
11:27.200 --> 11:33.320 | |
Although we have not studied the biases in AI, given the wide usage of negative sampling | |
11:33.320 --> 11:39.880 | |
to train AI, our approach may lead to methods and studies that expose and mitigate biases | |
11:39.880 --> 11:41.920 | |
in AI. | |
11:41.920 --> 11:47.720 | |
We believe that our approach contributes to the effort to create transparent and accountable | |
11:47.720 --> 11:53.520 | |
machine learning methods, especially because our method enables us to explicitly control | |
11:53.520 --> 11:57.160 | |
the biases in the graph representation. | |
11:57.160 --> 12:03.520 | |
That's all for the presentation, and finally I'd like to acknowledge Jason Yoon, Isabel | |
12:03.520 --> 12:11.280 | |
Constantino, and Yongyuan An for creating and adding momentum to this project for years, | |
12:11.280 --> 12:15.160 | |
and for all of you who watched this video. | |
12:15.160 --> 12:19.640 | |
If you want to know more in detail, please check out our paper. | |
12:19.640 --> 12:27.880 | |
Thanks! | |