ytseg_demo / demo_data /nips-2021 /25965 /transcript_whisper_large-v2.vtt
retkowski's picture
Add demo
cb71ef5
WEBVTT
00:00.000 --> 00:10.880
How many friends do you have?
00:10.880 --> 00:13.480
At least you have more friends than I do.
00:13.480 --> 00:15.960
Well, on average.
00:15.960 --> 00:18.280
Don't get me wrong, I am not a pity person.
00:18.280 --> 00:23.920
This is a mathematical fact known as the friendship paradox.
00:23.920 --> 00:30.800
Suppose we have two persons, A who has one friend and B who has three friends.
00:30.800 --> 00:36.520
Now let me ask in which friend list am I likely to appear?
00:36.520 --> 00:42.400
Because B has three times more friends, I am three times more likely to appear in the
00:42.400 --> 00:45.600
B's friend list.
00:45.600 --> 00:52.140
The friendship paradox dictates that on average, your friends have more friends than you do.
00:52.140 --> 00:58.280
The more friends someone has, the more likely someone appears in your friend list.
00:58.280 --> 01:04.120
Beyond an interesting piece of trivia, the friendship paradox has substantial importance
01:04.120 --> 01:10.040
because it may introduce biases in graph embeddings.
01:10.040 --> 01:15.680
Hello everyone, my name is Sadamori Kojak, and we will walk you through a new insight
01:15.680 --> 01:21.340
into biases in graph embedding arising from the friendship paradox.
01:21.340 --> 01:26.160
The graph embedding is a technique to map a graph into a vector space that reflects
01:26.160 --> 01:28.360
the structure of the graph.
01:28.360 --> 01:34.040
A widespread paradigm is the approach based on Word2Vec.
01:34.040 --> 01:39.480
In this approach, one somehow generates a sequence of nodes from the graph.
01:39.480 --> 01:45.600
The nodes in the sentences are then mapped to a vector space by Word2Vec.
01:45.600 --> 01:52.360
Now the key is that Word2Vec does not directly learn the graph, but through the sentences
01:52.360 --> 01:55.360
generated from the graph.
01:55.360 --> 02:01.120
Unlike the word embedding, where the input sentences are the actual data, for graph embedding,
02:01.120 --> 02:07.640
the input sentence is artificially generated, and how to generate it is a critical modeling
02:07.640 --> 02:08.640
decision.
02:08.640 --> 02:15.280
This leads us to the question of how to generate the sentences from the graph.
02:15.280 --> 02:20.160
A common way is to use random walks.
02:20.160 --> 02:28.560
The worker starts from a node in the graph, and this node is the first node in the sentence.
02:28.560 --> 02:32.840
Then the worker moves to one of the neighbors selected randomly.
02:32.840 --> 02:35.920
This new node is added to the sentence.
02:35.920 --> 02:43.320
By repeating this process, we can generate a sentence of nodes from this graph.
02:43.320 --> 02:48.940
The friendship paradox comes into play when the worker follows an edge.
02:48.940 --> 02:53.320
It is more likely to visit a node with many neighbors.
02:53.320 --> 02:58.800
In other words, following edges is a bias sampling that preferentially leads random
02:58.800 --> 03:02.400
workers to nodes with many neighbors.
03:02.400 --> 03:07.640
To see this effect, let us consider a graph with co-peripheral structure, where kernels
03:07.640 --> 03:10.600
have more neighbors than periphery.
03:10.600 --> 03:15.560
A sentence can be generated from this graph by running a random walk.
03:15.560 --> 03:21.200
Now, the kernels are about 20% of nodes in the graph.
03:21.200 --> 03:26.200
But when looking at the generated sentence, the kernels are overrepresented, which is
03:26.200 --> 03:30.160
because of the bias due to the friendship paradox.
03:30.160 --> 03:38.160
The fact that the sentence is biased by the friendship paradox leads us to our main question.
03:38.160 --> 03:41.760
Does the sampling bias have negative impact?
03:41.760 --> 03:44.360
If so, how can we fix it?
03:44.360 --> 03:50.920
Surprisingly, it has no effect because Word2Vec itself has an overlooked built-in devising
03:50.920 --> 03:56.440
feature that happens to negate the bias due to the friendship paradox.
03:56.440 --> 04:03.040
This built-in devising feature can be easily utilized to negate other types of biases,
04:03.040 --> 04:06.640
and we demonstrate how to do this.
04:06.640 --> 04:10.280
Our starting point is a sentence of words.
04:10.280 --> 04:17.480
Word2Vec picks a word called center and surrounding words called context, and then models the
04:17.480 --> 04:24.400
conditional probability using a softmax function, where the conditional probability is reflected
04:24.400 --> 04:28.880
as a dot similarity of the two vectors of the words.
04:28.880 --> 04:35.240
We want to fit this model to the data, but it is computationally challenging due to the
04:35.240 --> 04:42.080
normalization constant, which extends over all unique words in the corpus.
04:42.080 --> 04:46.160
A common way to reduce this burden is negative sampling.
04:46.160 --> 04:53.720
Now, it is often underappreciated that negative sampling is actually a simplified version
04:53.720 --> 04:56.800
of noise contrastive estimation.
04:56.800 --> 05:04.920
And it is this simplification that gives rise to an interesting feature of Word2Vec.
05:04.920 --> 05:09.880
How does the noise contrastive estimation, or NCE, works?
05:09.880 --> 05:16.840
NCE samples k random contexts from so-called noise distribution.
05:16.840 --> 05:23.400
This noise distribution is roughly proportional to the frequency of a word in the corpus.
05:23.400 --> 05:30.240
The random contexts are labeled as 0, and the actual context is labeled as 1.
05:30.240 --> 05:37.160
Then NCE calculates the probability that a word comes from actual data using a Bayesian
05:37.160 --> 05:39.040
framework.
05:39.040 --> 05:46.040
By putting the prior likelihood together, we have a posterior like this.
05:46.040 --> 05:52.320
This function is a sigmoid function and takes the dot similarity and the noise distribution
05:52.320 --> 05:54.400
as the arguments.
05:54.400 --> 06:01.160
Now the key feature of the NCE is that it is asymptomatically unbiased for the model
06:01.160 --> 06:03.000
of the Word2Vec.
06:03.000 --> 06:08.160
Meaning if the data is actually generated from this model, and we increase the number
06:08.160 --> 06:14.240
of trainings, then the embedding vectors converge to the true vectors.
06:14.240 --> 06:20.560
Beyond Word2Vec, the noise contrastive estimation is also an unbiased estimator for a more general
06:20.560 --> 06:27.680
model that takes a real value function f instead of the dot similarity.
06:27.680 --> 06:33.400
Now the negative sampling simplifies the noise contrastive estimation.
06:33.400 --> 06:40.640
It estimates the same probability, but variably drops the term of the noise distribution.
06:40.640 --> 06:45.000
You might be wondering what happens without this term.
06:45.000 --> 06:50.880
To see this, we rewrite it in form of the noise contrastive estimation, where we define
06:50.880 --> 06:59.480
a new function f' which consists of the original function f as well as the noise distribution.
06:59.480 --> 07:08.640
This is asymptomatically unbiased for a probability model which now includes the noise distribution.
07:08.640 --> 07:16.160
So all in all, Word2Vec trained with skip-gram-negative sampling is asymptomatically unbiased for
07:16.160 --> 07:23.440
this probability model, or more specifically for Word2Vec, this function.
07:23.440 --> 07:30.840
In this model, the noise distribution offsets the modeled probability, serving as a baseline.
07:30.840 --> 07:35.720
The embedding vectors captures the residual from the baseline.
07:35.720 --> 07:41.760
Now, remind that the baseline probability is roughly proportional to the frequency.
07:41.760 --> 07:48.160
Therefore, the embedding vectors capture the information other than the frequency.
07:48.160 --> 07:57.320
In other words, SGNS Word2Vec has a built-in debiasing feature for frequency bias.
07:57.320 --> 08:01.800
Now let us revisit the friendship paradox.
08:01.800 --> 08:08.280
The sampling bias due to the friendship paradox is that the frequency of a word is determined
08:08.280 --> 08:12.000
thoroughly by the degree of noise.
08:12.000 --> 08:17.480
Notice that this frequency is actually accounted for by the baseline probability.
08:17.480 --> 08:24.840
Therefore, the friendship paradox has no effect thanks to the built-in debiasing feature of
08:24.840 --> 08:28.400
SGNS Word2Vec.
08:28.400 --> 08:33.840
This realization leads us to Residual2Vec.
08:33.840 --> 08:41.120
The key idea is to model the baseline probability explicitly to control what bias to remove
08:41.120 --> 08:43.160
in embedding.
08:43.160 --> 08:47.880
So how can we model the baseline more specifically?
08:47.880 --> 08:54.880
We start from the given graph and randomize the structure, then generate a sequence using
08:54.880 --> 09:01.600
random walks, then calculate the conditional probability as the baseline, which is based
09:01.600 --> 09:08.080
on the idea that we should remove biases arising from the trivial structure.
09:08.080 --> 09:12.640
This debiasing feature is useful to predict links in the graph.
09:12.640 --> 09:20.320
Residual2Vec performs the best or nearly the best for all six graphs of different domains.
09:20.320 --> 09:27.360
Furthermore, Residual2Vec is the best or the second best performer for a community detection
09:27.360 --> 09:29.360
benchmark.
09:29.360 --> 09:36.280
To showcase the debiasing feature, we constructed a citation graph of general issues using the
09:36.280 --> 09:43.800
web of science, where the nodes are general issues connected by undirected and weighted
09:43.800 --> 09:45.800
citations.
09:45.800 --> 09:51.640
When applying grove embedding, all genres are concentrated on the center, reflecting
09:51.640 --> 09:55.040
temporal aspects of the issues.
09:55.040 --> 10:01.440
This is because the old issues have time to accumulate many citations, and therefore well
10:01.440 --> 10:04.840
connected to many different issues.
10:04.840 --> 10:10.720
For subject-wise, grove separates different fields to some extent.
10:10.720 --> 10:15.560
With Residual2Vec, we can remove the biases due to time.
10:15.560 --> 10:21.720
In effect, the old genres now spread out, and the disciplinary separations are more
10:21.720 --> 10:23.920
clearly visible.
10:23.920 --> 10:29.480
Beyond eyeballing the embeddings, we test the embeddings quantitatively by predicting
10:29.480 --> 10:35.160
the genre impact factor as well as the subject categories.
10:35.160 --> 10:41.280
We find that the impact factor and the subject of genres can be well predicted by removing
10:41.280 --> 10:46.560
the temporal biases as well as the friendship paradox effect.
10:46.560 --> 10:54.600
In summary, we show that World2Vec has a built-in debiasing feature attributed to negative sampling.
10:54.600 --> 11:00.320
Inspired by this finding, we propose Residual2Vec that can negate other types of structural
11:00.320 --> 11:02.320
biases.
11:02.320 --> 11:08.360
We demonstrate that removing biases not only improves the performance, but also enabling
11:08.360 --> 11:13.400
us to control on the biases in the final representation.
11:13.400 --> 11:19.480
Our results highlighted a new potential of negative sampling as a way to mitigate biases
11:19.480 --> 11:27.200
in representations, which may be useful to address the problem of the biases in AI.
11:27.200 --> 11:33.320
Although we have not studied the biases in AI, given the wide usage of negative sampling
11:33.320 --> 11:39.880
to train AI, our approach may lead to methods and studies that expose and mitigate biases
11:39.880 --> 11:41.920
in AI.
11:41.920 --> 11:47.720
We believe that our approach contributes to the effort to create transparent and accountable
11:47.720 --> 11:53.520
machine learning methods, especially because our method enables us to explicitly control
11:53.520 --> 11:57.160
the biases in the graph representation.
11:57.160 --> 12:03.520
That's all for the presentation, and finally I'd like to acknowledge Jason Yoon, Isabel
12:03.520 --> 12:11.280
Constantino, and Yongyuan An for creating and adding momentum to this project for years,
12:11.280 --> 12:15.160
and for all of you who watched this video.
12:15.160 --> 12:19.640
If you want to know more in detail, please check out our paper.
12:19.640 --> 12:27.880
Thanks!