Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /nips-2021 /25965 /transcript_whisper_large-v2.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

12.5 kB

	WEBVTT

	00:00.000 --> 00:10.880
	How many friends do you have?

	00:10.880 --> 00:13.480
	At least you have more friends than I do.

	00:13.480 --> 00:15.960
	Well, on average.

	00:15.960 --> 00:18.280
	Don't get me wrong, I am not a pity person.

	00:18.280 --> 00:23.920
	This is a mathematical fact known as the friendship paradox.

	00:23.920 --> 00:30.800
	Suppose we have two persons, A who has one friend and B who has three friends.

	00:30.800 --> 00:36.520
	Now let me ask in which friend list am I likely to appear?

	00:36.520 --> 00:42.400
	Because B has three times more friends, I am three times more likely to appear in the

	00:42.400 --> 00:45.600
	B's friend list.

	00:45.600 --> 00:52.140
	The friendship paradox dictates that on average, your friends have more friends than you do.

	00:52.140 --> 00:58.280
	The more friends someone has, the more likely someone appears in your friend list.

	00:58.280 --> 01:04.120
	Beyond an interesting piece of trivia, the friendship paradox has substantial importance

	01:04.120 --> 01:10.040
	because it may introduce biases in graph embeddings.

	01:10.040 --> 01:15.680
	Hello everyone, my name is Sadamori Kojak, and we will walk you through a new insight

	01:15.680 --> 01:21.340
	into biases in graph embedding arising from the friendship paradox.

	01:21.340 --> 01:26.160
	The graph embedding is a technique to map a graph into a vector space that reflects

	01:26.160 --> 01:28.360
	the structure of the graph.

	01:28.360 --> 01:34.040
	A widespread paradigm is the approach based on Word2Vec.

	01:34.040 --> 01:39.480
	In this approach, one somehow generates a sequence of nodes from the graph.

	01:39.480 --> 01:45.600
	The nodes in the sentences are then mapped to a vector space by Word2Vec.

	01:45.600 --> 01:52.360
	Now the key is that Word2Vec does not directly learn the graph, but through the sentences

	01:52.360 --> 01:55.360
	generated from the graph.

	01:55.360 --> 02:01.120
	Unlike the word embedding, where the input sentences are the actual data, for graph embedding,

	02:01.120 --> 02:07.640
	the input sentence is artificially generated, and how to generate it is a critical modeling

	02:07.640 --> 02:08.640
	decision.

	02:08.640 --> 02:15.280
	This leads us to the question of how to generate the sentences from the graph.

	02:15.280 --> 02:20.160
	A common way is to use random walks.

	02:20.160 --> 02:28.560
	The worker starts from a node in the graph, and this node is the first node in the sentence.

	02:28.560 --> 02:32.840
	Then the worker moves to one of the neighbors selected randomly.

	02:32.840 --> 02:35.920
	This new node is added to the sentence.

	02:35.920 --> 02:43.320
	By repeating this process, we can generate a sentence of nodes from this graph.

	02:43.320 --> 02:48.940
	The friendship paradox comes into play when the worker follows an edge.

	02:48.940 --> 02:53.320
	It is more likely to visit a node with many neighbors.

	02:53.320 --> 02:58.800
	In other words, following edges is a bias sampling that preferentially leads random

	02:58.800 --> 03:02.400
	workers to nodes with many neighbors.

	03:02.400 --> 03:07.640
	To see this effect, let us consider a graph with co-peripheral structure, where kernels

	03:07.640 --> 03:10.600
	have more neighbors than periphery.

	03:10.600 --> 03:15.560
	A sentence can be generated from this graph by running a random walk.

	03:15.560 --> 03:21.200
	Now, the kernels are about 20% of nodes in the graph.

	03:21.200 --> 03:26.200
	But when looking at the generated sentence, the kernels are overrepresented, which is

	03:26.200 --> 03:30.160
	because of the bias due to the friendship paradox.

	03:30.160 --> 03:38.160
	The fact that the sentence is biased by the friendship paradox leads us to our main question.

	03:38.160 --> 03:41.760
	Does the sampling bias have negative impact?

	03:41.760 --> 03:44.360
	If so, how can we fix it?

	03:44.360 --> 03:50.920
	Surprisingly, it has no effect because Word2Vec itself has an overlooked built-in devising

	03:50.920 --> 03:56.440
	feature that happens to negate the bias due to the friendship paradox.

	03:56.440 --> 04:03.040
	This built-in devising feature can be easily utilized to negate other types of biases,

	04:03.040 --> 04:06.640
	and we demonstrate how to do this.

	04:06.640 --> 04:10.280
	Our starting point is a sentence of words.

	04:10.280 --> 04:17.480
	Word2Vec picks a word called center and surrounding words called context, and then models the

	04:17.480 --> 04:24.400
	conditional probability using a softmax function, where the conditional probability is reflected

	04:24.400 --> 04:28.880
	as a dot similarity of the two vectors of the words.

	04:28.880 --> 04:35.240
	We want to fit this model to the data, but it is computationally challenging due to the

	04:35.240 --> 04:42.080
	normalization constant, which extends over all unique words in the corpus.

	04:42.080 --> 04:46.160
	A common way to reduce this burden is negative sampling.

	04:46.160 --> 04:53.720
	Now, it is often underappreciated that negative sampling is actually a simplified version

	04:53.720 --> 04:56.800
	of noise contrastive estimation.

	04:56.800 --> 05:04.920
	And it is this simplification that gives rise to an interesting feature of Word2Vec.

	05:04.920 --> 05:09.880
	How does the noise contrastive estimation, or NCE, works?

	05:09.880 --> 05:16.840
	NCE samples k random contexts from so-called noise distribution.

	05:16.840 --> 05:23.400
	This noise distribution is roughly proportional to the frequency of a word in the corpus.

	05:23.400 --> 05:30.240
	The random contexts are labeled as 0, and the actual context is labeled as 1.

	05:30.240 --> 05:37.160
	Then NCE calculates the probability that a word comes from actual data using a Bayesian

	05:37.160 --> 05:39.040
	framework.

	05:39.040 --> 05:46.040
	By putting the prior likelihood together, we have a posterior like this.

	05:46.040 --> 05:52.320
	This function is a sigmoid function and takes the dot similarity and the noise distribution

	05:52.320 --> 05:54.400
	as the arguments.

	05:54.400 --> 06:01.160
	Now the key feature of the NCE is that it is asymptomatically unbiased for the model

	06:01.160 --> 06:03.000
	of the Word2Vec.

	06:03.000 --> 06:08.160
	Meaning if the data is actually generated from this model, and we increase the number

	06:08.160 --> 06:14.240
	of trainings, then the embedding vectors converge to the true vectors.

	06:14.240 --> 06:20.560
	Beyond Word2Vec, the noise contrastive estimation is also an unbiased estimator for a more general

	06:20.560 --> 06:27.680
	model that takes a real value function f instead of the dot similarity.

	06:27.680 --> 06:33.400
	Now the negative sampling simplifies the noise contrastive estimation.

	06:33.400 --> 06:40.640
	It estimates the same probability, but variably drops the term of the noise distribution.

	06:40.640 --> 06:45.000
	You might be wondering what happens without this term.

	06:45.000 --> 06:50.880
	To see this, we rewrite it in form of the noise contrastive estimation, where we define

	06:50.880 --> 06:59.480
	a new function f' which consists of the original function f as well as the noise distribution.

	06:59.480 --> 07:08.640
	This is asymptomatically unbiased for a probability model which now includes the noise distribution.

	07:08.640 --> 07:16.160
	So all in all, Word2Vec trained with skip-gram-negative sampling is asymptomatically unbiased for

	07:16.160 --> 07:23.440
	this probability model, or more specifically for Word2Vec, this function.

	07:23.440 --> 07:30.840
	In this model, the noise distribution offsets the modeled probability, serving as a baseline.

	07:30.840 --> 07:35.720
	The embedding vectors captures the residual from the baseline.

	07:35.720 --> 07:41.760
	Now, remind that the baseline probability is roughly proportional to the frequency.

	07:41.760 --> 07:48.160
	Therefore, the embedding vectors capture the information other than the frequency.

	07:48.160 --> 07:57.320
	In other words, SGNS Word2Vec has a built-in debiasing feature for frequency bias.

	07:57.320 --> 08:01.800
	Now let us revisit the friendship paradox.

	08:01.800 --> 08:08.280
	The sampling bias due to the friendship paradox is that the frequency of a word is determined

	08:08.280 --> 08:12.000
	thoroughly by the degree of noise.

	08:12.000 --> 08:17.480
	Notice that this frequency is actually accounted for by the baseline probability.

	08:17.480 --> 08:24.840
	Therefore, the friendship paradox has no effect thanks to the built-in debiasing feature of

	08:24.840 --> 08:28.400
	SGNS Word2Vec.

	08:28.400 --> 08:33.840
	This realization leads us to Residual2Vec.

	08:33.840 --> 08:41.120
	The key idea is to model the baseline probability explicitly to control what bias to remove

	08:41.120 --> 08:43.160
	in embedding.

	08:43.160 --> 08:47.880
	So how can we model the baseline more specifically?

	08:47.880 --> 08:54.880
	We start from the given graph and randomize the structure, then generate a sequence using

	08:54.880 --> 09:01.600
	random walks, then calculate the conditional probability as the baseline, which is based

	09:01.600 --> 09:08.080
	on the idea that we should remove biases arising from the trivial structure.

	09:08.080 --> 09:12.640
	This debiasing feature is useful to predict links in the graph.

	09:12.640 --> 09:20.320
	Residual2Vec performs the best or nearly the best for all six graphs of different domains.

	09:20.320 --> 09:27.360
	Furthermore, Residual2Vec is the best or the second best performer for a community detection

	09:27.360 --> 09:29.360
	benchmark.

	09:29.360 --> 09:36.280
	To showcase the debiasing feature, we constructed a citation graph of general issues using the

	09:36.280 --> 09:43.800
	web of science, where the nodes are general issues connected by undirected and weighted

	09:43.800 --> 09:45.800
	citations.

	09:45.800 --> 09:51.640
	When applying grove embedding, all genres are concentrated on the center, reflecting

	09:51.640 --> 09:55.040
	temporal aspects of the issues.

	09:55.040 --> 10:01.440
	This is because the old issues have time to accumulate many citations, and therefore well

	10:01.440 --> 10:04.840
	connected to many different issues.

	10:04.840 --> 10:10.720
	For subject-wise, grove separates different fields to some extent.

	10:10.720 --> 10:15.560
	With Residual2Vec, we can remove the biases due to time.

	10:15.560 --> 10:21.720
	In effect, the old genres now spread out, and the disciplinary separations are more

	10:21.720 --> 10:23.920
	clearly visible.

	10:23.920 --> 10:29.480
	Beyond eyeballing the embeddings, we test the embeddings quantitatively by predicting

	10:29.480 --> 10:35.160
	the genre impact factor as well as the subject categories.

	10:35.160 --> 10:41.280
	We find that the impact factor and the subject of genres can be well predicted by removing

	10:41.280 --> 10:46.560
	the temporal biases as well as the friendship paradox effect.

	10:46.560 --> 10:54.600
	In summary, we show that World2Vec has a built-in debiasing feature attributed to negative sampling.

	10:54.600 --> 11:00.320
	Inspired by this finding, we propose Residual2Vec that can negate other types of structural

	11:00.320 --> 11:02.320
	biases.

	11:02.320 --> 11:08.360
	We demonstrate that removing biases not only improves the performance, but also enabling

	11:08.360 --> 11:13.400
	us to control on the biases in the final representation.

	11:13.400 --> 11:19.480
	Our results highlighted a new potential of negative sampling as a way to mitigate biases

	11:19.480 --> 11:27.200
	in representations, which may be useful to address the problem of the biases in AI.

	11:27.200 --> 11:33.320
	Although we have not studied the biases in AI, given the wide usage of negative sampling

	11:33.320 --> 11:39.880
	to train AI, our approach may lead to methods and studies that expose and mitigate biases

	11:39.880 --> 11:41.920
	in AI.

	11:41.920 --> 11:47.720
	We believe that our approach contributes to the effort to create transparent and accountable

	11:47.720 --> 11:53.520
	machine learning methods, especially because our method enables us to explicitly control

	11:53.520 --> 11:57.160
	the biases in the graph representation.

	11:57.160 --> 12:03.520
	That's all for the presentation, and finally I'd like to acknowledge Jason Yoon, Isabel

	12:03.520 --> 12:11.280
	Constantino, and Yongyuan An for creating and adding momentum to this project for years,

	12:11.280 --> 12:15.160
	and for all of you who watched this video.

	12:15.160 --> 12:19.640
	If you want to know more in detail, please check out our paper.

	12:19.640 --> 12:27.880
	Thanks!