So for each query q in Q, we can consider only | |
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is | |
modified to mask the current token (except at the first position), because it will give a query and a key equal (so | |
very similar to each other). |