WEBVTT 0:00:01.541 --> 0:00:06.926 Okay, so we'll come back to today's lecture. 0:00:08.528 --> 0:00:23.334 We want to talk about is speech translation, so we'll have two lectures in this week about 0:00:23.334 --> 0:00:26.589 speech translation. 0:00:27.087 --> 0:00:36.456 And so in the last week we'll have some exercise and repetition. 0:00:36.456 --> 0:00:46.690 We want to look at what is now to do when we want to translate speech. 0:00:46.946 --> 0:00:55.675 So we want to address the specific challenges that occur when we switch from translating 0:00:55.675 --> 0:00:56.754 to speech. 0:00:57.697 --> 0:01:13.303 Today we will look at the more general picture out and build the systems. 0:01:13.493 --> 0:01:23.645 And then secondly an end approach where we are going to put in audio and generate. 0:01:24.224 --> 0:01:41.439 Which are the main dominant systems which are used in research and commercial systems. 0:01:43.523 --> 0:01:56.879 More general, what is the general task of speech translation that is shown here? 0:01:56.879 --> 0:02:01.826 The idea is we have a speech. 0:02:02.202 --> 0:02:12.838 Then we want to have a system which takes this audio and then translates it into another 0:02:12.838 --> 0:02:14.033 language. 0:02:15.095 --> 0:02:20.694 Then it's no longer as clear the output modality. 0:02:20.694 --> 0:02:33.153 In contrast, for humans we can typically have: So you can either have more textual translation, 0:02:33.153 --> 0:02:37.917 then you have subtitles, and the. 0:02:38.538 --> 0:02:57.010 Are you want to have it also in audio like it's done for human interpretation? 0:02:57.417 --> 0:03:03.922 See there is not the one best solution, so all of this one is always better. 0:03:03.922 --> 0:03:09.413 It heavily depends on what is the use of what the people prefer. 0:03:09.929 --> 0:03:14.950 For example, you can think of if you know a bit the source of language, but you're a 0:03:14.950 --> 0:03:17.549 bit unsure and don't understand everything. 0:03:17.549 --> 0:03:23.161 They may texture it out for this pattern because you can direct your gear to what was said and 0:03:23.161 --> 0:03:26.705 only if you're unsure you check down with your translation. 0:03:27.727 --> 0:03:33.511 Are another things that might be preferable to have a complete spoken of. 0:03:34.794 --> 0:03:48.727 So there are both ones for a long time in automatic systems focused mainly on text output. 0:03:48.727 --> 0:04:06.711 In most cases: But of course you can always hand them to text to speech systems which generates 0:04:06.711 --> 0:04:09.960 audio from that. 0:04:12.772 --> 0:04:14.494 Why should we care about that? 0:04:14.494 --> 0:04:15.771 Why should we do that? 0:04:17.737 --> 0:04:24.141 There is the nice thing that yeah, with a globalized world, we are able to now interact 0:04:24.141 --> 0:04:25.888 with a lot more people. 0:04:25.888 --> 0:04:29.235 You can do some conferences around the world. 0:04:29.235 --> 0:04:31.564 We can travel around the world. 0:04:31.671 --> 0:04:37.802 We can by Internet watch movies from all over the world and watch TV from all over the world. 0:04:38.618 --> 0:04:47.812 However, there is still this barrier that is mainly to watch videos, either in English 0:04:47.812 --> 0:04:49.715 or in a language. 0:04:50.250 --> 0:05:00.622 So what is currently happening in order to reach a large audience is that everybody. 0:05:00.820 --> 0:05:07.300 So if we are going, for example, to a conferences, these are international conferences. 0:05:08.368 --> 0:05:22.412 However, everybody will then speak English since that is some of the common language that 0:05:22.412 --> 0:05:26.001 everybody understands. 0:05:26.686 --> 0:05:32.929 So on the other hand, we cannot like have human interpreters like they ever work. 0:05:32.892 --> 0:05:37.797 You have that maybe in the European Parliament or in important business meetings. 0:05:38.078 --> 0:05:47.151 But this is relatively expensive, and so the question is, can we enable communication in 0:05:47.151 --> 0:05:53.675 your mother-in-law without having to have human interpretation? 0:05:54.134 --> 0:06:04.321 And there like speech translation can be helpful in order to help you bridge this gap. 0:06:06.726 --> 0:06:22.507 In this case, there are different scenarios of how you can apply speech translation. 0:06:22.422 --> 0:06:29.282 That's typically more interactive than we are talking about text translation. 0:06:29.282 --> 0:06:32.800 Text translation is most commonly used. 0:06:33.153 --> 0:06:41.637 Course: Nowadays there's things like chat and so on where it could also be interactive. 0:06:42.082 --> 0:06:48.299 In contrast to speech translation, that is less static, so there is different ways of 0:06:48.299 --> 0:06:48.660 how. 0:06:49.149 --> 0:07:00.544 The one scenario is what is called a translation where you first get an input, then you translate 0:07:00.544 --> 0:07:03.799 this fixed input, and then. 0:07:04.944 --> 0:07:12.823 With me, which means you have always like fixed, yeah fixed challenges which you need 0:07:12.823 --> 0:07:14.105 to translate. 0:07:14.274 --> 0:07:25.093 You don't need to like beat your mind what are the boundaries where there's an end. 0:07:25.405 --> 0:07:31.023 Also, there is no overlapping. 0:07:31.023 --> 0:07:42.983 There is always a one-person sentence that is getting translated. 0:07:43.443 --> 0:07:51.181 Of course, this has a disadvantage that it makes the conversation a lot longer because 0:07:51.181 --> 0:07:55.184 you always have only speech and translation. 0:07:57.077 --> 0:08:03.780 For example, if you would use that for a presentation there would be yeah quite get quite long, if 0:08:03.780 --> 0:08:09.738 I would just imagine you sitting here in the lecture I would say three sentences that I 0:08:09.738 --> 0:08:15.765 would wait for this interpreter to translate it, then I would say the next two sentences 0:08:15.765 --> 0:08:16.103 and. 0:08:16.676 --> 0:08:28.170 That is why in these situations, for example, if you have a direct conversation with a patient, 0:08:28.170 --> 0:08:28.888 then. 0:08:29.209 --> 0:08:32.733 But still there it's too big to be taking them very long. 0:08:33.473 --> 0:08:42.335 And that's why there's also the research on simultaneous translation, where the idea is 0:08:42.335 --> 0:08:43.644 in parallel. 0:08:43.964 --> 0:08:46.179 That Is the Dining for Human. 0:08:46.126 --> 0:08:52.429 Interpretation like if you think of things like the European Parliament where they of 0:08:52.429 --> 0:08:59.099 course not only speak always one sentence but are just giving their speech and in parallel 0:08:59.099 --> 0:09:04.157 human interpreters are translating the speech into another language. 0:09:04.985 --> 0:09:12.733 The same thing is interesting for automatic speech translation where we in parallel generate 0:09:12.733 --> 0:09:13.817 translation. 0:09:15.415 --> 0:09:32.271 The challenges then, of course, are that we need to segment our speech into somehow's chunks. 0:09:32.152 --> 0:09:34.903 We just looked for the dots we saw. 0:09:34.903 --> 0:09:38.648 There are some challenges that we have to check. 0:09:38.648 --> 0:09:41.017 The Doctor may not understand. 0:09:41.201 --> 0:09:47.478 But in generally getting sentence boundary sentences is not a really research question. 0:09:47.647 --> 0:09:51.668 While in speech translation, this is not that easy. 0:09:51.952 --> 0:10:05.908 Either getting that in the audio is difficult because it's not like we typically do breaks 0:10:05.908 --> 0:10:09.742 when there's a sentence. 0:10:10.150 --> 0:10:17.432 And even if you then see the transcript and would have to add the punctuation, this is 0:10:17.432 --> 0:10:18.101 not as. 0:10:20.340 --> 0:10:25.942 Another question is how many speakers we have here. 0:10:25.942 --> 0:10:31.759 In presentations you have more like a single speaker. 0:10:31.931 --> 0:10:40.186 That is normally easier from the part of audio processing, so in general in speech translation. 0:10:40.460 --> 0:10:49.308 You can have different challenges and they can be of different components. 0:10:49.308 --> 0:10:57.132 In addition to translation, you have: And if you're not going, for example, the magical 0:10:57.132 --> 0:11:00.378 speaker, there are significantly additional challenges. 0:11:00.720 --> 0:11:10.313 So we as humans we are very good in filtering out noises, or if two people speak in parallel 0:11:10.313 --> 0:11:15.058 to like separate these two speakers and hear. 0:11:15.495 --> 0:11:28.300 However, if you want to do that with automatic systems that is very challenging so that you 0:11:28.300 --> 0:11:33.172 can separate the speakers so that. 0:11:33.453 --> 0:11:41.284 For the more of you have this multi-speaker scenario, typically it's also less well prepared. 0:11:41.721 --> 0:11:45.807 So you're getting very, we'll talk about the spontaneous effects. 0:11:46.186 --> 0:11:53.541 So people like will stop in the middle of the sentence, they change their sentence, and 0:11:53.541 --> 0:12:01.481 so on, and like filtering these, these fluences out of the text and working with them is often 0:12:01.481 --> 0:12:02.986 very challenging. 0:12:05.565 --> 0:12:09.144 So these are all additional challenges when you have multiples. 0:12:10.330 --> 0:12:19.995 Then there's a question of an online or offline system, sometimes textbook station. 0:12:19.995 --> 0:12:21.836 We also mainly. 0:12:21.962 --> 0:12:36.507 That means you can take the whole text and you can translate it in a badge. 0:12:37.337 --> 0:12:44.344 However, for speech translation there's also several scenarios where this is the case. 0:12:44.344 --> 0:12:51.513 For example, when you're translating a movie, it's not only that you don't have to do it 0:12:51.513 --> 0:12:54.735 live, but you can take the whole movie. 0:12:55.215 --> 0:13:05.473 However, there is also a lot of situations where you don't have this opportunity like 0:13:05.473 --> 0:13:06.785 or sports. 0:13:07.247 --> 0:13:13.963 And you don't want to like first like let around a sports event and then like show in 0:13:13.963 --> 0:13:19.117 the game three hours later then there is not really any interest. 0:13:19.399 --> 0:13:31.118 So you have to do it live, and so we have the additional challenge of translating the 0:13:31.118 --> 0:13:32.208 system. 0:13:32.412 --> 0:13:42.108 There are still things on the one end of course. 0:13:42.108 --> 0:13:49.627 It needs to be real time translation. 0:13:49.869 --> 0:13:54.153 It's taking longer, then you're getting more and more and more delayed. 0:13:55.495 --> 0:14:05.245 So it maybe seems simple, but there have been research systems which are undertime slower 0:14:05.245 --> 0:14:07.628 than real time or so. 0:14:07.628 --> 0:14:15.103 If you want to show what is possible with the best current systems,. 0:14:16.596 --> 0:14:18.477 But that isn't even not enough. 0:14:18.918 --> 0:14:29.593 The other question: You can have a system which is even like several times real time. 0:14:29.509 --> 0:14:33.382 In less than one second, it might still be not useful. 0:14:33.382 --> 0:14:39.648 Then the question is like the latency, so how much time has passed since you can produce 0:14:39.648 --> 0:14:39.930 an. 0:14:40.120 --> 0:14:45.814 It might be that in average you can like concress it, but you still can't do it directly. 0:14:45.814 --> 0:14:51.571 You need to do it after, or you need to have the full context of thirty seconds before you 0:14:51.571 --> 0:14:55.178 can output something, and then you have a large latency. 0:14:55.335 --> 0:15:05.871 So it can be that do it as fast as it is produced, but have to wait until the food. 0:15:06.426 --> 0:15:13.772 So we'll look into that on Thursday how we can then generate translations that are having 0:15:13.772 --> 0:15:14.996 a low latency. 0:15:15.155 --> 0:15:21.587 You can imagine, for example, in German that it's maybe quite challenging since the word 0:15:21.587 --> 0:15:23.466 is often like at the end. 0:15:23.466 --> 0:15:30.115 If you're using perfect, like in harbor and so on, and then in English you have to directly 0:15:30.115 --> 0:15:30.983 produce it. 0:15:31.311 --> 0:15:38.757 So if you really want to have no context you might need to wait until the end of the sentence. 0:15:41.021 --> 0:15:45.920 Besides that, of course, offline and it gives you more additional help. 0:15:45.920 --> 0:15:52.044 I think last week you talked about context based systems that typically have context from 0:15:52.044 --> 0:15:55.583 maybe from the past but maybe also from the future. 0:15:55.595 --> 0:16:02.923 Then, of course, you cannot use anything from the future in this case, but you can use it. 0:16:07.407 --> 0:16:24.813 Finally, there is a thing about how you want to present it to the audience in automatic 0:16:24.813 --> 0:16:27.384 translation. 0:16:27.507 --> 0:16:31.361 There is also the thing that you want to do. 0:16:31.361 --> 0:16:35.300 All your outfits are running like the system. 0:16:35.996 --> 0:16:36.990 Top of it. 0:16:36.990 --> 0:16:44.314 Then they answered questions: How should it be spoken so you can do things like. 0:16:46.586 --> 0:16:52.507 Voice cloning so that it's like even the same voice than the original speaker. 0:16:53.994 --> 0:16:59.081 And if you do text or dubbing then there might be additional constraints. 0:16:59.081 --> 0:17:05.729 So if you think about subtitles: And they should be readable, and we are too big to speak 0:17:05.729 --> 0:17:07.957 faster than you can maybe read. 0:17:08.908 --> 0:17:14.239 So you might need to shorten your text. 0:17:14.239 --> 0:17:20.235 People say that a subtitle can be two lines. 0:17:20.235 --> 0:17:26.099 Each line can be this number of characters. 0:17:26.346 --> 0:17:31.753 So you cannot like if you have too long text, we might need to shorten that to do that. 0:17:32.052 --> 0:17:48.272 Similarly, if you think about dubbing, if you want to produce dubbing voice, then the 0:17:48.272 --> 0:17:50.158 original. 0:17:51.691 --> 0:17:59.294 Here is another problem that we have different settings like a more formal setting and let's 0:17:59.294 --> 0:18:00.602 have different. 0:18:00.860 --> 0:18:09.775 If you think about the United Nations maybe you want more former things and between friends 0:18:09.775 --> 0:18:14.911 maybe that former and there are languages which use. 0:18:15.355 --> 0:18:21.867 That is sure that is an important research question. 0:18:21.867 --> 0:18:28.010 To do that would more think of it more generally. 0:18:28.308 --> 0:18:32.902 That's important in text translation. 0:18:32.902 --> 0:18:41.001 If you translate a letter to your boss, it should sound different. 0:18:42.202 --> 0:18:53.718 So there is a question of how you can do this style work on how you can do that. 0:18:53.718 --> 0:19:00.542 For example, if you can specify that you might. 0:19:00.460 --> 0:19:10.954 So you can tax the center or generate an informal style because, as you correctly said, this 0:19:10.954 --> 0:19:16.709 is especially challenging again in the situations. 0:19:16.856 --> 0:19:20.111 Of course, there are ways of like being formal or less formal. 0:19:20.500 --> 0:19:24.846 But it's not like as clear as you do it, for example, in German where you have the twin 0:19:24.846 --> 0:19:24.994 C. 0:19:25.165 --> 0:19:26.855 So there is no one to own mapping. 0:19:27.287 --> 0:19:34.269 If you want to make that sure you can build a system which generates different styles in 0:19:34.269 --> 0:19:38.662 the output, so yeah that's definitely also a challenge. 0:19:38.662 --> 0:19:43.762 It just may be not mentioned here because it's not specific now. 0:19:44.524 --> 0:19:54.029 Generally, of course, these are all challenges in how to customize and adapt systems to use 0:19:54.029 --> 0:19:56.199 cases with specific. 0:20:00.360 --> 0:20:11.020 Speech translation has been done for quite a while and it's maybe not surprising it started 0:20:11.020 --> 0:20:13.569 with more simple use. 0:20:13.793 --> 0:20:24.557 So people first started to look into, for example, limited to main translations. 0:20:24.557 --> 0:20:33.726 The tourist was typically application if you're going to a new city. 0:20:34.834 --> 0:20:44.028 Then there are several open things of doing open domain translation, especially people. 0:20:44.204 --> 0:20:51.957 Like where there's a lot of data so you could build systems which are more open to main, 0:20:51.957 --> 0:20:55.790 but of course it's still a bit restrictive. 0:20:55.790 --> 0:20:59.101 It's true in the European Parliament. 0:20:59.101 --> 0:21:01.888 People talk about anything but. 0:21:02.162 --> 0:21:04.820 And so it's not completely used for everything. 0:21:05.165 --> 0:21:11.545 Nowadays we've seen this technology in a lot of different situations guess you ought. 0:21:11.731 --> 0:21:17.899 Use it so there is some basic technologies where you can use them already. 0:21:18.218 --> 0:21:33.599 There is still a lot of open questions going from if you are going to really spontaneous 0:21:33.599 --> 0:21:35.327 meetings. 0:21:35.655 --> 0:21:41.437 Then these systems typically work good for like some languages where we have a lot of 0:21:41.437 --> 0:21:42.109 friendly. 0:21:42.742 --> 0:21:48.475 But if we want to go for really low resource data then things are often challenging. 0:21:48.448 --> 0:22:02.294 Last week we had a workshop on spoken language translation and there is a low-resource data 0:22:02.294 --> 0:22:05.756 track which is dialed. 0:22:05.986 --> 0:22:06.925 And so on. 0:22:06.925 --> 0:22:14.699 All these languages can still then have significantly lower performance than for a higher. 0:22:17.057 --> 0:22:20.126 So how does this work? 0:22:20.126 --> 0:22:31.614 If we want to do speech translation, there's like three basic technology: So on the one 0:22:31.614 --> 0:22:40.908 hand, it's automatic speech recognition where automatic speech recognition normally transacts 0:22:40.908 --> 0:22:41.600 audio. 0:22:42.822 --> 0:22:58.289 Then what we talked about here is machine translation, which takes input and translates 0:22:58.289 --> 0:23:01.276 into the target. 0:23:02.642 --> 0:23:11.244 And the very simple model now, if you think about it, is of course the similar combination. 0:23:11.451 --> 0:23:14.740 We have solved all these parts in a salt bedrock. 0:23:14.975 --> 0:23:31.470 We are working on all these problems there, so if we want to do a speech transition, maybe. 0:23:31.331 --> 0:23:35.058 Such problems we just put all these combinations together. 0:23:35.335 --> 0:23:45.130 And then you get what you have as a cascading system, which first is so you take your audio. 0:23:45.045 --> 0:23:59.288 To take this as input and generate the output, and then you take this text output, put it 0:23:59.288 --> 0:24:00.238 into. 0:24:00.640 --> 0:24:05.782 So in that way you have now. 0:24:08.008 --> 0:24:18.483 Have now a solution for generating doing speech translation for these types of systems, and 0:24:18.483 --> 0:24:20.874 this type is called. 0:24:21.681 --> 0:24:28.303 It is still often reaching state of the art, however it has benefits and disadvantages. 0:24:28.668 --> 0:24:41.709 So the one big benefit is we have independent components and some of that is nice. 0:24:41.709 --> 0:24:48.465 So if there are great ideas put into your. 0:24:48.788 --> 0:24:57.172 And then some other times people develop a new good way of how to improve. 0:24:57.172 --> 0:25:00.972 You can also take this model and. 0:25:01.381 --> 0:25:07.639 So you can leverage improvements from all the different communities in order to adapt. 0:25:08.288 --> 0:25:18.391 Furthermore, we would like to see, since all of them is learning, that the biggest advantage 0:25:18.391 --> 0:25:23.932 is that we have training data for each individual. 0:25:24.164 --> 0:25:34.045 So there's a lot less training data where you have the English audio, so it's easy to 0:25:34.045 --> 0:25:34.849 train. 0:25:36.636 --> 0:25:48.595 Now am a one that we will focus on when talking about the cascaded approach is that often it. 0:25:48.928 --> 0:25:58.049 So you need to adapt each component a bit so that it's adapting to its input and. 0:25:58.278 --> 0:26:07.840 So we'll focus there especially on how to combine and since said the main focus is: So 0:26:07.840 --> 0:26:18.589 if you would directly use an output that might not work as perfect as you would,. 0:26:18.918 --> 0:26:33.467 So a major challenge when building a cascade of speech translation systems is how can we 0:26:33.467 --> 0:26:38.862 adapt these systems and how can? 0:26:41.681 --> 0:26:43.918 So why, why is this the kick? 0:26:44.164 --> 0:26:49.183 So it would look quite nice. 0:26:49.183 --> 0:26:54.722 It seems to be very reasonable. 0:26:54.722 --> 0:26:58.356 You have some audio. 0:26:58.356 --> 0:27:03.376 You put it into your system. 0:27:04.965 --> 0:27:23.759 However, this is a bit which for thinking because if you speak what you speak is more. 0:27:23.984 --> 0:27:29.513 And especially all that rarely have punctuations in there, and while the anti-system. 0:27:29.629 --> 0:27:43.247 They assume, of course, that it's a full sentence, that you don't have there some. 0:27:43.523 --> 0:27:55.087 So we see we want to get this bridge between the output and the input, and we might need 0:27:55.087 --> 0:27:56.646 additional. 0:27:58.778 --> 0:28:05.287 And that is typically what is referred to as re-case and re-piculation system. 0:28:05.445 --> 0:28:15.045 So the idea is that you might be good to have something like an adapter here in between, 0:28:15.045 --> 0:28:20.007 which really tries to adapt the speech input. 0:28:20.260 --> 0:28:28.809 That can be at different levels, but it might be even more rephrasing. 0:28:29.569 --> 0:28:40.620 If you think of the sentence, if you have false starts, then when speaking you sometimes 0:28:40.620 --> 0:28:41.986 assume oh. 0:28:41.901 --> 0:28:52.224 You restart it, then you might want to delete that because if you read it you don't want 0:28:52.224 --> 0:28:52.688 to. 0:28:56.096 --> 0:28:57.911 Why is this yeah? 0:28:57.911 --> 0:29:01.442 The case in punctuation important. 0:29:02.622 --> 0:29:17.875 One important thing is directly for the challenge is when speak is just a continuous stream of 0:29:17.875 --> 0:29:18.999 words. 0:29:19.079 --> 0:29:27.422 Then just speaking and punctuation marks, and so on are all notes are there in natural. 0:29:27.507 --> 0:29:30.281 However, they are of course important. 0:29:30.410 --> 0:29:33.877 They are first of all very important for readability. 0:29:34.174 --> 0:29:41.296 If you have once read a text without characterization marks, you need more time to process it. 0:29:41.861 --> 0:29:47.375 They're sometimes even semantically important. 0:29:47.375 --> 0:29:52.890 There's a list for grandpa and big difference. 0:29:53.553 --> 0:30:00.089 And so this, of course, with humans as well, it'd be easy to distinguish by again doing 0:30:00.089 --> 0:30:01.426 it automatically. 0:30:01.426 --> 0:30:06.180 It's more typically and finally, in our case, if we want to do. 0:30:06.386 --> 0:30:13.672 We are assuming normally sentence wise, so we always enter out system which is like one 0:30:13.672 --> 0:30:16.238 sentence by the next sentence. 0:30:16.736 --> 0:30:26.058 If you want to do speech translation of a continuous stream, then of course what are 0:30:26.058 --> 0:30:26.716 your. 0:30:28.168 --> 0:30:39.095 And the easiest and most straightforward situation is, of course, if you have a continuously. 0:30:39.239 --> 0:30:51.686 And if it generates your calculation marks, it's easy to separate your text into sentences. 0:30:52.032 --> 0:31:09.157 So we can again reuse our system and thereby have a normal anti-system on this continuous. 0:31:14.174 --> 0:31:21.708 These are a bit older numbers, but they show you a bit also how important all that is. 0:31:21.861 --> 0:31:31.719 So this was so the best is if you do insurance transcript you get roughly a blue score of. 0:31:32.112 --> 0:31:47.678 If you have as it is with some air based length segmentation, then you get something like. 0:31:47.907 --> 0:31:57.707 If you then use the segments correctly as it's done from the reference, you get one blue 0:31:57.707 --> 0:32:01.010 point and another blue point. 0:32:01.201 --> 0:32:08.085 So you see that you have been total like nearly two blue points just by having the correct 0:32:08.085 --> 0:32:09.144 segmentation. 0:32:10.050 --> 0:32:21.178 This shows you that it's important to estimate as good a segmentation because even if you 0:32:21.178 --> 0:32:25.629 still have the same arrows in your. 0:32:27.147 --> 0:32:35.718 Is to be into this movement, which is also not as unusual as we do in translation. 0:32:36.736 --> 0:32:40.495 So this is done by looking at the reference. 0:32:40.495 --> 0:32:48.097 It should show you how much these scores are done to just analyze how important are these. 0:32:48.097 --> 0:32:55.699 So you take the A's R transcript and you look at the reference and it's only done for the. 0:32:55.635 --> 0:33:01.720 If we have optimal punctuations, if our model is as good and optimal, so as a reference we 0:33:01.720 --> 0:33:15.602 could: But of course this is not how we can do it in reality because we don't have access 0:33:15.602 --> 0:33:16.990 to that. 0:33:17.657 --> 0:33:24.044 Because one would invade you okay, why should we do that? 0:33:24.044 --> 0:33:28.778 If we have the optimal then it's possible. 0:33:31.011 --> 0:33:40.060 And yeah, that is why a typical system does not only yeah depend on if our key component. 0:33:40.280 --> 0:33:56.468 But in between you have this segmentation in there in order to have more input and. 0:33:56.496 --> 0:34:01.595 You can also prefer often this invariability over the average study. 0:34:04.164 --> 0:34:19.708 So the task of segmentation is to re-segment the text into what is called sentence like 0:34:19.708 --> 0:34:24.300 unit, so you also assign. 0:34:24.444 --> 0:34:39.421 That is more a traditional thing because for a long time case information was not provided. 0:34:39.879 --> 0:34:50.355 So there was any good ASR system which directly provides you with case information and this 0:34:50.355 --> 0:34:52.746 may not be any more. 0:34:56.296 --> 0:35:12.060 How that can be done is you can have three different approaches because that was some 0:35:12.060 --> 0:35:16.459 of the most common one. 0:35:17.097 --> 0:35:23.579 Course: That is not the only thing you can do. 0:35:23.579 --> 0:35:30.888 You can also try to train the data to generate that. 0:35:31.891 --> 0:35:41.324 On the other hand, that is of course more challenging. 0:35:41.324 --> 0:35:47.498 You need some type of segmentation. 0:35:48.028 --> 0:35:59.382 Mean, of course, you can easily remove and capture information from your data and then 0:35:59.382 --> 0:36:05.515 play a system which does non-case to non-case. 0:36:05.945 --> 0:36:15.751 You can also, of course, try to combine these two into one so that you directly translate 0:36:15.751 --> 0:36:17.386 from non-case. 0:36:17.817 --> 0:36:24.722 What is more happening by now is that you also try to provide these to that you provide. 0:36:24.704 --> 0:36:35.267 The ASR is a segmentation directly get these information in there. 0:36:35.267 --> 0:36:45.462 The systems that combine the A's and A's are: Yes, there is a valid rule. 0:36:45.462 --> 0:36:51.187 What we come later to today is that you do audio to text in the target language. 0:36:51.187 --> 0:36:54.932 That is what is referred to as an end to end system. 0:36:54.932 --> 0:36:59.738 So it's directly and this is still more often done for text output. 0:36:59.738 --> 0:37:03.414 But there is also end to end system which directly. 0:37:03.683 --> 0:37:09.109 There you have additional challenges by how to even measure if things are correct or not. 0:37:09.089 --> 0:37:10.522 Mean for text. 0:37:10.522 --> 0:37:18.073 You can mention, in other words, that for audio the audio signal is even more. 0:37:18.318 --> 0:37:27.156 That's why it's currently mostly speech to text, but that is one single system, but of 0:37:27.156 --> 0:37:27.969 course. 0:37:32.492 --> 0:37:35.605 Yeah, how can you do that? 0:37:35.605 --> 0:37:45.075 You can do adding these calculation information: Will look into three systems. 0:37:45.075 --> 0:37:53.131 You can do that as a sequence labeling problem or as a monolingual. 0:37:54.534 --> 0:37:57.145 Let's have a little bit of a series. 0:37:57.145 --> 0:37:59.545 This was some of the first ideas. 0:37:59.545 --> 0:38:04.626 There's the idea where you try to do it mainly based on language model. 0:38:04.626 --> 0:38:11.471 So how probable is that there is a punctuation that was done with like old style engram language 0:38:11.471 --> 0:38:12.883 models to visually. 0:38:13.073 --> 0:38:24.687 So you can, for example, if you have a program language model to calculate the score of Hello, 0:38:24.687 --> 0:38:25.787 how are? 0:38:25.725 --> 0:38:33.615 And then you compare this probability and take the one which has the highest probability. 0:38:33.615 --> 0:38:39.927 You might have something like if you have very long pauses, you anyway. 0:38:40.340 --> 0:38:51.953 So this is a very easy model, which only calculates some language model probabilities, and however 0:38:51.953 --> 0:39:00.023 the advantages of course are: And then, of course, in general, so what we will look into 0:39:00.023 --> 0:39:06.249 here is that maybe interesting is that most of the systems, also the advance, are really 0:39:06.249 --> 0:39:08.698 mainly focused purely on the text. 0:39:09.289 --> 0:39:19.237 If you think about how to insert punctuation marks, maybe your first idea would have been 0:39:19.237 --> 0:39:22.553 we can use pause information. 0:39:23.964 --> 0:39:30.065 But however interestingly most systems that use are really focusing on the text. 0:39:31.151 --> 0:39:34.493 There are several reasons. 0:39:34.493 --> 0:39:44.147 One is that it's easier to get training data so you only need pure text data. 0:39:46.806 --> 0:40:03.221 The next way you can do it is you can make it as a secret labeling tax or something like 0:40:03.221 --> 0:40:04.328 that. 0:40:04.464 --> 0:40:11.734 Then you have how there is nothing in you, and there is a. 0:40:11.651 --> 0:40:15.015 A question. 0:40:15.315 --> 0:40:31.443 So you have the number of labels, the number of punctuation symbols you have for the basic 0:40:31.443 --> 0:40:32.329 one. 0:40:32.892 --> 0:40:44.074 Typically nowadays it would use something like bird, and then you can train a sister. 0:40:48.168 --> 0:40:59.259 Any questions to that then it would probably be no contrary, you know, or not. 0:41:00.480 --> 0:41:03.221 Yeah, you have definitely a labeled imbalance. 0:41:04.304 --> 0:41:12.405 Think that works relatively well and haven't seen that. 0:41:12.405 --> 0:41:21.085 It's not a completely crazy label, maybe twenty times more. 0:41:21.561 --> 0:41:29.636 It can and especially for the more rare things mean, the more rare things is question marks. 0:41:30.670 --> 0:41:43.877 At least for question marks you have typically very strong indicator words. 0:41:47.627 --> 0:42:03.321 And then what was done for quite a long time can we know how to do machine translation? 0:42:04.504 --> 0:42:12.640 So the idea is, can we just translate non punctuated English into punctuated English 0:42:12.640 --> 0:42:14.650 and do it correctly? 0:42:15.855 --> 0:42:25.344 So what you need is something like this type of data where the source doesn't have punctuation. 0:42:25.845 --> 0:42:30.641 Course: A year is already done. 0:42:30.641 --> 0:42:36.486 You have to make it a bit challenging. 0:42:41.661 --> 0:42:44.550 Yeah, that is true. 0:42:44.550 --> 0:42:55.237 If you think about the normal trained age, you have to do one thing more. 0:42:55.237 --> 0:43:00.724 Is it otherwise difficult to predict? 0:43:05.745 --> 0:43:09.277 Here it's already this already looks different than normal training data. 0:43:09.277 --> 0:43:09.897 What is the. 0:43:10.350 --> 0:43:15.305 People want to use this transcript of speech. 0:43:15.305 --> 0:43:19.507 We'll probably go to our text editors. 0:43:19.419 --> 0:43:25.906 Yes, that is all already quite too difficult. 0:43:26.346 --> 0:43:33.528 Mean, that's making things a lot better with the first and easiest thing is you have to 0:43:33.528 --> 0:43:35.895 randomly cut your sentences. 0:43:35.895 --> 0:43:43.321 So if you take just me normally we have one sentence per line and if you take this as your 0:43:43.321 --> 0:43:44.545 training data. 0:43:44.924 --> 0:43:47.857 And that is, of course, not very helpful. 0:43:48.208 --> 0:44:01.169 So in order to build the training corpus for doing punctuation you randomly cut your sentences 0:44:01.169 --> 0:44:08.264 and then you can remove all your punctuation marks. 0:44:08.528 --> 0:44:21.598 Because of course there is no longer to do when you have some random segments in your 0:44:21.598 --> 0:44:22.814 system. 0:44:25.065 --> 0:44:37.984 And then you can, for example, if you then have generated your punctuation marks before 0:44:37.984 --> 0:44:41.067 going to the system. 0:44:41.221 --> 0:44:54.122 And that is an important thing, which we like to see is more challenging for end systems. 0:44:54.122 --> 0:45:00.143 We can change the segmentation, so maybe. 0:45:00.040 --> 0:45:06.417 You can, then if you're combining these things you can change the segmentation here, so. 0:45:06.406 --> 0:45:18.178 While you have ten new ten segments in your, you might only have five ones in your anymore. 0:45:18.178 --> 0:45:18.946 Then. 0:45:19.259 --> 0:45:33.172 Which might be more useful or helpful in because you have to reorder things and so on. 0:45:33.273 --> 0:45:43.994 And if you think of the wrong segmentation then you cannot reorder things from the beginning 0:45:43.994 --> 0:45:47.222 to the end of the sentence. 0:45:49.749 --> 0:45:58.006 Okay, so much about segmentation do you have any more questions about that? 0:46:02.522 --> 0:46:21.299 Then there is one additional thing you can do, and that is when we refer to the idea. 0:46:21.701 --> 0:46:29.356 And when you get input there might be some arrows in there, so it might not be perfect. 0:46:29.889 --> 0:46:36.322 So the question is, can we adapt to that? 0:46:36.322 --> 0:46:45.358 And can the system be improved by saying that it can some. 0:46:45.265 --> 0:46:50.591 So that is as aware that before there is a. 0:46:50.490 --> 0:46:55.449 Their arm might not be the best one. 0:46:55.935 --> 0:47:01.961 There are different ways of dealing with them. 0:47:01.961 --> 0:47:08.116 You can use a best list but several best lists. 0:47:08.408 --> 0:47:16.711 So the idea is that you're not only telling the system this is the transcript, but here 0:47:16.711 --> 0:47:18.692 I'm not going to be. 0:47:19.419 --> 0:47:30.748 Or that you can try to make it more robust towards arrows from an system so that. 0:47:32.612 --> 0:47:48.657 Interesting what is often done is hope convince you it might be a good idea to deal. 0:47:48.868 --> 0:47:57.777 The interesting thing is if you're looking into a lot of systems, this is often ignored, 0:47:57.777 --> 0:48:04.784 so they are not adapting their T-system to this type of A-S-R system. 0:48:05.345 --> 0:48:15.232 So it's not really doing any handling of Arab, and the interesting thing is often works as 0:48:15.232 --> 0:48:15.884 good. 0:48:16.516 --> 0:48:23.836 And one reason is, of course, one reason is if the ASR system does not arrow up to like 0:48:23.836 --> 0:48:31.654 a challenging situation, and then the antisystem is really for the antisystem hard to detect. 0:48:31.931 --> 0:48:39.375 If it would be easy for the system to detect the error you would integrate this information 0:48:39.375 --> 0:48:45.404 into: That is not always the case, but that of course makes it a bit challenging, and that's 0:48:45.404 --> 0:48:49.762 why there is a lot of systems where it's not explicitly handled how to deal with. 0:48:52.912 --> 0:49:06.412 But of course it might be good, so one thing is you can give him a best list and you can 0:49:06.412 --> 0:49:09.901 translate every entry. 0:49:10.410 --> 0:49:17.705 And then you have two scores like the anti-probability and the square probability. 0:49:18.058 --> 0:49:25.695 Combine them and then generate or output the output from what has the best combined. 0:49:26.366 --> 0:49:29.891 And then it might no longer be the best. 0:49:29.891 --> 0:49:38.144 It might like we had a bean search, so this has the best score, but this has a better combined. 0:49:39.059 --> 0:49:46.557 The problem sometimes works, but the problem is that the anti-system might then tend to 0:49:46.557 --> 0:49:52.777 just translate not the correct sentence but the one easier to translate. 0:49:53.693 --> 0:50:03.639 You can also generate a more compact representation of this invest in it by having this type of 0:50:03.639 --> 0:50:04.467 graphs. 0:50:05.285 --> 0:50:22.952 Lettices: So then you could like try to do a graph to text translation so you can translate. 0:50:22.802 --> 0:50:26.582 Where like all possibilities, by the way our systems are invented. 0:50:26.906 --> 0:50:31.485 So it can be like a hostage, a conference with some programs. 0:50:31.591 --> 0:50:35.296 So the highest probability is here. 0:50:35.296 --> 0:50:41.984 Conference is being recorded, but there are other possibilities. 0:50:42.302 --> 0:50:53.054 And you can take all of this information out there with your probabilities. 0:50:59.980 --> 0:51:07.614 But we'll see this type of arrow propagation that if you have an error that this might then 0:51:07.614 --> 0:51:15.165 propagate to, and t errors is one of the main reasons why people looked into other ways of 0:51:15.165 --> 0:51:17.240 doing it and not having. 0:51:19.219 --> 0:51:28.050 By generally a cascaded combination, as we've seen it, it has several advantages: The biggest 0:51:28.050 --> 0:51:42.674 maybe is the data availability so we can train systems for the different components. 0:51:42.822 --> 0:51:47.228 So you can train your individual components on relatively large stages. 0:51:47.667 --> 0:51:58.207 A modular system where you can improve each individual model and if there's new development 0:51:58.207 --> 0:52:01.415 and models you can improve. 0:52:01.861 --> 0:52:11.280 There are several advantages, but of course there are also some disadvantages: The most 0:52:11.280 --> 0:52:19.522 common thing is that there is what is referred to as arrow propagation. 0:52:19.522 --> 0:52:28.222 If the arrow is arrow, probably your output will then directly do an arrow. 0:52:28.868 --> 0:52:41.740 Typically it's like if there's an error in the system, it's easier to like ignore by a 0:52:41.740 --> 0:52:46.474 quantity scale than the output. 0:52:46.967 --> 0:52:49.785 What do that mean? 0:52:49.785 --> 0:53:01.209 It's complicated, so if you have German, the ASR does the Arab, and instead. 0:53:01.101 --> 0:53:05.976 Then most probably you'll ignore it or you'll still know what it was said. 0:53:05.976 --> 0:53:11.827 Maybe you even don't notice because you'll fastly read over it and don't see that there's 0:53:11.827 --> 0:53:12.997 one letter wrong. 0:53:13.673 --> 0:53:25.291 However, if you translate this one in an English sentence about speeches, there's something 0:53:25.291 --> 0:53:26.933 about wines. 0:53:27.367 --> 0:53:37.238 So it's a lot easier typically to read over like arrows in the than reading over them in 0:53:37.238 --> 0:53:38.569 the speech. 0:53:40.120 --> 0:53:45.863 But there is additional challenges in in cascaded systems. 0:53:46.066 --> 0:53:52.667 So secondly we have seen that we optimize each component individually so you have a separate 0:53:52.667 --> 0:53:59.055 optimization and that doesn't mean that the overall performance is really the best at the 0:53:59.055 --> 0:53:59.410 end. 0:53:59.899 --> 0:54:07.945 And we have tried to do that by already saying yes. 0:54:07.945 --> 0:54:17.692 You need to adapt them a bit to work good together, but still. 0:54:20.280 --> 0:54:24.185 Secondly, like that, there's a computational complexity. 0:54:24.185 --> 0:54:30.351 You always need to run an ASR system and an MTT system, and especially if you think about 0:54:30.351 --> 0:54:32.886 it, it should be fast and real time. 0:54:32.886 --> 0:54:37.065 It's challenging to always run two systems and not a single. 0:54:38.038 --> 0:54:45.245 And one final thing which you might have not directly thought of, but most of the world's 0:54:45.245 --> 0:54:47.407 languages do not have any. 0:54:48.108 --> 0:55:01.942 So if you have a language which doesn't have any script, then of course if you want to translate 0:55:01.942 --> 0:55:05.507 it you cannot first use. 0:55:05.905 --> 0:55:13.705 So in order to do this, the pressure was mentioned before ready. 0:55:13.705 --> 0:55:24.264 Build somehow a system which takes the audio and directly generates text in the target. 0:55:26.006 --> 0:55:41.935 And there is quite big opportunity for that because before that there was very different 0:55:41.935 --> 0:55:44.082 technology. 0:55:44.644 --> 0:55:55.421 However, since we are using neuromachine translation encoded decoder models, the interesting thing 0:55:55.421 --> 0:56:00.429 is that we are using very similar technology. 0:56:00.360 --> 0:56:06.047 It's like in both cases very similar architecture. 0:56:06.047 --> 0:56:09.280 The main difference is once. 0:56:09.649 --> 0:56:17.143 But generally how it's done is very similar, and therefore of course it might be put everything 0:56:17.143 --> 0:56:22.140 together, and that is what is referred to as end-to-end speech. 0:56:22.502 --> 0:56:31.411 So that means we're having one large neural network and decoded voice system, but we put 0:56:31.411 --> 0:56:34.914 an audio in one language and then. 0:56:36.196 --> 0:56:43.106 We can then have a system which directly does the full process. 0:56:43.106 --> 0:56:46.454 We don't have to care anymore. 0:56:48.048 --> 0:57:02.615 So if you think of it as before, so we have this decoder, and that's the two separate. 0:57:02.615 --> 0:57:04.792 We have the. 0:57:05.085 --> 0:57:18.044 And instead of going via the discrete text representation in the Suez language, we can 0:57:18.044 --> 0:57:21.470 go via the continuous. 0:57:21.681 --> 0:57:26.027 Of course, they hope it's by not doing this discrimination in between. 0:57:26.146 --> 0:57:30.275 We don't have a problem at doing errors. 0:57:30.275 --> 0:57:32.793 We can only cover later. 0:57:32.772 --> 0:57:47.849 But we can encode here the variability or so that we have and then only define the decision. 0:57:51.711 --> 0:57:54.525 And so. 0:57:54.274 --> 0:58:02.253 What we're doing is we're having very similar technique. 0:58:02.253 --> 0:58:12.192 We're having still the decoder model where we're coming from the main. 0:58:12.552 --> 0:58:24.098 Instead of getting discrete tokens in there as we have subwords, we always encoded that 0:58:24.098 --> 0:58:26.197 in one pattern. 0:58:26.846 --> 0:58:42.505 The problem is that this is in continuous, so we have to check how we can work with continuous 0:58:42.505 --> 0:58:43.988 signals. 0:58:47.627 --> 0:58:55.166 Mean, the first thing in your system is when you do your disc freeze and code it. 0:59:02.402 --> 0:59:03.888 A newer machine translation. 0:59:03.888 --> 0:59:05.067 You're getting a word. 0:59:05.067 --> 0:59:06.297 It's one hot, some not. 0:59:21.421 --> 0:59:24.678 The first layer of the machine translation. 0:59:27.287 --> 0:59:36.147 Yes, you do the word embedding, so then you have a continuous thing. 0:59:36.147 --> 0:59:40.128 So if you know get continuous. 0:59:40.961 --> 0:59:46.316 Deal with it the same way, so we'll see not a big of a challenge. 0:59:46.316 --> 0:59:48.669 What is more challenging is. 0:59:49.349 --> 1:00:04.498 So the audio signal is ten times longer or so, like more time steps you have. 1:00:04.764 --> 1:00:10.332 And so that is, of course, any challenge how we can deal with this type of long sequence. 1:00:11.171 --> 1:00:13.055 The advantage is a bit. 1:00:13.055 --> 1:00:17.922 The long sequence is only at the input and not at the output. 1:00:17.922 --> 1:00:24.988 So when you remember for the efficiency, for example, like a long sequence are especially 1:00:24.988 --> 1:00:29.227 challenging in the decoder, but also for the encoder. 1:00:31.371 --> 1:00:33.595 So how it is this? 1:00:33.595 --> 1:00:40.617 How can we process audio into an speech translation system? 1:00:41.501 --> 1:00:51.856 And you can follow mainly what is done in an system, so you have the audio signal. 1:00:52.172 --> 1:00:59.135 Then you measure your amplitude at every time step. 1:00:59.135 --> 1:01:04.358 It's typically something like killing. 1:01:04.384 --> 1:01:13.893 And then you're doing this, this windowing, so that you get a signal of a length twenty 1:01:13.893 --> 1:01:22.430 to thirty seconds, and you have all these windowings so that you measure them. 1:01:22.342 --> 1:01:32.260 A simple gear, and then you look at these time signals of seconds. 1:01:32.432 --> 1:01:36.920 So in the end then it is ten seconds, ten million seconds. 1:01:36.920 --> 1:01:39.735 You have for every ten milliseconds. 1:01:40.000 --> 1:01:48.309 Some type of representation which type of representation you can generate from that, 1:01:48.309 --> 1:01:49.286 but that. 1:01:49.649 --> 1:02:06.919 So instead of having no letter or word, you have no representations for every 10mm of your 1:02:06.919 --> 1:02:08.437 system. 1:02:08.688 --> 1:02:13.372 How we record that now your thirty second window here there is different ways. 1:02:16.176 --> 1:02:31.891 Was a traditional way of how people have done that from an audio signal what frequencies 1:02:31.891 --> 1:02:34.010 are in the. 1:02:34.114 --> 1:02:44.143 So to do that you can do this malfrequency, capsule co-pression so you can use gear transformations. 1:02:44.324 --> 1:02:47.031 Which frequencies are there? 1:02:47.031 --> 1:02:53.566 You know that the letters are different by the different frequencies. 1:02:53.813 --> 1:03:04.243 And then if you're doing that, use the matte to covers for your window we have before. 1:03:04.624 --> 1:03:14.550 So for each of these windows: You will calculate what frequencies in there and then get features 1:03:14.550 --> 1:03:20.059 for this window and features for this window. 1:03:19.980 --> 1:03:28.028 These are the frequencies that occur there and that help you to model which letters are 1:03:28.028 --> 1:03:28.760 spoken. 1:03:31.611 --> 1:03:43.544 More recently, instead of doing the traditional signal processing, you can also replace that 1:03:43.544 --> 1:03:45.853 by deep learning. 1:03:46.126 --> 1:03:56.406 So that we are using a self-supervised approach from language model to generate features that 1:03:56.406 --> 1:03:58.047 describe what. 1:03:58.358 --> 1:03:59.821 So you have your. 1:03:59.759 --> 1:04:07.392 All your signal again, and then for each child to do your convolutional neural networks to 1:04:07.392 --> 1:04:07.811 get. 1:04:07.807 --> 1:04:23.699 First representation here is a transformer network here, and in the end it's similar to 1:04:23.699 --> 1:04:25.866 a language. 1:04:25.705 --> 1:04:30.238 And you tried to predict what was referenced here. 1:04:30.670 --> 1:04:42.122 So that is in a way similar that you also try to learn a good representation of all these 1:04:42.122 --> 1:04:51.608 audio signals by predicting: And then you don't do the signal processing base, but have this 1:04:51.608 --> 1:04:52.717 way to make. 1:04:52.812 --> 1:04:59.430 But in all the things that you have to remember what is most important for you, and to end 1:04:59.430 --> 1:05:05.902 system is, of course, that you in the end get for every minute ten milliseconds, you get 1:05:05.902 --> 1:05:11.283 a representation of this audio signal, which is again a vector, and that. 1:05:11.331 --> 1:05:15.365 And then you can use your normal encoder to code your model to do this research. 1:05:21.861 --> 1:05:32.694 So that is all which directly has to be changed, and then you can build your first base. 1:05:33.213 --> 1:05:37.167 You do the audio processing. 1:05:37.167 --> 1:05:49.166 You of course need data which is like Audio and English and Text in German and then you 1:05:49.166 --> 1:05:50.666 can train. 1:05:53.333 --> 1:05:57.854 And interestingly, it works at the beginning. 1:05:57.854 --> 1:06:03.261 The systems were maybe a bit worse, but we saw really. 1:06:03.964 --> 1:06:11.803 This is like from the biggest workshop where people like compared different systems. 1:06:11.751 --> 1:06:17.795 Special challenge on comparing Cascaded to end to end systems and you see two thousand 1:06:17.795 --> 1:06:18.767 and eighteen. 1:06:18.767 --> 1:06:25.089 We had quite a huge gap between the Cascaded and end to end systems and then it got nearer 1:06:25.089 --> 1:06:27.937 and earlier in starting in two thousand. 1:06:27.907 --> 1:06:33.619 Twenty the performance was mainly the same, so there was no clear difference anymore. 1:06:34.014 --> 1:06:42.774 So this is, of course, writing a bit of hope saying if we better learn how to build these 1:06:42.774 --> 1:06:47.544 internal systems, they might really fall better. 1:06:49.549 --> 1:06:52.346 However, a bit. 1:06:52.452 --> 1:06:59.018 This satisfying this is how this all continues, and this is not only in two thousand and twenty 1:06:59.018 --> 1:07:04.216 one, but even nowadays we can say there is no clear performance difference. 1:07:04.216 --> 1:07:10.919 It's not like the one model is better than the other, but we are seeing very similar performance. 1:07:11.391 --> 1:07:19.413 So the question is what is the difference? 1:07:19.413 --> 1:07:29.115 Of course, this can only be achieved by new tricks. 1:07:30.570 --> 1:07:35.658 Yes and no, that's what we will mainly look into now. 1:07:35.658 --> 1:07:39.333 How can we make use of other types of. 1:07:39.359 --> 1:07:53.236 In that case you can achieve some performance by using different types of training so you 1:07:53.236 --> 1:07:55.549 can also make. 1:07:55.855 --> 1:08:04.961 So if you are training or preparing the systems only on very small corpora where you have as 1:08:04.961 --> 1:08:10.248 much data than you have for the individual ones then. 1:08:10.550 --> 1:08:22.288 So that is the biggest challenge of an end system that you have small corpora and therefore. 1:08:24.404 --> 1:08:30.479 Of course, there is several advantages so you can give access to the audio information. 1:08:30.750 --> 1:08:42.046 So that's, for example, interesting if you think about it, you might not have modeled 1:08:42.046 --> 1:08:45.198 everything in the text. 1:08:45.198 --> 1:08:50.321 So remember when we talk about biases. 1:08:50.230 --> 1:08:55.448 Male or female, and that of course is not in the text any more, but in the audio signal 1:08:55.448 --> 1:08:56.515 it's still there. 1:08:58.078 --> 1:09:03.108 It also allows you to talk about that on Thursday when you talk about latency. 1:09:03.108 --> 1:09:08.902 You have a bit better chance if you do an end to end system to get a lower latency because 1:09:08.902 --> 1:09:14.377 you only have one system and you don't have two systems which might have to wait for. 1:09:14.934 --> 1:09:20.046 And having one system might be also a bit easier management. 1:09:20.046 --> 1:09:23.146 See that two systems work and so on. 1:09:26.346 --> 1:09:41.149 The biggest challenge of end systems is the data, so as you correctly pointed out, typically 1:09:41.149 --> 1:09:42.741 there is. 1:09:43.123 --> 1:09:45.829 There is some data for Ted. 1:09:45.829 --> 1:09:47.472 People did that. 1:09:47.472 --> 1:09:52.789 They took the English audio with all the translations. 1:09:53.273 --> 1:10:02.423 But in January there is a lot less so we'll look into how you can use other data sources. 1:10:05.305 --> 1:10:10.950 And secondly, the second challenge is that we have to deal with audio. 1:10:11.431 --> 1:10:22.163 For example, in input length, and therefore it's also important to handle this in your 1:10:22.163 --> 1:10:27.590 network and maybe have dedicated solutions. 1:10:31.831 --> 1:10:40.265 So in general we have this challenge that we have a lot of text and translation and audio 1:10:40.265 --> 1:10:43.076 transcript data by quite few. 1:10:43.643 --> 1:10:50.844 So what can we do in one trick? 1:10:50.844 --> 1:11:00.745 You already know a bit from other research. 1:11:02.302 --> 1:11:14.325 Exactly so what you can do is you can, for example, use to take a power locust, generate 1:11:14.325 --> 1:11:19.594 an audio of a Suez language, and then. 1:11:21.341 --> 1:11:33.780 There has been a bit motivated by what we have seen in Beck translation, which was very 1:11:33.780 --> 1:11:35.476 successful. 1:11:38.758 --> 1:11:54.080 However, it's a bit more challenging because it is often very different from real audience. 1:11:54.314 --> 1:12:07.131 So often if you build a system only trained on, but then generalized to real audio data 1:12:07.131 --> 1:12:10.335 is quite challenging. 1:12:10.910 --> 1:12:20.927 And therefore here the synthetic data generation is significantly more challenging than when. 1:12:20.981 --> 1:12:27.071 Because if you read a text, it's maybe bad translation. 1:12:27.071 --> 1:12:33.161 It's hard, but it's a real text or a text generated by. 1:12:35.835 --> 1:12:42.885 But it's a valid solution, and for example we use that also for say current systems. 1:12:43.923 --> 1:12:53.336 Of course you can also do a bit of forward translation that is done so that you take data. 1:12:53.773 --> 1:13:02.587 But then the problem is that your reference is not always correct, and you remember when 1:13:02.587 --> 1:13:08.727 we talked about back translation, it's a bit of an advantage. 1:13:09.229 --> 1:13:11.930 But both can be done and both have been done. 1:13:12.212 --> 1:13:20.277 So you can think about this picture again. 1:13:20.277 --> 1:13:30.217 You can take this data and generate the audio to it. 1:13:30.750 --> 1:13:37.938 However, it is only synthetic of what can be used for the voice handling technology for: 1:13:40.240 --> 1:13:47.153 But you have not, I mean, yet you get text to speech, but the voice cloning would need 1:13:47.153 --> 1:13:47.868 a voice. 1:13:47.868 --> 1:13:53.112 You can use, of course, and then it's nothing else than a normal. 1:13:54.594 --> 1:14:03.210 But still think there are better than both, but there are some characteristics of that 1:14:03.210 --> 1:14:05.784 which is quite different. 1:14:07.327 --> 1:14:09.341 But yeah, it's getting better. 1:14:09.341 --> 1:14:13.498 That is definitely true, and then this might get more and more. 1:14:16.596 --> 1:14:21.885 Here make sure it's a good person and our own systems because we try to train and. 1:14:21.881 --> 1:14:24.356 And it's like a feedback mood. 1:14:24.356 --> 1:14:28.668 There's anything like the Dutch English model that's. 1:14:28.648 --> 1:14:33.081 Yeah, you of course need a decent amount of real data. 1:14:33.081 --> 1:14:40.255 But I mean, as I said, so there is always an advantage if you have this synthetics thing 1:14:40.255 --> 1:14:44.044 only on the input side and not on the outside. 1:14:44.464 --> 1:14:47.444 That you at least always generate correct outcomes. 1:14:48.688 --> 1:14:54.599 That's different in a language case because they have input and the output and it's not 1:14:54.599 --> 1:14:55.002 like. 1:14:58.618 --> 1:15:15.815 The other idea is to integrate additional sources so you can have more model sharing. 1:15:16.376 --> 1:15:23.301 But you can use these components also in the system. 1:15:23.301 --> 1:15:28.659 Typically the text decoder and the text. 1:15:29.169 --> 1:15:41.845 And so the other way of languaging is to join a train or somehow train all these tasks. 1:15:43.403 --> 1:15:54.467 The first and easy thing to do is multi task training so the idea is you take these components 1:15:54.467 --> 1:16:02.038 and train these two components and train the speech translation. 1:16:02.362 --> 1:16:13.086 So then, for example, all your encoders used by the speech translation system can also gain 1:16:13.086 --> 1:16:14.951 from the large. 1:16:14.975 --> 1:16:24.048 So everything can gain a bit of emphasis, but it can partly gain in there quite a bit. 1:16:27.407 --> 1:16:39.920 The other idea is to do it in a pre-training phase. 1:16:40.080 --> 1:16:50.414 And then you take the end coder and the text decoder and trade your model on that. 1:16:54.774 --> 1:17:04.895 Finally, there is also what is referred to as knowledge distillation, so there you have 1:17:04.895 --> 1:17:11.566 to remember if you learn from a probability distribution. 1:17:11.771 --> 1:17:24.371 So what you can do then is you have your system and if you then have your audio and text input 1:17:24.371 --> 1:17:26.759 you can use your. 1:17:27.087 --> 1:17:32.699 And then get a more rich signal that you'll not only know this is the word, but you have 1:17:32.699 --> 1:17:33.456 a complete. 1:17:34.394 --> 1:17:41.979 Example is typically also done because, of course, if you have ski data, it still begins 1:17:41.979 --> 1:17:49.735 that you don't only have source language audio and target language text, but then you also 1:17:49.735 --> 1:17:52.377 have the source language text. 1:17:53.833 --> 1:18:00.996 Get a good idea of the text editor and the artist design. 1:18:00.996 --> 1:18:15.888 Now have to be aligned so that: Otherwise they wouldn't be able to determine which degree 1:18:15.888 --> 1:18:17.922 they'd be. 1:18:18.178 --> 1:18:25.603 What you've been doing in non-stasilation is you run your MP and then you get your probability 1:18:25.603 --> 1:18:32.716 distribution for all the words and you use that to train and that is not only more helpful 1:18:32.716 --> 1:18:34.592 than only getting back. 1:18:35.915 --> 1:18:44.427 You can, of course, use the same decoder to be even similar. 1:18:44.427 --> 1:18:49.729 Otherwise you don't have exactly the. 1:18:52.832 --> 1:19:03.515 Is a good point making these tools, and generally in all these cases it's good to have more similar 1:19:03.515 --> 1:19:05.331 representations. 1:19:05.331 --> 1:19:07.253 You can transfer. 1:19:07.607 --> 1:19:23.743 If you hear your representation to give from the audio encoder and the text encoder are 1:19:23.743 --> 1:19:27.410 more similar, then. 1:19:30.130 --> 1:19:39.980 So here you have your text encoder in the target language and you can train it on large 1:19:39.980 --> 1:19:40.652 data. 1:19:41.341 --> 1:19:45.994 But of course you want to benefit also for this task because that's what your most interested. 1:19:46.846 --> 1:19:59.665 Of course, the most benefit for this task is if these two representations you give are 1:19:59.665 --> 1:20:01.728 more similar. 1:20:02.222 --> 1:20:10.583 Therefore, it's interesting to look into how can we make these two representations as similar 1:20:10.583 --> 1:20:20.929 as: The hope is that in the end you can't even do something like zero shot transfer, but while 1:20:20.929 --> 1:20:25.950 you only learn this one you can also deal with. 1:20:30.830 --> 1:20:40.257 So what you can do is you can look at these two representations. 1:20:40.257 --> 1:20:42.867 So once the text. 1:20:43.003 --> 1:20:51.184 And you can either put them into the text decoder to the encoder. 1:20:51.184 --> 1:20:53.539 We have seen both. 1:20:53.539 --> 1:21:03.738 You can think: If you want to build an A's and to insist on you can either take the audio 1:21:03.738 --> 1:21:06.575 encoder and see how deep. 1:21:08.748 --> 1:21:21.915 However, you have these two representations and you want to make them more similar. 1:21:21.915 --> 1:21:23.640 One thing. 1:21:23.863 --> 1:21:32.797 Here we have, like you said, for every ten million seconds we have a representation. 1:21:35.335 --> 1:21:46.085 So what people may have done, for example, is to remove redundant information so you can: 1:21:46.366 --> 1:21:56.403 So you can use your system to put India based on letter or words and then average over the 1:21:56.403 --> 1:21:58.388 words or letters. 1:21:59.179 --> 1:22:07.965 So that the number of representations from the encoder is the same as you would get from. 1:22:12.692 --> 1:22:20.919 Okay, that much to data do have any more questions first about that. 1:22:27.207 --> 1:22:36.787 Then we'll finish with the audience assessing and highlight a bit while this is challenging, 1:22:36.787 --> 1:22:52.891 so here's: One test here has one thousand eight hundred sentences, so there are words or characters. 1:22:53.954 --> 1:22:59.336 If you look how many all your features, so how many samples there is like one point five 1:22:59.336 --> 1:22:59.880 million. 1:23:00.200 --> 1:23:10.681 So you have ten times more pizzas than you have characters, and then again five times 1:23:10.681 --> 1:23:11.413 more. 1:23:11.811 --> 1:23:23.934 So you have the sequence leg of the audio as long as you have for words, and that is 1:23:23.934 --> 1:23:25.788 a challenge. 1:23:26.086 --> 1:23:34.935 So the question is what can you do to make the sequins a bit shorter and not have this? 1:23:38.458 --> 1:23:48.466 The one thing is you can try to reduce the dimensional entity in your encounter. 1:23:48.466 --> 1:23:50.814 There's different. 1:23:50.991 --> 1:24:04.302 So, for example, you can just sum up always over some or you can do a congregation. 1:24:04.804 --> 1:24:12.045 Are you a linear projectile or you even take not every feature but only every fifth or something? 1:24:12.492 --> 1:24:23.660 So this way you can very easily reduce your number of features in there, and there has 1:24:23.660 --> 1:24:25.713 been different. 1:24:26.306 --> 1:24:38.310 There's also what you can do with things like a convolutional layer. 1:24:38.310 --> 1:24:43.877 If you skip over what you can,. 1:24:47.327 --> 1:24:55.539 And then, in addition to the audio, the other problem is higher variability. 1:24:55.539 --> 1:25:04.957 So if you have a text you can: But there are very different ways of saying that you can 1:25:04.957 --> 1:25:09.867 distinguish whether say a sentence or your voice. 1:25:10.510 --> 1:25:21.224 That of course makes it more challenging because now you get different inputs and while they 1:25:21.224 --> 1:25:22.837 were in text. 1:25:23.263 --> 1:25:32.360 So that makes especially for limited data things more challenging and you want to somehow 1:25:32.360 --> 1:25:35.796 learn that this is not important. 1:25:36.076 --> 1:25:39.944 So there is the idea again okay. 1:25:39.944 --> 1:25:47.564 Can we doing some type of data augmentation to better deal with? 1:25:48.908 --> 1:25:55.735 And again people can mainly use what has been done in and try to do the same things. 1:25:56.276 --> 1:26:02.937 You can try to do a bit of noise and speech perturbation so playing the audio like slower 1:26:02.937 --> 1:26:08.563 and a bit faster to get more samples then and you can train on all of them. 1:26:08.563 --> 1:26:14.928 What is very important and very successful recently is what is called Spektr augment. 1:26:15.235 --> 1:26:25.882 The idea is that you directly work on all your features and you can try to last them 1:26:25.882 --> 1:26:29.014 and that gives you more. 1:26:29.469 --> 1:26:41.717 What do they mean with masking so this is your audio feature and then there is different? 1:26:41.962 --> 1:26:47.252 You can do what is referred to as mask and a time masking. 1:26:47.252 --> 1:26:50.480 That means you just set some masks. 1:26:50.730 --> 1:26:58.003 And since then you should be still able to to deal with it because you can normally. 1:26:57.937 --> 1:27:05.840 Also without that you are getting more robust and not and you can handle that because then 1:27:05.840 --> 1:27:10.877 many symbols which have different time look more similar. 1:27:11.931 --> 1:27:22.719 You are not only doing that for time masking but also for frequency masking so that if you 1:27:22.719 --> 1:27:30.188 have here the frequency channels you mask a frequency channel. 1:27:30.090 --> 1:27:33.089 Thereby being able to better recognize these things. 1:27:35.695 --> 1:27:43.698 This we have had an overview of the two main approaches for speech translation that is on 1:27:43.698 --> 1:27:51.523 the one hand cascaded speech translation and on the other hand we talked about advanced 1:27:51.523 --> 1:27:53.302 speech translation. 1:27:53.273 --> 1:28:02.080 It's like how to combine things and what they work together for end speech translations. 1:28:02.362 --> 1:28:06.581 Here was data challenges and a bit about long circuits. 1:28:07.747 --> 1:28:09.304 We have any more questions. 1:28:11.451 --> 1:28:19.974 Can you really describe the change in cascading from translation to text to speech because 1:28:19.974 --> 1:28:22.315 thought the translation. 1:28:25.745 --> 1:28:30.201 Yes, so mean that works again the easiest thing. 1:28:30.201 --> 1:28:33.021 What of course is challenging? 1:28:33.021 --> 1:28:40.751 What can be challenging is how to make that more lively and like that pronunciation? 1:28:40.680 --> 1:28:47.369 And yeah, which things are put more important, how to put things like that into. 1:28:47.627 --> 1:28:53.866 In the normal text, otherwise it would sound very monotone. 1:28:53.866 --> 1:28:57.401 You want to add this information. 1:28:58.498 --> 1:29:02.656 That is maybe one thing to make it a bit more emotional. 1:29:02.656 --> 1:29:04.917 That is maybe one thing which. 1:29:05.305 --> 1:29:13.448 But you are right there and out of the box. 1:29:13.448 --> 1:29:20.665 If you have everything works decently. 1:29:20.800 --> 1:29:30.507 Still, especially if you have a very monotone voice, so think these are quite some open challenges. 1:29:30.750 --> 1:29:35.898 Maybe another open challenge is that it's not so much for the end product, but for the 1:29:35.898 --> 1:29:37.732 development is very important. 1:29:37.732 --> 1:29:40.099 It's very hard to evaluate the quality. 1:29:40.740 --> 1:29:48.143 So you cannot doubt that there is a way about most systems are currently evaluated by human 1:29:48.143 --> 1:29:49.109 evaluation. 1:29:49.589 --> 1:29:54.474 So you cannot try hundreds of things and run your blue score and get this score. 1:29:54.975 --> 1:30:00.609 So therefore no means very important to have some type of evaluation metric and that is 1:30:00.609 --> 1:30:01.825 quite challenging. 1:30:08.768 --> 1:30:15.550 And thanks for listening, and we'll have the second part of speech translation on search.