WEBVTT

0:00:01.541 --> 0:00:06.926
Okay, so we'll come back to today's lecture.

0:00:08.528 --> 0:00:23.334
We want to talk about is speech translation,
so we'll have two lectures in this week about

0:00:23.334 --> 0:00:26.589
speech translation.

0:00:27.087 --> 0:00:36.456
And so in the last week we'll have some exercise
and repetition.

0:00:36.456 --> 0:00:46.690
We want to look at what is now to do when
we want to translate speech.

0:00:46.946 --> 0:00:55.675
So we want to address the specific challenges
that occur when we switch from translating

0:00:55.675 --> 0:00:56.754
to speech.

0:00:57.697 --> 0:01:13.303
Today we will look at the more general picture
out and build the systems.

0:01:13.493 --> 0:01:23.645
And then secondly an end approach where we
are going to put in audio and generate.

0:01:24.224 --> 0:01:41.439
Which are the main dominant systems which
are used in research and commercial systems.

0:01:43.523 --> 0:01:56.879
More general, what is the general task of
speech translation that is shown here?

0:01:56.879 --> 0:02:01.826
The idea is we have a speech.

0:02:02.202 --> 0:02:12.838
Then we want to have a system which takes
this audio and then translates it into another

0:02:12.838 --> 0:02:14.033
language.

0:02:15.095 --> 0:02:20.694
Then it's no longer as clear the output modality.

0:02:20.694 --> 0:02:33.153
In contrast, for humans we can typically have:
So you can either have more textual translation,

0:02:33.153 --> 0:02:37.917
then you have subtitles, and the.

0:02:38.538 --> 0:02:57.010
Are you want to have it also in audio like
it's done for human interpretation?

0:02:57.417 --> 0:03:03.922
See there is not the one best solution, so
all of this one is always better.

0:03:03.922 --> 0:03:09.413
It heavily depends on what is the use of what
the people prefer.

0:03:09.929 --> 0:03:14.950
For example, you can think of if you know
a bit the source of language, but you're a

0:03:14.950 --> 0:03:17.549
bit unsure and don't understand everything.

0:03:17.549 --> 0:03:23.161
They may texture it out for this pattern because
you can direct your gear to what was said and

0:03:23.161 --> 0:03:26.705
only if you're unsure you check down with your
translation.

0:03:27.727 --> 0:03:33.511
Are another things that might be preferable
to have a complete spoken of.

0:03:34.794 --> 0:03:48.727
So there are both ones for a long time in
automatic systems focused mainly on text output.

0:03:48.727 --> 0:04:06.711
In most cases: But of course you can always
hand them to text to speech systems which generates

0:04:06.711 --> 0:04:09.960
audio from that.

0:04:12.772 --> 0:04:14.494
Why should we care about that?

0:04:14.494 --> 0:04:15.771
Why should we do that?

0:04:17.737 --> 0:04:24.141
There is the nice thing that yeah, with a
globalized world, we are able to now interact

0:04:24.141 --> 0:04:25.888
with a lot more people.

0:04:25.888 --> 0:04:29.235
You can do some conferences around the world.

0:04:29.235 --> 0:04:31.564
We can travel around the world.

0:04:31.671 --> 0:04:37.802
We can by Internet watch movies from all over
the world and watch TV from all over the world.

0:04:38.618 --> 0:04:47.812
However, there is still this barrier that
is mainly to watch videos, either in English

0:04:47.812 --> 0:04:49.715
or in a language.

0:04:50.250 --> 0:05:00.622
So what is currently happening in order to
reach a large audience is that everybody.

0:05:00.820 --> 0:05:07.300
So if we are going, for example, to a conferences,
these are international conferences.

0:05:08.368 --> 0:05:22.412
However, everybody will then speak English
since that is some of the common language that

0:05:22.412 --> 0:05:26.001
everybody understands.

0:05:26.686 --> 0:05:32.929
So on the other hand, we cannot like have
human interpreters like they ever work.

0:05:32.892 --> 0:05:37.797
You have that maybe in the European Parliament
or in important business meetings.

0:05:38.078 --> 0:05:47.151
But this is relatively expensive, and so the
question is, can we enable communication in

0:05:47.151 --> 0:05:53.675
your mother-in-law without having to have human
interpretation?

0:05:54.134 --> 0:06:04.321
And there like speech translation can be helpful
in order to help you bridge this gap.

0:06:06.726 --> 0:06:22.507
In this case, there are different scenarios
of how you can apply speech translation.

0:06:22.422 --> 0:06:29.282
That's typically more interactive than we
are talking about text translation.

0:06:29.282 --> 0:06:32.800
Text translation is most commonly used.

0:06:33.153 --> 0:06:41.637
Course: Nowadays there's things like chat
and so on where it could also be interactive.

0:06:42.082 --> 0:06:48.299
In contrast to speech translation, that is
less static, so there is different ways of

0:06:48.299 --> 0:06:48.660
how.

0:06:49.149 --> 0:07:00.544
The one scenario is what is called a translation
where you first get an input, then you translate

0:07:00.544 --> 0:07:03.799
this fixed input, and then.

0:07:04.944 --> 0:07:12.823
With me, which means you have always like
fixed, yeah fixed challenges which you need

0:07:12.823 --> 0:07:14.105
to translate.

0:07:14.274 --> 0:07:25.093
You don't need to like beat your mind what
are the boundaries where there's an end.

0:07:25.405 --> 0:07:31.023
Also, there is no overlapping.

0:07:31.023 --> 0:07:42.983
There is always a one-person sentence that
is getting translated.

0:07:43.443 --> 0:07:51.181
Of course, this has a disadvantage that it
makes the conversation a lot longer because

0:07:51.181 --> 0:07:55.184
you always have only speech and translation.

0:07:57.077 --> 0:08:03.780
For example, if you would use that for a presentation
there would be yeah quite get quite long, if

0:08:03.780 --> 0:08:09.738
I would just imagine you sitting here in the
lecture I would say three sentences that I

0:08:09.738 --> 0:08:15.765
would wait for this interpreter to translate
it, then I would say the next two sentences

0:08:15.765 --> 0:08:16.103
and.

0:08:16.676 --> 0:08:28.170
That is why in these situations, for example,
if you have a direct conversation with a patient,

0:08:28.170 --> 0:08:28.888
then.

0:08:29.209 --> 0:08:32.733
But still there it's too big to be taking
them very long.

0:08:33.473 --> 0:08:42.335
And that's why there's also the research on
simultaneous translation, where the idea is

0:08:42.335 --> 0:08:43.644
in parallel.

0:08:43.964 --> 0:08:46.179
That Is the Dining for Human.

0:08:46.126 --> 0:08:52.429
Interpretation like if you think of things
like the European Parliament where they of

0:08:52.429 --> 0:08:59.099
course not only speak always one sentence but
are just giving their speech and in parallel

0:08:59.099 --> 0:09:04.157
human interpreters are translating the speech
into another language.

0:09:04.985 --> 0:09:12.733
The same thing is interesting for automatic
speech translation where we in parallel generate

0:09:12.733 --> 0:09:13.817
translation.

0:09:15.415 --> 0:09:32.271
The challenges then, of course, are that we
need to segment our speech into somehow's chunks.

0:09:32.152 --> 0:09:34.903
We just looked for the dots we saw.

0:09:34.903 --> 0:09:38.648
There are some challenges that we have to
check.

0:09:38.648 --> 0:09:41.017
The Doctor may not understand.

0:09:41.201 --> 0:09:47.478
But in generally getting sentence boundary
sentences is not a really research question.

0:09:47.647 --> 0:09:51.668
While in speech translation, this is not that
easy.

0:09:51.952 --> 0:10:05.908
Either getting that in the audio is difficult
because it's not like we typically do breaks

0:10:05.908 --> 0:10:09.742
when there's a sentence.

0:10:10.150 --> 0:10:17.432
And even if you then see the transcript and
would have to add the punctuation, this is

0:10:17.432 --> 0:10:18.101
not as.

0:10:20.340 --> 0:10:25.942
Another question is how many speakers we have
here.

0:10:25.942 --> 0:10:31.759
In presentations you have more like a single
speaker.

0:10:31.931 --> 0:10:40.186
That is normally easier from the part of audio
processing, so in general in speech translation.

0:10:40.460 --> 0:10:49.308
You can have different challenges and they
can be of different components.

0:10:49.308 --> 0:10:57.132
In addition to translation, you have: And
if you're not going, for example, the magical

0:10:57.132 --> 0:11:00.378
speaker, there are significantly additional
challenges.

0:11:00.720 --> 0:11:10.313
So we as humans we are very good in filtering
out noises, or if two people speak in parallel

0:11:10.313 --> 0:11:15.058
to like separate these two speakers and hear.

0:11:15.495 --> 0:11:28.300
However, if you want to do that with automatic
systems that is very challenging so that you

0:11:28.300 --> 0:11:33.172
can separate the speakers so that.

0:11:33.453 --> 0:11:41.284
For the more of you have this multi-speaker
scenario, typically it's also less well prepared.

0:11:41.721 --> 0:11:45.807
So you're getting very, we'll talk about the
spontaneous effects.

0:11:46.186 --> 0:11:53.541
So people like will stop in the middle of
the sentence, they change their sentence, and

0:11:53.541 --> 0:12:01.481
so on, and like filtering these, these fluences
out of the text and working with them is often

0:12:01.481 --> 0:12:02.986
very challenging.

0:12:05.565 --> 0:12:09.144
So these are all additional challenges when
you have multiples.

0:12:10.330 --> 0:12:19.995
Then there's a question of an online or offline
system, sometimes textbook station.

0:12:19.995 --> 0:12:21.836
We also mainly.

0:12:21.962 --> 0:12:36.507
That means you can take the whole text and
you can translate it in a badge.

0:12:37.337 --> 0:12:44.344
However, for speech translation there's also
several scenarios where this is the case.

0:12:44.344 --> 0:12:51.513
For example, when you're translating a movie,
it's not only that you don't have to do it

0:12:51.513 --> 0:12:54.735
live, but you can take the whole movie.

0:12:55.215 --> 0:13:05.473
However, there is also a lot of situations
where you don't have this opportunity like

0:13:05.473 --> 0:13:06.785
or sports.

0:13:07.247 --> 0:13:13.963
And you don't want to like first like let
around a sports event and then like show in

0:13:13.963 --> 0:13:19.117
the game three hours later then there is not
really any interest.

0:13:19.399 --> 0:13:31.118
So you have to do it live, and so we have
the additional challenge of translating the

0:13:31.118 --> 0:13:32.208
system.

0:13:32.412 --> 0:13:42.108
There are still things on the one end of course.

0:13:42.108 --> 0:13:49.627
It needs to be real time translation.

0:13:49.869 --> 0:13:54.153
It's taking longer, then you're getting more
and more and more delayed.

0:13:55.495 --> 0:14:05.245
So it maybe seems simple, but there have been
research systems which are undertime slower

0:14:05.245 --> 0:14:07.628
than real time or so.

0:14:07.628 --> 0:14:15.103
If you want to show what is possible with
the best current systems,.

0:14:16.596 --> 0:14:18.477
But that isn't even not enough.

0:14:18.918 --> 0:14:29.593
The other question: You can have a system
which is even like several times real time.

0:14:29.509 --> 0:14:33.382
In less than one second, it might still be
not useful.

0:14:33.382 --> 0:14:39.648
Then the question is like the latency, so
how much time has passed since you can produce

0:14:39.648 --> 0:14:39.930
an.

0:14:40.120 --> 0:14:45.814
It might be that in average you can like concress
it, but you still can't do it directly.

0:14:45.814 --> 0:14:51.571
You need to do it after, or you need to have
the full context of thirty seconds before you

0:14:51.571 --> 0:14:55.178
can output something, and then you have a large
latency.

0:14:55.335 --> 0:15:05.871
So it can be that do it as fast as it is produced,
but have to wait until the food.

0:15:06.426 --> 0:15:13.772
So we'll look into that on Thursday how we
can then generate translations that are having

0:15:13.772 --> 0:15:14.996
a low latency.

0:15:15.155 --> 0:15:21.587
You can imagine, for example, in German that
it's maybe quite challenging since the word

0:15:21.587 --> 0:15:23.466
is often like at the end.

0:15:23.466 --> 0:15:30.115
If you're using perfect, like in harbor and
so on, and then in English you have to directly

0:15:30.115 --> 0:15:30.983
produce it.

0:15:31.311 --> 0:15:38.757
So if you really want to have no context you
might need to wait until the end of the sentence.

0:15:41.021 --> 0:15:45.920
Besides that, of course, offline and it gives
you more additional help.

0:15:45.920 --> 0:15:52.044
I think last week you talked about context
based systems that typically have context from

0:15:52.044 --> 0:15:55.583
maybe from the past but maybe also from the
future.

0:15:55.595 --> 0:16:02.923
Then, of course, you cannot use anything from
the future in this case, but you can use it.

0:16:07.407 --> 0:16:24.813
Finally, there is a thing about how you want
to present it to the audience in automatic

0:16:24.813 --> 0:16:27.384
translation.

0:16:27.507 --> 0:16:31.361
There is also the thing that you want to do.

0:16:31.361 --> 0:16:35.300
All your outfits are running like the system.

0:16:35.996 --> 0:16:36.990
Top of it.

0:16:36.990 --> 0:16:44.314
Then they answered questions: How should it
be spoken so you can do things like.

0:16:46.586 --> 0:16:52.507
Voice cloning so that it's like even the same
voice than the original speaker.

0:16:53.994 --> 0:16:59.081
And if you do text or dubbing then there might
be additional constraints.

0:16:59.081 --> 0:17:05.729
So if you think about subtitles: And they
should be readable, and we are too big to speak

0:17:05.729 --> 0:17:07.957
faster than you can maybe read.

0:17:08.908 --> 0:17:14.239
So you might need to shorten your text.

0:17:14.239 --> 0:17:20.235
People say that a subtitle can be two lines.

0:17:20.235 --> 0:17:26.099
Each line can be this number of characters.

0:17:26.346 --> 0:17:31.753
So you cannot like if you have too long text,
we might need to shorten that to do that.

0:17:32.052 --> 0:17:48.272
Similarly, if you think about dubbing, if
you want to produce dubbing voice, then the

0:17:48.272 --> 0:17:50.158
original.

0:17:51.691 --> 0:17:59.294
Here is another problem that we have different
settings like a more formal setting and let's

0:17:59.294 --> 0:18:00.602
have different.

0:18:00.860 --> 0:18:09.775
If you think about the United Nations maybe
you want more former things and between friends

0:18:09.775 --> 0:18:14.911
maybe that former and there are languages which
use.

0:18:15.355 --> 0:18:21.867
That is sure that is an important research
question.

0:18:21.867 --> 0:18:28.010
To do that would more think of it more generally.

0:18:28.308 --> 0:18:32.902
That's important in text translation.

0:18:32.902 --> 0:18:41.001
If you translate a letter to your boss, it
should sound different.

0:18:42.202 --> 0:18:53.718
So there is a question of how you can do this
style work on how you can do that.

0:18:53.718 --> 0:19:00.542
For example, if you can specify that you might.

0:19:00.460 --> 0:19:10.954
So you can tax the center or generate an informal
style because, as you correctly said, this

0:19:10.954 --> 0:19:16.709
is especially challenging again in the situations.

0:19:16.856 --> 0:19:20.111
Of course, there are ways of like being formal
or less formal.

0:19:20.500 --> 0:19:24.846
But it's not like as clear as you do it, for
example, in German where you have the twin

0:19:24.846 --> 0:19:24.994
C.

0:19:25.165 --> 0:19:26.855
So there is no one to own mapping.

0:19:27.287 --> 0:19:34.269
If you want to make that sure you can build
a system which generates different styles in

0:19:34.269 --> 0:19:38.662
the output, so yeah that's definitely also
a challenge.

0:19:38.662 --> 0:19:43.762
It just may be not mentioned here because
it's not specific now.

0:19:44.524 --> 0:19:54.029
Generally, of course, these are all challenges
in how to customize and adapt systems to use

0:19:54.029 --> 0:19:56.199
cases with specific.

0:20:00.360 --> 0:20:11.020
Speech translation has been done for quite
a while and it's maybe not surprising it started

0:20:11.020 --> 0:20:13.569
with more simple use.

0:20:13.793 --> 0:20:24.557
So people first started to look into, for
example, limited to main translations.

0:20:24.557 --> 0:20:33.726
The tourist was typically application if you're
going to a new city.

0:20:34.834 --> 0:20:44.028
Then there are several open things of doing
open domain translation, especially people.

0:20:44.204 --> 0:20:51.957
Like where there's a lot of data so you could
build systems which are more open to main,

0:20:51.957 --> 0:20:55.790
but of course it's still a bit restrictive.

0:20:55.790 --> 0:20:59.101
It's true in the European Parliament.

0:20:59.101 --> 0:21:01.888
People talk about anything but.

0:21:02.162 --> 0:21:04.820
And so it's not completely used for everything.

0:21:05.165 --> 0:21:11.545
Nowadays we've seen this technology in a lot
of different situations guess you ought.

0:21:11.731 --> 0:21:17.899
Use it so there is some basic technologies
where you can use them already.

0:21:18.218 --> 0:21:33.599
There is still a lot of open questions going
from if you are going to really spontaneous

0:21:33.599 --> 0:21:35.327
meetings.

0:21:35.655 --> 0:21:41.437
Then these systems typically work good for
like some languages where we have a lot of

0:21:41.437 --> 0:21:42.109
friendly.

0:21:42.742 --> 0:21:48.475
But if we want to go for really low resource
data then things are often challenging.

0:21:48.448 --> 0:22:02.294
Last week we had a workshop on spoken language
translation and there is a low-resource data

0:22:02.294 --> 0:22:05.756
track which is dialed.

0:22:05.986 --> 0:22:06.925
And so on.

0:22:06.925 --> 0:22:14.699
All these languages can still then have significantly
lower performance than for a higher.

0:22:17.057 --> 0:22:20.126
So how does this work?

0:22:20.126 --> 0:22:31.614
If we want to do speech translation, there's
like three basic technology: So on the one

0:22:31.614 --> 0:22:40.908
hand, it's automatic speech recognition where
automatic speech recognition normally transacts

0:22:40.908 --> 0:22:41.600
audio.

0:22:42.822 --> 0:22:58.289
Then what we talked about here is machine
translation, which takes input and translates

0:22:58.289 --> 0:23:01.276
into the target.

0:23:02.642 --> 0:23:11.244
And the very simple model now, if you think
about it, is of course the similar combination.

0:23:11.451 --> 0:23:14.740
We have solved all these parts in a salt bedrock.

0:23:14.975 --> 0:23:31.470
We are working on all these problems there,
so if we want to do a speech transition, maybe.

0:23:31.331 --> 0:23:35.058
Such problems we just put all these combinations
together.

0:23:35.335 --> 0:23:45.130
And then you get what you have as a cascading
system, which first is so you take your audio.

0:23:45.045 --> 0:23:59.288
To take this as input and generate the output,
and then you take this text output, put it

0:23:59.288 --> 0:24:00.238
into.

0:24:00.640 --> 0:24:05.782
So in that way you have now.

0:24:08.008 --> 0:24:18.483
Have now a solution for generating doing speech
translation for these types of systems, and

0:24:18.483 --> 0:24:20.874
this type is called.

0:24:21.681 --> 0:24:28.303
It is still often reaching state of the art,
however it has benefits and disadvantages.

0:24:28.668 --> 0:24:41.709
So the one big benefit is we have independent
components and some of that is nice.

0:24:41.709 --> 0:24:48.465
So if there are great ideas put into your.

0:24:48.788 --> 0:24:57.172
And then some other times people develop a
new good way of how to improve.

0:24:57.172 --> 0:25:00.972
You can also take this model and.

0:25:01.381 --> 0:25:07.639
So you can leverage improvements from all
the different communities in order to adapt.

0:25:08.288 --> 0:25:18.391
Furthermore, we would like to see, since all
of them is learning, that the biggest advantage

0:25:18.391 --> 0:25:23.932
is that we have training data for each individual.

0:25:24.164 --> 0:25:34.045
So there's a lot less training data where
you have the English audio, so it's easy to

0:25:34.045 --> 0:25:34.849
train.

0:25:36.636 --> 0:25:48.595
Now am a one that we will focus on when talking
about the cascaded approach is that often it.

0:25:48.928 --> 0:25:58.049
So you need to adapt each component a bit
so that it's adapting to its input and.

0:25:58.278 --> 0:26:07.840
So we'll focus there especially on how to
combine and since said the main focus is: So

0:26:07.840 --> 0:26:18.589
if you would directly use an output that might
not work as perfect as you would,.

0:26:18.918 --> 0:26:33.467
So a major challenge when building a cascade
of speech translation systems is how can we

0:26:33.467 --> 0:26:38.862
adapt these systems and how can?

0:26:41.681 --> 0:26:43.918
So why, why is this the kick?

0:26:44.164 --> 0:26:49.183
So it would look quite nice.

0:26:49.183 --> 0:26:54.722
It seems to be very reasonable.

0:26:54.722 --> 0:26:58.356
You have some audio.

0:26:58.356 --> 0:27:03.376
You put it into your system.

0:27:04.965 --> 0:27:23.759
However, this is a bit which for thinking
because if you speak what you speak is more.

0:27:23.984 --> 0:27:29.513
And especially all that rarely have punctuations
in there, and while the anti-system.

0:27:29.629 --> 0:27:43.247
They assume, of course, that it's a full sentence,
that you don't have there some.

0:27:43.523 --> 0:27:55.087
So we see we want to get this bridge between
the output and the input, and we might need

0:27:55.087 --> 0:27:56.646
additional.

0:27:58.778 --> 0:28:05.287
And that is typically what is referred to
as re-case and re-piculation system.

0:28:05.445 --> 0:28:15.045
So the idea is that you might be good to have
something like an adapter here in between,

0:28:15.045 --> 0:28:20.007
which really tries to adapt the speech input.

0:28:20.260 --> 0:28:28.809
That can be at different levels, but it might
be even more rephrasing.

0:28:29.569 --> 0:28:40.620
If you think of the sentence, if you have
false starts, then when speaking you sometimes

0:28:40.620 --> 0:28:41.986
assume oh.

0:28:41.901 --> 0:28:52.224
You restart it, then you might want to delete
that because if you read it you don't want

0:28:52.224 --> 0:28:52.688
to.

0:28:56.096 --> 0:28:57.911
Why is this yeah?

0:28:57.911 --> 0:29:01.442
The case in punctuation important.

0:29:02.622 --> 0:29:17.875
One important thing is directly for the challenge
is when speak is just a continuous stream of

0:29:17.875 --> 0:29:18.999
words.

0:29:19.079 --> 0:29:27.422
Then just speaking and punctuation marks,
and so on are all notes are there in natural.

0:29:27.507 --> 0:29:30.281
However, they are of course important.

0:29:30.410 --> 0:29:33.877
They are first of all very important for readability.

0:29:34.174 --> 0:29:41.296
If you have once read a text without characterization
marks, you need more time to process it.

0:29:41.861 --> 0:29:47.375
They're sometimes even semantically important.

0:29:47.375 --> 0:29:52.890
There's a list for grandpa and big difference.

0:29:53.553 --> 0:30:00.089
And so this, of course, with humans as well,
it'd be easy to distinguish by again doing

0:30:00.089 --> 0:30:01.426
it automatically.

0:30:01.426 --> 0:30:06.180
It's more typically and finally, in our case,
if we want to do.

0:30:06.386 --> 0:30:13.672
We are assuming normally sentence wise, so
we always enter out system which is like one

0:30:13.672 --> 0:30:16.238
sentence by the next sentence.

0:30:16.736 --> 0:30:26.058
If you want to do speech translation of a
continuous stream, then of course what are

0:30:26.058 --> 0:30:26.716
your.

0:30:28.168 --> 0:30:39.095
And the easiest and most straightforward situation
is, of course, if you have a continuously.

0:30:39.239 --> 0:30:51.686
And if it generates your calculation marks,
it's easy to separate your text into sentences.

0:30:52.032 --> 0:31:09.157
So we can again reuse our system and thereby
have a normal anti-system on this continuous.

0:31:14.174 --> 0:31:21.708
These are a bit older numbers, but they show
you a bit also how important all that is.

0:31:21.861 --> 0:31:31.719
So this was so the best is if you do insurance
transcript you get roughly a blue score of.

0:31:32.112 --> 0:31:47.678
If you have as it is with some air based length
segmentation, then you get something like.

0:31:47.907 --> 0:31:57.707
If you then use the segments correctly as
it's done from the reference, you get one blue

0:31:57.707 --> 0:32:01.010
point and another blue point.

0:32:01.201 --> 0:32:08.085
So you see that you have been total like nearly
two blue points just by having the correct

0:32:08.085 --> 0:32:09.144
segmentation.

0:32:10.050 --> 0:32:21.178
This shows you that it's important to estimate
as good a segmentation because even if you

0:32:21.178 --> 0:32:25.629
still have the same arrows in your.

0:32:27.147 --> 0:32:35.718
Is to be into this movement, which is also
not as unusual as we do in translation.

0:32:36.736 --> 0:32:40.495
So this is done by looking at the reference.

0:32:40.495 --> 0:32:48.097
It should show you how much these scores are
done to just analyze how important are these.

0:32:48.097 --> 0:32:55.699
So you take the A's R transcript and you look
at the reference and it's only done for the.

0:32:55.635 --> 0:33:01.720
If we have optimal punctuations, if our model
is as good and optimal, so as a reference we

0:33:01.720 --> 0:33:15.602
could: But of course this is not how we can
do it in reality because we don't have access

0:33:15.602 --> 0:33:16.990
to that.

0:33:17.657 --> 0:33:24.044
Because one would invade you okay, why should
we do that?

0:33:24.044 --> 0:33:28.778
If we have the optimal then it's possible.

0:33:31.011 --> 0:33:40.060
And yeah, that is why a typical system does
not only yeah depend on if our key component.

0:33:40.280 --> 0:33:56.468
But in between you have this segmentation
in there in order to have more input and.

0:33:56.496 --> 0:34:01.595
You can also prefer often this invariability
over the average study.

0:34:04.164 --> 0:34:19.708
So the task of segmentation is to re-segment
the text into what is called sentence like

0:34:19.708 --> 0:34:24.300
unit, so you also assign.

0:34:24.444 --> 0:34:39.421
That is more a traditional thing because for
a long time case information was not provided.

0:34:39.879 --> 0:34:50.355
So there was any good ASR system which directly
provides you with case information and this

0:34:50.355 --> 0:34:52.746
may not be any more.

0:34:56.296 --> 0:35:12.060
How that can be done is you can have three
different approaches because that was some

0:35:12.060 --> 0:35:16.459
of the most common one.

0:35:17.097 --> 0:35:23.579
Course: That is not the only thing you can
do.

0:35:23.579 --> 0:35:30.888
You can also try to train the data to generate
that.

0:35:31.891 --> 0:35:41.324
On the other hand, that is of course more
challenging.

0:35:41.324 --> 0:35:47.498
You need some type of segmentation.

0:35:48.028 --> 0:35:59.382
Mean, of course, you can easily remove and
capture information from your data and then

0:35:59.382 --> 0:36:05.515
play a system which does non-case to non-case.

0:36:05.945 --> 0:36:15.751
You can also, of course, try to combine these
two into one so that you directly translate

0:36:15.751 --> 0:36:17.386
from non-case.

0:36:17.817 --> 0:36:24.722
What is more happening by now is that you
also try to provide these to that you provide.

0:36:24.704 --> 0:36:35.267
The ASR is a segmentation directly get these
information in there.

0:36:35.267 --> 0:36:45.462
The systems that combine the A's and A's are:
Yes, there is a valid rule.

0:36:45.462 --> 0:36:51.187
What we come later to today is that you do
audio to text in the target language.

0:36:51.187 --> 0:36:54.932
That is what is referred to as an end to end
system.

0:36:54.932 --> 0:36:59.738
So it's directly and this is still more often
done for text output.

0:36:59.738 --> 0:37:03.414
But there is also end to end system which
directly.

0:37:03.683 --> 0:37:09.109
There you have additional challenges by how
to even measure if things are correct or not.

0:37:09.089 --> 0:37:10.522
Mean for text.

0:37:10.522 --> 0:37:18.073
You can mention, in other words, that for
audio the audio signal is even more.

0:37:18.318 --> 0:37:27.156
That's why it's currently mostly speech to
text, but that is one single system, but of

0:37:27.156 --> 0:37:27.969
course.

0:37:32.492 --> 0:37:35.605
Yeah, how can you do that?

0:37:35.605 --> 0:37:45.075
You can do adding these calculation information:
Will look into three systems.

0:37:45.075 --> 0:37:53.131
You can do that as a sequence labeling problem
or as a monolingual.

0:37:54.534 --> 0:37:57.145
Let's have a little bit of a series.

0:37:57.145 --> 0:37:59.545
This was some of the first ideas.

0:37:59.545 --> 0:38:04.626
There's the idea where you try to do it mainly
based on language model.

0:38:04.626 --> 0:38:11.471
So how probable is that there is a punctuation
that was done with like old style engram language

0:38:11.471 --> 0:38:12.883
models to visually.

0:38:13.073 --> 0:38:24.687
So you can, for example, if you have a program
language model to calculate the score of Hello,

0:38:24.687 --> 0:38:25.787
how are?

0:38:25.725 --> 0:38:33.615
And then you compare this probability and
take the one which has the highest probability.

0:38:33.615 --> 0:38:39.927
You might have something like if you have
very long pauses, you anyway.

0:38:40.340 --> 0:38:51.953
So this is a very easy model, which only calculates
some language model probabilities, and however

0:38:51.953 --> 0:39:00.023
the advantages of course are: And then, of
course, in general, so what we will look into

0:39:00.023 --> 0:39:06.249
here is that maybe interesting is that most
of the systems, also the advance, are really

0:39:06.249 --> 0:39:08.698
mainly focused purely on the text.

0:39:09.289 --> 0:39:19.237
If you think about how to insert punctuation
marks, maybe your first idea would have been

0:39:19.237 --> 0:39:22.553
we can use pause information.

0:39:23.964 --> 0:39:30.065
But however interestingly most systems that
use are really focusing on the text.

0:39:31.151 --> 0:39:34.493
There are several reasons.

0:39:34.493 --> 0:39:44.147
One is that it's easier to get training data
so you only need pure text data.

0:39:46.806 --> 0:40:03.221
The next way you can do it is you can make
it as a secret labeling tax or something like

0:40:03.221 --> 0:40:04.328
that.

0:40:04.464 --> 0:40:11.734
Then you have how there is nothing in you,
and there is a.

0:40:11.651 --> 0:40:15.015
A question.

0:40:15.315 --> 0:40:31.443
So you have the number of labels, the number
of punctuation symbols you have for the basic

0:40:31.443 --> 0:40:32.329
one.

0:40:32.892 --> 0:40:44.074
Typically nowadays it would use something
like bird, and then you can train a sister.

0:40:48.168 --> 0:40:59.259
Any questions to that then it would probably
be no contrary, you know, or not.

0:41:00.480 --> 0:41:03.221
Yeah, you have definitely a labeled imbalance.

0:41:04.304 --> 0:41:12.405
Think that works relatively well and haven't
seen that.

0:41:12.405 --> 0:41:21.085
It's not a completely crazy label, maybe twenty
times more.

0:41:21.561 --> 0:41:29.636
It can and especially for the more rare things
mean, the more rare things is question marks.

0:41:30.670 --> 0:41:43.877
At least for question marks you have typically
very strong indicator words.

0:41:47.627 --> 0:42:03.321
And then what was done for quite a long time
can we know how to do machine translation?

0:42:04.504 --> 0:42:12.640
So the idea is, can we just translate non
punctuated English into punctuated English

0:42:12.640 --> 0:42:14.650
and do it correctly?

0:42:15.855 --> 0:42:25.344
So what you need is something like this type
of data where the source doesn't have punctuation.

0:42:25.845 --> 0:42:30.641
Course: A year is already done.

0:42:30.641 --> 0:42:36.486
You have to make it a bit challenging.

0:42:41.661 --> 0:42:44.550
Yeah, that is true.

0:42:44.550 --> 0:42:55.237
If you think about the normal trained age,
you have to do one thing more.

0:42:55.237 --> 0:43:00.724
Is it otherwise difficult to predict?

0:43:05.745 --> 0:43:09.277
Here it's already this already looks different
than normal training data.

0:43:09.277 --> 0:43:09.897
What is the.

0:43:10.350 --> 0:43:15.305
People want to use this transcript of speech.

0:43:15.305 --> 0:43:19.507
We'll probably go to our text editors.

0:43:19.419 --> 0:43:25.906
Yes, that is all already quite too difficult.

0:43:26.346 --> 0:43:33.528
Mean, that's making things a lot better with
the first and easiest thing is you have to

0:43:33.528 --> 0:43:35.895
randomly cut your sentences.

0:43:35.895 --> 0:43:43.321
So if you take just me normally we have one
sentence per line and if you take this as your

0:43:43.321 --> 0:43:44.545
training data.

0:43:44.924 --> 0:43:47.857
And that is, of course, not very helpful.

0:43:48.208 --> 0:44:01.169
So in order to build the training corpus for
doing punctuation you randomly cut your sentences

0:44:01.169 --> 0:44:08.264
and then you can remove all your punctuation
marks.

0:44:08.528 --> 0:44:21.598
Because of course there is no longer to do
when you have some random segments in your

0:44:21.598 --> 0:44:22.814
system.

0:44:25.065 --> 0:44:37.984
And then you can, for example, if you then
have generated your punctuation marks before

0:44:37.984 --> 0:44:41.067
going to the system.

0:44:41.221 --> 0:44:54.122
And that is an important thing, which we like
to see is more challenging for end systems.

0:44:54.122 --> 0:45:00.143
We can change the segmentation, so maybe.

0:45:00.040 --> 0:45:06.417
You can, then if you're combining these things
you can change the segmentation here, so.

0:45:06.406 --> 0:45:18.178
While you have ten new ten segments in your,
you might only have five ones in your anymore.

0:45:18.178 --> 0:45:18.946
Then.

0:45:19.259 --> 0:45:33.172
Which might be more useful or helpful in because
you have to reorder things and so on.

0:45:33.273 --> 0:45:43.994
And if you think of the wrong segmentation
then you cannot reorder things from the beginning

0:45:43.994 --> 0:45:47.222
to the end of the sentence.

0:45:49.749 --> 0:45:58.006
Okay, so much about segmentation do you have
any more questions about that?

0:46:02.522 --> 0:46:21.299
Then there is one additional thing you can
do, and that is when we refer to the idea.

0:46:21.701 --> 0:46:29.356
And when you get input there might be some
arrows in there, so it might not be perfect.

0:46:29.889 --> 0:46:36.322
So the question is, can we adapt to that?

0:46:36.322 --> 0:46:45.358
And can the system be improved by saying that
it can some.

0:46:45.265 --> 0:46:50.591
So that is as aware that before there is a.

0:46:50.490 --> 0:46:55.449
Their arm might not be the best one.

0:46:55.935 --> 0:47:01.961
There are different ways of dealing with them.

0:47:01.961 --> 0:47:08.116
You can use a best list but several best lists.

0:47:08.408 --> 0:47:16.711
So the idea is that you're not only telling
the system this is the transcript, but here

0:47:16.711 --> 0:47:18.692
I'm not going to be.

0:47:19.419 --> 0:47:30.748
Or that you can try to make it more robust
towards arrows from an system so that.

0:47:32.612 --> 0:47:48.657
Interesting what is often done is hope convince
you it might be a good idea to deal.

0:47:48.868 --> 0:47:57.777
The interesting thing is if you're looking
into a lot of systems, this is often ignored,

0:47:57.777 --> 0:48:04.784
so they are not adapting their T-system to
this type of A-S-R system.

0:48:05.345 --> 0:48:15.232
So it's not really doing any handling of Arab,
and the interesting thing is often works as

0:48:15.232 --> 0:48:15.884
good.

0:48:16.516 --> 0:48:23.836
And one reason is, of course, one reason is
if the ASR system does not arrow up to like

0:48:23.836 --> 0:48:31.654
a challenging situation, and then the antisystem
is really for the antisystem hard to detect.

0:48:31.931 --> 0:48:39.375
If it would be easy for the system to detect
the error you would integrate this information

0:48:39.375 --> 0:48:45.404
into: That is not always the case, but that
of course makes it a bit challenging, and that's

0:48:45.404 --> 0:48:49.762
why there is a lot of systems where it's not
explicitly handled how to deal with.

0:48:52.912 --> 0:49:06.412
But of course it might be good, so one thing
is you can give him a best list and you can

0:49:06.412 --> 0:49:09.901
translate every entry.

0:49:10.410 --> 0:49:17.705
And then you have two scores like the anti-probability
and the square probability.

0:49:18.058 --> 0:49:25.695
Combine them and then generate or output the
output from what has the best combined.

0:49:26.366 --> 0:49:29.891
And then it might no longer be the best.

0:49:29.891 --> 0:49:38.144
It might like we had a bean search, so this
has the best score, but this has a better combined.

0:49:39.059 --> 0:49:46.557
The problem sometimes works, but the problem
is that the anti-system might then tend to

0:49:46.557 --> 0:49:52.777
just translate not the correct sentence but
the one easier to translate.

0:49:53.693 --> 0:50:03.639
You can also generate a more compact representation
of this invest in it by having this type of

0:50:03.639 --> 0:50:04.467
graphs.

0:50:05.285 --> 0:50:22.952
Lettices: So then you could like try to do
a graph to text translation so you can translate.

0:50:22.802 --> 0:50:26.582
Where like all possibilities, by the way our
systems are invented.

0:50:26.906 --> 0:50:31.485
So it can be like a hostage, a conference
with some programs.

0:50:31.591 --> 0:50:35.296
So the highest probability is here.

0:50:35.296 --> 0:50:41.984
Conference is being recorded, but there are
other possibilities.

0:50:42.302 --> 0:50:53.054
And you can take all of this information out
there with your probabilities.

0:50:59.980 --> 0:51:07.614
But we'll see this type of arrow propagation
that if you have an error that this might then

0:51:07.614 --> 0:51:15.165
propagate to, and t errors is one of the main
reasons why people looked into other ways of

0:51:15.165 --> 0:51:17.240
doing it and not having.

0:51:19.219 --> 0:51:28.050
By generally a cascaded combination, as we've
seen it, it has several advantages: The biggest

0:51:28.050 --> 0:51:42.674
maybe is the data availability so we can train
systems for the different components.

0:51:42.822 --> 0:51:47.228
So you can train your individual components
on relatively large stages.

0:51:47.667 --> 0:51:58.207
A modular system where you can improve each
individual model and if there's new development

0:51:58.207 --> 0:52:01.415
and models you can improve.

0:52:01.861 --> 0:52:11.280
There are several advantages, but of course
there are also some disadvantages: The most

0:52:11.280 --> 0:52:19.522
common thing is that there is what is referred
to as arrow propagation.

0:52:19.522 --> 0:52:28.222
If the arrow is arrow, probably your output
will then directly do an arrow.

0:52:28.868 --> 0:52:41.740
Typically it's like if there's an error in
the system, it's easier to like ignore by a

0:52:41.740 --> 0:52:46.474
quantity scale than the output.

0:52:46.967 --> 0:52:49.785
What do that mean?

0:52:49.785 --> 0:53:01.209
It's complicated, so if you have German, the
ASR does the Arab, and instead.

0:53:01.101 --> 0:53:05.976
Then most probably you'll ignore it or you'll
still know what it was said.

0:53:05.976 --> 0:53:11.827
Maybe you even don't notice because you'll
fastly read over it and don't see that there's

0:53:11.827 --> 0:53:12.997
one letter wrong.

0:53:13.673 --> 0:53:25.291
However, if you translate this one in an English
sentence about speeches, there's something

0:53:25.291 --> 0:53:26.933
about wines.

0:53:27.367 --> 0:53:37.238
So it's a lot easier typically to read over
like arrows in the than reading over them in

0:53:37.238 --> 0:53:38.569
the speech.

0:53:40.120 --> 0:53:45.863
But there is additional challenges in in cascaded
systems.

0:53:46.066 --> 0:53:52.667
So secondly we have seen that we optimize
each component individually so you have a separate

0:53:52.667 --> 0:53:59.055
optimization and that doesn't mean that the
overall performance is really the best at the

0:53:59.055 --> 0:53:59.410
end.

0:53:59.899 --> 0:54:07.945
And we have tried to do that by already saying
yes.

0:54:07.945 --> 0:54:17.692
You need to adapt them a bit to work good
together, but still.

0:54:20.280 --> 0:54:24.185
Secondly, like that, there's a computational
complexity.

0:54:24.185 --> 0:54:30.351
You always need to run an ASR system and an
MTT system, and especially if you think about

0:54:30.351 --> 0:54:32.886
it, it should be fast and real time.

0:54:32.886 --> 0:54:37.065
It's challenging to always run two systems
and not a single.

0:54:38.038 --> 0:54:45.245
And one final thing which you might have not
directly thought of, but most of the world's

0:54:45.245 --> 0:54:47.407
languages do not have any.

0:54:48.108 --> 0:55:01.942
So if you have a language which doesn't have
any script, then of course if you want to translate

0:55:01.942 --> 0:55:05.507
it you cannot first use.

0:55:05.905 --> 0:55:13.705
So in order to do this, the pressure was mentioned
before ready.

0:55:13.705 --> 0:55:24.264
Build somehow a system which takes the audio
and directly generates text in the target.

0:55:26.006 --> 0:55:41.935
And there is quite big opportunity for that
because before that there was very different

0:55:41.935 --> 0:55:44.082
technology.

0:55:44.644 --> 0:55:55.421
However, since we are using neuromachine translation
encoded decoder models, the interesting thing

0:55:55.421 --> 0:56:00.429
is that we are using very similar technology.

0:56:00.360 --> 0:56:06.047
It's like in both cases very similar architecture.

0:56:06.047 --> 0:56:09.280
The main difference is once.

0:56:09.649 --> 0:56:17.143
But generally how it's done is very similar,
and therefore of course it might be put everything

0:56:17.143 --> 0:56:22.140
together, and that is what is referred to as
end-to-end speech.

0:56:22.502 --> 0:56:31.411
So that means we're having one large neural
network and decoded voice system, but we put

0:56:31.411 --> 0:56:34.914
an audio in one language and then.

0:56:36.196 --> 0:56:43.106
We can then have a system which directly does
the full process.

0:56:43.106 --> 0:56:46.454
We don't have to care anymore.

0:56:48.048 --> 0:57:02.615
So if you think of it as before, so we have
this decoder, and that's the two separate.

0:57:02.615 --> 0:57:04.792
We have the.

0:57:05.085 --> 0:57:18.044
And instead of going via the discrete text
representation in the Suez language, we can

0:57:18.044 --> 0:57:21.470
go via the continuous.

0:57:21.681 --> 0:57:26.027
Of course, they hope it's by not doing this
discrimination in between.

0:57:26.146 --> 0:57:30.275
We don't have a problem at doing errors.

0:57:30.275 --> 0:57:32.793
We can only cover later.

0:57:32.772 --> 0:57:47.849
But we can encode here the variability or
so that we have and then only define the decision.

0:57:51.711 --> 0:57:54.525
And so.

0:57:54.274 --> 0:58:02.253
What we're doing is we're having very similar
technique.

0:58:02.253 --> 0:58:12.192
We're having still the decoder model where
we're coming from the main.

0:58:12.552 --> 0:58:24.098
Instead of getting discrete tokens in there
as we have subwords, we always encoded that

0:58:24.098 --> 0:58:26.197
in one pattern.

0:58:26.846 --> 0:58:42.505
The problem is that this is in continuous,
so we have to check how we can work with continuous

0:58:42.505 --> 0:58:43.988
signals.

0:58:47.627 --> 0:58:55.166
Mean, the first thing in your system is when
you do your disc freeze and code it.

0:59:02.402 --> 0:59:03.888
A newer machine translation.

0:59:03.888 --> 0:59:05.067
You're getting a word.

0:59:05.067 --> 0:59:06.297
It's one hot, some not.

0:59:21.421 --> 0:59:24.678
The first layer of the machine translation.

0:59:27.287 --> 0:59:36.147
Yes, you do the word embedding, so then you
have a continuous thing.

0:59:36.147 --> 0:59:40.128
So if you know get continuous.

0:59:40.961 --> 0:59:46.316
Deal with it the same way, so we'll see not
a big of a challenge.

0:59:46.316 --> 0:59:48.669
What is more challenging is.

0:59:49.349 --> 1:00:04.498
So the audio signal is ten times longer or
so, like more time steps you have.

1:00:04.764 --> 1:00:10.332
And so that is, of course, any challenge how
we can deal with this type of long sequence.

1:00:11.171 --> 1:00:13.055
The advantage is a bit.

1:00:13.055 --> 1:00:17.922
The long sequence is only at the input and
not at the output.

1:00:17.922 --> 1:00:24.988
So when you remember for the efficiency, for
example, like a long sequence are especially

1:00:24.988 --> 1:00:29.227
challenging in the decoder, but also for the
encoder.

1:00:31.371 --> 1:00:33.595
So how it is this?

1:00:33.595 --> 1:00:40.617
How can we process audio into an speech translation
system?

1:00:41.501 --> 1:00:51.856
And you can follow mainly what is done in
an system, so you have the audio signal.

1:00:52.172 --> 1:00:59.135
Then you measure your amplitude at every time
step.

1:00:59.135 --> 1:01:04.358
It's typically something like killing.

1:01:04.384 --> 1:01:13.893
And then you're doing this, this windowing,
so that you get a signal of a length twenty

1:01:13.893 --> 1:01:22.430
to thirty seconds, and you have all these windowings
so that you measure them.

1:01:22.342 --> 1:01:32.260
A simple gear, and then you look at these
time signals of seconds.

1:01:32.432 --> 1:01:36.920
So in the end then it is ten seconds, ten
million seconds.

1:01:36.920 --> 1:01:39.735
You have for every ten milliseconds.

1:01:40.000 --> 1:01:48.309
Some type of representation which type of
representation you can generate from that,

1:01:48.309 --> 1:01:49.286
but that.

1:01:49.649 --> 1:02:06.919
So instead of having no letter or word, you
have no representations for every 10mm of your

1:02:06.919 --> 1:02:08.437
system.

1:02:08.688 --> 1:02:13.372
How we record that now your thirty second
window here there is different ways.

1:02:16.176 --> 1:02:31.891
Was a traditional way of how people have done
that from an audio signal what frequencies

1:02:31.891 --> 1:02:34.010
are in the.

1:02:34.114 --> 1:02:44.143
So to do that you can do this malfrequency,
capsule co-pression so you can use gear transformations.

1:02:44.324 --> 1:02:47.031
Which frequencies are there?

1:02:47.031 --> 1:02:53.566
You know that the letters are different by
the different frequencies.

1:02:53.813 --> 1:03:04.243
And then if you're doing that, use the matte
to covers for your window we have before.

1:03:04.624 --> 1:03:14.550
So for each of these windows: You will calculate
what frequencies in there and then get features

1:03:14.550 --> 1:03:20.059
for this window and features for this window.

1:03:19.980 --> 1:03:28.028
These are the frequencies that occur there
and that help you to model which letters are

1:03:28.028 --> 1:03:28.760
spoken.

1:03:31.611 --> 1:03:43.544
More recently, instead of doing the traditional
signal processing, you can also replace that

1:03:43.544 --> 1:03:45.853
by deep learning.

1:03:46.126 --> 1:03:56.406
So that we are using a self-supervised approach
from language model to generate features that

1:03:56.406 --> 1:03:58.047
describe what.

1:03:58.358 --> 1:03:59.821
So you have your.

1:03:59.759 --> 1:04:07.392
All your signal again, and then for each child
to do your convolutional neural networks to

1:04:07.392 --> 1:04:07.811
get.

1:04:07.807 --> 1:04:23.699
First representation here is a transformer
network here, and in the end it's similar to

1:04:23.699 --> 1:04:25.866
a language.

1:04:25.705 --> 1:04:30.238
And you tried to predict what was referenced
here.

1:04:30.670 --> 1:04:42.122
So that is in a way similar that you also
try to learn a good representation of all these

1:04:42.122 --> 1:04:51.608
audio signals by predicting: And then you don't
do the signal processing base, but have this

1:04:51.608 --> 1:04:52.717
way to make.

1:04:52.812 --> 1:04:59.430
But in all the things that you have to remember
what is most important for you, and to end

1:04:59.430 --> 1:05:05.902
system is, of course, that you in the end get
for every minute ten milliseconds, you get

1:05:05.902 --> 1:05:11.283
a representation of this audio signal, which
is again a vector, and that.

1:05:11.331 --> 1:05:15.365
And then you can use your normal encoder to
code your model to do this research.

1:05:21.861 --> 1:05:32.694
So that is all which directly has to be changed,
and then you can build your first base.

1:05:33.213 --> 1:05:37.167
You do the audio processing.

1:05:37.167 --> 1:05:49.166
You of course need data which is like Audio
and English and Text in German and then you

1:05:49.166 --> 1:05:50.666
can train.

1:05:53.333 --> 1:05:57.854
And interestingly, it works at the beginning.

1:05:57.854 --> 1:06:03.261
The systems were maybe a bit worse, but we
saw really.

1:06:03.964 --> 1:06:11.803
This is like from the biggest workshop where
people like compared different systems.

1:06:11.751 --> 1:06:17.795
Special challenge on comparing Cascaded to
end to end systems and you see two thousand

1:06:17.795 --> 1:06:18.767
and eighteen.

1:06:18.767 --> 1:06:25.089
We had quite a huge gap between the Cascaded
and end to end systems and then it got nearer

1:06:25.089 --> 1:06:27.937
and earlier in starting in two thousand.

1:06:27.907 --> 1:06:33.619
Twenty the performance was mainly the same,
so there was no clear difference anymore.

1:06:34.014 --> 1:06:42.774
So this is, of course, writing a bit of hope
saying if we better learn how to build these

1:06:42.774 --> 1:06:47.544
internal systems, they might really fall better.

1:06:49.549 --> 1:06:52.346
However, a bit.

1:06:52.452 --> 1:06:59.018
This satisfying this is how this all continues,
and this is not only in two thousand and twenty

1:06:59.018 --> 1:07:04.216
one, but even nowadays we can say there is
no clear performance difference.

1:07:04.216 --> 1:07:10.919
It's not like the one model is better than
the other, but we are seeing very similar performance.

1:07:11.391 --> 1:07:19.413
So the question is what is the difference?

1:07:19.413 --> 1:07:29.115
Of course, this can only be achieved by new
tricks.

1:07:30.570 --> 1:07:35.658
Yes and no, that's what we will mainly look
into now.

1:07:35.658 --> 1:07:39.333
How can we make use of other types of.

1:07:39.359 --> 1:07:53.236
In that case you can achieve some performance
by using different types of training so you

1:07:53.236 --> 1:07:55.549
can also make.

1:07:55.855 --> 1:08:04.961
So if you are training or preparing the systems
only on very small corpora where you have as

1:08:04.961 --> 1:08:10.248
much data than you have for the individual
ones then.

1:08:10.550 --> 1:08:22.288
So that is the biggest challenge of an end
system that you have small corpora and therefore.

1:08:24.404 --> 1:08:30.479
Of course, there is several advantages so
you can give access to the audio information.

1:08:30.750 --> 1:08:42.046
So that's, for example, interesting if you
think about it, you might not have modeled

1:08:42.046 --> 1:08:45.198
everything in the text.

1:08:45.198 --> 1:08:50.321
So remember when we talk about biases.

1:08:50.230 --> 1:08:55.448
Male or female, and that of course is not
in the text any more, but in the audio signal

1:08:55.448 --> 1:08:56.515
it's still there.

1:08:58.078 --> 1:09:03.108
It also allows you to talk about that on Thursday
when you talk about latency.

1:09:03.108 --> 1:09:08.902
You have a bit better chance if you do an
end to end system to get a lower latency because

1:09:08.902 --> 1:09:14.377
you only have one system and you don't have
two systems which might have to wait for.

1:09:14.934 --> 1:09:20.046
And having one system might be also a bit
easier management.

1:09:20.046 --> 1:09:23.146
See that two systems work and so on.

1:09:26.346 --> 1:09:41.149
The biggest challenge of end systems is the
data, so as you correctly pointed out, typically

1:09:41.149 --> 1:09:42.741
there is.

1:09:43.123 --> 1:09:45.829
There is some data for Ted.

1:09:45.829 --> 1:09:47.472
People did that.

1:09:47.472 --> 1:09:52.789
They took the English audio with all the translations.

1:09:53.273 --> 1:10:02.423
But in January there is a lot less so we'll
look into how you can use other data sources.

1:10:05.305 --> 1:10:10.950
And secondly, the second challenge is that
we have to deal with audio.

1:10:11.431 --> 1:10:22.163
For example, in input length, and therefore
it's also important to handle this in your

1:10:22.163 --> 1:10:27.590
network and maybe have dedicated solutions.

1:10:31.831 --> 1:10:40.265
So in general we have this challenge that
we have a lot of text and translation and audio

1:10:40.265 --> 1:10:43.076
transcript data by quite few.

1:10:43.643 --> 1:10:50.844
So what can we do in one trick?

1:10:50.844 --> 1:11:00.745
You already know a bit from other research.

1:11:02.302 --> 1:11:14.325
Exactly so what you can do is you can, for
example, use to take a power locust, generate

1:11:14.325 --> 1:11:19.594
an audio of a Suez language, and then.

1:11:21.341 --> 1:11:33.780
There has been a bit motivated by what we
have seen in Beck translation, which was very

1:11:33.780 --> 1:11:35.476
successful.

1:11:38.758 --> 1:11:54.080
However, it's a bit more challenging because
it is often very different from real audience.

1:11:54.314 --> 1:12:07.131
So often if you build a system only trained
on, but then generalized to real audio data

1:12:07.131 --> 1:12:10.335
is quite challenging.

1:12:10.910 --> 1:12:20.927
And therefore here the synthetic data generation
is significantly more challenging than when.

1:12:20.981 --> 1:12:27.071
Because if you read a text, it's maybe bad
translation.

1:12:27.071 --> 1:12:33.161
It's hard, but it's a real text or a text
generated by.

1:12:35.835 --> 1:12:42.885
But it's a valid solution, and for example
we use that also for say current systems.

1:12:43.923 --> 1:12:53.336
Of course you can also do a bit of forward
translation that is done so that you take data.

1:12:53.773 --> 1:13:02.587
But then the problem is that your reference
is not always correct, and you remember when

1:13:02.587 --> 1:13:08.727
we talked about back translation, it's a bit
of an advantage.

1:13:09.229 --> 1:13:11.930
But both can be done and both have been done.

1:13:12.212 --> 1:13:20.277
So you can think about this picture again.

1:13:20.277 --> 1:13:30.217
You can take this data and generate the audio
to it.

1:13:30.750 --> 1:13:37.938
However, it is only synthetic of what can
be used for the voice handling technology for:

1:13:40.240 --> 1:13:47.153
But you have not, I mean, yet you get text
to speech, but the voice cloning would need

1:13:47.153 --> 1:13:47.868
a voice.

1:13:47.868 --> 1:13:53.112
You can use, of course, and then it's nothing
else than a normal.

1:13:54.594 --> 1:14:03.210
But still think there are better than both,
but there are some characteristics of that

1:14:03.210 --> 1:14:05.784
which is quite different.

1:14:07.327 --> 1:14:09.341
But yeah, it's getting better.

1:14:09.341 --> 1:14:13.498
That is definitely true, and then this might
get more and more.

1:14:16.596 --> 1:14:21.885
Here make sure it's a good person and our
own systems because we try to train and.

1:14:21.881 --> 1:14:24.356
And it's like a feedback mood.

1:14:24.356 --> 1:14:28.668
There's anything like the Dutch English model
that's.

1:14:28.648 --> 1:14:33.081
Yeah, you of course need a decent amount of
real data.

1:14:33.081 --> 1:14:40.255
But I mean, as I said, so there is always
an advantage if you have this synthetics thing

1:14:40.255 --> 1:14:44.044
only on the input side and not on the outside.

1:14:44.464 --> 1:14:47.444
That you at least always generate correct
outcomes.

1:14:48.688 --> 1:14:54.599
That's different in a language case because
they have input and the output and it's not

1:14:54.599 --> 1:14:55.002
like.

1:14:58.618 --> 1:15:15.815
The other idea is to integrate additional
sources so you can have more model sharing.

1:15:16.376 --> 1:15:23.301
But you can use these components also in the
system.

1:15:23.301 --> 1:15:28.659
Typically the text decoder and the text.

1:15:29.169 --> 1:15:41.845
And so the other way of languaging is to join
a train or somehow train all these tasks.

1:15:43.403 --> 1:15:54.467
The first and easy thing to do is multi task
training so the idea is you take these components

1:15:54.467 --> 1:16:02.038
and train these two components and train the
speech translation.

1:16:02.362 --> 1:16:13.086
So then, for example, all your encoders used
by the speech translation system can also gain

1:16:13.086 --> 1:16:14.951
from the large.

1:16:14.975 --> 1:16:24.048
So everything can gain a bit of emphasis,
but it can partly gain in there quite a bit.

1:16:27.407 --> 1:16:39.920
The other idea is to do it in a pre-training
phase.

1:16:40.080 --> 1:16:50.414
And then you take the end coder and the text
decoder and trade your model on that.

1:16:54.774 --> 1:17:04.895
Finally, there is also what is referred to
as knowledge distillation, so there you have

1:17:04.895 --> 1:17:11.566
to remember if you learn from a probability
distribution.

1:17:11.771 --> 1:17:24.371
So what you can do then is you have your system
and if you then have your audio and text input

1:17:24.371 --> 1:17:26.759
you can use your.

1:17:27.087 --> 1:17:32.699
And then get a more rich signal that you'll
not only know this is the word, but you have

1:17:32.699 --> 1:17:33.456
a complete.

1:17:34.394 --> 1:17:41.979
Example is typically also done because, of
course, if you have ski data, it still begins

1:17:41.979 --> 1:17:49.735
that you don't only have source language audio
and target language text, but then you also

1:17:49.735 --> 1:17:52.377
have the source language text.

1:17:53.833 --> 1:18:00.996
Get a good idea of the text editor and the
artist design.

1:18:00.996 --> 1:18:15.888
Now have to be aligned so that: Otherwise
they wouldn't be able to determine which degree

1:18:15.888 --> 1:18:17.922
they'd be.

1:18:18.178 --> 1:18:25.603
What you've been doing in non-stasilation
is you run your MP and then you get your probability

1:18:25.603 --> 1:18:32.716
distribution for all the words and you use
that to train and that is not only more helpful

1:18:32.716 --> 1:18:34.592
than only getting back.

1:18:35.915 --> 1:18:44.427
You can, of course, use the same decoder to
be even similar.

1:18:44.427 --> 1:18:49.729
Otherwise you don't have exactly the.

1:18:52.832 --> 1:19:03.515
Is a good point making these tools, and generally
in all these cases it's good to have more similar

1:19:03.515 --> 1:19:05.331
representations.

1:19:05.331 --> 1:19:07.253
You can transfer.

1:19:07.607 --> 1:19:23.743
If you hear your representation to give from
the audio encoder and the text encoder are

1:19:23.743 --> 1:19:27.410
more similar, then.

1:19:30.130 --> 1:19:39.980
So here you have your text encoder in the
target language and you can train it on large

1:19:39.980 --> 1:19:40.652
data.

1:19:41.341 --> 1:19:45.994
But of course you want to benefit also for
this task because that's what your most interested.

1:19:46.846 --> 1:19:59.665
Of course, the most benefit for this task
is if these two representations you give are

1:19:59.665 --> 1:20:01.728
more similar.

1:20:02.222 --> 1:20:10.583
Therefore, it's interesting to look into how
can we make these two representations as similar

1:20:10.583 --> 1:20:20.929
as: The hope is that in the end you can't even
do something like zero shot transfer, but while

1:20:20.929 --> 1:20:25.950
you only learn this one you can also deal with.

1:20:30.830 --> 1:20:40.257
So what you can do is you can look at these
two representations.

1:20:40.257 --> 1:20:42.867
So once the text.

1:20:43.003 --> 1:20:51.184
And you can either put them into the text
decoder to the encoder.

1:20:51.184 --> 1:20:53.539
We have seen both.

1:20:53.539 --> 1:21:03.738
You can think: If you want to build an A's
and to insist on you can either take the audio

1:21:03.738 --> 1:21:06.575
encoder and see how deep.

1:21:08.748 --> 1:21:21.915
However, you have these two representations
and you want to make them more similar.

1:21:21.915 --> 1:21:23.640
One thing.

1:21:23.863 --> 1:21:32.797
Here we have, like you said, for every ten
million seconds we have a representation.

1:21:35.335 --> 1:21:46.085
So what people may have done, for example,
is to remove redundant information so you can:

1:21:46.366 --> 1:21:56.403
So you can use your system to put India based
on letter or words and then average over the

1:21:56.403 --> 1:21:58.388
words or letters.

1:21:59.179 --> 1:22:07.965
So that the number of representations from
the encoder is the same as you would get from.

1:22:12.692 --> 1:22:20.919
Okay, that much to data do have any more questions
first about that.

1:22:27.207 --> 1:22:36.787
Then we'll finish with the audience assessing
and highlight a bit while this is challenging,

1:22:36.787 --> 1:22:52.891
so here's: One test here has one thousand eight
hundred sentences, so there are words or characters.

1:22:53.954 --> 1:22:59.336
If you look how many all your features, so
how many samples there is like one point five

1:22:59.336 --> 1:22:59.880
million.

1:23:00.200 --> 1:23:10.681
So you have ten times more pizzas than you
have characters, and then again five times

1:23:10.681 --> 1:23:11.413
more.

1:23:11.811 --> 1:23:23.934
So you have the sequence leg of the audio
as long as you have for words, and that is

1:23:23.934 --> 1:23:25.788
a challenge.

1:23:26.086 --> 1:23:34.935
So the question is what can you do to make
the sequins a bit shorter and not have this?

1:23:38.458 --> 1:23:48.466
The one thing is you can try to reduce the
dimensional entity in your encounter.

1:23:48.466 --> 1:23:50.814
There's different.

1:23:50.991 --> 1:24:04.302
So, for example, you can just sum up always
over some or you can do a congregation.

1:24:04.804 --> 1:24:12.045
Are you a linear projectile or you even take
not every feature but only every fifth or something?

1:24:12.492 --> 1:24:23.660
So this way you can very easily reduce your
number of features in there, and there has

1:24:23.660 --> 1:24:25.713
been different.

1:24:26.306 --> 1:24:38.310
There's also what you can do with things like
a convolutional layer.

1:24:38.310 --> 1:24:43.877
If you skip over what you can,.

1:24:47.327 --> 1:24:55.539
And then, in addition to the audio, the other
problem is higher variability.

1:24:55.539 --> 1:25:04.957
So if you have a text you can: But there are
very different ways of saying that you can

1:25:04.957 --> 1:25:09.867
distinguish whether say a sentence or your
voice.

1:25:10.510 --> 1:25:21.224
That of course makes it more challenging because
now you get different inputs and while they

1:25:21.224 --> 1:25:22.837
were in text.

1:25:23.263 --> 1:25:32.360
So that makes especially for limited data
things more challenging and you want to somehow

1:25:32.360 --> 1:25:35.796
learn that this is not important.

1:25:36.076 --> 1:25:39.944
So there is the idea again okay.

1:25:39.944 --> 1:25:47.564
Can we doing some type of data augmentation
to better deal with?

1:25:48.908 --> 1:25:55.735
And again people can mainly use what has been
done in and try to do the same things.

1:25:56.276 --> 1:26:02.937
You can try to do a bit of noise and speech
perturbation so playing the audio like slower

1:26:02.937 --> 1:26:08.563
and a bit faster to get more samples then and
you can train on all of them.

1:26:08.563 --> 1:26:14.928
What is very important and very successful
recently is what is called Spektr augment.

1:26:15.235 --> 1:26:25.882
The idea is that you directly work on all
your features and you can try to last them

1:26:25.882 --> 1:26:29.014
and that gives you more.

1:26:29.469 --> 1:26:41.717
What do they mean with masking so this is
your audio feature and then there is different?

1:26:41.962 --> 1:26:47.252
You can do what is referred to as mask and
a time masking.

1:26:47.252 --> 1:26:50.480
That means you just set some masks.

1:26:50.730 --> 1:26:58.003
And since then you should be still able to
to deal with it because you can normally.

1:26:57.937 --> 1:27:05.840
Also without that you are getting more robust
and not and you can handle that because then

1:27:05.840 --> 1:27:10.877
many symbols which have different time look
more similar.

1:27:11.931 --> 1:27:22.719
You are not only doing that for time masking
but also for frequency masking so that if you

1:27:22.719 --> 1:27:30.188
have here the frequency channels you mask a
frequency channel.

1:27:30.090 --> 1:27:33.089
Thereby being able to better recognize these
things.

1:27:35.695 --> 1:27:43.698
This we have had an overview of the two main
approaches for speech translation that is on

1:27:43.698 --> 1:27:51.523
the one hand cascaded speech translation and
on the other hand we talked about advanced

1:27:51.523 --> 1:27:53.302
speech translation.

1:27:53.273 --> 1:28:02.080
It's like how to combine things and what they
work together for end speech translations.

1:28:02.362 --> 1:28:06.581
Here was data challenges and a bit about long
circuits.

1:28:07.747 --> 1:28:09.304
We have any more questions.

1:28:11.451 --> 1:28:19.974
Can you really describe the change in cascading
from translation to text to speech because

1:28:19.974 --> 1:28:22.315
thought the translation.

1:28:25.745 --> 1:28:30.201
Yes, so mean that works again the easiest
thing.

1:28:30.201 --> 1:28:33.021
What of course is challenging?

1:28:33.021 --> 1:28:40.751
What can be challenging is how to make that
more lively and like that pronunciation?

1:28:40.680 --> 1:28:47.369
And yeah, which things are put more important,
how to put things like that into.

1:28:47.627 --> 1:28:53.866
In the normal text, otherwise it would sound
very monotone.

1:28:53.866 --> 1:28:57.401
You want to add this information.

1:28:58.498 --> 1:29:02.656
That is maybe one thing to make it a bit more
emotional.

1:29:02.656 --> 1:29:04.917
That is maybe one thing which.

1:29:05.305 --> 1:29:13.448
But you are right there and out of the box.

1:29:13.448 --> 1:29:20.665
If you have everything works decently.

1:29:20.800 --> 1:29:30.507
Still, especially if you have a very monotone
voice, so think these are quite some open challenges.

1:29:30.750 --> 1:29:35.898
Maybe another open challenge is that it's
not so much for the end product, but for the

1:29:35.898 --> 1:29:37.732
development is very important.

1:29:37.732 --> 1:29:40.099
It's very hard to evaluate the quality.

1:29:40.740 --> 1:29:48.143
So you cannot doubt that there is a way about
most systems are currently evaluated by human

1:29:48.143 --> 1:29:49.109
evaluation.

1:29:49.589 --> 1:29:54.474
So you cannot try hundreds of things and run
your blue score and get this score.

1:29:54.975 --> 1:30:00.609
So therefore no means very important to have
some type of evaluation metric and that is

1:30:00.609 --> 1:30:01.825
quite challenging.

1:30:08.768 --> 1:30:15.550
And thanks for listening, and we'll have the
second part of speech translation on search.