|
--- |
|
license: cc-by-nc-nd-4.0 |
|
tags: |
|
- Image |
|
- Captionning |
|
- RESNET-152 |
|
- LSTM |
|
--- |
|
|
|
## Introduction |
|
|
|
This model is defined as proposed in the book "mastering pytorch". |
|
It is based on CNN-encoder and a LSTM-decoder. |
|
|
|
The CNN-encoder is based on a pretrained RESNET-152. The last layer of the resnet is replaced by a vector embedding layer of 256 elements. |
|
The LSTM-decoder use an input of 256, a hidden layer of 512, and uses the vocabulary size. |
|
|
|
The model has been trained as a pure learning exercise, and so the model performances remain relatively mean. |
|
|
|
## Training procedure |
|
|
|
For the sake of the exercise, the model has been trained for only 5 epochs. |
|
|
|
It has been trained on the COCO dataset. |