## Preparación de un dataset

Descargamos el dataset y lo preparamos para el entrenamiento. En el caso de ejemplo, usaremos toxic-teenage-relationships, que son frases que describen si un comporamiento es tóxico o sano. Tienen una campo de texto y un campo de etiqueta, que vale 1 si es tóxico y 0 si no lo es. Acumula 267 ejemplos de entrenamiento y 66 para testear.

In [1]:
from datasets import load_dataset
data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("toxic-teenage-relationships", data_files=data_files, sep=";")
dataset['train'][100]

{'label': 1,
 'text': 'Mi amiga no puede subir videos a tik tok porque su pareja no le deja'}

Una vez cargado el dataset, se crea un tokenizador para procesar el texto e incluir una estrategia para el padding y el truncamiento. Para poder procesar el dataset en un solo paso, se utiliza el método dataset.map para preprocesar todo el dataset.

In [2]:
#en este ejemplo, utilizamos el AutoTokenizer
from transformers import AutoTokenizer
#from transformers import RobertaTokenizer

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-bne")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Ahora vamos a convertir el dataset en formator de TensorFlow. Para eso usamos DefaultDataCollator, que junta los tensores en un batch para que el modelo se entrene en él. Debemos especificar el argumento return_tensors="tf". 


In [3]:
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")

guardamos los dataset de train y de test


In [4]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]



En primer lugar, vamos a crear el modelo



In [5]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification
#también tiene una clase propia para el cabezal de clasificación, en este cogemos el general
#from transformers import TFRobertaForSequenceClassification
#Hay dos categorías, así que ponemos 2 etiquetas (0 sano 1 tóxico)
model = TFAutoModelForSequenceClassification.from_pretrained("PlanTL-GOB-ES/roberta-base-bne", num_labels=2, from_pt="True")   

2023-08-29 20:57:53.276031: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154404864 exceeds 10% of free system memory.
2023-08-29 20:57:53.624006: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154404864 exceeds 10% of free system memory.
2023-08-29 20:57:53.683150: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154404864 exceeds 10% of free system memory.
2023-08-29 20:58:02.251496: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154404864 exceeds 10% of free system memory.
2023-08-29 20:58:02.566086: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154404864 exceeds 10% of free system memory.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another archit

A hora vamos a convertir los datasets tokenizados en datasets de TensorFlow con el método .to_tf_dataset. Las entradas están en columns y la etiqueta en label_cols. El bach size es el número de ejemplos que se introducen en la red para que se entrene cada vez.

In [6]:
tf_train_dataset= train_dataset.to_tf_dataset(
columns=["attention_mask", "input_ids"],
label_cols="labels",
shuffle=True,
collate_fn=data_collator,
batch_size=8,
)
tf_validation_dataset= eval_dataset.to_tf_dataset(
columns=["attention_mask", "input_ids"],
label_cols="labels",
shuffle=False,
collate_fn=data_collator,
batch_size=8,
)


Compilamos

In [7]:
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)

## Cross-validation
Se definen los parámetros de K-flod cross valdation en primer lugar. Al ser un dataset pequeño el nmero de 
splits será de 3.

In [8]:
from sklearn.model_selection import KFold
from keras.callbacks import EarlyStopping
num_splits = 3
kf = KFold(num_splits, shuffle= True, random_state=42)



Ahora definimos el ciclo de validación cruzada

In [9]:
#listas para almacenar las métricas en cada fold
train_losses=[]
train_accuracies=[]
val_losses = []
val_accuracies=[]

for fold, (train_index, val_index) in enumerate(kf.split(train_dataset)):
    print (f"Fold {fold + 1}")
    
    #crear conjuntos de entrenamiento y validación para esta iteración
    train_fold_dataset = train_dataset.select(train_index)
    val_fold_dataset = train_dataset.select(val_index)
           
    #convertir los datasets a Tensorflow
    tf_train_fold_dataset= train_fold_dataset.to_tf_dataset(
        columns=["attention_mask", "input_ids"],
        label_cols="labels",
        shuffle=True,
        collate_fn=data_collator,
        batch_size=8,
        )
           
    tf_val_fold_dataset= val_fold_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols="labels",
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
    )
    
    #early-stop
    early_stop=EarlyStopping(monitor="val_loss",patience=2,mode="auto", restore_best_weights=True)
    
    #entrenar el modelo       
    model.fit(tf_train_fold_dataset, validation_data=tf_val_fold_dataset, epochs=10, callbacks=[early_stop])
    
    # Evaluar el modelo      
    train_scores = model.evaluate(tf_train_fold_dataset, verbose=0)
    val_scores = model.evaluate(tf_val_fold_dataset, verbose=0)
    print("Train")
    print(f"Fold {fold + 1} - Loss: {train_scores[0]}, Accuracy: {train_scores[1]}")
    print("Val")
    print(f"Fold {fold + 1} - Loss: {val_scores[0]}, Accuracy: {val_scores[1]}")
    
    # Guardamos las cifras para después hacer la media
    train_losses.append(train_scores[0])
    train_accuracies.append(train_scores[1])
    val_losses.append(val_scores[0])
    val_accuracies.append(val_scores[1])
    



Fold 1
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Train
Fold 1 - Loss: 0.224392369389534, Accuracy: 0.9269663095474243
Val
Fold 1 - Loss: 0.6009274125099182, Accuracy: 0.7222222089767456
Fold 2
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Train
Fold 2 - Loss: 0.01458701491355896, Accuracy: 0.994413435459137
Val
Fold 2 - Loss: 0.15508417785167694, Accuracy: 0.932584285736084
Fold 3
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Train
Fold 3 - Loss: 0.07668939232826233, Accuracy: 0.9776536226272583
Val
Fold 3 - Loss: 0.010059510357677937, Accuracy: 1.0


NameError: name 'np' is not defined

In [10]:
import numpy as np
#Calcular las medidas de las métricas
mean_train_loss = np.mean(train_losses)
mean_train_accuracy = np.mean(train_accuracies)
mean_val_loss = np.mean(val_losses)
mean_val_accuracy = np. mean(val_accuracies)

#Imprimir las medias de las métricas
print(f"Mean Train Loss: {mean_train_loss}, Mean Train Accuracy: {mean_train_accuracy}")
print(f"Mean Val Loss: {mean_val_loss}, Mean Val Accuracy: {mean_val_accuracy}")

Mean Train Loss: 0.1052229255437851, Mean Train Accuracy: 0.9663444558779398
Mean Val Loss: 0.25535703357309103, Mean Val Accuracy: 0.8849354982376099
