Clarification on Recent Changes to Loss and Gradient Accumulation

#15
by jiosephlee - opened

Hi!

I've been loosely following the recent conversations on bugs/issues such as https://github.com/huggingface/transformers/pull/34198 https://github.com/huggingface/transformers/pull/34191. As a lay user, it's not entirely clear to me what the issue is.

To hone in on specific questions, as someone who wants to use Trainer with a custom loss function, I'm concerned that there are numerous factors I need to account for, which leads to the following questions:

  1. When providing a custom compute_loss_func, Is the expectation to divide the loss by num_items_in_batch (batch_size * gradient accumulation)? To confirm my understanding, the gradient accumulation is handled by just breaking up each "step" of the "effective batch size" into smaller steps, so the outputs and labels provided already account for gradient accumulation, and the loss just needs to be divided by num_items_in_batch.
  2. I'm seeing open issues, such as https://github.com/huggingface/transformers/issues/38837, which appear to be unique to the last step when the gradient accumulation isn't nicely divided. Is this still an issue?
  3. Do any of these dynamics change in a multi-GPU setup?

Sign up or log in to comment