Market2Vec

Trademark-Based Product Timeline Embeddings (Forecasting MLM)

This repo builds and trains Market2Vec from trademark data using a sequence-of-products view:

Firms (owners) are treated as entities with a timeline of products
Each product is a sequential event
Each product has an item set (goods/services descriptors) treated as a basket (set, not order)

Two training versions (two objectives)

We support two versions of the MLM objective:

Version A — Forecasting (last-event prediction)

Purpose: learn to forecast the last product’s items from the firm’s earlier product history.

We identify the last product event in the sequence: the segment between the last [APP] and the next [APP_END] (or [SEP])
We force-mask ITEM_* tokens inside that last event (mask probability = 1.0)
(Optional) random masking elsewhere can be turned off for “clean forecasting” evaluation

This version is best when your downstream use-case is “given past trademark products, predict items in the most recent/next product”.

Version B — Random MLM over the full product sequence

Purpose: learn general co-occurrence/semantic structure of items in firm timelines (classic MLM).

We mask tokens randomly across the whole sequence with probability p (e.g., 15%)
This includes items across all product events, not only the last one
This version behaves like standard BERT MLM, but applied to your product timeline format

This version is best when you want broad embeddings capturing item relationships and temporal context without specifically focusing on forecasting the last event.

How to enable each version in code

The behavior is controlled by the masking probabilities used in the collator:

TRAIN_RANDOM_MLM_PROB
EVAL_RANDOM_MLM_PROB

And by whether you “force-mask last event items” (enabled in the forecasting collator logic).

Recommended settings

Forecasting-only (Version A)

Train: TRAIN_RANDOM_MLM_PROB = 0.0 (no random MLM noise)
Eval: EVAL_RANDOM_MLM_PROB = 0.0
Force-masking last event ITEM_* stays ON

This focuses learning and evaluation on last-event item prediction.

Forecasting + regularization (Version A + random noise)

Train: TRAIN_RANDOM_MLM_PROB = 0.15
Eval: EVAL_RANDOM_MLM_PROB = 0.0
Force-masking last event ITEM_* stays ON

This is the default “forecasting twist” setup: train with extra random MLM, evaluate cleanly on forecasting.

Random MLM across full sequence (Version B)

Train: TRAIN_RANDOM_MLM_PROB = 0.15
Eval: EVAL_RANDOM_MLM_PROB = 0.15 (or any non-zero)
(Optional) disable force-masking last-event items if you want pure standard MLM

Note: In the current ForecastingCollator, force-masking last-event items is always applied.
If you want pure random MLM (no forecasting), add a flag like force_last_event=False and skip the prob[force_mask] = 1.0 step.

What the forecasting masking means (in practice)

A packed firm sequence looks like:

[CLS] DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] ... DATE_YYYY_MM [APP] NICE_* ... ITEM_* ... [APP_END] <-- last event [SEP]

Version A: masks ITEM_* in the last [APP]..[APP_END] segment (forecasting target)
Version B: masks tokens randomly across the entire sequence (classic MLM)

Metrics

Validation reports:

AccAll: accuracy over all masked tokens
Item@K: Top-K accuracy restricted to masked positions where the true label is an ITEM_* token

For forecasting, Item@K is the main metric because it directly measures how well the model predicts items in the last product basket.

Results — MarketBERT (pretrained Market2Vec checkpoint)

Training Summary

Model: A4_full_fixed_alpha_optionA_h512_h32
Best validation loss: 3.6433

Validation (HARD)

Acc@1: 0.5996
Acc@5: 0.6651
Acc@10: 0.6944

“HARD” refers to the stricter evaluation setting used in our validation protocol (forecasting-focused metrics on masked targets).

Usage (Hugging Face)

from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("HamidBekam/MarketBERT")
model = AutoModel.from_pretrained("HamidBekam/MarketBERT")

Downloads last month: 43

Safetensors

Model size

37M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support