geo-moe-mae: Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation
Model Description
geo-moe-mae is a compact, metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) designed for Earth Observation (EO) imagery. It aims to bring self-supervised representation learning to remote sensing with a lightweight architecture.
Key features:
- Uses sparse expert routing (i.e. mixture of experts) to improve capacity without scaling parameters linearly.
- Incorporates geospatial metadata (latitude, longitude) and temporal input encodings (seasonal / daily cycles) alongside image data.
- Pretrained on the BigEarthNet Landsat dataset. :contentReference[oaicite:3]{index=3}
- Evaluated via linear probing (frozen encoder) on downstream tasks including BigEarthNet and EuroSAT, achieving competitive performance relative to much larger models.
- Very lightweight: model has only ~2.5 million parameters in its design.
Model Architecture
- Base is a masked autoencoder (vision transformer style) with a MoE mechanism to route inputs to different expert submodules.
- Metadata (geographic and temporal) is fused into encoding layers to inform representation learning.
- During inference for embeddings, the model’s encoder is typically frozen and outputs embeddings used by downstream linear classifiers.
Intended Use & Applications
Primary Use Cases
- Representation learning for Earth Observation imagery.
- Downstream classification or regression tasks (e.g. land cover classification, change detection, etc.), via linear probes or fine-tuning.
- Transfer learning across EO datasets, especially where metadata is available (e.g. lat/lon, time).
- Lightweight deployment scenarios (constrained compute), due to small model size.
Out-of-scope / Misuse
- Not designed for dense segmentation out-of-the-box (unless adapted).
- May underperform on highly detailed tasks needing large capacity or very fine resolution.
- Metadata might bias model predictions if geographic or temporal distributions vary in training vs inference domains.
- Do not use the model for safety-critical applications without validating thoroughly (e.g. disaster response, agricultural guarantees, etc.).
Training Data & Pretraining
Pretrained on the BigEarthNet-Landsat dataset (a large EO dataset).
Seasonal / temporal cycles and geospatial metadata were included as additional inputs (latitude, longitude, etc.).
Trained via masked autoencoding objective (reconstruction loss and MoE balancing loss) on image patches.
Evaluated by freezing the encoder and training linear probes/classifiers on downstream tasks.
Tasks include classification on BigEarthNet and EuroSAT datasets.
compared with baseline models of much larger capacity, showing competitive performance.
Limitations & Biases
- The model’s incorporation of metadata means that it may rely too much on geospatial or temporal priors, which can lead to overfitting to regions with distinct metadata distributions.
- The linear-probing evaluation does not reflect full fine-tuning scenarios (i.e. adapting full model).
- The original training data (BigEarthNet) may have class imbalance, geographic coverage gaps, or biases in sensors or times.
How to Use
Here’s a sketch of how you might load and use the model (you may need to adjust paths/configs):
import torch
from models.moe_mae import MOEMAE, build_model
# Load model (checkpoint path)
model_size = "S"
img_size = 40
patch_size = 4
in_channels = 7
checkpoint_path = "./weights/moe_mae_bigearthnet_ls/pretrained_S_best.pth"
encoder = build_model(
size=model_size,
img_size=img_size,
patch_size=patch_size,
in_chans=in_channels,
)
model = MOEMAE(encoder).to(device)
model = load_model(model,checkpoint_path,device)
encoder = model.encoder
encoder.eval();
# Preprocess an example image + metadata
img = ... # preprocessed and normalized image tensor
lat = ... # normalized lat as sine, cosine pair
lon = ... # normalized lon as sine, cosine pair
week = ... # normalized week as sine, cosine pair
hour = ... # normalized hour as sine, cosine pair
# Forward to get embeddings
out = model(x_in,week,hour,lat,lon) # or whatever method in repo
embed = out[-1]
# Then downstream: linear classifier etc.
Citation
If you use this model in your work, please cite:
@misc{albughdadi2025lightweightmetadataawaremixtureofexpertsmasked,
title = {Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation},
author = {Mohanad Albughdadi},
year = {2025},
eprint = {2509.10919},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2509.10919},
}