geo-moe-mae: Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation

GitHub - geo-moe-mae

Model Description

geo-moe-mae is a compact, metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) designed for Earth Observation (EO) imagery. It aims to bring self-supervised representation learning to remote sensing with a lightweight architecture.

Key features:

  • Uses sparse expert routing (i.e. mixture of experts) to improve capacity without scaling parameters linearly.
  • Incorporates geospatial metadata (latitude, longitude) and temporal input encodings (seasonal / daily cycles) alongside image data.
  • Pretrained on the BigEarthNet Landsat dataset. :contentReference[oaicite:3]{index=3}
  • Evaluated via linear probing (frozen encoder) on downstream tasks including BigEarthNet and EuroSAT, achieving competitive performance relative to much larger models.
  • Very lightweight: model has only ~2.5 million parameters in its design.

Model Architecture

  • Base is a masked autoencoder (vision transformer style) with a MoE mechanism to route inputs to different expert submodules.
  • Metadata (geographic and temporal) is fused into encoding layers to inform representation learning.
  • During inference for embeddings, the model’s encoder is typically frozen and outputs embeddings used by downstream linear classifiers.

Intended Use & Applications

Primary Use Cases

  • Representation learning for Earth Observation imagery.
  • Downstream classification or regression tasks (e.g. land cover classification, change detection, etc.), via linear probes or fine-tuning.
  • Transfer learning across EO datasets, especially where metadata is available (e.g. lat/lon, time).
  • Lightweight deployment scenarios (constrained compute), due to small model size.

Out-of-scope / Misuse

  • Not designed for dense segmentation out-of-the-box (unless adapted).
  • May underperform on highly detailed tasks needing large capacity or very fine resolution.
  • Metadata might bias model predictions if geographic or temporal distributions vary in training vs inference domains.
  • Do not use the model for safety-critical applications without validating thoroughly (e.g. disaster response, agricultural guarantees, etc.).

Training Data & Pretraining

  • Pretrained on the BigEarthNet-Landsat dataset (a large EO dataset).

  • Seasonal / temporal cycles and geospatial metadata were included as additional inputs (latitude, longitude, etc.).

  • Trained via masked autoencoding objective (reconstruction loss and MoE balancing loss) on image patches.

  • Evaluated by freezing the encoder and training linear probes/classifiers on downstream tasks.

  • Tasks include classification on BigEarthNet and EuroSAT datasets.

  • compared with baseline models of much larger capacity, showing competitive performance.

Limitations & Biases

  • The model’s incorporation of metadata means that it may rely too much on geospatial or temporal priors, which can lead to overfitting to regions with distinct metadata distributions.
  • The linear-probing evaluation does not reflect full fine-tuning scenarios (i.e. adapting full model).
  • The original training data (BigEarthNet) may have class imbalance, geographic coverage gaps, or biases in sensors or times.

How to Use

Here’s a sketch of how you might load and use the model (you may need to adjust paths/configs):

import torch
from models.moe_mae import MOEMAE, build_model

# Load model (checkpoint path)
model_size = "S"
img_size = 40
patch_size = 4
in_channels = 7
checkpoint_path = "./weights/moe_mae_bigearthnet_ls/pretrained_S_best.pth"
encoder = build_model(
        size=model_size,
        img_size=img_size,
        patch_size=patch_size,
        in_chans=in_channels,
    )
model = MOEMAE(encoder).to(device)
model = load_model(model,checkpoint_path,device)
encoder = model.encoder
encoder.eval();

# Preprocess an example image + metadata
img = ...  # preprocessed and normalized image tensor
lat = ...  # normalized lat as sine, cosine pair
lon = ...  # normalized lon as sine, cosine pair
week = ... # normalized week as sine, cosine pair
hour = ... # normalized hour as sine, cosine pair

# Forward to get embeddings
out = model(x_in,week,hour,lat,lon)  # or whatever method in repo
embed = out[-1]
# Then downstream: linear classifier etc.

Citation

If you use this model in your work, please cite:

@misc{albughdadi2025lightweightmetadataawaremixtureofexpertsmasked,
  title        = {Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation},
  author       = {Mohanad Albughdadi},
  year         = {2025},
  eprint       = {2509.10919},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2509.10919},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support