Visualizing XLM-RoBERTa Finetune Dedupe

This heatmap shows deduplication across a family of fine-tuned models based on XLM-RoBERTa large, a multilingual transformer introduced in 2019 and trained on 100 languages. Each row represents a model repository (which often contains multiple formats; e.g., Safetensor, Keras, PyTorch) derived from the original research. Repository data is chunked into blocks of up to 64MB in Xet's storage layer, and this heatmap visualizes those blocks across models.

The base model is xlm-roberta-large, while the others are fine-tuned for specific languages on the CoNLL NER datasets (Dutch, Spanish, English, German). Darker blue regions highlight content shared across models—the more overlap, the more efficient storage and transfer becomes. This level of deduplication leads to faster uploads, quicker iterations, and less friction when scaling experimentation.

XLM-RoBERTa large currently has 396 fine-tunes on the Hub. The fine-tunes from the original CoNLL research deduplicate at ~17%, representing a substantial time savings for builders repeatedly pushing new checkpoints and variants.

To explore the visualization:

Hover over a block in a repository to highlight it and see where else it appears in other repos.
Click any block to see all other repositories that share blocks with that repo.
Double-click anywhere on any repo to reset and continue exploring.