Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

National Yang Ming Chiao Tung University
*Indicates Equal Contribution
MY ALT TEXT

Overview of YearGuessr. (a) Global distribution of the 55k Wikipedia-sourced building images. (b) Log-scale histogram of construction years spanning 1001∼2024 CE. (c) A detailed example entry from our dataset, including an image, descriptive text, and various building attributes such as year, coordinates, and views. (d) Given an image and optional GPS coordinates, YearCLIP returns the estimated construction year together with architectural rationales.

Abstract

Building age matters for sustainability, heritage, and safety, yet no public benchmark is both global and ordinal. YearGuessr fills this gap with 55,546 Wikipedia facades from 157 countries, continuously labeled across 1001~2024 CE and paired with GPS, captions, and page-view counts. We frame age prediction as ordinal regression and introduce popularity-weighted MAE plus interval accuracy (± 5/20/50/100 year). We benchmark 30+ models, including CNN-based, Transformer-based, CLIP-based models, and VLMs. Our dataset is CC BY-SA 4.0 licensed at Here!

YearGuessr Dataset

The data collection for YearGuessr is a two-stage process. The first stage involves using the Wikipedia website and related tools to acquire around 90,000 data entries. These entries include images, construction years, building names, GPS coordinates, textual descriptions from the Wiki pages, and one-year pageview counts. In the next stage, the data is processed through deduplication, a model-based filter, and human auditing, resulting in a final dataset of 55,546 entries.

Statistics Analysis

We analyzed the data distribution, including geographical location, time period, views (popularity), location population density (rural/urban areas), and building updates. Geographically, the data is concentrated in the Americas and Europe (63.31%, 22.48%). Temporally, it is concentrated in the modern era (1800-2024). Views are mainly between 100 and 10,000 (83.89%). The building locations are predominantly urban (46.37%), and most buildings have undergone updates (without changing the construction year, 52.99%).


MY ALT TEXT

YearCLIP Architecture.

YearCLIP Model

YearCLIP uses NumCLIP as its backbone and improves performance for ordinal regression (construction year estimation) through language priors and a coarse-to-fine approach. For input, YearCLIP fuses image and optional coordinate information. For pre-defined inputs, it first defines a set of coarse year classes and reasoning reasons. In the process, YearCLIP calculates the similarities between the input and these pre-defined inputs, then uses as the input for a regressor to calculate a more precise year.

Explainability

A key advantage of YearCLIP is its ability to provide interpretable predictions through architectural reasoning. By leveraging language priors and similarity scores across predefined year ranges and reasoning templates, YearCLIP can explain its predictions with human-readable rationales such as architectural styles, materials, and design elements.

We demonstrate YearCLIP's explainability on three different datasets: our YearGuessr benchmark, FI-London (a Street View Imagery dataset), and MapYourCity (another SVI dataset). The model successfully identifies architectural features and provides coherent explanations across diverse building types and geographical locations, showcasing its robustness and interpretability.

Experimental Results

We compared the results of over 40 models (more details in the supplementary material) on YearGuessr, including major categories like CNN, Transformer, CLIP-based, Closed VLM, and Open VLM. We visualized some of YearCLIP's prediction results and evaluated performance using MAE and Interval Accuracy (IA). Among the CLIP-based methods, YearCLIP has the lowest MAE at 39.52. We found that VLMs generally perform better, but when calculated separately by popularity, VLMs have a clear popularity bias. Other common attributes that reduce performance include continents with less building data, very old construction years, remote areas, and buildings that have been rebuilt.

BibTeX


@misc{szutu2025memorizationmultimodalordinalregression,
      title={Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models}, 
      author={Li-Zhong Szu-Tu and Ting-Lin Wu and Chia-Jui Chang and He Syu and Yu-Lun Liu},
      year={2025},
      eprint={2512.21337},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.21337}, 
}