Algoverse Logo
logo

EnDive

A Cross-Dialect Benchmark for Fairness & Reasoning in LLMs

About EnDive

EnDive (English Diversity) is a large-scale evaluation suite for measuring language model fairness and performance across underrepresented English dialects. We include over 40,000 dialect-specific examples spanning 12 natural language understanding tasks and 5 diverse dialects: AAVE, ChcE, JamE, IndE, and CollSgE. The benchmark reveals real-world disparities in how models perform on dialectal versus standard English data, helping developers build more inclusive, robust, and trustworthy language systems.

Key Findings & Impact

EnDive evaluates seven state-of-the-art large language models across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our benchmark reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English.

Benchmark Features
  • 40,000+ dialect-specific examples across 12 NLU tasks
  • Five diverse dialects: AAVE, ChcE, JamE, IndE, and CollSgE
  • Human-validated translations with high faithfulness scores
  • Comprehensive evaluation of seven leading LLMs
Model Performance

Even top-performing models like o1 and Gemini 2.5 Pro show performance gaps of 3-5% between dialectal and SAE inputs. Smaller models like GPT-4o-mini exhibit even wider disparities, with drops of up to 12% on certain dialects.

Methodology

Our approach combines few-shot prompting with verified examples from native speakers to create dialect-specific translations. We apply BLEU-based filtering to ensure only substantive linguistic variations are included, creating a challenging benchmark that reveals true model biases.

Human Validation

Native speakers of each dialect assessed our translations on faithfulness, fluency, formality, and information retention. Average scores exceeded 6.0/7 across all metrics, confirming the linguistic authenticity of our benchmark.

EnDive Team

Contact

For questions about the benchmark, collaboration opportunities, or to report issues, please contact us at: abhaygupta1266@gmail.com

We welcome contributions and feedback from the research community to help improve model fairness across diverse English dialects.

Citation

@misc{gupta2025endivecrossdialectbenchmarkfairness,
  title={EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models},
  author={Abhay Gupta and Jacob Cheung and Philip Meng and Shayan Sayyed and Austen Liao and Kevin Zhu and Sean O'Brien},
  year={2025},
  eprint={2504.07100},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.07100}
}

This table compares how accurately different language models perform on Standard American English (SAE) versus various English dialects. Each column represents a specific dialect, and the scores reflect model performance when prompted in that dialect using chain-of-thought (CoT) reasoning. Significant drops in performance on dialects like African American English (AAVE) or Jamaican English (JamE), compared to SAE, indicate potential biases and reduced effectiveness. These disparities highlight the need to evaluate and improve AI fairness across diverse language varieties.

Model AAVE ChcE CollSgE IndE JamE
CoT SAE CoT CoT SAE CoT CoT SAE CoT CoT SAE CoT CoT SAE CoT
🥇 o1 89.13 93.15 88.54 93.39 89.14 93.50 90.34 94.07 89.40 93.14
🥈 Gemini 2.5 Pro 88.89 92.06 88.70 92.31 89.02 92.14 89.72 92.24 89.19 92.18
🥉 GPT-4o 82.20 87.36 80.37 87.35 82.43 87.31 83.30 87.34 82.53 87.44
DeepSeek-v3 82.06 87.36 81.55 87.27 81.65 87.37 82.90 87.44 81.40 87.38
Claude 3.5 Sonnet 79.78 83.10 81.15 88.78 81.15 88.83 79.61 88.82 80.18 88.79
LLaMa-3-8B Instruct 82.69 87.49 78.08 82.94 78.41 83.00 81.52 86.12 79.14 83.20
GPT-4o-mini 74.53 78.27 75.01 77.70 80.59 86.61 74.26 86.63 80.56 86.60
Average 82.75 86.97 81.91 87.11 83.20 88.39 83.09 88.95 83.20 88.39

EnDive: A Cross-Dialect Benchmark for Fairness and Performance in LLMs

Purpose & Motivation

The diversity of human language presents significant challenges for NLP systems. While English serves as a global lingua franca, its dialects exhibit substantial variation that often goes unaddressed in language technologies (Chambers and Trudgill, 1998). This oversight perpetuates discrimination against dialect speakers in critical domains like education and employment (Purnell et al., 1999; Hofmann et al., 2024).

Recent studies reveal systemic biases in LLM processing of non-standard dialects (Fleisig et al., 2024; Resende et al., 2024)—from toxic speech misclassification of African American Vernacular English tweets (Sap et al., 2019) to parsing errors in Chicano and Jamaican English (Fought, 2003; Patrick, 1999).

EnDive addresses these gaps by providing a comprehensive benchmark that evaluates seven state-of-the-art LLMs across reasoning tasks in five underrepresented English dialects:

  • African American Vernacular English (AAVE): 33M speakers with distinct syntax/phonology (Lippi-Green, 1997)
  • Indian English (IndE): 250M speakers blending local/colonial influences (Kachru, 1983)
  • Jamaican English (JamE): Diaspora language with mesolectal variation (Patrick, 1999)
  • Chicano English (ChcE): Spanish-influenced variety in US Hispanic communities (Fought, 2003)
  • Colloquial Singaporean English (CollSgE): Multicultural creole with Asian substrates (Platt and Weber, 1980)
Methodology

Dataset Construction: EnDive curates challenges from 12 established datasets spanning four core reasoning categories:

  • Language Understanding: BoolQ, MultiRC, WSC, SST-2, COPA
  • Algorithmic Understanding: HumanEval, MBPP
  • Mathematics: GSM8K, SVAMP
  • Logic: LogicBench, FOLIO

Dialect Translation: Using few-shot prompting with GPT-4o and verified examples from eWAVE (Kortmann et al., 2020), tasks are translated from SAE to target dialects while preserving sociolinguistic nuance. Example prompts used during translations can be found in our GitHub.

Quality Filtering: To eliminate superficial transformations, BLEU-based filtering removes translations with scores ≥0.7 against their SAE sources—retaining only substantive linguistic variations that challenge LLMs' dialect understanding.

Human Validation: Native speakers of each dialect assessed 120 randomly sampled translations on four dimensions using 7-point Likert scales:

Dialect Faithfulness Fluency Formality Info Retention
AAVE 6.28 6.28 6.28 6.63
ChcE 6.40 6.33 6.26 6.71
IndE 6.45 6.62 6.59 6.91
JamE 6.37 6.28 6.33 6.66
CollSgE 6.19 6.11 6.02 6.52

Table 1: Native speaker evaluation scores (1-7 scale), showing high quality across all metrics.

Key Findings

Performance Gaps: All seven models (including top-tier systems like o1 and Gemini 2.5 Pro) demonstrate consistent performance drops when evaluated on dialectal inputs compared to SAE prompts. The average gap ranges from 2.69% to over 12.37%.

Model-Specific Disparities:

  • Top-Tier Models: o1 and Gemini 2.5 Pro deliver the strongest performance across all dialects but still exhibit performance gaps of 3-5 points between dialectal and SAE inputs.
  • Mid-Tier Models: GPT-4o, DeepSeek-v3, and Claude 3.5 Sonnet show gaps exceeding 9 points in multiple dialects.
  • Smaller Models: GPT-4o-mini and LLaMa-3-8B Instruct consistently yield lower accuracies and exhibit wider gaps between dialectal and SAE inputs.

Dialect-Induced Errors: Common failure modes include semantic misalignment where models misinterpret polarity in dialectal inputs. Constructions like double negatives ("ain't no one"), habitual aspect ("don't be"), or markers like "been had" often cause models to flip the correct answer.

Implications & Impact

EnDive reveals that current language models, regardless of scale, exhibit dialectal bias. Even the best-performing models show measurable degradation across dialects, with performance gaps persisting across model tiers.

These findings highlight the need for more inclusive language technologies that serve all linguistic communities equitably, especially in high-stakes domains like education, healthcare, and legal services where dialect disparities can lead to real-world discrimination.

Related Works

EnDive builds upon and extends several key research directions in dialect-aware NLP: