About EnDive
EnDive (English Diversity) is a large-scale evaluation suite for measuring language model fairness and performance across underrepresented English dialects. We include over 40,000 dialect-specific examples spanning 12 natural language understanding tasks and 5 diverse dialects: AAVE, ChcE, JamE, IndE, and CollSgE. The benchmark reveals real-world disparities in how models perform on dialectal versus standard English data, helping developers build more inclusive, robust, and trustworthy language systems.
Key Findings & Impact
EnDive evaluates seven state-of-the-art large language models across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our benchmark reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English.
Benchmark Features
- 40,000+ dialect-specific examples across 12 NLU tasks
- Five diverse dialects: AAVE, ChcE, JamE, IndE, and CollSgE
- Human-validated translations with high faithfulness scores
- Comprehensive evaluation of seven leading LLMs
Model Performance
Even top-performing models like o1 and Gemini 2.5 Pro show performance gaps of 3-5% between dialectal and SAE inputs. Smaller models like GPT-4o-mini exhibit even wider disparities, with drops of up to 12% on certain dialects.
Methodology
Our approach combines few-shot prompting with verified examples from native speakers to create dialect-specific translations. We apply BLEU-based filtering to ensure only substantive linguistic variations are included, creating a challenging benchmark that reveals true model biases.
Human Validation
Native speakers of each dialect assessed our translations on faithfulness, fluency, formality, and information retention. Average scores exceeded 6.0/7 across all metrics, confirming the linguistic authenticity of our benchmark.
EnDive Team
EnDive is developed by researchers at Algoverse AI Research:
Contact
For questions about the benchmark, collaboration opportunities, or to report issues, please contact us at:
abhaygupta1266@gmail.com
We welcome contributions and feedback from the research community to help improve model fairness across diverse English dialects.
Citation
@misc{gupta2025endivecrossdialectbenchmarkfairness, title={EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models}, author={Abhay Gupta and Jacob Cheung and Philip Meng and Shayan Sayyed and Austen Liao and Kevin Zhu and Sean O'Brien}, year={2025}, eprint={2504.07100}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.07100} }
This table compares how accurately different language models perform on Standard American English (SAE) versus various English dialects. Each column represents a specific dialect, and the scores reflect model performance when prompted in that dialect using chain-of-thought (CoT) reasoning. Significant drops in performance on dialects like African American English (AAVE) or Jamaican English (JamE), compared to SAE, indicate potential biases and reduced effectiveness. These disparities highlight the need to evaluate and improve AI fairness across diverse language varieties.
Model | AAVE | ChcE | CollSgE | IndE | JamE | |||||
---|---|---|---|---|---|---|---|---|---|---|
CoT | SAE CoT | CoT | SAE CoT | CoT | SAE CoT | CoT | SAE CoT | CoT | SAE CoT | |
🥇 o1 | 89.13 | 93.15 | 88.54 | 93.39 | 89.14 | 93.50 | 90.34 | 94.07 | 89.40 | 93.14 |
🥈 Gemini 2.5 Pro | 88.89 | 92.06 | 88.70 | 92.31 | 89.02 | 92.14 | 89.72 | 92.24 | 89.19 | 92.18 |
🥉 GPT-4o | 82.20 | 87.36 | 80.37 | 87.35 | 82.43 | 87.31 | 83.30 | 87.34 | 82.53 | 87.44 |
DeepSeek-v3 | 82.06 | 87.36 | 81.55 | 87.27 | 81.65 | 87.37 | 82.90 | 87.44 | 81.40 | 87.38 |
Claude 3.5 Sonnet | 79.78 | 83.10 | 81.15 | 88.78 | 81.15 | 88.83 | 79.61 | 88.82 | 80.18 | 88.79 |
LLaMa-3-8B Instruct | 82.69 | 87.49 | 78.08 | 82.94 | 78.41 | 83.00 | 81.52 | 86.12 | 79.14 | 83.20 |
GPT-4o-mini | 74.53 | 78.27 | 75.01 | 77.70 | 80.59 | 86.61 | 74.26 | 86.63 | 80.56 | 86.60 |
Average | 82.75 | 86.97 | 81.91 | 87.11 | 83.20 | 88.39 | 83.09 | 88.95 | 83.20 | 88.39 |
EnDive: A Cross-Dialect Benchmark for Fairness and Performance in LLMs
Purpose & Motivation
The diversity of human language presents significant challenges for NLP systems. While English serves as a global lingua franca, its dialects exhibit substantial variation that often goes unaddressed in language technologies (Chambers and Trudgill, 1998). This oversight perpetuates discrimination against dialect speakers in critical domains like education and employment (Purnell et al., 1999; Hofmann et al., 2024).
Recent studies reveal systemic biases in LLM processing of non-standard dialects (Fleisig et al., 2024; Resende et al., 2024)—from toxic speech misclassification of African American Vernacular English tweets (Sap et al., 2019) to parsing errors in Chicano and Jamaican English (Fought, 2003; Patrick, 1999).
EnDive addresses these gaps by providing a comprehensive benchmark that evaluates seven state-of-the-art LLMs across reasoning tasks in five underrepresented English dialects:
- African American Vernacular English (AAVE): 33M speakers with distinct syntax/phonology (Lippi-Green, 1997)
- Indian English (IndE): 250M speakers blending local/colonial influences (Kachru, 1983)
- Jamaican English (JamE): Diaspora language with mesolectal variation (Patrick, 1999)
- Chicano English (ChcE): Spanish-influenced variety in US Hispanic communities (Fought, 2003)
- Colloquial Singaporean English (CollSgE): Multicultural creole with Asian substrates (Platt and Weber, 1980)
Methodology
Dataset Construction: EnDive curates challenges from 12 established datasets spanning four core reasoning categories:
- Language Understanding: BoolQ, MultiRC, WSC, SST-2, COPA
- Algorithmic Understanding: HumanEval, MBPP
- Mathematics: GSM8K, SVAMP
- Logic: LogicBench, FOLIO
Dialect Translation: Using few-shot prompting with GPT-4o and verified examples from eWAVE (Kortmann et al., 2020), tasks are translated from SAE to target dialects while preserving sociolinguistic nuance. Example prompts used during translations can be found in our GitHub.
Quality Filtering: To eliminate superficial transformations, BLEU-based filtering removes translations with scores ≥0.7 against their SAE sources—retaining only substantive linguistic variations that challenge LLMs' dialect understanding.
Human Validation: Native speakers of each dialect assessed 120 randomly sampled translations on four dimensions using 7-point Likert scales:
Dialect | Faithfulness | Fluency | Formality | Info Retention |
---|---|---|---|---|
AAVE | 6.28 | 6.28 | 6.28 | 6.63 |
ChcE | 6.40 | 6.33 | 6.26 | 6.71 |
IndE | 6.45 | 6.62 | 6.59 | 6.91 |
JamE | 6.37 | 6.28 | 6.33 | 6.66 |
CollSgE | 6.19 | 6.11 | 6.02 | 6.52 |
Table 1: Native speaker evaluation scores (1-7 scale), showing high quality across all metrics.
Key Findings
Performance Gaps: All seven models (including top-tier systems like o1 and Gemini 2.5 Pro) demonstrate consistent performance drops when evaluated on dialectal inputs compared to SAE prompts. The average gap ranges from 2.69% to over 12.37%.
Model-Specific Disparities:
- Top-Tier Models: o1 and Gemini 2.5 Pro deliver the strongest performance across all dialects but still exhibit performance gaps of 3-5 points between dialectal and SAE inputs.
- Mid-Tier Models: GPT-4o, DeepSeek-v3, and Claude 3.5 Sonnet show gaps exceeding 9 points in multiple dialects.
- Smaller Models: GPT-4o-mini and LLaMa-3-8B Instruct consistently yield lower accuracies and exhibit wider gaps between dialectal and SAE inputs.
Dialect-Induced Errors: Common failure modes include semantic misalignment where models misinterpret polarity in dialectal inputs. Constructions like double negatives ("ain't no one"), habitual aspect ("don't be"), or markers like "been had" often cause models to flip the correct answer.
Implications & Impact
EnDive reveals that current language models, regardless of scale, exhibit dialectal bias. Even the best-performing models show measurable degradation across dialects, with performance gaps persisting across model tiers.
These findings highlight the need for more inclusive language technologies that serve all linguistic communities equitably, especially in high-stakes domains like education, healthcare, and legal services where dialect disparities can lead to real-world discrimination.
Related Works
EnDive builds upon and extends several key research directions in dialect-aware NLP:
- VALUE: Understanding Dialect Disparity in NLU (Ziems et al., 2022): Early work establishing the need for dialect-aware evaluation in NLP.
- Multi-VALUE: A Framework for Cross-Dialectal English NLP (Ziems et al., 2023): Rule-based framework for transforming SAE into target dialects using lexical substitutions.
- AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE (Gupta et al., 2024): Human-validated benchmark specifically for AAVE.
- CulturePark: Boosting Cross-Cultural Understanding in LLMs (Li et al., 2024): Hybrid methodology for cross-cultural dialogue evaluation.
- AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs (Mousi et al., 2024): Similar benchmark focused on Arabic dialects.
- Dialect Prejudice Predicts AI Decisions About People (Hofmann et al., 2024): Links LLM dialect biases to real-world discrimination in employment, criminality, and medical diagnoses.
- A Comprehensive View of the Biases of Toxicity and Sentiment Analysis Methods Towards Utterances with African American English Expressions (Resende et al., 2024): Examines biases in toxicity detection systems against AAVE speakers.