Multilingual LLMs Morality Benchmark Research

A multilingual benchmark study evaluating the moral reasoning and safety capabilities of major LLMs

Research 2025

Project Overview

A research project evaluating the moral reasoning and safety capabilities of major LLMs across multiple languages, with the goal of establishing a benchmark for multilingual ethical LLM evaluation. I served as a researcher on our team, designing the evaluation framework, building the LLM-as-judge scoring system, analyzing data, and writing the final research paper.

Research Methodology

We built a 500-question benchmark dataset across 5 ethical categories (legality, moral judgment, bias & stereotypes, harm prevention, and consent & autonomy). To assess the limits of LLM safety features, questions in the harm prevention and consent & autonomy categories were designed to be indirect and deceptive, testing whether models could detect harmful intent disguised as harmless requests. We tested 5 major LLMs (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro, Llama 4, Qwen3) across 6 languages (English, Spanish, Arabic, Hindi, Chinese, Swahili) using a Gemini-as-judge scoring system. See the presentation slides below or the research paper for more details.

Key Findings

  • GPT-5 led overall with ~92% accuracy; Qwen had the lowest at ~66%
  • All models performed better in straightforward categories (legality, moral judgment) than in deceptive ones (consent, harm prevention)
  • Models scored higher in low-resource languages on trick questions (likely because they flagged harmful keywords rather than reading context)
  • Qwen defaulted to Chinese law when answering legality questions, regardless of the user's location
  • Gemini showed the largest performance gap between regular and trick categories (2+ point difference)

Reflection

This project grew my ability to quickly absorb and apply complex concepts. I started the summer with little LLM knowledge to designing a full evaluation framework through reading numerous research papers and replicating studies. Collaborating closely with my team daily strengthened my communication and research skills. Getting direct feedback from researchers in the field and professionals at leading AI companies was especially valuable in refining our methodology, and also gave me a much better understanding of what AI safety research looks like in the industry.

Project Details

Client: Independent Contractor
Duration: Jun. 2025 - Sept. 2025
Team Size: 3 researchers

Research Tools

Data Analysis Python Large Language Models Prompt Engineering

Research Presentation