Question 1

What format?

Accepted Answer

Each row is a prompt, two AI answers, and a human's preferred choice, with topic, language, and nested per-row provenance (source dataset, license, source row id). We ship the same data as JSONL, JSON, and CSV.

Question 2

What languages?

Accepted Answer

The public sample is English and proves the format and provenance discipline. The mission is human-feedback data for low-resource and underserved languages; per-language sets are prospective and built on commission, never claimed before they exist.

Question 3

Is the data license-clean?

Accepted Answer

Yes. The sample is built only from public, openly licensed sources, with the license recorded per row (Apache-2.0 for oasst2 rows, MIT for hh-rlhf helpful-base rows) and a NOTICE that must be retained on redistribution.

Open Human-Feedback Datasets for Underserved Languages

Data types

Download the sample

Languages

Request access

Dataset FAQ