A cleaned version available on Hugging Face that removes potentially infringing subsets like Books3 and OpenSubtitles.
Do not decompress with gzip or unzip . The files use Zstandard , not gzip. Always use zstd -d . how to download the pile dataset