All language subtitles for 020Hugging Face 数据集概览 (Pytorch)

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,120 --> 00:00:11,520 The Hugging Face Datasets library: A Quick  overview. The Hugging Face Datasets library   2 00:00:11,520 --> 00:00:16,560 is a library that provides an API to quickly  download many public datasets and preprocess them.   3 00:00:17,360 --> 00:00:22,560 In this video we will explore how to do that. The  downloading part is easy: with the load_dataset   4 00:00:22,560 --> 00:00:28,400 function, you can directly download and cache a  dataset from its identifier on the Dataset hub.   5 00:00:29,520 --> 00:00:32,720 Here we fetch the MRPC dataset  from the GLUE benchmark,   6 00:00:33,360 --> 00:00:38,320 which is a dataset containing pairs of sentences  where the task is to determine the paraphrases.   7 00:00:39,520 --> 00:00:45,440 The object returned by the load_dataset function  is a DatasetDict, which is a sort of dictionary   8 00:00:45,440 --> 00:00:51,120 containing each split of our dataset. We can  access each split by indexing with its name.   9 00:00:52,000 --> 00:00:57,440 This split is then an instance of the  Dataset class, with columns (here sentence1,   10 00:00:57,440 --> 00:01:04,240 sentence2. label and idx) and rows. We  can access a given element by its index.   11 00:01:05,200 --> 00:01:10,000 The amazing thing about the Hugging Face Datasets  library is that everything is saved to disk   12 00:01:10,000 --> 00:01:15,520 using Apache Arrow, which means that even if  your dataset is huge you won't get out of RAM:   13 00:01:16,080 --> 00:01:21,920 only the elements you request are loaded in  memory. Accessing a slice of your dataset is   14 00:01:21,920 --> 00:01:26,720 as easy as one element. The result is then a  dictionary with list of values for each keys   15 00:01:27,280 --> 00:01:31,600 (here the list of labels, the list of first  sentences and the list of second sentences).   16 00:01:33,440 --> 00:01:38,880 The features attribute of a Dataset gives us more  information about its columns. In particular,   17 00:01:38,880 --> 00:01:45,280 we can see here it gives us the correspondence  between the integers and names for the labels. 0   18 00:01:45,280 --> 00:01:51,760 stands for not equivalent and 1 for equivalent.  To preprocess all the elements of our dataset,   19 00:01:51,760 --> 00:01:56,800 we need to tokenize them. Have a look at the  video "Preprocess sentence pairs" for a refresher,   20 00:01:57,360 --> 00:02:02,320 but you just have to send the two sentences to the  tokenizer with some additional keyword arguments.   21 00:02:03,520 --> 00:02:08,560 Here we indicate a maximum length of 128  and pad inputs shorter than this length,   22 00:02:08,560 --> 00:02:14,320 truncate inputs that are longer. We put all of  this in a tokenize_function that we can directly   23 00:02:14,320 --> 00:02:20,240 apply to all the splits in our dataset with the  map method. As long as the function returns a   24 00:02:20,240 --> 00:02:25,680 dictionary-like object, the map method will add  new columns as needed or update existing ones.   25 00:02:27,360 --> 00:02:31,840 To speed up preprocessing and take advantage  of the fact our tokenizer is backed by Rust   26 00:02:31,840 --> 00:02:36,880 thanks to the Hugging Face Tokenizers library, we  can process several elements at the same time to   27 00:02:36,880 --> 00:02:42,160 our tokenize function, using the batched=True  argument. Since the tokenizer can handle list   28 00:02:42,160 --> 00:02:48,880 of first/second sentences, the tokenize_function  does not need to change for this. You can also use   29 00:02:49,440 --> 00:02:56,400 multiprocessing with the map method, check out its  documentation! Once this is done, we are almost   30 00:02:56,400 --> 00:03:01,920 ready for training: we just remove the columns we  don't need anymore with the remove_columns method,   31 00:03:01,920 --> 00:03:06,640 rename label to labels (since the models  from Hugging Face Transformers expect that)   32 00:03:07,440 --> 00:03:14,000 and set the output format to our desired  backend: torch, tensorflow or numpy. If needed,   33 00:03:14,000 --> 00:03:17,840 we can also generate a short sample  of a dataset using the select method. 4320

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.