subtitlecat.com

All language subtitles for 020Hugging Face 数据集概览 (Pytorch)

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,120 --> 00:00:11,520 The Hugging Face Datasets library: A Quick overview. The Hugging Face Datasets library 2 00:00:11,520 --> 00:00:16,560 is a library that provides an API to quickly download many public datasets and preprocess them. 3 00:00:17,360 --> 00:00:22,560 In this video we will explore how to do that. The downloading part is easy: with the load_dataset 4 00:00:22,560 --> 00:00:28,400 function, you can directly download and cache a dataset from its identifier on the Dataset hub. 5 00:00:29,520 --> 00:00:32,720 Here we fetch the MRPC dataset from the GLUE benchmark, 6 00:00:33,360 --> 00:00:38,320 which is a dataset containing pairs of sentences where the task is to determine the paraphrases. 7 00:00:39,520 --> 00:00:45,440 The object returned by the load_dataset function is a DatasetDict, which is a sort of dictionary 8 00:00:45,440 --> 00:00:51,120 containing each split of our dataset. We can access each split by indexing with its name. 9 00:00:52,000 --> 00:00:57,440 This split is then an instance of the Dataset class, with columns (here sentence1, 10 00:00:57,440 --> 00:01:04,240 sentence2. label and idx) and rows. We can access a given element by its index. 11 00:01:05,200 --> 00:01:10,000 The amazing thing about the Hugging Face Datasets library is that everything is saved to disk 12 00:01:10,000 --> 00:01:15,520 using Apache Arrow, which means that even if your dataset is huge you won't get out of RAM: 13 00:01:16,080 --> 00:01:21,920 only the elements you request are loaded in memory. Accessing a slice of your dataset is 14 00:01:21,920 --> 00:01:26,720 as easy as one element. The result is then a dictionary with list of values for each keys 15 00:01:27,280 --> 00:01:31,600 (here the list of labels, the list of first sentences and the list of second sentences). 16 00:01:33,440 --> 00:01:38,880 The features attribute of a Dataset gives us more information about its columns. In particular, 17 00:01:38,880 --> 00:01:45,280 we can see here it gives us the correspondence between the integers and names for the labels. 0 18 00:01:45,280 --> 00:01:51,760 stands for not equivalent and 1 for equivalent. To preprocess all the elements of our dataset, 19 00:01:51,760 --> 00:01:56,800 we need to tokenize them. Have a look at the video "Preprocess sentence pairs" for a refresher, 20 00:01:57,360 --> 00:02:02,320 but you just have to send the two sentences to the tokenizer with some additional keyword arguments. 21 00:02:03,520 --> 00:02:08,560 Here we indicate a maximum length of 128 and pad inputs shorter than this length, 22 00:02:08,560 --> 00:02:14,320 truncate inputs that are longer. We put all of this in a tokenize_function that we can directly 23 00:02:14,320 --> 00:02:20,240 apply to all the splits in our dataset with the map method. As long as the function returns a 24 00:02:20,240 --> 00:02:25,680 dictionary-like object, the map method will add new columns as needed or update existing ones. 25 00:02:27,360 --> 00:02:31,840 To speed up preprocessing and take advantage of the fact our tokenizer is backed by Rust 26 00:02:31,840 --> 00:02:36,880 thanks to the Hugging Face Tokenizers library, we can process several elements at the same time to 27 00:02:36,880 --> 00:02:42,160 our tokenize function, using the batched=True argument. Since the tokenizer can handle list 28 00:02:42,160 --> 00:02:48,880 of first/second sentences, the tokenize_function does not need to change for this. You can also use 29 00:02:49,440 --> 00:02:56,400 multiprocessing with the map method, check out its documentation! Once this is done, we are almost 30 00:02:56,400 --> 00:03:01,920 ready for training: we just remove the columns we don't need anymore with the remove_columns method, 31 00:03:01,920 --> 00:03:06,640 rename label to labels (since the models from Hugging Face Transformers expect that) 32 00:03:07,440 --> 00:03:14,000 and set the output format to our desired backend: torch, tensorflow or numpy. If needed, 33 00:03:14,000 --> 00:03:17,840 we can also generate a short sample of a dataset using the select method. 4320