subtitlecat.com

All language subtitles for 010管道函数内部发生了什么？（TensorFlow）

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,360 --> 00:00:07,680 What happens inside the pipeline function? 2 00:00:09,840 --> 00:00:14,800 In this video, we will look at what actually happens when we use the pipeline function of 3 00:00:14,800 --> 00:00:20,880 the Transformers library. More specifically, we will look at the sentiment analysis pipeline, and 4 00:00:20,880 --> 00:00:26,720 how it went from the two following sentences to the positive labels with their respective scores. 5 00:00:28,560 --> 00:00:34,160 As we have seen in the pipeline presentation, there are three stages in the pipeline. First, 6 00:00:34,800 --> 00:00:38,880 we convert the raw texts to numbers the model can make sense of, using a tokenizer. 7 00:00:40,000 --> 00:00:43,520 Then, those numbers go through the model, which outputs logits. 8 00:00:44,400 --> 00:00:49,120 Finally, the post-processing steps transforms those logits into labels and scores. 9 00:00:50,720 --> 00:00:54,960 Let's look in detail at those three steps, and how to replicate them using the Transformers library, 10 00:00:54,960 --> 00:01:03,280 beginning with the first stage, tokenization. The tokenization process has several steps. First, 11 00:01:03,280 --> 00:01:09,120 the text is split into small chunks called tokens. They can be words, parts of words or punctuation 12 00:01:09,120 --> 00:01:17,440 symbols. Then the tokenizer will had some special tokens (if the model expect them). Here the model 13 00:01:17,440 --> 00:01:22,800 uses expects a CLS token at the beginning and a SEP token at the end of the sentence to classify. 14 00:01:23,760 --> 00:01:28,880 Lastly, the tokenizer matches each token to its unique ID in the vocabulary of the pretrained 15 00:01:28,880 --> 00:01:34,640 model. To load such a tokenizer, the Transformers library provides the AutoTokenizer API. 16 00:01:35,680 --> 00:01:40,640 The most important method of this class is from_pretrained, which will download and cache 17 00:01:40,640 --> 00:01:47,200 the configuration and the vocabulary associated to a given checkpoint. Here, the checkpoint used 18 00:01:47,200 --> 00:01:53,840 by default for the sentiment analysis pipeline is distilbert base uncased finetuned sst2 english. 19 00:01:56,560 --> 00:02:01,440 We instantiate a tokenizer associated with that checkpoint, then feed it the two sentences. 20 00:02:02,640 --> 00:02:07,360 Since those two sentences are not of the same size, we will need to pad the shortest one to 21 00:02:07,360 --> 00:02:11,680 be able to build an array. This is done by the tokenizer with the option padding=True. 22 00:02:13,840 --> 00:02:18,960 With truncation=True, we ensure that any sentence longer than the maximum the model can handle 23 00:02:18,960 --> 00:02:25,600 is truncated. Lastly, the return_tensors option tells the tokenizer to return a TensorFlow tensor. 24 00:02:26,720 --> 00:02:29,680 Looking at the result, we see we have a dictionary with two keys. 25 00:02:30,240 --> 00:02:37,280 Input IDs contains the IDs of both sentences, with 0s where the padding is applied. The second key, 26 00:02:37,280 --> 00:02:42,080 attention mask, indicates where padding has been applied, so the model does not pay attention to 27 00:02:42,080 --> 00:02:48,000 it. This is all what is inside the tokenization step. Now let's have a look at the second step, 28 00:02:48,640 --> 00:02:54,960 the model. As for the tokenizer, there is an TFAutoModel API, with a from_pretrained method. 29 00:02:55,600 --> 00:02:59,840 It will download and cache the configuration of the model as well as the pretrained 30 00:02:59,840 --> 00:03:05,600 weights. However, the TFAutoModel API will only instantiate the body of the model, 31 00:03:06,320 --> 00:03:10,640 that is, the part of the model that is left once the pretraining head is removed. 32 00:03:12,000 --> 00:03:16,960 It will output a high-dimensional tensor that is a representation of the sentences passed, 33 00:03:16,960 --> 00:03:20,080 but which is not directly useful for our classification problem. 34 00:03:21,760 --> 00:03:28,080 Here the tensor has two sentences, each of sixteen tokens and the last dimension is the hidden size 35 00:03:28,080 --> 00:03:34,320 of our model 768. To get an output linked to our classification problem, we need to 36 00:03:34,320 --> 00:03:40,000 use the TFAutoModelForSequenceClassification class. It works exactly as the AutoModel class, 37 00:03:40,000 --> 00:03:45,440 except that it will build a model with a classification head. There is one auto class for 38 00:03:45,440 --> 00:03:52,160 each common NLP task in the Transformers library. Here, after giving our model the two sentences, 39 00:03:52,160 --> 00:03:59,120 we get a tensor of size two by two: one result for each sentence and for each possible label. Those 40 00:03:59,120 --> 00:04:04,800 outputs are not probabilities yet (we can see they don't sum to 1). This is because each model of the 41 00:04:04,800 --> 00:04:10,960 Transformers library returns logits. To make sense of those logits, we need to dig into the third and 42 00:04:10,960 --> 00:04:17,519 last step of the pipeline: post-processing. To convert logits into probabilities, we need to 43 00:04:17,519 --> 00:04:22,800 apply a SoftMax layer to them. As we can see, this transforms them into positive numbers that 44 00:04:22,800 --> 00:04:28,160 sum up to 1. The last step is to know which of those corresponds to the positive or the negative 45 00:04:28,160 --> 00:04:34,720 label. This is given by the id2label field of the model config. The first probabilities 46 00:04:34,720 --> 00:04:40,800 (index 0) correspond to the negative label, and the seconds (index 1) correspond to the positive 47 00:04:40,800 --> 00:04:46,640 label. This is how our classifier built with the pipeline function picked those labels and computed 48 00:04:46,640 --> 00:04:55,840 those scores. Now that you know how each steps works, you can easily tweak them to your needs. 6278