subtitlecat.com

All language subtitles for 007 Term Frequency Inverse Document Frequency (TFIDF)_en[UdemyIran.Com]

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish Download

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:06,810 --> 00:00:09,980 Now, of course, indices aren't quite that simple. 2 00:00:10,140 --> 00:00:15,060 An index is actually what's called an inverted index and this is basically the mechanism by which pretty 3 00:00:15,060 --> 00:00:17,310 much all search engines work. 4 00:00:17,520 --> 00:00:21,970 As an example, imagine I have a couple of documents in my index that contain text to data. 5 00:00:22,650 --> 00:00:26,800 Let's say I have one document that contains: Space the final frontier, 6 00:00:26,820 --> 00:00:31,410 these are the voyages, and maybe I have another document that says: he's bad, 7 00:00:31,440 --> 00:00:35,190 he's number one, he's a space cowboy with a laser gun, 8 00:00:35,190 --> 00:00:40,500 and if you understand what both of those are references to, then you and I have a lot in common. Now an 9 00:00:40,500 --> 00:00:43,100 inverted index wouldn't store those strings directly, 10 00:00:43,140 --> 00:00:48,690 instead, it sort of flips it on its head. A search engine, such as the elastic search, actually splits each 11 00:00:48,690 --> 00:00:51,770 document up into its individual search terms, 12 00:00:51,870 --> 00:00:56,520 and in this example, we'll just split it up for each word and we'll convert them to lowercase just to 13 00:00:56,520 --> 00:00:58,610 normalize things. 14 00:00:58,620 --> 00:01:03,900 Then what it does is map each search term to the documents that those search terms occur within. 15 00:01:03,900 --> 00:01:09,960 So in this example, the word space actually occurs in both documents, meaning the inverted index would 16 00:01:09,960 --> 00:01:15,090 indicate that the word space occurs in both documents one and two, the word 17 00:01:15,100 --> 00:01:17,710 the also appears in both documents, 18 00:01:17,710 --> 00:01:24,580 so that will also map to both documents one and two, and the word, final, only appears in the first document, 19 00:01:24,580 --> 00:01:29,820 so the inverted index would match the word, final, as a search term to document one. 20 00:01:29,830 --> 00:01:34,120 Now it's a little bit more complicated than that in practice and in reality it actually stores not only 21 00:01:34,120 --> 00:01:37,690 what documents end but also the position within the document that it's in. 22 00:01:38,260 --> 00:01:44,440 But at a high conceptual level, this is the basic idea. An inverted index is what you're actually getting 23 00:01:44,440 --> 00:01:48,790 with a search index, where it's mapping things that you're searching for to the documents of those things 24 00:01:48,790 --> 00:01:51,850 live within, and of course it's not even quite that simple. 25 00:01:54,850 --> 00:01:58,530 So how do I actually deal with the concept of relevance? 26 00:01:58,600 --> 00:02:00,420 Let's take - for example - the word the, 27 00:02:00,460 --> 00:02:02,230 how do I deal with that? 28 00:02:02,230 --> 00:02:06,170 The word the is going to be a very common word in every single document., 29 00:02:06,190 --> 00:02:10,600 so how do I make sure that only documents where the is a special word are the ones that I get back, 30 00:02:10,630 --> 00:02:18,010 if I actually search for the word the? Well that's where TF IDF comes in, that stands for a term frequency 31 00:02:18,010 --> 00:02:24,590 times inverse document frequency, it's a very fancy-sounding term but it's actually a very simple concept. 32 00:02:24,790 --> 00:02:31,540 So let's break it down. Term frequency is just how often a given search term appears within a given document. 33 00:02:31,540 --> 00:02:37,420 So if the word space occurs very frequently in a given document, it would have a high term frequency. 34 00:02:37,510 --> 00:02:40,630 The same applies if the word appears frequently to the document, 35 00:02:40,660 --> 00:02:46,880 it would also have a high term frequency. Now document frequency is just how often a term appears in 36 00:02:46,910 --> 00:02:49,640 all of the documents in your entire index. 37 00:02:49,640 --> 00:02:51,900 So here's where things get interesting. 38 00:02:52,010 --> 00:02:57,710 So the word space probably doesn't occur very often across the entire index, so it would have a low document 39 00:02:57,710 --> 00:02:59,150 frequency. 40 00:02:59,180 --> 00:03:02,610 However, the word does appear in all documents pretty frequently, 41 00:03:02,750 --> 00:03:09,000 so it would have a very high document frequency. So if we divide term frequency by document frequency, 42 00:03:09,360 --> 00:03:12,440 that's the same as multiplying by the inverse document frequency, 43 00:03:12,480 --> 00:03:14,470 mathematically we get a measure of relevance. 44 00:03:14,490 --> 00:03:17,300 So we see how special this term is to the document. 45 00:03:17,740 --> 00:03:21,990 It measures not only how often does this term occur within the document, but how does that compare to 46 00:03:21,990 --> 00:03:26,520 how often this term occurs in documents across the entire index? 47 00:03:26,520 --> 00:03:31,650 So with that example, the word, space, in an article about space would rank very highly. 48 00:03:31,650 --> 00:03:35,390 However, the word the wouldn't necessarily rank very highly at all, 49 00:03:35,430 --> 00:03:38,330 that's a common term found in every other document as well, 50 00:03:38,460 --> 00:03:43,170 and this is the basic idea of how search engines work. If you're searching for a given term, it will try 51 00:03:43,170 --> 00:03:48,480 to give you back results in the order of their relevancy. Relevancy is loosely based at least on the 52 00:03:48,480 --> 00:03:50,460 concept of TF-IDF, 53 00:03:50,460 --> 00:03:52,260 it's not really that complicated. 5943