Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:06,810 --> 00:00:09,980
Now, of course, indices aren't quite that simple.
2
00:00:10,140 --> 00:00:15,060
An index is actually what's called an inverted index and this is basically the mechanism by which pretty
3
00:00:15,060 --> 00:00:17,310
much all search engines work.
4
00:00:17,520 --> 00:00:21,970
As an example, imagine I have a couple of documents in my index that contain text to data.
5
00:00:22,650 --> 00:00:26,800
Let's say I have one document that contains: Space the final frontier,
6
00:00:26,820 --> 00:00:31,410
these are the voyages, and maybe I have another document that says: he's bad,
7
00:00:31,440 --> 00:00:35,190
he's number one, he's a space cowboy with a laser gun,
8
00:00:35,190 --> 00:00:40,500
and if you understand what both of those are references to, then you and I have a lot in common. Now an
9
00:00:40,500 --> 00:00:43,100
inverted index wouldn't store those strings directly,
10
00:00:43,140 --> 00:00:48,690
instead, it sort of flips it on its head. A search engine, such as the elastic search, actually splits each
11
00:00:48,690 --> 00:00:51,770
document up into its individual search terms,
12
00:00:51,870 --> 00:00:56,520
and in this example, we'll just split it up for each word and we'll convert them to lowercase just to
13
00:00:56,520 --> 00:00:58,610
normalize things.
14
00:00:58,620 --> 00:01:03,900
Then what it does is map each search term to the documents that those search terms occur within.
15
00:01:03,900 --> 00:01:09,960
So in this example, the word space actually occurs in both documents, meaning the inverted index would
16
00:01:09,960 --> 00:01:15,090
indicate that the word space occurs in both documents one and two, the word
17
00:01:15,100 --> 00:01:17,710
the also appears in both documents,
18
00:01:17,710 --> 00:01:24,580
so that will also map to both documents one and two, and the word, final, only appears in the first document,
19
00:01:24,580 --> 00:01:29,820
so the inverted index would match the word, final, as a search term to document one.
20
00:01:29,830 --> 00:01:34,120
Now it's a little bit more complicated than that in practice and in reality it actually stores not only
21
00:01:34,120 --> 00:01:37,690
what documents end but also the position within the document that it's in.
22
00:01:38,260 --> 00:01:44,440
But at a high conceptual level, this is the basic idea. An inverted index is what you're actually getting
23
00:01:44,440 --> 00:01:48,790
with a search index, where it's mapping things that you're searching for to the documents of those things
24
00:01:48,790 --> 00:01:51,850
live within, and of course it's not even quite that simple.
25
00:01:54,850 --> 00:01:58,530
So how do I actually deal with the concept of relevance?
26
00:01:58,600 --> 00:02:00,420
Let's take - for example - the word the,
27
00:02:00,460 --> 00:02:02,230
how do I deal with that?
28
00:02:02,230 --> 00:02:06,170
The word the is going to be a very common word in every single document.,
29
00:02:06,190 --> 00:02:10,600
so how do I make sure that only documents where the is a special word are the ones that I get back,
30
00:02:10,630 --> 00:02:18,010
if I actually search for the word the? Well that's where TF IDF comes in, that stands for a term frequency
31
00:02:18,010 --> 00:02:24,590
times inverse document frequency, it's a very fancy-sounding term but it's actually a very simple concept.
32
00:02:24,790 --> 00:02:31,540
So let's break it down. Term frequency is just how often a given search term appears within a given document.
33
00:02:31,540 --> 00:02:37,420
So if the word space occurs very frequently in a given document, it would have a high term frequency.
34
00:02:37,510 --> 00:02:40,630
The same applies if the word appears frequently to the document,
35
00:02:40,660 --> 00:02:46,880
it would also have a high term frequency. Now document frequency is just how often a term appears in
36
00:02:46,910 --> 00:02:49,640
all of the documents in your entire index.
37
00:02:49,640 --> 00:02:51,900
So here's where things get interesting.
38
00:02:52,010 --> 00:02:57,710
So the word space probably doesn't occur very often across the entire index, so it would have a low document
39
00:02:57,710 --> 00:02:59,150
frequency.
40
00:02:59,180 --> 00:03:02,610
However, the word does appear in all documents pretty frequently,
41
00:03:02,750 --> 00:03:09,000
so it would have a very high document frequency. So if we divide term frequency by document frequency,
42
00:03:09,360 --> 00:03:12,440
that's the same as multiplying by the inverse document frequency,
43
00:03:12,480 --> 00:03:14,470
mathematically we get a measure of relevance.
44
00:03:14,490 --> 00:03:17,300
So we see how special this term is to the document.
45
00:03:17,740 --> 00:03:21,990
It measures not only how often does this term occur within the document, but how does that compare to
46
00:03:21,990 --> 00:03:26,520
how often this term occurs in documents across the entire index?
47
00:03:26,520 --> 00:03:31,650
So with that example, the word, space, in an article about space would rank very highly.
48
00:03:31,650 --> 00:03:35,390
However, the word the wouldn't necessarily rank very highly at all,
49
00:03:35,430 --> 00:03:38,330
that's a common term found in every other document as well,
50
00:03:38,460 --> 00:03:43,170
and this is the basic idea of how search engines work. If you're searching for a given term, it will try
51
00:03:43,170 --> 00:03:48,480
to give you back results in the order of their relevancy. Relevancy is loosely based at least on the
52
00:03:48,480 --> 00:03:50,460
concept of TF-IDF,
53
00:03:50,460 --> 00:03:52,260
it's not really that complicated.
5943
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.