Nice Explanation for Search. Max covered Inverted Index, TF-IDF, stop word removal, stemming, Ranking, Proximity, Conceptual Search etc. in 10 minutes only. Well Done.
@wlfbck209 жыл бұрын
>We could do whole videos about those two topics And you should imho. Knowing how search engines work (roughly) greatly helps in finding stuff on the internet, which atleast i think is incredibly important regardless of profession (even for hobbyist stuff this is pretty important):
@M4rtingale9 жыл бұрын
"Libraries were a big place full of books you wanted to find [...]" Nice ...
@syedali12176 жыл бұрын
This guy is awesome. Simple, clear and straight to the point. Well done mate.
@mikejohnstonbob9359 жыл бұрын
6:19 How do you preindex the word relative location for nearness approach? if you assign bonuses based on combinations, then the number combinations for most documents would make the index metadocument hard to read. likewise, if the word locations are recorded in the index, the metadocument would be HUGE!
@Slithy9 жыл бұрын
look at my horse, my horse is amazing. I couldn't stop thinking about this song.
@KhalilEstell9 жыл бұрын
Slithereenn You commented this before I could!
@TriantalexАй бұрын
ok?
@SlithyАй бұрын
@@Triantalex bruh, 9 years, i have to rewatch the video to remember what was this all about
@Tony2dH9 жыл бұрын
I would personally love to see an in-depth video on the language models used for search engines, and more about e.g. what Google calls 'neural networks'.
@JsbWalker9 жыл бұрын
Tony2dH It's not just what Google calls them, Computer scientists call them that too.
@BrettonAuerbach Жыл бұрын
indeed time for part 2 with a machine learning vs algorithmic approaches chat
@harounhajem79729 жыл бұрын
Cool topic, and awesome video production.
@tubeworm3399 жыл бұрын
At 5:58 Dr. Wilson mentions stemming as a way to find documents based on the root word. When programming this, do large search engines use a specific set of rules based on the english language, or does it pick these things up through sequences it sees often, using machine learning? Sorry if this comment didn't really make sense, I'm just trying to figure out how that would be programmed.
@minihjalte9 жыл бұрын
Interesting video. Thanks for creating it.
@veggiet20099 жыл бұрын
8:40 the first problem that came to mind in your, granted simplistic, explanation is that a document with "my field" 40 times would rank way higher than a page with "my pony" 6 times. What I am curious about is how the index still factors in. What I understand is that the index indexes all words in order to improve search speed, but to make the search better you have to look for clumps of words. Do you have to reindex every page with two word groups and three word groups? that seems inefficient. "my - 6, horse - 3, my horse - 2, horse is - 2, my horse is - 1" etc... In other words how do you catalog the relationship between words?
@veggiet20099 жыл бұрын
***** yes, but I'm wondering about the speed of these methods in conjunction with basic index. You could pull articles through the index based on the words alone, and then use your other algorithms to sort the resulting list. but that seems inefficient similar to the pre index search. i.e. going through each document and counting which words are close together based on a third resource which has all the keywords logically organised based on concepts or what have you.
@markderosa9 жыл бұрын
Where can I find a copy of that 50 lines of Python that he mentions at 4:20?
@soviut9 жыл бұрын
Great video. I'd love to see some followups on the probabilistic and language approaches.
@Robertlavigne19 жыл бұрын
Awesome video!! I would love to have more Computerphile videos on Semantic web related topics. I am doing an researched project on ontology alignment and mapping at the moment and the topics of this video were very relevant to want I have been looking at. Thanks for making this!!
@rich10514149 жыл бұрын
Google isn't really a 'secret' formula. It is a product of having a LOT of information which has been applied towards their optimizations. PageRank perhaps can be seen as the magic formula which allowed them to be good enough to get to where they are today, but that is only a small piece of the puzzle. The real star of the show is relevant equality of search terms, which requires a lot of data to achieve accurately. Beyond what is discussed in this video, with how each word is given an importance value, each word is also put into a 'group of equality'. Any group of equality for search terms is the combination of a variety of things we would likely consider fundamentally different, but in respect to what is important to a search, that fundamental difference is worthless, what is valuable, is if it is relevant to the search. Searching for boats, for instance, may realize that returning results fishing rods returns positive relevancy, so boats and fishing rods could then be seen as 80% the same thing. So when someone searches for boats, results for fishing rods could be returned, but with 80% of the importance factor given for boats, and return better results. This leaves the 'relevancy synonyms' left up to the engine to assign autonomously in the most statistically optimized way. In this, google has a search algorithm which is exponentially more efficient than anything humans could write themselves, because of how it automatically self optimizes its results without human intervention, with no care for what 'should' or 'shouldn't' be technically categorized together. Beyond the initial sorting of a site into its group of completely different but relevantly equal things, it will then rate the site on how good it is at being loyal to its predicted relevancy, by tracking if people found what they needed there, or immediately jumped back to their search to try again. If a site is deemed say, 30% relevant, but in practice is actually _more_ or less, it is a sign that the site or search terms are poorly defined, and is improved. More would improve the specifically of its relevancy(or add a relevancy synonym), less would simply be stuck further down the list until people stop wasting their time clicking on it. This makes spam sites hoping to exploit it a challenging, if not an impossible to maintain task, because those results will lead to a poor relevancy rating, causing them less and less likely to be anywhere near the top of any string of text you search for without some very specific searching, in which case, you were likely looking for it. Edit: I see at 8:00 you went over latent semantics analysis, so nevermind xD I should remember to finish watching a video before commenting. Ah well.
@mustafaadam96979 жыл бұрын
Richard Smith Still, you comment was more interesting and informative than 99.9% of the YT comments. As a beginner into the world of data and machine learning, I enjoyed reading very much ^_^
@stok3si39 жыл бұрын
You missed a trick here by not having the "like" mug in the background show a link when you hover over it that takes you to the computerphile facebook page.
@eSZett_9 жыл бұрын
I've got a question. How does one search through an index quickly to assign these scores? Do you sort it roughly somehow and just assign scores to the first few elements? It seems like when the index gets large, that would become the speed limiter.
@samuelvidal34379 жыл бұрын
What about the stationary distribution of Markov chain, page rank ?
@rngwrldngnr9 жыл бұрын
What about measuring correlation between words with predictability? Like, if you have horse, there's a 20% chance the result also contains pet and if you have pet, there's a 6% change the document mentions horse. I don't think you could explicitly group words, because it's non abelian, but you could have some kind of minimum threshold of probable connection that was required for the word to be considered a real associate.
@FoxDren9 жыл бұрын
rngwrldngnr if you watched the whole video you'd hear him say that it is much more complex than that and measures probability
@woobmonkey9 жыл бұрын
***** What would, IMHO, be a pertinent side-topic to explore is: by what algorithms do search engines decide which results are relevant to you, personally. There seems to be a somewhat disturbing trend toward an echo-chamber effect; two people, using the exact same search terms, are likely to find variant results in what pages they're shown, as well as the order in which they appear. It makes it difficult, or at least more difficult than in a brick-and-mortar library, to find contrary points of view and/or conflicting information on a given topic. For anyone interested in elevating discussion on fora such as KZbin comments, this may well be more than a trivial matter.
@SyntheticFuture9 жыл бұрын
And all that in mere seconds... it's an amazing world we take for granted.
@JonHurlock9 жыл бұрын
Nice little intro to TF-IDF Max :)
@foxdash9 жыл бұрын
I remember that the big thing about google was how fast it was compared to others, now it seems that any increase in the speed of a search engine is pretty trivial when they all return results very fast. I guess google still had the edge on returning the most relevant results though.
@JimmyWirsborg9 жыл бұрын
Very basic stuff but awesomely explained =)
@LazyMasterGamer9 жыл бұрын
I've got the same cup as the "Coffee" like button one except it's written "Tea" :p
@TheNefari9 жыл бұрын
So you always need an index ... Then what do you do if your word is not in the index ?
@o0julek0o9 жыл бұрын
That's when Google says there's nothing to be found.
@Nilguiri9 жыл бұрын
TheNefari If the word is out there but not in the index, then the index needs updating. In the meantime it will tell you that it's not found.
@beeflon9 жыл бұрын
TheNefari Who did I found here. Didn't expected to see you somewhere in the comment section.
@aakksshhaayy9 жыл бұрын
Albert Hofmann were you guys lovers or something?
@beeflon9 жыл бұрын
aakksshhaayy Nah, he has a small channel and I saw some vid. a while ago. Was just surprised.
@musicalsimon9 жыл бұрын
More videos on this topic please! I'm curious to know in what ways google search is superior to other search engines
@RageForSeven9 жыл бұрын
the oscillating chair is really interesting...
@hrnekbezucha9 жыл бұрын
Are you trying out the technique each time you say _pony_ in a video, it doubles the view count?
@NyanSten9 жыл бұрын
When you do stemming, you need to run the words through a dictionary. In that case, you can also read whether it is an adjective or determiner and you can treat all adjectives and determiners as having the same distance from their word so that ‘my horse’ and ‘my lovely horse’ (and ‘horse of mine’) would be treated as equally relevant.
@SuperdoggyMusic9 жыл бұрын
At 8:34 I was almost expecting him to mention google bombs. :P
@bkky99 жыл бұрын
Has Google indexed every file on the internet? How does it have space for that?
@Nilguiri9 жыл бұрын
bkky9 They have a joke size hard disk on their PC.
@xponen9 жыл бұрын
bkky9 they have supercomputers
@Cr42yguy9 жыл бұрын
he seems to really like bouncing up and down THE WHOLE VIDEO!
@brandonthesteele9 жыл бұрын
I think we all have our ticks. I used to literally tremble when giving explanations on things I was interested in and studying.
@Cr42yguy9 жыл бұрын
i am totally aware of that fact. nontheless was it very annoying once i noticed it.
@JeaneAdix9 жыл бұрын
+Cr42yguy I have same problem, it bothers a lot of people. They always get annoyed, some people tell me to stop bouncing my legs, but i can't help it. The moment you stop forcing yourself to remain still it starts to happen again.
@chameleonedm9 жыл бұрын
+Cr42yguy You should see him in a lecture xD
@RandomNUser9 жыл бұрын
+Cr42yguy Noticed there are two coffee cups? that might be a good reason for bouncing as well as a good ammount of interest in the topic.
@privettotheworld9 жыл бұрын
i like the "banana" screensaver going on in the background
@captainnintendo9 жыл бұрын
Came for the topic, stayed for the ponies :D
@Niki_00019 жыл бұрын
I wonder how search engines deal with languages that conjugate words a lot, like Finnish? There are dozens if not hundreds of ways to conjugate a single noun, which can affect or be affected by how you conjugate other words in the sentence.
@hnnnnnghhh9 жыл бұрын
Rented Mule By "word stemming", conjugations and mutations of words are cut off and stored as the "stem" or root of the word. (Running, Ran, Runs)->Run. Each language has their own unique stemming rules that search engines can use.
@Niki_00019 жыл бұрын
hnnnnnghhh I guess it makes sense that search engines would have access to dictionaries. I did a little poking on Google and found a research paper that claims that search engines like Google, Yahoo and Bing don't perform very well with non-English languages. Granted, the paper also says that there are smaller, localized search engines that perform well on morphologically complex languages.
@ValleysOfRain9 жыл бұрын
I take it that web crawlers will be mentioned in the next video?
@Destro70009 жыл бұрын
My lovely horse, running through the field Where are you going, with your fetlocks blowing in the wind?
@Reavenk9 жыл бұрын
He does an excellent job explaining stuff, but sometimes he just mumbles to where I can't understand him.
@peterr62056 жыл бұрын
I disagree. He does tend to mumble stuff, but he doesn't do a great job explaining stuff. I know tfidf quite well, but even I found his explanation to be quite weak.
@michael1026h19 жыл бұрын
Couple of questions: Why wouldn't a document with nothing but the word "horse" listed a thousand times show up high in the rankings? Also, if this is how indexes work, how does Google search for strings with quotes? IE: "My horse" wouldn't show documents with "my lovely horse".
@Jorissoris9 жыл бұрын
50 Lines of PYTHON are not really fast you say? Well, thats what you get for using python.
@DariushMJ9 жыл бұрын
***** True, but the biggest problem in this case is the algorithm, not the language. Changing the language may make it run double as fast, while changing the algorithm may make it run a billion times faster when there is a lot of data.
@bookdream9 жыл бұрын
Dariush MJ Exactly, you can use the fastest language on the fastest machine, if its an inefficient algorithm it could take a ridiculously long time in comparison to a faster algorithm using python
@BGBTech9 жыл бұрын
Dariush MJ Python in general is sort of a bane to programming though, producing lots of very slow and unreliable code, with performance often a bit worse even vs a lot of other scripting languages (such as Lua or JS). a lot of times with Python code though, it is the combined problem of both a slow language and poorly written code. if speed is relevant, a person is probably better off using C or C++ or similar.
@Folopolis9 жыл бұрын
***** The problem is that given a long document with enough unique words, you could be running through a loop hundreds of thousands of times, that's not going to be fast in any language. Efficient algorithm production is as much of an art as a science. This is why Google only has 3 or 4 competitors that are even trying any more.
@BGBTech9 жыл бұрын
Alexandru Gheorghe I know about algorithmic complexity, but Python is often around 40x-100x slower than C, if your code actually *does* anything (vs just calling into library functions or doing database queries or similar). if the same algorithm is used in either language, that speed difference may amount to a fairly big difference overall. C in no way prevents using O(log n) or O(1) algorithms, and optimizing algorithms is still a pretty big deal in C land as well.
@Seppes949 жыл бұрын
Look at my horse, my horse is amazing....
@TriantalexАй бұрын
ok?
@Seppes94Ай бұрын
@@Triantalex I'm sure there is a point in this video, where this weebl reference makes sense. Might rewatch later. It's been 9 years.
@Nulono9 жыл бұрын
6:48 11.5?
@fadouarasmouki7259 жыл бұрын
Dear Computerphile, can you please add English subtitles for non-native speakers. Thank you.
@Adamantium90019 жыл бұрын
So Google just has a MASSIVE index which shows how many times EVERY POSSIBLE WORD occurs in EVERY SINGLE WEB PAGE in their search space?
@DataCab1e9 жыл бұрын
The second episode of Star Trek TNG appears horribly dated to anyone who's used Google, because Data's search for an incident in which someone had showered in his or her clothing was treated as an un-indexed paper document search, assisted only by the android's ability to read every file relatively quickly.
@dupirechristophe77035 жыл бұрын
What we search here is "my horse" as a block, and not two separate words, but let's do some maths here it will surely resolve the problem x'D
@CraftySalvager9 жыл бұрын
That's a lot of pre-computation. Pre-computation that you might only use 10% of the final result.
@thetommantom9 жыл бұрын
these remind me of fractals, and then making it 3d or 4d connecting them
@danidanae69059 жыл бұрын
Hi your info is really interesting🙉
@owhs9 жыл бұрын
was he sat on an exercise ball?
@SparkysBarelyMusic9 жыл бұрын
I once wanted to find out how the Japanese calendar worked, i.e. 2015 = 27 in Heisei. Anyway i googled "japanese dates" Moral of the story do not google Japanese dates
@ScornMuffins9 жыл бұрын
Jeez, will someone just get this guy a horse already!?
@Tharkz9 жыл бұрын
OK 1/3 through and I just can't hold it in me any longer... You're wearing sun glasses in door with the blinds down and closed, why? :-)
@chappie__4 жыл бұрын
You should rename the video "How search engine indexing works"... So that your video gets a higher index lol
@iyaanazeez89895 жыл бұрын
Quick question, Will i become each time i watch a computerphile video? Agree oR Not
@Mad_Elf_09 жыл бұрын
Yeah... all this 'intelligence' that search engine providers are putting into their products are really neat, but when over 75% of searches you make as part of your job require looking for *exact* words or *exact* phrases, and the search engines 'intelligently' turn "process halted with error" into "process stopping by mistake", *even* if you use double quotes, it starts getting **REALLY** **ANNOYING**. I really wish Google would add a "I mean this literally" option to their search options
@SimbaKing79 жыл бұрын
more!
@michaelkruger44219 жыл бұрын
And the obvious thing to do next is Google "my horse"
@4pThorpy9 жыл бұрын
what a fidget!
@khaledtareq14728 жыл бұрын
he said we can do this in 50 lines of python , please I want this 50 line code
@lafeo00776 жыл бұрын
Could you go into my complexity?
@VladVladislav7909 жыл бұрын
I miss Sixty Symbols :(
@thetommantom9 жыл бұрын
or trees
@DJDavid989 жыл бұрын
"my pony" I c what u did thar
@ArnoldsKtm9 жыл бұрын
What's with this guy and his horses? :D
@BillyBob-ik4pn9 жыл бұрын
7:42 My Little Pony... Half Life 3 confirmed!
@zebraforceone8 жыл бұрын
(blazin saddles) HORSES??!?!??!?!??!
@Flagen5799 жыл бұрын
BANANA
@trefod9 жыл бұрын
I lost my concentration a couple of times because the presenter kept bobbing up and down. It is a subtle but effective of making me lose my calm because I can't reach out to steady him.
@Zishy9 жыл бұрын
are you riding a horse?
@goeiecool99999 жыл бұрын
This guy sounds super tired lol
@kosojmshj55642 ай бұрын
لا احد اضاف شىء
@arminhrnjic87069 жыл бұрын
Like if you googled "my horse"
@rdoetjes9 жыл бұрын
Very interesting subject but as a director I was getting so annoyed by the guy trembling up and down in his chair as if was wiggling his feet being nervous. It really get me out of the story.
@aliaydogdu58109 жыл бұрын
domato
@7177YT5 жыл бұрын
Cute, he explains 'what libraries were' for the average millenial barbarian. lol
@grimreefer43669 жыл бұрын
It's time I sling the baskets off this overburdened HORSE Sink MY toes into the ground and set a different course Cause if I were here and you were there I'd meet you in between And not until MY dying day, confess what I have seen
@KhalilEstell9 жыл бұрын
He is very bouncy.
@poteb9 жыл бұрын
Great explanation, but please stop jumping in your chair, I'm getting a bit of motion sickness.
@ariebrons79769 жыл бұрын
first
@ariebrons79769 жыл бұрын
arie brons 9th you idiot
@ariebrons79769 жыл бұрын
arie brons buy a mirror and then we 'll see who is the idiot here
@ariebrons79769 жыл бұрын
arie brons guys, guys calm down we don't need to fight i mean we are all human, in fact we are all the same person
@ariebrons79769 жыл бұрын
arie brons what do you mean, same person
@ariebrons79769 жыл бұрын
arie brons i mean that we are litterallty just letters expressing the oppinion of some dude with a weird hobby