How Search Engines Treat Data - Computerphile

  Рет қаралды 132,836

Computerphile

Computerphile

Күн бұрын

Пікірлер: 128
@UsamaNada
@UsamaNada 9 жыл бұрын
Nice Explanation for Search. Max covered Inverted Index, TF-IDF, stop word removal, stemming, Ranking, Proximity, Conceptual Search etc. in 10 minutes only. Well Done.
@wlfbck20
@wlfbck20 9 жыл бұрын
>We could do whole videos about those two topics And you should imho. Knowing how search engines work (roughly) greatly helps in finding stuff on the internet, which atleast i think is incredibly important regardless of profession (even for hobbyist stuff this is pretty important):
@M4rtingale
@M4rtingale 9 жыл бұрын
"Libraries were a big place full of books you wanted to find [...]" Nice ...
@syedali1217
@syedali1217 6 жыл бұрын
This guy is awesome. Simple, clear and straight to the point. Well done mate.
@mikejohnstonbob935
@mikejohnstonbob935 9 жыл бұрын
6:19 How do you preindex the word relative location for nearness approach? if you assign bonuses based on combinations, then the number combinations for most documents would make the index metadocument hard to read. likewise, if the word locations are recorded in the index, the metadocument would be HUGE!
@Slithy
@Slithy 9 жыл бұрын
look at my horse, my horse is amazing. I couldn't stop thinking about this song.
@KhalilEstell
@KhalilEstell 9 жыл бұрын
Slithereenn You commented this before I could!
@Triantalex
@Triantalex Ай бұрын
ok?
@Slithy
@Slithy Ай бұрын
@@Triantalex bruh, 9 years, i have to rewatch the video to remember what was this all about
@Tony2dH
@Tony2dH 9 жыл бұрын
I would personally love to see an in-depth video on the language models used for search engines, and more about e.g. what Google calls 'neural networks'.
@JsbWalker
@JsbWalker 9 жыл бұрын
Tony2dH It's not just what Google calls them, Computer scientists call them that too.
@BrettonAuerbach
@BrettonAuerbach Жыл бұрын
indeed time for part 2 with a machine learning vs algorithmic approaches chat
@harounhajem7972
@harounhajem7972 9 жыл бұрын
Cool topic, and awesome video production.
@tubeworm339
@tubeworm339 9 жыл бұрын
At 5:58 Dr. Wilson mentions stemming as a way to find documents based on the root word. When programming this, do large search engines use a specific set of rules based on the english language, or does it pick these things up through sequences it sees often, using machine learning? Sorry if this comment didn't really make sense, I'm just trying to figure out how that would be programmed.
@minihjalte
@minihjalte 9 жыл бұрын
Interesting video. Thanks for creating it.
@veggiet2009
@veggiet2009 9 жыл бұрын
8:40 the first problem that came to mind in your, granted simplistic, explanation is that a document with "my field" 40 times would rank way higher than a page with "my pony" 6 times. What I am curious about is how the index still factors in. What I understand is that the index indexes all words in order to improve search speed, but to make the search better you have to look for clumps of words. Do you have to reindex every page with two word groups and three word groups? that seems inefficient. "my - 6, horse - 3, my horse - 2, horse is - 2, my horse is - 1" etc... In other words how do you catalog the relationship between words?
@veggiet2009
@veggiet2009 9 жыл бұрын
***** yes, but I'm wondering about the speed of these methods in conjunction with basic index. You could pull articles through the index based on the words alone, and then use your other algorithms to sort the resulting list. but that seems inefficient similar to the pre index search. i.e. going through each document and counting which words are close together based on a third resource which has all the keywords logically organised based on concepts or what have you.
@markderosa
@markderosa 9 жыл бұрын
Where can I find a copy of that 50 lines of Python that he mentions at 4:20?
@soviut
@soviut 9 жыл бұрын
Great video. I'd love to see some followups on the probabilistic and language approaches.
@Robertlavigne1
@Robertlavigne1 9 жыл бұрын
Awesome video!! I would love to have more Computerphile videos on Semantic web related topics. I am doing an researched project on ontology alignment and mapping at the moment and the topics of this video were very relevant to want I have been looking at. Thanks for making this!!
@rich1051414
@rich1051414 9 жыл бұрын
Google isn't really a 'secret' formula. It is a product of having a LOT of information which has been applied towards their optimizations. PageRank perhaps can be seen as the magic formula which allowed them to be good enough to get to where they are today, but that is only a small piece of the puzzle. The real star of the show is relevant equality of search terms, which requires a lot of data to achieve accurately. Beyond what is discussed in this video, with how each word is given an importance value, each word is also put into a 'group of equality'. Any group of equality for search terms is the combination of a variety of things we would likely consider fundamentally different, but in respect to what is important to a search, that fundamental difference is worthless, what is valuable, is if it is relevant to the search. Searching for boats, for instance, may realize that returning results fishing rods returns positive relevancy, so boats and fishing rods could then be seen as 80% the same thing. So when someone searches for boats, results for fishing rods could be returned, but with 80% of the importance factor given for boats, and return better results. This leaves the 'relevancy synonyms' left up to the engine to assign autonomously in the most statistically optimized way. In this, google has a search algorithm which is exponentially more efficient than anything humans could write themselves, because of how it automatically self optimizes its results without human intervention, with no care for what 'should' or 'shouldn't' be technically categorized together. Beyond the initial sorting of a site into its group of completely different but relevantly equal things, it will then rate the site on how good it is at being loyal to its predicted relevancy, by tracking if people found what they needed there, or immediately jumped back to their search to try again. If a site is deemed say, 30% relevant, but in practice is actually _more_ or less, it is a sign that the site or search terms are poorly defined, and is improved. More would improve the specifically of its relevancy(or add a relevancy synonym), less would simply be stuck further down the list until people stop wasting their time clicking on it. This makes spam sites hoping to exploit it a challenging, if not an impossible to maintain task, because those results will lead to a poor relevancy rating, causing them less and less likely to be anywhere near the top of any string of text you search for without some very specific searching, in which case, you were likely looking for it. Edit: I see at 8:00 you went over latent semantics analysis, so nevermind xD I should remember to finish watching a video before commenting. Ah well.
@mustafaadam9697
@mustafaadam9697 9 жыл бұрын
Richard Smith Still, you comment was more interesting and informative than 99.9% of the YT comments. As a beginner into the world of data and machine learning, I enjoyed reading very much ^_^
@stok3si3
@stok3si3 9 жыл бұрын
You missed a trick here by not having the "like" mug in the background show a link when you hover over it that takes you to the computerphile facebook page.
@eSZett_
@eSZett_ 9 жыл бұрын
I've got a question. How does one search through an index quickly to assign these scores? Do you sort it roughly somehow and just assign scores to the first few elements? It seems like when the index gets large, that would become the speed limiter.
@samuelvidal3437
@samuelvidal3437 9 жыл бұрын
What about the stationary distribution of Markov chain, page rank ?
@rngwrldngnr
@rngwrldngnr 9 жыл бұрын
What about measuring correlation between words with predictability? Like, if you have horse, there's a 20% chance the result also contains pet and if you have pet, there's a 6% change the document mentions horse. I don't think you could explicitly group words, because it's non abelian, but you could have some kind of minimum threshold of probable connection that was required for the word to be considered a real associate.
@FoxDren
@FoxDren 9 жыл бұрын
rngwrldngnr if you watched the whole video you'd hear him say that it is much more complex than that and measures probability
@woobmonkey
@woobmonkey 9 жыл бұрын
***** What would, IMHO, be a pertinent side-topic to explore is: by what algorithms do search engines decide which results are relevant to you, personally. There seems to be a somewhat disturbing trend toward an echo-chamber effect; two people, using the exact same search terms, are likely to find variant results in what pages they're shown, as well as the order in which they appear. It makes it difficult, or at least more difficult than in a brick-and-mortar library, to find contrary points of view and/or conflicting information on a given topic. For anyone interested in elevating discussion on fora such as KZbin comments, this may well be more than a trivial matter.
@SyntheticFuture
@SyntheticFuture 9 жыл бұрын
And all that in mere seconds... it's an amazing world we take for granted.
@JonHurlock
@JonHurlock 9 жыл бұрын
Nice little intro to TF-IDF Max :)
@foxdash
@foxdash 9 жыл бұрын
I remember that the big thing about google was how fast it was compared to others, now it seems that any increase in the speed of a search engine is pretty trivial when they all return results very fast. I guess google still had the edge on returning the most relevant results though.
@JimmyWirsborg
@JimmyWirsborg 9 жыл бұрын
Very basic stuff but awesomely explained =)
@LazyMasterGamer
@LazyMasterGamer 9 жыл бұрын
I've got the same cup as the "Coffee" like button one except it's written "Tea" :p
@TheNefari
@TheNefari 9 жыл бұрын
So you always need an index ... Then what do you do if your word is not in the index ?
@o0julek0o
@o0julek0o 9 жыл бұрын
That's when Google says there's nothing to be found.
@Nilguiri
@Nilguiri 9 жыл бұрын
TheNefari If the word is out there but not in the index, then the index needs updating. In the meantime it will tell you that it's not found.
@beeflon
@beeflon 9 жыл бұрын
TheNefari Who did I found here. Didn't expected to see you somewhere in the comment section.
@aakksshhaayy
@aakksshhaayy 9 жыл бұрын
Albert Hofmann were you guys lovers or something?
@beeflon
@beeflon 9 жыл бұрын
aakksshhaayy Nah, he has a small channel and I saw some vid. a while ago. Was just surprised.
@musicalsimon
@musicalsimon 9 жыл бұрын
More videos on this topic please! I'm curious to know in what ways google search is superior to other search engines
@RageForSeven
@RageForSeven 9 жыл бұрын
the oscillating chair is really interesting...
@hrnekbezucha
@hrnekbezucha 9 жыл бұрын
Are you trying out the technique each time you say _pony_ in a video, it doubles the view count?
@NyanSten
@NyanSten 9 жыл бұрын
When you do stemming, you need to run the words through a dictionary. In that case, you can also read whether it is an adjective or determiner and you can treat all adjectives and determiners as having the same distance from their word so that ‘my horse’ and ‘my lovely horse’ (and ‘horse of mine’) would be treated as equally relevant.
@SuperdoggyMusic
@SuperdoggyMusic 9 жыл бұрын
At 8:34 I was almost expecting him to mention google bombs. :P
@bkky9
@bkky9 9 жыл бұрын
Has Google indexed every file on the internet? How does it have space for that?
@Nilguiri
@Nilguiri 9 жыл бұрын
bkky9 They have a joke size hard disk on their PC.
@xponen
@xponen 9 жыл бұрын
bkky9 they have supercomputers
@Cr42yguy
@Cr42yguy 9 жыл бұрын
he seems to really like bouncing up and down THE WHOLE VIDEO!
@brandonthesteele
@brandonthesteele 9 жыл бұрын
I think we all have our ticks. I used to literally tremble when giving explanations on things I was interested in and studying.
@Cr42yguy
@Cr42yguy 9 жыл бұрын
i am totally aware of that fact. nontheless was it very annoying once i noticed it.
@JeaneAdix
@JeaneAdix 9 жыл бұрын
+Cr42yguy I have same problem, it bothers a lot of people. They always get annoyed, some people tell me to stop bouncing my legs, but i can't help it. The moment you stop forcing yourself to remain still it starts to happen again.
@chameleonedm
@chameleonedm 9 жыл бұрын
+Cr42yguy You should see him in a lecture xD
@RandomNUser
@RandomNUser 9 жыл бұрын
+Cr42yguy Noticed there are two coffee cups? that might be a good reason for bouncing as well as a good ammount of interest in the topic.
@privettotheworld
@privettotheworld 9 жыл бұрын
i like the "banana" screensaver going on in the background
@captainnintendo
@captainnintendo 9 жыл бұрын
Came for the topic, stayed for the ponies :D
@Niki_0001
@Niki_0001 9 жыл бұрын
I wonder how search engines deal with languages that conjugate words a lot, like Finnish? There are dozens if not hundreds of ways to conjugate a single noun, which can affect or be affected by how you conjugate other words in the sentence.
@hnnnnnghhh
@hnnnnnghhh 9 жыл бұрын
Rented Mule By "word stemming", conjugations and mutations of words are cut off and stored as the "stem" or root of the word. (Running, Ran, Runs)->Run. Each language has their own unique stemming rules that search engines can use.
@Niki_0001
@Niki_0001 9 жыл бұрын
hnnnnnghhh I guess it makes sense that search engines would have access to dictionaries. I did a little poking on Google and found a research paper that claims that search engines like Google, Yahoo and Bing don't perform very well with non-English languages. Granted, the paper also says that there are smaller, localized search engines that perform well on morphologically complex languages.
@ValleysOfRain
@ValleysOfRain 9 жыл бұрын
I take it that web crawlers will be mentioned in the next video?
@Destro7000
@Destro7000 9 жыл бұрын
My lovely horse, running through the field Where are you going, with your fetlocks blowing in the wind?
@Reavenk
@Reavenk 9 жыл бұрын
He does an excellent job explaining stuff, but sometimes he just mumbles to where I can't understand him.
@peterr6205
@peterr6205 6 жыл бұрын
I disagree. He does tend to mumble stuff, but he doesn't do a great job explaining stuff. I know tfidf quite well, but even I found his explanation to be quite weak.
@michael1026h1
@michael1026h1 9 жыл бұрын
Couple of questions: Why wouldn't a document with nothing but the word "horse" listed a thousand times show up high in the rankings? Also, if this is how indexes work, how does Google search for strings with quotes? IE: "My horse" wouldn't show documents with "my lovely horse".
@Jorissoris
@Jorissoris 9 жыл бұрын
50 Lines of PYTHON are not really fast you say? Well, thats what you get for using python.
@DariushMJ
@DariushMJ 9 жыл бұрын
***** True, but the biggest problem in this case is the algorithm, not the language. Changing the language may make it run double as fast, while changing the algorithm may make it run a billion times faster when there is a lot of data.
@bookdream
@bookdream 9 жыл бұрын
Dariush MJ Exactly, you can use the fastest language on the fastest machine, if its an inefficient algorithm it could take a ridiculously long time in comparison to a faster algorithm using python
@BGBTech
@BGBTech 9 жыл бұрын
Dariush MJ Python in general is sort of a bane to programming though, producing lots of very slow and unreliable code, with performance often a bit worse even vs a lot of other scripting languages (such as Lua or JS). a lot of times with Python code though, it is the combined problem of both a slow language and poorly written code. if speed is relevant, a person is probably better off using C or C++ or similar.
@Folopolis
@Folopolis 9 жыл бұрын
***** The problem is that given a long document with enough unique words, you could be running through a loop hundreds of thousands of times, that's not going to be fast in any language. Efficient algorithm production is as much of an art as a science. This is why Google only has 3 or 4 competitors that are even trying any more.
@BGBTech
@BGBTech 9 жыл бұрын
Alexandru Gheorghe I know about algorithmic complexity, but Python is often around 40x-100x slower than C, if your code actually *does* anything (vs just calling into library functions or doing database queries or similar). if the same algorithm is used in either language, that speed difference may amount to a fairly big difference overall. C in no way prevents using O(log n) or O(1) algorithms, and optimizing algorithms is still a pretty big deal in C land as well.
@Seppes94
@Seppes94 9 жыл бұрын
Look at my horse, my horse is amazing....
@Triantalex
@Triantalex Ай бұрын
ok?
@Seppes94
@Seppes94 Ай бұрын
@@Triantalex I'm sure there is a point in this video, where this weebl reference makes sense. Might rewatch later. It's been 9 years.
@Nulono
@Nulono 9 жыл бұрын
6:48 11.5?
@fadouarasmouki725
@fadouarasmouki725 9 жыл бұрын
Dear Computerphile, can you please add English subtitles for non-native speakers. Thank you.
@Adamantium9001
@Adamantium9001 9 жыл бұрын
So Google just has a MASSIVE index which shows how many times EVERY POSSIBLE WORD occurs in EVERY SINGLE WEB PAGE in their search space?
@DataCab1e
@DataCab1e 9 жыл бұрын
The second episode of Star Trek TNG appears horribly dated to anyone who's used Google, because Data's search for an incident in which someone had showered in his or her clothing was treated as an un-indexed paper document search, assisted only by the android's ability to read every file relatively quickly.
@dupirechristophe7703
@dupirechristophe7703 5 жыл бұрын
What we search here is "my horse" as a block, and not two separate words, but let's do some maths here it will surely resolve the problem x'D
@CraftySalvager
@CraftySalvager 9 жыл бұрын
That's a lot of pre-computation. Pre-computation that you might only use 10% of the final result.
@thetommantom
@thetommantom 9 жыл бұрын
these remind me of fractals, and then making it 3d or 4d connecting them
@danidanae6905
@danidanae6905 9 жыл бұрын
Hi your info is really interesting🙉
@owhs
@owhs 9 жыл бұрын
was he sat on an exercise ball?
@SparkysBarelyMusic
@SparkysBarelyMusic 9 жыл бұрын
I once wanted to find out how the Japanese calendar worked, i.e. 2015 = 27 in Heisei. Anyway i googled "japanese dates" Moral of the story do not google Japanese dates
@ScornMuffins
@ScornMuffins 9 жыл бұрын
Jeez, will someone just get this guy a horse already!?
@Tharkz
@Tharkz 9 жыл бұрын
OK 1/3 through and I just can't hold it in me any longer... You're wearing sun glasses in door with the blinds down and closed, why? :-)
@chappie__
@chappie__ 4 жыл бұрын
You should rename the video "How search engine indexing works"... So that your video gets a higher index lol
@iyaanazeez8989
@iyaanazeez8989 5 жыл бұрын
Quick question, Will i become each time i watch a computerphile video? Agree oR Not
@Mad_Elf_0
@Mad_Elf_0 9 жыл бұрын
Yeah... all this 'intelligence' that search engine providers are putting into their products are really neat, but when over 75% of searches you make as part of your job require looking for *exact* words or *exact* phrases, and the search engines 'intelligently' turn "process halted with error" into "process stopping by mistake", *even* if you use double quotes, it starts getting **REALLY** **ANNOYING**. I really wish Google would add a "I mean this literally" option to their search options
@SimbaKing7
@SimbaKing7 9 жыл бұрын
more!
@michaelkruger4421
@michaelkruger4421 9 жыл бұрын
And the obvious thing to do next is Google "my horse"
@4pThorpy
@4pThorpy 9 жыл бұрын
what a fidget!
@khaledtareq1472
@khaledtareq1472 8 жыл бұрын
he said we can do this in 50 lines of python , please I want this 50 line code
@lafeo0077
@lafeo0077 6 жыл бұрын
Could you go into my complexity?
@VladVladislav790
@VladVladislav790 9 жыл бұрын
I miss Sixty Symbols :(
@thetommantom
@thetommantom 9 жыл бұрын
or trees
@DJDavid98
@DJDavid98 9 жыл бұрын
"my pony" I c what u did thar
@ArnoldsKtm
@ArnoldsKtm 9 жыл бұрын
What's with this guy and his horses? :D
@BillyBob-ik4pn
@BillyBob-ik4pn 9 жыл бұрын
7:42 My Little Pony... Half Life 3 confirmed!
@zebraforceone
@zebraforceone 8 жыл бұрын
(blazin saddles) HORSES??!?!??!?!??!
@Flagen579
@Flagen579 9 жыл бұрын
BANANA
@trefod
@trefod 9 жыл бұрын
I lost my concentration a couple of times because the presenter kept bobbing up and down. It is a subtle but effective of making me lose my calm because I can't reach out to steady him.
@Zishy
@Zishy 9 жыл бұрын
are you riding a horse?
@goeiecool9999
@goeiecool9999 9 жыл бұрын
This guy sounds super tired lol
@kosojmshj5564
@kosojmshj5564 2 ай бұрын
لا احد اضاف شىء
@arminhrnjic8706
@arminhrnjic8706 9 жыл бұрын
Like if you googled "my horse"
@rdoetjes
@rdoetjes 9 жыл бұрын
Very interesting subject but as a director I was getting so annoyed by the guy trembling up and down in his chair as if was wiggling his feet being nervous. It really get me out of the story.
@aliaydogdu5810
@aliaydogdu5810 9 жыл бұрын
domato
@7177YT
@7177YT 5 жыл бұрын
Cute, he explains 'what libraries were' for the average millenial barbarian. lol
@grimreefer4366
@grimreefer4366 9 жыл бұрын
It's time I sling the baskets off this overburdened HORSE Sink MY toes into the ground and set a different course Cause if I were here and you were there I'd meet you in between And not until MY dying day, confess what I have seen
@KhalilEstell
@KhalilEstell 9 жыл бұрын
He is very bouncy.
@poteb
@poteb 9 жыл бұрын
Great explanation, but please stop jumping in your chair, I'm getting a bit of motion sickness.
@ariebrons7976
@ariebrons7976 9 жыл бұрын
first
@ariebrons7976
@ariebrons7976 9 жыл бұрын
arie brons 9th you idiot
@ariebrons7976
@ariebrons7976 9 жыл бұрын
arie brons buy a mirror and then we 'll see who is the idiot here
@ariebrons7976
@ariebrons7976 9 жыл бұрын
arie brons guys, guys calm down we don't need to fight i mean we are all human, in fact we are all the same person
@ariebrons7976
@ariebrons7976 9 жыл бұрын
arie brons what do you mean, same person
@ariebrons7976
@ariebrons7976 9 жыл бұрын
arie brons i mean that we are litterallty just letters expressing the oppinion of some dude with a weird hobby
@Goodtimes4100
@Goodtimes4100 9 жыл бұрын
First comment! Love the vid thanks
Secure Web Browsing - Computerphile
12:20
Computerphile
Рет қаралды 202 М.
Wednesday VS Enid: Who is The Best Mommy? #shorts
0:14
Troom Oki Toki
Рет қаралды 50 МЛН
Jaidarman TOP / Жоғары лига-2023 / Жекпе-жек 1-ТУР / 1-топ
1:30:54
24 Часа в БОУЛИНГЕ !
27:03
A4
Рет қаралды 7 МЛН
5. Understanding Bicep Syntax - Parameters
13:30
Azure Automations With Nacho
Рет қаралды 16
Page Ranking and Search Engines - Computerphile
9:31
Computerphile
Рет қаралды 131 М.
Coding a Web Server in 25 Lines - Computerphile
17:49
Computerphile
Рет қаралды 353 М.
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 1 МЛН
How Computer Memory Works - Computerphile
14:16
Computerphile
Рет қаралды 843 М.
When Optimisations Work, But for the Wrong Reasons
22:19
SimonDev
Рет қаралды 1,1 МЛН
How Ray Tracing Works - Computerphile
20:23
Computerphile
Рет қаралды 99 М.
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 1 МЛН
computers suck at division (a painful discovery)
5:09
Low Level
Рет қаралды 1,7 МЛН
Why Use Binary? - Computerphile
8:29
Computerphile
Рет қаралды 667 М.
Wednesday VS Enid: Who is The Best Mommy? #shorts
0:14
Troom Oki Toki
Рет қаралды 50 МЛН