Python Hash Sets Explained & Demonstrated

Python Hash Sets Explained & Demonstrated - Computerphile

Рет қаралды 101,898

3 ай бұрын

Featuring Mike Pound. Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp --- More below ↓↓↓
Hash Sets in Python work a little bit like the index of a book, giving you a shortcut to looking for a value in a list. Dr Mike Pound explains how they work and demos with some code.
#Python #HashSet #Code #Computerphile
Jane Street’s Academy of Math and Programming is now accepting applications for their summer 2024 program, which will run from June 29th-August 2nd in NYC... Or you can just check out the puzzle for fun too - bit.ly/computerphile-amp (episode sponsor)
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

Пікірлер: 159

@Computerphile 3 ай бұрын

Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp

@olivier2553 3 ай бұрын

I get the skyscraper, but I don't get the route thing. Oh, I got t now. The font is too small, I was misreading.

@Davourflave 3 ай бұрын

"O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D

@guinea_horn 3 ай бұрын

+ c

@styleisaweapon 3 ай бұрын

in modern times, the constants can frequently matter most (when you can perform more than several hundred operations (3+ ipc) during that cache miss, for example)

@mina86 3 ай бұрын

@@styleisaweapon, yeah, which is why you should virtually never use linked lists even though it offers O(1) operations.

@antonf.9278 3 ай бұрын

@@mina86Linked lists are only O(1) if you have a pointer to the position you want to access. Unless you are already traversing the list you only have that for the first and last element. Appending to an array based list is already amortized constant and prependiding is as well in array based queues. In conclusion, linked lists are even worse.

@styleisaweapon 3 ай бұрын

@@mina86 I have never written code that used a linked list! and trees I most often use stop bit encoding (heap-style implicit tree addressing), and when sparse I back it with a hash table instead of an array....

@ejc4684 3 ай бұрын

Always love a Mike Pound video!!!

@programorprogrammed 3 ай бұрын

You've been Pounded!

@ejc4684 3 ай бұрын

Nice!!!@@programorprogrammed

@Michael-sh1fb 3 ай бұрын

I see Mike Pound, I click

@klyanadkmorr 17 күн бұрын

It's .

@Richardincancale 3 ай бұрын

Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process

@MichaelKingsfordGray 3 ай бұрын

I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things: 1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically. 2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)

@anon_y_mousse 2 ай бұрын

Keys are unique in a hash table, and generally immutable. If you're confusing a multiway hash table, you still only use one instance of a key. In such a case, a key/value pair would be best with the value side of that equation being a balanced tree structure instead of the usual array.

@DanieleCapellini 2 ай бұрын

1) is debatable and depends on your number of insertions vs lookups, since if you have a lot of insertions, sorting every time there's an insertion collision can be more expensive than searching over an unsorted list every time there's a lookup collision

@anon_y_mousse 2 ай бұрын

@@DanieleCapellini If you have a lot of collisions, then it being sorted or not will be the least of your problems. However, in a tiny array, say with 16 or fewer elements, then sorting can be done fastest with an insertion sort algorithm and it'll be faster to lookup and eat less memory as an array. Of course, if you don't resize your table adequately and collision chains back up to an insane degree, then you'd be better off with a tree data structure all around.

@bensmith9253 3 ай бұрын

I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁

@gustafsundberg 3 ай бұрын

I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.

@halfsourlizard9319 3 ай бұрын

Extremely surprised that people were still using ADA that late in an educational context -- would have suspected that 'trendy' and 'new' Java would have been on offer.

@tracyrreed 3 ай бұрын

@@halfsourlizard9319Java was released, brand new and buggy with no support or books yet, in 1995. Two years later is barely enough time for books to be written and schools to re-orient their curriculum to a new language. I used Java back then and it was very rough and didn't work well at all.

@gustafsundberg 3 ай бұрын

@@halfsourlizard9319 ADA was used for algorithms, C-language was used for OS, Networking etc. We where using Sun Workstations and I believe Java became available at 1996. Universities are probably not so fast in switching languages.

@centerfield6339 3 ай бұрын

@@halfsourlizard9319I went to uni in 2000 and they were just introducing Java.

@exchange4918 3 ай бұрын

Thanks for your videos Mike keep it rolling 🎉

@JaapVersteegh 3 ай бұрын

Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging

@Loki- 3 ай бұрын

I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol

@jonny__b 3 ай бұрын

Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.

@ibrahimmahrir 3 ай бұрын

12:02 in the *___contains___* function there shouldn't be an *_else:_* before the *_return False_* at the end, otherwise in case if the list *_self._data[idx]_* is not *_None_* and the item is not in that list, the return value won't be a boolean.

@williamli55 3 ай бұрын

Good point. It should allow both of the above if statements to fall through, and simply return False as a default. It will still work for __contains__ though because None is falsey. It will evaluate to false when using the "in" operator. It's not a consistent return type though, so yeah, he would probably be better off just removing the else statement.

@vadrif-draco 3 ай бұрын

While that is correct, it seems that the *___contains___* special function returns *False* by default if you don't return anything yourself (I tried it on an online interpreter). So, there shouldn't be an *else:* statement or a *return* statement at all. If you go to 15:52, it seems to have worked even though it was the same exact scenario you described: the list at the specified hash value does indeed exist ( is not *None* ) -- the list of items whose hash value is *999999* , however, the item itself ( *999999* ) is not in that list as it had just been removed.

@Ceelvain 3 ай бұрын

Thanks pep8. :D

@jamesarthurkimbell 3 ай бұрын

@@vadrif-draco Any Python function returns None by default, and None is falsey (i.e. it works with conditionals the same way that False does)

@michaelpound9891 3 ай бұрын

Great spot! You’re absolutely right, I messed up the implementation :) even if technically it works this time, the code isn’t super clear due to the bug, correcting it would also make it more readable I think.

@prkhrsrvstv1 3 ай бұрын

Bloom filters could be a good follow-up to this.

@Checker8763 3 ай бұрын

And the HyperLogLog

@styleisaweapon 3 ай бұрын

yes bloom filters are good, but first you gotta cover how to do universal hash functions (and therefore can create an arbitrary number of them) .. I think Erik Demaine covers it well enough but his stuff is lecture format

@styleisaweapon 3 ай бұрын

@@Checker8763 any probabilistic counting .. no need for the added complexity of HLL

@cinderwolf32 3 ай бұрын

The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!

@trevinbeattie4888 3 ай бұрын

Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.

@jfftck 3 ай бұрын

I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.

@PhilippeCarphin 3 ай бұрын

That sounds really interesting, might I trouble you for a link to this talk?

@rngwrldngnr Ай бұрын

@@PhilippeCarphinprobably Modern Dictionaries by Raymond Hettinger. It's on KZbin. It's from 2016, so it's possible it no longer represents the state-of-the-art.

@princevijay871 3 ай бұрын

Just what i wanted now. thanks a lot :)

@olivergurka2456 3 ай бұрын

Hi, could you do a video on characteristics of a good hash function used in hashtables and their evaluation as a followup video?

@barneylaurance1865 3 ай бұрын

The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.

@vlc-cosplayer 7 күн бұрын

"Arguably a bit too widely used sometimes" - There's a name for code that overuses primitive types, "stringly typed" 😆

@LAGHRISSIAdnane Ай бұрын

Thank you mike

@martinparidon9056 2 ай бұрын

Very interesting, thank you!

@skittles6949 2 ай бұрын

I would love to see a follow up video for dictionaries/hashmaps :)

@Ceelvain 3 ай бұрын

A follow-up on the amortized complexity would be nice. Because it's a bit disingenuous to call the hashmap insertion O(1) if the underlying table doesn't grow. ^^

@casperes0912 3 ай бұрын

I did that skyscraper puzzle as my first algorithm challenge at university

@KipIngram Ай бұрын

It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things. Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.

@cacheman 3 ай бұрын

13:30 I get that it's easier to generate numbers, but I think pedagogically it makes much more sense to use strings. import random with open('/usr/share/dict/words') as f: words = [w.rstrip() for w in f] print(random.choices(words, k=10))

@billywoods1688 3 ай бұрын

Hi Mike, Sean and all, thanks for the video! I have a question about the claim that lookup in a hash table is O(1), at least as implemented in the video. Suppose the capacity of your hash set is c, and the number of data points you have is n. Then, to look up x, it takes O(log(c)) operations to find the list corresponding to the hash h(x), and O(n/c) operations to search through that (perhaps unsorted) list for x, plus maybe some small overhead for the hashing function -- that is, O(log(c) + n/c) in total, right? If so, then when n is much larger than c, the resulting complexity is still O(n). When n is smaller than log(c), we have O(log(c)), but that isn't actually any better than O(n). So it seems that the useful case is when n and c scale *together* such that log(c) < n < c. In this case, it takes O(log(c)) operations - but in this case O(log(c)) is no better than O(log(n)). Am I misunderstanding something? 😅 Thanks!

@Alchemetica 3 ай бұрын

Oh, I must not have watched an episode for a while as that is the first time, I have seen a sponsor acknowledgement. Is Mike building an AI supercomputer in his garage?

@SangaPommike 3 ай бұрын

There's an issue with lines 26/27 on the HashSet. If there's something in the bucket for the object other than the object that's queried, the method will return None. Now, this will of course evaluate to False rather than True, but it's not really the expected value.

@KipIngram Ай бұрын

You do also need to compare the object that's sitting there in your hash table with your search spec - if a lookup of say, foo hash collides with a prior insertion bar, well, those don't match. You need to report that foo isn't in the table - not report bar as the successful search result.

@harold2718 3 ай бұрын

Please do a video on LLL, maybe show Coppersmith's attack on RSA, maybe knapsack

@cannaroe1213 3 ай бұрын

At 7:00, i used to explain it this way, but i came to the understanding it's simpler to say imagine theres a second hashing function, more random than the last lol, that gets its 32bit hash stored at the location of the first hash instead of Obj. Now it's a fixed-length structure, and you get count/fingerprinting of fingerprints, and you dont need to make an Obj. Well in Python you do.

@cannaroe1213 3 ай бұрын

or just a 64bit hash and you split it in half, quaters, etc.

@seapearl3175 27 күн бұрын

If the hash of 2 objects is the same and they end up as a list of that hash, how would I obtain the correct object from it later?

@DaniErik 3 ай бұрын

If you use a naive hashing algorithm, I suppose a nefarious actor could engineer a stream of data where each entry collides. This could presumably be used in a DOS attack.

@eveleynce 3 ай бұрын

hash collisions are used in quite a few different attacks, there is no need to correctly guess someone's password (for example) when you can just guess any value whose hash matches that password, increasing the chances of success for the hacker

@jpopelish 3 ай бұрын

As a school boy I heard about casting out 9s, but never learned what it was. I'm now 74 and have finally learned about casting out 9s. I now have a pass time of casting out 9s from every license plate I stop behind. So I could use the remainder produced by casting out 9s as a hash to store all these license codes into 9 sub-lists, instead of putting all the codes into one big list. Have I got this right?

@sledgex9 3 ай бұрын

Both the add() and contains() functions have a code branch that doesn't return a value if reached. Aka the first if is true but the nested if is false. However, Python has implicit returns. In this particular case for contains() it will return None which is evaluated as False in a boolean context and works by accident. Not sure how this affects the add() function.

@gidi1899 3 ай бұрын

Is there a "recursive hash" method?!

@jsomhorst 3 ай бұрын

I'm wondering, when using objects instead of integers, how would you handle collisions?

@Grimlock1979 3 ай бұрын

A hash is just an index in an array. In the end, it has to be an int.

@DjoumyDjoums 3 ай бұрын

To add to the video, a few key points about hash sets and maps : - Because of the hash and modulo, the contained items cannot be ordered in any way (sorted hash sets and maps exist, but are more costly). - Looping through the items is a lot slower than looping on a simple array, because of data locality. - The target load threshold is often 70% for optimal performance, meaning the capacity is about 50% greater than the number of contained items. This impacts memory consumption in large sets/maps, or when you have a lot of them. - To maintain the load threshold, the hash set may resize itself (increase its capacity), which requires a full rehash of the items, because the capacity on which we use a modulo has changed. This can be very costly, and is why adding an item to a hash set is said to be O(1) amortized (= is O(1) unless you hit a resize threshold) - Hence, specifying the initial capacity when you know it is a very good practice, and avoids any rehashing - From what I've seen, the best values for capacities are either prime numbers or powers of 2. .Net uses prime numbers for example.

@gdclemo 3 ай бұрын

You can store the original un-modulo'd hash value with the key. This speeds up resizing at the expense of some RAM.

@kristik432 3 ай бұрын

yea python dict maintains insert order by also keeping a list of keys behind the scenes

@DjoumyDjoums 3 ай бұрын

@@kristik432 Since 3.7 yes, they're now ordered, but as I said this is more costly than keeping it unordered. Probably not a big deal here anyway, because when you care about performance and memory consumption, Python isn't exactly the first candidate 😅

@adambomb1337 3 ай бұрын

if two strings get hashed to the same key, and each key gets an .add(obj), how does it know which obj to pick from when I .get(oneOfThoseStrings)? edit: nvm i just realized this is a set, hah. Im curious then how hashmaps resolve this. looking forward fo that one!

@digama0 3 ай бұрын

After identifying the right bucket, you just compare the key to every key in the bucket using =. In the hashset code this is handled by delegating to the contains function on lists.

@RonJohn63 3 ай бұрын

Big question: how do Dictionaries auto-expand?

@goaserer 3 ай бұрын

Usually the whole internal data structure gets rehashed into a new hash table of increased size. Since capacity plays a role where entries get stored, copying everything into a new hashtable under the hood is usually the only consistant option. Doing this can cost some time esp. on bigger structures, which is why pre-sizing dictionaries and sets is recommended in most languages

@RonJohn63 3 ай бұрын

@@goasererDoes presizing dictionaries exist in Python? I don't remember seeing it, last time I looked.

@goaserer 3 ай бұрын

@@RonJohn63 For Python standard dict's especially I think you are right. An option (or workaround) would be to use dict.fromkeys - which prevents rehashing but probably only makes sense from a performance standpoint if you got all keys in a separate list anyway. Well, since the dict itself only makes sense if data is more often read than written probably it's not worth loosing too much sleep about the rehashing. While it is not for free, it's still a in-memory operation

@steveHoweisno1 3 ай бұрын

Why is there not the following bug? In the __contains__ method, suppose the list is nonempty for given idx but the element is not there. Then the function returns None since return keyword is not encountered. I suppose when you call a in b for some object b with a __contains__ method, if it retruns None that is interpreted as False Still seems sloppy. I would prefer if you wrote return i in **list**

@aikumaDK 3 ай бұрын

I understood some of the words Mike used

@Little-bird-told-me Ай бұрын

Its Sunny in UK

@winstonsmith1977 3 ай бұрын

I am quite new to python. One issue about hash() is, that 'salting' is added for strings and other types. Can someone elaborate about this? Thanks in advance!

@antonf.9278 3 ай бұрын

Salting means putting known extra data into the hash function. This is normally done with random data when hashing passwords, so that multiple users with the same password have different hashes. (for a better explanation look up the "how not to store passwords" video with Tom Scott) I don't know about python, but I could see reasons for salting data in a dynamically typed language with the type. This would avoid collision between objects of different type but same representation.

@HTH565 3 ай бұрын

Python generate a random string on startup and combine it ith your string when you cal hash(). This is intended as a security measure to prevent attacker to force collisions. For instance if you write a web application that use a hashmap somewhere and where the keys of the hashmap can be influenced by the web clients, if there were no salting, an attacker could perhaps send some specific requests that would cause lots of collisions and suddently your web process would consume 100% CPU trying to find walk those collision chains. If the attacker doen't know the salt howerver it is much more difficult to do that.

@gameeverything816 3 ай бұрын

Neat

@the7Cofficial 3 ай бұрын

❤

@betterinbooks 3 ай бұрын

it's not just python hashsets but hashsets in general.

@xv0047 3 ай бұрын

I wish I could explain concepts like this.

@Vixikats 3 ай бұрын

Constant time does not mean "fastest time". The overhead from most constant time operations is usually significantly slower than even checking an array 1 element at a time unless the data structure gets beyond some breakpoint in size. You might need a million elements in an array before a hash set starts to be the fastest solution. This is why profiling your code is always necessary.

@mikefochtman7164 3 ай бұрын

Always fascinated by hash related things. Perhaps discussing actual 'dictionaries' that use actual strings? Perhaps you've already done this in another video, sorry if you have already and I just missed it. We had a system that tracked all the variables in a system of over 250000 lines of code. The 'variable database' used a hash on each variable name (code had on the order of 50000 variables) and stored in each hash location a short list of all the source-code modules that contained the variable. I know today most folks would just 'grep' for things like this, but at the time (early 80's) this was far superior way to track variable usage when doing the direct searches was slower.

@williamli55 3 ай бұрын

That sounds difficult to maintain. How did you get the data into the database? Did you have a process that routinely scanned the source and updated the database?

@mikefochtman7164 3 ай бұрын

@@williamli55 It was only half of the database. The other half maintained a 'reverse list' of all variables used in a module. So when recompiling a module, the first step used second set to 'remove' the module name from each variable's list of places it was referenced. Then after the module compiled successfully, a 'post compiler' step would parse the source file looking for all variables now referenced by it and add the module name to each of those variable's entry. Because it used it's own data to keep itself up to date (the reverse list to know what to delete), it did occasionally get corrupted. We'd live with it a bit, but at some point we'd just 'recompile the world' overnight to restore everything. (sorry I've left out a lot of the details, but hopefully you get the 'gist' of it)

@williamli55 3 ай бұрын

@@mikefochtman7164 Okay I see. Parsing after compilation makes sense, plus it sounds like the 'post compiler' was only scanning what it needed. Thanks for the explanation!

@dgo4490 3 ай бұрын

The downside is it requires around 50% data overhead to retain good performance characteristics.

@4115steve 3 ай бұрын

why arent hash sets use inside sorting algorithms to reduce insertion time?

@phiefer3 3 ай бұрын

Probably because hashsets sidestep the concept of 'sorting' altogether. They don't care about what 'order' things are arranged, and instead they just have a designated place for any possible object. So when you give them an object, either for insertion or searching, they don't need to worry about searching for that object, they just check its proper location and it's either there or it isn't. Hashes aren't really useful for sorting because they're not sorted in any meaningful way (unless you are just working with integers). Instead, objects will be arranged in what appears to be a random distribution according to the hash function that you're using. There's not really a straightforward way of extracting the contents of a hashset into a sorted structure, you'd probably have to literally just search the hashset for ever possible object and then append any that you find to your list. For example, if it was a list of strings, you'd need to do a search for every possible string that could be made (you can't just search for the strings from your list, because you'd need to search for them in order, which would require that you already have them sorted). Basically, it'd be like going through the dictionary one entry at a time in order to alphabetize a list. Hashsets are primarily used for datasets that either don't have a natural way to order them, or where their order itself doesn't matter; but you still want to have a fast way to access each element. Like as mentioned a software library is likely implemented with a hashset or something similar. Sure you could store the contents of the library alphabetically but there's no really anything gained by knowing where each element is in relation to the others; the order does not matter. So by using something like a hash you reduce the time it takes to find what you need form O(logn) to O(1) without loosing anything useful. If you actually care about the 'order', then chances are that a hashset isn't right for the job.

@4115steve 3 ай бұрын

@@phiefer3 So you cant use an inserstion sort algorithm that also includes a hash for things that have already been sorted? So if I were sorting things alpabetically with insertion sort, it wouldn't make the sorting algorithm faster to have what has already been sorted be sorted along side a hash table. I was thinking I could use both, a hash table of abc's, and the sorting algorithm, so then sorting algorithm didn't have to go through each letter, it could just use the hash table to go to the starting letter then continue sorting

@phiefer3 3 ай бұрын

@@4115steve That's not how hashsets work. The objects in the hashset are not in any meaningful order. There is no concept of the "start" of the hashset, or "start of a letter" in the hashset. Nor is there a concept of what comes "next". Objects in the hashtable are effectively randomly distributed according to the hashfunction. Like if you were to store the words 'dog' and 'dogs' in a hashtable they would not be anywhere near eachother (well, they could be depending on the hashfunction, but even then it would be just coincidental). You could do something like take your already sorted list, and then make a hashtable where you stored their indexes using the strings themselves, ie you search the hashtable for 'dog' and it gives you the index where dog is stored in the list. This would in theory allow you to search the list in O(1) speed, but it would not really help that much with insertion or with sorting. The act of actually moving things around in the list in order to add, remove or sort the list would still require the same complexity as normal, the hashtable wouldn't help with that; but then every time you alter the list you would then need to rehash the ENTIRE list again in order to update it with the new positions of all the objects. What it sounds like you're actually talking about is having a list of lists, or a 2-dimensional array where you can use the first letter of a word to first identify which sublist to search. However, this would still have the same big O complexity as a regular list, since in general the size of each sublist would be roughly proportional to the total number of objects. And I suppose, if you squinted really really hard you might be able to cal this a very simplistic form of hashing, but in practice it's not really.

@trevinbeattie4888 3 ай бұрын

@@4115stevewhat you’re describing kind of sounds like a Bucket Sort, but there’s no need for hashing in that case; you would just define the ranges of values for each bucket so the buckets are sorted in advance, then walk through the buckets in order to get the sorted items. Most hashing functions (other than binning) are inherently unsorted, so you would not use this in any sorting algorithm.

@aidanthompson5053 3 ай бұрын

Tomrocksmaths | hash map 8:26

@williamli55 3 ай бұрын

Recently, I needed ("wanted" would actually be a more fitting word) to create a class where I would be able to map multiple keys to the same values (i.e. multiple keys with different hashes reference the same memory address). It was an absolute nightmare to get right, and it likely wasn't worth the effort, but it was a fun experiment which produced something which was actually more useful than I anticipated. I'm really fascinated by hash maps. Dictionaries are by far my favorite data structure in Python, especially now that they're ordered and support set operations on their views.

@idogendel 3 ай бұрын

...but how do you implement a true O(1) "in" operator, unless you allocate enough memory in advance for all possible hash values? That's a serious price to pay for the performance.

@digama0 3 ай бұрын

You are supposed to resize the array when the filling factor becomes larger than a fixed constant like 0.75.

@idogendel 3 ай бұрын

@@digama0 ok, so it's still a speed/memory tradeoff. I come from Embedded background, where such shenanigans are not welcomed 🙂

@digama0 3 ай бұрын

@@idogendel I guess the embedded version of that is to just crash or start dropping items once the filling factor gets too large

@idogendel 3 ай бұрын

@@digama0 My guess would be a crash 🙂 I wonder how many python programmers add explicit safeguards against out-of-memory errors, instead of just hoping the computer has enough RAM.

@tonywakuu 3 ай бұрын

👏👏👏👏👏

@mrlithium69 3 ай бұрын

The differences between Hash Set and Hash Table in different languages can be confusing

@JoQeZzZ 3 ай бұрын

A hash set is just that: a set. Therefore it can only contain a single type of item. A hash map then maps this item onto a value. This is also called a hash table because a mathematical table is another way to display a surjection

@Grimlock1979 3 ай бұрын

You could also say that a Hash Set is a Hash Table where a key is the same as the value.

@halfsourlizard9319 3 ай бұрын

The difference between 'O(1)' and 'doing precisely a single operation' seems to be a subtlety that evades people ... I can't say how many dev candidates I've interviewed who've said that their 2-pass algorithm had 'O(2n)' perf -- that's not a thing, and it suggests that you've completely misunderstood asymptotes / limits.

@styleisaweapon 3 ай бұрын

what do you think of dev candidates that give an estimated expected latency instead? people that specialize in optimization barely consider big-O because they are coding for the known target architecture and what they care about is the latency, not silly metrics that dont translate well to real world perf?

@halfsourlizard9319 3 ай бұрын

@@styleisaweapon Depends very much on the use case -- absolutely that'd matter way more if it were some low-level hard-real-time system or a high-volume trading system or something ... and you're quite right -- even for say a frontend thing (and this is way outside of things I'm familiar with), I know that good devs grab flame graphs and are trying to cut out needlessly expensive ops in their JavaScript -- which is probably 5 levels of abstraction away from bare metal. But, I'd still expect someone with a comp sci background to be conversant in asymptotic complexity -- even if just to discuss why it misses things ... and why writing a quadratic solution to a thing that could be linear is probably (although not always) not a great idea. All abstractions ignore things; some abstractions are useful.

@hanifarroisimukhlis5989 3 ай бұрын

Add difference between actual O(1) and amortized O(1) (and how that amortization can lie to you).

@halfsourlizard9319 3 ай бұрын

@@hanifarroisimukhlis5989 Oh, yeah, that's super important for a lot of things that come up in practice with resizable data structures 👍👍

@halfsourlizard9319 3 ай бұрын

(Fav case that I see a lot is: People doing mid-sequence insertion into something that's backed by an array ... and they just assume that it's efficient ... With good people, asking 'so, how would you implement insertion in a contiguous-mem-backed container?' will get them to realise it ... but I've had candidates refuse to believe that it's not const time.)

@_aullik 3 ай бұрын

"if (boolean expression) return True else return False" is really strange code.

@mikelwrnc 3 ай бұрын

Ease up on the skepticism that AI is AI guys. I’m a PhD cognitive scientist & statistician with moderate experience with the latest AI approaches and I’ve come to the assessment that what’s out now warrants the moniker “AI”. Previously we could look at a shallow net or forest or linear regression and assert confidently the limits thereof that made them feel qualitatively different from human information processing. Not so with the latest approaches.

@chaoslab 3 ай бұрын

"a few moments later..."

@pavelnoryk3823 3 ай бұрын

The main thing u didn't explain is why getting an element from set is O(1) Coz you told that it's just looking for the elements hash in your collection. But it could be done in different ways and with different complexity. E.g. "el in [...]" it's too looking for an el, but it has O(n) It's O(1) coz u know: - the structure u created saves only ints, coz of implementation of hash function - the fixed bytes size of any int on your machine and know thst it's stores sequentially in a single block of memory If you have an adress of the first element, and know that all of the elements are the same block size, u can get an adress of n-th element by increasing this address n times of fixed int size For example (super simplified) u have an array with 4 int elements. Int takes 4 bytes to be stored = 32 bits. When u creates it, the cpu allocate memory for this 4 elements where everyone is 4 bytes. 1th stored in 0x0000. The 2th element would have 0000+32 = 0x0032. It's the 2th element adress

@anon_y_mousse 2 ай бұрын

When implementing a hash table, you should use a table size that's a power of two so that you can do a `bitwise and` on the hash code to index the table with, but also you should store the unnormalized hash with the corresponding keys. Everyone always recommends using a prime number for the table size, but that's the wrong approach. You can always make the hashing function more complicated, but the indexing should be as simple as possible. In case you're wondering why you should save the unnormalized hash code, that's so that you can resize the table as necessary when too many or too few collisions occur and rebuild the table without further querying the hashing function. There's a lot of balancing to be done in implementing a hash table, regardless of whether it's a set or map, but these two things should always be a part of the implementation.

@lasdernas 3 ай бұрын

corrected title: Hash Sets Explained & Demonstrated Using Python

@nHans 3 ай бұрын

_"... once the dater is sautéed ..."_ From _1,000 Tasty Ways to Enjoy Your Date_ by Dr. Lecter. (It's okay. As an Indian, I'm allowed to make fun of other people's English accents. It's on a reciprocating basis.)

@robertnabholz2062 3 ай бұрын

dark mode please!

@guanche011 3 ай бұрын

I'm probably going to get hate for this but; Computer scientists are so bad at choosing variable names. And i've seen this so many times.. In an interpreted language is doesn't make a difference and, still they insist on max 3 letters 😅.. (i'm saying this with a wink btw)

@nowster 3 ай бұрын

All this without mentioning the standard Python "set" type which provides this functionality (and more). [Edit: as mentioned below, I was wrong about this]

@Jester01 3 ай бұрын

All this without watching the video? It is mentioned at 9:00

@nowster 3 ай бұрын

@@Jester01 True. I was wrong. Something must have distracted me at that point.