Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23

Рет қаралды 5,208

Jordan has no life

Күн бұрын

Пікірлер: 22

@Ms452123 2 жыл бұрын

Probably my most favorite design thus far. Was thinking about Shazam yesterday

@almogkurtser4557 2 жыл бұрын

Great stuff! I think it's good you avoided the whole DFT thing that just diverts the discussion into very specialized algorithmic land. As for partitioning, while it's clearly a viable solution, I think this specific use case has two properties that might lend themself well to relying solely on single leader replication instead: 1. Immutable data 2. Writes (insertions of new songs) do not need to be reflected immediately (= we should be okay with gigantic replication lag caused by a batch job) 3. Even if all reads were in the order of 10ms, song identification might still take as much as 10 seconds. Slow DB reads that hit the disk will be negligible I think. So I guess both strategies are viable (we'll need some sort of replication on top of partitioning in any case to have high availability), so maybe the discussion can be diverted to pros and cons in terms of costs, which can lead to another nest of vipers, namely capacity estimation for how many nodes we'll need with each approach and their cost. :(

@jordanhasnolife5163 2 жыл бұрын

You make a good point - I'd imagine the matching phase is a bottleneck and as a result those index reads don't have to be as fast! Though based on another paper I had read it seemed it was best to do everything in memory - in which case I'd think partitioning was necessary.

@almogkurtser4557 2 жыл бұрын

@@jordanhasnolife5163 Could you point me to the paper that mention the significance of keeping everything in memory?

@jordanhasnolife5163 2 жыл бұрын

@@almogkurtser4557 Just faster than on disk. By keeping things in memory you can use a hash table index as opposed to a B-Tree or LSM-tree + SSTables

@almogkurtser4557 2 жыл бұрын

@@jordanhasnolife5163 That's true, B-Tree is how I imagined it working in a replication only setup. I Probably just misunderstood your prev comment thinking you were alluding to additional challenges that would be inherent to storing on disk. The main disadvantage is potentially using more nodes to serve the same amount of request due to higher latency per request with the advantage being simpler routing since we only need to know which nodes exist rather than using consistent hashing. Availability would also be better since every key is contained in every node. I guess it comes down to complexity vs cost. I have no idea if say 5x B-Tree node would be cheaper or more expensive than R*X servers with more memory (R being the replication factor for each partition). We can make the partition solution cheaper ( = more complex) by saying that we don't need RX servers but rather just X, each being a partition follower but keeping that partition as a cold storage and only warming it into memory when the partition leader fails.

@jh0720 2 жыл бұрын

Another banger 🤷‍♂️

@jordanhasnolife5163 2 жыл бұрын

Thanks for all the kind words Jonathan, I appreciate it :)

@theinfamous4231 Жыл бұрын

thank you so much for the Explanation man!!

@jordanhasnolife5163 Жыл бұрын

Np! It's a cool Algo!

@cd-ux9ot 2 жыл бұрын

What about an ML approach? Train an autoencoder model where the input data is all the songs plus random/background noise and the output is the actual song id. Then use the model weights as embeddings and store them in a vector database. During search, create embedding from the query audio, lookup candidates in the vector database using similarity search. Then another model to score the candidates an select the candidate with the highest score.

@jordanhasnolife5163 2 жыл бұрын

This could perhaps work, the biggest problem imo is that the clip of the song can be from any starting point. I think something like an RNN would work best here since they are good for sequences, and you could make each fingerprint an element of the sequence that you're trying to classify.

@cd-ux9ot 2 жыл бұрын

@@jordanhasnolife5163 An RNN is a great choice. Transformers have recently shown great performance with sequence data. Convolution layers can even work with audio data because a spectrogram is image-like.

@jordanhasnolife5163 2 жыл бұрын

@@cd-ux9ot Cool point! I will say I don't know too much about machine learning but I'd be interested to see if anyone has taken an inference based approach for this problem - it may be a bit overkill I guess since shazam was already working pretty well without it :)

@jayantprakash6425 2 жыл бұрын

good explanation

@user-se9zv8hq9r 2 жыл бұрын

based and jewfro-pilled

@buildleadinnovate24 2 жыл бұрын

How singing song app created???

@AyushPanditofficial 2 жыл бұрын

I m from india i love your videos i watched all.your videos we need more videos i m learning many thing from your video implementing in my web app

@Jashan77114 2 жыл бұрын

What is sharding the indexes?

@jordanhasnolife5163 2 жыл бұрын

We use consistent hashing on the 64 bit fingerprint which is comprised of the start note, end note, and time delta. If you mean what is responsible for upholding the consistent hashing, then we could use a coordination service like zookeeper to do that for us.

@UjjwalBhardwaj96 10 ай бұрын

@@jordanhasnolife5163 You have mentioned in the video that we can shard the data such that all the generated hash (fingerprint) with the same anchor point goes in the same node. Can we achieve this using consitent hashing?