Mel Spectrograms Explained Easily

  Рет қаралды 92,377

Valerio Velardo - The Sound of AI

Valerio Velardo - The Sound of AI

Күн бұрын

Mel spectrograms are often the feature of choice to train Deep Learning Audio algorithms. In this video, you can learn what Mel spectrograms are, how they differ from “vanilla” spectrograms, and their applications in AI audio. To explain Mel spectrograms, I also discuss the Mel scale and Mel filter banks.
Slides:
github.com/musikalkemist/Audi...
Join The Sound Of AI Slack community:
valeriovelardo.com/the-sound-...
Interested in hiring me as a consultant/freelancer?
valeriovelardo.com/
Follow Valerio on Facebook:
/ thesoundofai
Connect with Valerio on Linkedin:
/ valeriovelardo
Follow Valerio on Twitter:
/ musikalkemist

Пікірлер: 127
@zhouzhou3785
@zhouzhou3785 3 жыл бұрын
thank god your videos just make my learning curve of speech processing much flatter just like mel scale does.
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
It's nice to be your Mel scale :D
@jennifer6278
@jennifer6278 3 жыл бұрын
I was struggling so much trying to understand this for my speech recognition class, I can't believe I understood everything within only 30 minutes! Thank you so much! :) This is incredibly well explained. Now on to MFCCs ...
@aoliveira_
@aoliveira_ Жыл бұрын
Don't forget that when you came here you already had previous knowledge. I also consider that these videos are very good in explaining things that I've struggled to understand in other places. But I didn't begin here. Most likely you are complementing these explanations with your previous knowledge.
@aussieronnied
@aussieronnied 3 жыл бұрын
Thanks Valerio! The triangular filter bank visualisation helped me connect the dots in understanding what is happening behind the scenes. Keep up the great work :)
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Nice to hear that Ronald!
@romainpattyn4528
@romainpattyn4528 3 жыл бұрын
Really nice video thank you, i like the way you explain things. Just wanted to mention that there is an error at 13:47, in the formula to go from Hz to Mel, the frequency should be divided by 700, not by 500. 😉
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thanks a lot! Yep, that is a mistake - thank you for pointing that out :)
@magnuspierrau2466
@magnuspierrau2466 2 жыл бұрын
This was just awesome! Thank you so much for explaining this concept so clearly, intuitively and passionately! Great stuff! :)
@ash3844
@ash3844 2 жыл бұрын
Amazing!!! Loving all the series of your videos. Thanks a ton!!!
@oscarwjy5084
@oscarwjy5084 3 жыл бұрын
Man you really helped me a lot for my thesis related to auditory filter bank
@yuu_808
@yuu_808 Ай бұрын
How good explanation about that. It helped me to understand mel spec. Thank you so much.
@kirdiekirdie
@kirdiekirdie Ай бұрын
Fantastic explanation! Needed this as a prerequisite to understand the OpenAI Whisper paper.
@erkangjing2124
@erkangjing2124 3 жыл бұрын
Thank you for your sharing. And it's really useful for my learning on audio signal processing. Others things such as mel bands, mel filter bands, frequency resolution, and the frequecy range that that can be perceived by human beings, are sometimes so hard to distinguish and determine them. I hope that I could find the answer in the discussion board or other sharings of yours. Finally, really thanks for you sharings.
@Mattews1119
@Mattews1119 3 жыл бұрын
Thank you Valerio for the amazing content! I'm really grateful for the time and work you're spending in this videos. The way you teach is very clear and simple, I like that a lot :D Also, if you don't mind, I have a question. I was wondering if extracting frequency features (Spectral Centroid, Rolloff, ...) from a mel spectrogram, instead of a regular spectrogram, would be more beneficial for a MIR application?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thank you Mateus :)
@nedzadhadziosmanovic3785
@nedzadhadziosmanovic3785 3 жыл бұрын
In this video, and the next video called "Extracting Mel Spectrograms with Python" you are explaining to us what does a mel band mean, mel scale, mel filter bank etc, but in my opinion there is a single step missing for understanding what is really done when using mel filter banks to construct a mel spectrogram. The process you are referring to: 1. Find the smallest and biggest frequency expressed in Hz, which we got from the output of STFT 2. Convert these two values from Hz to mel scale 3. Choose the number of mel bands we want to use 4. According to the chosen number of mel band, we construct a mel filter bank And now comes the part which is not clear to me: The use of mel filter banks on outputs of STFT to get output of some other kind, which will be used to construct a mel spectrogram. At this point let's just go back and look at a single output of the STFT (which is equivalent to performing DFT on one frame of an audio wave). As a result we get a set of complex numbers, and by finding their magnitudes we are able to construct a amplitudeVSfrequency graph (also called "frequency domain graph"), by simply plotting the magnitudes as the amplitude for a certain frequency. In other words, each of the magnitudes of the complex numbers (to be clear, one magnitude per one complex number) is responsible for the high of one bin inside the amplitudeVSfrequency graph. Now we have this single amplitudeVSfrequency graph, and we want to use it in combination with mel filter banks to construct output of some kind. First question is how to apply a mel filter bank to a single single output of STFT (i.e. to one amplitudeVSfrequency graph)? In other words, how to combine these two to get an output of some kind? (I know that is a multiplication of two vectors basically, but how would you represent this visually, using a mel filter bank and a single amplitudeVSfrequency graph). Secondly, what is the this output representing, the amplitude for a single mel band? Lastly, I think it would be much more clear if we used mel bands on the y-axis and mel measuring unit (but I don't know would this be correct), but in my opinion, putting frequency in Hz on y-axis of a mel spcetrogram is completely misleading (and is making me think I did understand anything). I wanted to ask you would you be so kind to make a single graph which is the output of a single amplitudeVSfrequency graph (which we got from STFT) and mel filter bank, also expressed visually as graph (I suppose, but I am not sure that it would then be a amplitudeVSfrequency graph, but this time with mel frquencies on the x-axis), as I think that it could help both me, and a lot of your viewers?
@nezardasan5015
@nezardasan5015 3 жыл бұрын
DANKE Valerio, always shining
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thank you Nezar!
@sailfromsurigao
@sailfromsurigao 8 ай бұрын
I greatly appreciate the content you've been sharing on audio processing for machine learning; it's incredibly insightful. I am particularly interested in the intersection of audio and image data. Would it be possible to discuss methods for transforming an image into a Mel spectrogram or a standard spectrogram?
@canernm
@canernm 2 жыл бұрын
Hi Valerio, thanks for the videos. I have one question: in the previous video of the playlist, we took a vanilla spectrogram and transformed it to be both a log-amplitude and log -frequency spectrogram. The difference between Mel Spectrogram and the transformed one, is simply that in the latter we use a simple log2 scale?
@user-up3gx6nf5b
@user-up3gx6nf5b 10 ай бұрын
At 2:20, you mention that the higher frequencies sound similar but I hear the opposite. The lower frequencies I can't distingush, the higher ones, I can. Edit: had to wear headphones to hear the difference x_x
@luandesouzasilva565
@luandesouzasilva565 2 жыл бұрын
Thank you so much for these videos!
@burak4799
@burak4799 Жыл бұрын
You are a life saver! Thank you very much for the detailed lecture :)
@alfredoalarconyanez4896
@alfredoalarconyanez4896 2 жыл бұрын
Thank you Valerio for this super nice video
@minired4611
@minired4611 2 жыл бұрын
thank for your clear explanation. It help me a lot.
@mukundsrinivas8426
@mukundsrinivas8426 2 жыл бұрын
Amazing series of videos. Did u cover how to deal with audio of varying lengths in any video?
@DavidKalinex
@DavidKalinex 3 жыл бұрын
Very useful video! No doubt I will be revisiting for the rest of the year to finish my thesis
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thank you David :)
@arvindramanathan329
@arvindramanathan329 3 жыл бұрын
clear and intuitive explanation, thanks!
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thanks!
@preethamgali3023
@preethamgali3023 3 жыл бұрын
Great explaination. 🔥🔥
@StefaanHimpe
@StefaanHimpe 3 жыл бұрын
8:15 is it 500 or 700 ?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Great catch Stefan! It's supposed to be '700' not '500'. Thank you for pointing this out!
@kazmzengin5176
@kazmzengin5176 3 жыл бұрын
@@ValerioVelardoTheSoundofAI Hi Valerio, I would ask same question, if i couldn't read your emendation. Maybe you sould emendate in video too. Thank you very much for your video series.
@damdidum2601
@damdidum2601 2 жыл бұрын
excellent video, u r realy good at explainig these stuff!
@Beatitat
@Beatitat 11 ай бұрын
Could we go about using these features with identifying Keys and the chords? Watching your videos so I can learn a way to make a simple program that does chord progression detections of songs. Thanks for the videos!
@kenand330
@kenand330 2 жыл бұрын
Sir, there is something I don't understand here. We do not perceive the pitch difference between the first two notes you play. We can perceive the pitch difference between the second pair of notes. But shouldn't it be the other way around? Am I the only one hearing this?
@qin7280
@qin7280 3 жыл бұрын
Hi Valerio Thanks so much for your effort making these videos! I am keeping learning it by watching all your videos. May I ask a simple question about the Mel-spectrograms? Is it also useful if I want to detect the sound of heartbeat? Actually that's what I am doing recently but I am a totally beginner. I am so appreciate if you can share your ideas or any other good materials of this heartbeat detection stuff!!
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Yes, Mel spectrograms (usually!) work well with most audio classification problems.
@ashinkajay
@ashinkajay Жыл бұрын
Thank you so much !
@muntazirmehdi503
@muntazirmehdi503 3 жыл бұрын
you mentioned about the piano that we can use 40 mel banks as the notes are similar, but if we are working on audio (speech data) and have voices of different people with different voices, for that case how we can determine mel banks. TIA
@lenam317
@lenam317 Ай бұрын
Thank for great video. I am also trying to implement a kind of ASR for my project but I am unable to find any C/C+ libraries that support MFCC features from a live audio source ? It'd be great if you can give me some pointer here.
@antonnaumov4889
@antonnaumov4889 3 жыл бұрын
Hi, Valerio! Thanks a lot for your videos! Can you please explain, why on the mel spectrogram we are still using Hz units (at 26.47) ?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
That's just a convention to indicate how the different Mel bands are mapped to in terms of frequency.
@manjulakumari953
@manjulakumari953 2 жыл бұрын
great video. Must watch
@Deathlydave
@Deathlydave 2 жыл бұрын
Great video and great series. I really learned a lot from watching these videos. One thing that I am a little unclear about is why is the shape of the mel filter band (# bands , frame size / 2 + 1)? Are the values of the mel filter band simply the weights for the triangle filters? If so, since the triangle filters cover an increasing range of frequencies in Hz, how do we maintain the fixed frame size / 2 + 1 size?
@ehtashamulhaque5002
@ehtashamulhaque5002 Жыл бұрын
Edit: Okay I also had this confusion but remember we are doing STFT? And the number of our frame_size is actually dictating how many bins we are producing in the spectogram. It is easy to get confused when there are so much stuff to look out for.
@superhorstful
@superhorstful 3 жыл бұрын
Isn't there an error in the formula for the mel frequency? I mean it should be f/700 and not f/500?
@tiagobeltraolacerda5034
@tiagobeltraolacerda5034 3 жыл бұрын
I noticed that too. Using 700 we got that 1000 Mel = 1kHz, but using 500, 1238.1 Mel = 1kHz. I didn't understand why.
@andres-ab
@andres-ab 3 жыл бұрын
I have one question. Given the desired for the NN to learn or catch a pattern that the human ear may not recognize (e. g. classification in cough of different diseases, or positive/negative cases of one disease), what's the need to input the NN a spectogram with "humanly perceived coherence"? Could it be possible to avoid the frequency and amplitude correction? Does it make sense to do so? Thanks a lot. I really love this series.
@avidreader100
@avidreader100 3 жыл бұрын
I guess there can be any number of features suitably defined based on our objective and current insight. Mel would be one such based on human perception. It could have a great fit for applications where the human perception is relevant. There is no compulsion to use it for classifying cough. I would imagine a differently defined scale can very well be used.
@bashhad2633
@bashhad2633 2 жыл бұрын
This is a great video
@Underscore_1234
@Underscore_1234 23 күн бұрын
Hi, nice stuff (didn't know any about mels), but I wonder, I guess you apply triangular filters in the mel-domain, if so, the filter is not triangular in the (linear) frequency domain right? I believe the shape shouldn't be a triangle anymore in the linear frequency domain (in other words you apply the mel transformation before applying a filter right?)
@mahathibodela
@mahathibodela 7 ай бұрын
As, usual its a really informative, easy to understand video..Bt, i have a doubt. The spectogram u have showed in the last video was having log ranges for frequency and this mel spectorgram also has the same.. why cant we just do in the way as u said in the last video??
@Sam-jk5dw
@Sam-jk5dw 3 жыл бұрын
I wish there was a frequency conversion example for the Mel Filter bank. LIke just one example where you take a freqency(which doesn't have a weight of 0 or 1) and convert it to Mels. I felt like I didn't quite know what you were trying to say.
@shahnaz1981fat
@shahnaz1981fat 2 жыл бұрын
Hai Valerie . Nice explanation on Mel spectrograms. But I could not understand the triangular filter banks. It gives visualilization of the transformation from hz to mels. But as the triangles are overlapping, is it one to many transformation? I am preparing for PhD interview, unless it is not clear for me I cannot be confident. Please clarify…
@tetlleyplus
@tetlleyplus 5 ай бұрын
Is filtering using the mel banks just (algebraically) multiplied because convolution in the time domain is equivalent to multiplication in the frequency domain?
@andreeamadalina8509
@andreeamadalina8509 3 жыл бұрын
Is it just me but for the first pair I hear only one sound, while for the second one, I hear two sounds? Shouldn't have been the other way around? Lol
@ebrukeklek3237
@ebrukeklek3237 3 жыл бұрын
Incredibly good work Mr. Sometimes it was hard to understand you because of you talking really fast and with a dialect 🤣🙈 but your devotion is fantastic ❤️
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
Thanks!
@markusbuchholz3518
@markusbuchholz3518 3 жыл бұрын
Perfect! Thanks Valerio for this interesting video. Iam not going to be myself if I do not ask ... . There is a "long pipeline" in signal processing for deep learning. We "loose" info while sampling, quantisation, performing STFT, and now using triangular filers. Afterword we perform convolutions and again some important info is lost. Do you think that this process is "smart" enough and energy efficient ? I assume that, given question is related directly how we want to apply deep learning - I mean what we want to do with the signals - classification, generation, filtering, prediction and so for. Great channel and community!
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
You're spot on! The pre-processing audio pipeline can be quite convoluted. That's why some researchers are experimenting with raw audio signals. The problem with this is that audio is highly dimensional. The preprocessing steps we usually take with spectrograms trade "perfect" information with lowered data dimensionality.
@markusbuchholz3518
@markusbuchholz3518 3 жыл бұрын
@@ValerioVelardoTheSoundofAI Thanks for feedback and clarification. Anyway, in order to improve something it is great to be familiar with principles. Thanks!
@armanz.9182
@armanz.9182 Жыл бұрын
How well would rhythm be represented in mel spectograms? I can imagine 'pure' rhythm information to be stored in the low frequencies, but these are compromised in these spectograms right? I had the idea that maybe rhythm information can be found between 0.55Hz (33bpm, lowest perceivable tempo) and 20Hz (lowest perceivable tone). I have no idea though as to how valid this is. I would love to hear if anyone knows a valid way to analyze just rhythm, thanks!
@user-sx4ew3sm5u
@user-sx4ew3sm5u 2 жыл бұрын
Thank you for the excellent explanation. One quick question, is mel-spectrogram always good for deep learning? What I mean is that regardless of the sound classes(speech, ambient sound ...), is mel-spectrogram always better than using spectrogram?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 2 жыл бұрын
That will depend on the particular problem. For that reason, it's always advisable to try out different audio representations.
@rprantoine
@rprantoine 2 жыл бұрын
Hi Valerio, Thank you for your content, first of all! One thing I struggle to understand though is the need to have bands for Mel, and then the use of filters. Intuitively, to convert frequencies to Mels, I would have just applied the given Mel=f(frequency) formula to my discrete frequency vector and used the resultant discrete Mel vector as my y-axis. How is that not correct? Why do we need bands? Thanks in advance Antoine
@antonselitskii8351
@antonselitskii8351 2 жыл бұрын
Don't forget, we work with discretized data. You could notice that the number of Mels (0, m_1, ..., m_63 in total 64) is smaller than the number of frequencies (0, f_1, ..., f_512 in total 513 = 1024/2+1) . The intervals [0, f_1), [f_1, f_2), ..., [f_511, f_512), ..., [f_1023, f_1024) are called linear frequency bins, each interval is associated with its left boundary. Because of the symmetry of SFT, we use only half of them: 0, f_1, ..., f_512. We want to have Mel frequency bins [0, m_1), [m_1, m_2), ..., [m_63, m_64). Obviously, some linear frequency bins will collapse in one Mel bin, that is why we need a convolution with filters. In TorchAudio, this is done by a matrix 64x513. Use ms = torchaudio.transforms.MelScale(n_mels=64, sample_rate= sr, n_stft=1024//2+1), the matrix is saved in ms.fb variable.
@Waffano
@Waffano Жыл бұрын
@@antonselitskii8351 Great answer. Made me wonder: why do we not have # mel frequency bins = # frequency bins? Then we could just apply the mel function on all the frequency bins like @Antoine suggest right?
@antonselitskii8351
@antonselitskii8351 Жыл бұрын
@@Waffano You can think about this as a dimension reduction: you have vector f (say 1024) and m (say 80 mels) and transformation matrix T of size 80x1024. Then m = Tf. Yes, it will transform all linear frequencies. It's clear that we can do the inverse transformation, but it will not be precise, because we'll go from vector of size 80 to a vector of size 1024.
@zzhou4621
@zzhou4621 Жыл бұрын
oh, why need use the triangular filters , it seems also can get Mel spectrogram if use the formulation straightly. is there anybody know?
@SonGoku-rl9qf
@SonGoku-rl9qf 4 ай бұрын
At 27:40 the Mel spectogram has Hz at it´s coordinate axis. I thought it should be Mel?
@matthewsmalatji5994
@matthewsmalatji5994 3 жыл бұрын
Hey man. I love the series. I need some help. I want to perform obtain AUDIO FRAMES and generate SPECTROGRAMS for each frame... SO I CAN FEED CNN the spectrograms to do Music Transcription. Please Help. I am able to generate spectrograms using VQT the issues comes with generating frames and spectrograms for each frame
@Saitomar
@Saitomar 2 жыл бұрын
Hi Valerio. How is mel spectrogram is better compared to vanilla spectrogram in terms of deep learning? I understand that it is better in terms of how we perceive audio as humans. But for deep learning, the models pick up features that are more relevant to the model like how for images we just provide the image as a 3d array and the model identifies the underlying pattern. Is there any paper where there is a comparison for mel spectrogram and vanilla spectrogram in terms of deep learning? Thank you for the video
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 2 жыл бұрын
In general, people tend to use MelSpecs over vanilla specs. I'm not aware of any paper that compares the two across the board. Performance of the 2 representations depends on each task. The empirical approach is the best way to check which is best for you. Try both representations for your use case on the same architecture.
@Saitomar
@Saitomar 2 жыл бұрын
@@ValerioVelardoTheSoundofAI how does it depend on the given task? I am assuming the time-domain representations performance in DL to be task agnostic
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 2 жыл бұрын
@@Saitomar Unfortunately, it's not agnostic and it depends on the task.
@Saitomar
@Saitomar 2 жыл бұрын
Thanks for the reply, I am working on a model which was used in image classification and trying to use it for audio classification, which is why I was curious. Hopefully the results will be good.
@AbhishekMishra-fr7po
@AbhishekMishra-fr7po 3 жыл бұрын
at 18:55 , i think the x axis Freq is in KHZ not HZ, coz 1000 Khz = 1000 mel, m not sure though, but i think it is
@razvandumitrugrecea9388
@razvandumitrugrecea9388 3 жыл бұрын
nice one :) somebody who shares :)
@jamalseyedmohammadi6681
@jamalseyedmohammadi6681 3 жыл бұрын
Hi. Great video. I have one question. What is the difference between log frequency spectrogram and mel spectrogram? Thanks
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
I suggest you to check out the previous videos on STFT, where I introduce the concept of (Log) Spectrogram. In a nutshell, the Mel Spectrogram is a normal spectrogram where we apply Mel filterbanks.
@pranavsingh1081
@pranavsingh1081 3 жыл бұрын
@@ValerioVelardoTheSoundofAI it is not clear .please explain difference between log spectrogram and mel spectrogram
@melverys
@melverys 10 ай бұрын
This is how I found your video: I recently got into learning the Japanese language and I thought it would be cool to see the spelling of my name in Japanese. Seems like Mel translates to Meru and the definition of my name in Japanese is a logarithmic transformation of a signal’s frequency. Kind of an interesting rabid hole to go down since I’m a math geek and a musician too lol
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 10 ай бұрын
Fantastic story - thank you for sharing :)
@mangomonkey7830
@mangomonkey7830 3 жыл бұрын
Hi, What if my audio files are an hour long. When I use librosa to load them, I only obtain the first 3 mins. What's the standard practice to generate mel spectrograms for hour-long audio recordings?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
I would suggest segmenting the audio files, if possible.
@aayushchheda8689
@aayushchheda8689 Жыл бұрын
Don't really understand the psychoacoustic experiment ? Can you explain it here ?, I do not perceive the pitch difference between the first two notes you play. I can perceive the pitch difference between the second pair of notes. So shouldn't it be the other way around or am i getting something wrong..
@jaydeepchauhan2737
@jaydeepchauhan2737 3 жыл бұрын
What is difference between filter bank feature and Mel-spectrogram feature? Are both same?
@kirdiekirdie
@kirdiekirdie Ай бұрын
Tried to listen to the C2 note several times until I figured out that my Lenovo laptop speakers apparently don't go that low, but my cheap headphones do :-)
@disturbedeyebrow5977
@disturbedeyebrow5977 3 жыл бұрын
Thanks dude, you didn't mention the optimal number of MFCCs to use for image processing. In one of your previous videos you said that 13 MFCCs is the best choice for audio processing, why 13 ? and how to determine the optimal number ?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
I'll post a couple of videos (theory + implementation) about MFCCs in the coming weeks. (Stay tuned for those!) The short answer to your question is that 13 is a number traditionally used in earlier AI music research. Sometimes this number goes up to 48 or even 90. As I mentioned for the number of Mel bands in this video, these numbers are somewhat arbitrary and must be treated as hyperparameters, which should be optimised.
@disturbedeyebrow5977
@disturbedeyebrow5977 3 жыл бұрын
Thank you for answering so fast ! I'll be patient for incoming vids !
@iioiggtrt9085
@iioiggtrt9085 3 жыл бұрын
how save it as csv file for ml
@sarathanurahiyarehewage4642
@sarathanurahiyarehewage4642 Жыл бұрын
I have a question. When m=2595.log(1+f/500), the f shud be equal to 500(10^(m/2595) -1). Where is this 700 come from in f=700(10^(m/2595) -1)? is it a mistake?. In your video, it shows 700 in two places? Or am I missing something ?
@Waffano
@Waffano Жыл бұрын
Valerio wrote in a comment above that the first formula had a typo. It should be 700 instead of 500.
@maddai1764
@maddai1764 3 жыл бұрын
me again, why not just use the equation of frequency to mel to convert the hertz to mel just as you did in the previous videos to convert the herz to log (log frequence) ? why go through all these hastles ? I know there should be a reason, but dont grasp it.
@zzhou4621
@zzhou4621 Жыл бұрын
me toooo!
@henoknigatu7121
@henoknigatu7121 9 ай бұрын
can you show us how to convert melspectrogram to audio using python like vocoder
@andreiplatonov7689
@andreiplatonov7689 2 жыл бұрын
Thank you for your videos! However, if you place f=1000 in the formula of 'frequency to mel' conversion, you do not get 1000 mel..
@pjmmccann
@pjmmccann Жыл бұрын
* It should be 700, not 500 in the formula (see the inverse function, for example)
@zzhou4621
@zzhou4621 Жыл бұрын
Formulation: mel = 1/log(2) * (log(1 + (Hz/1000))) * 1000 [Reference: Traunmueller, H. (1990) \"Analytical expressions for the tonotopic sensory scale\" J. Acoust. Soc. Am. 88: 97-100]
@pranavsingh1081
@pranavsingh1081 3 жыл бұрын
could u please tell us the difference between log spectrogram and mel spectrogram ?
@chrischang1980
@chrischang1980 2 жыл бұрын
I think the difference is mel spectrogram is applying the mel filter, the result for a specific mel frequency is a weight sum of original frequency. Log spectrogram only change the scale from linear to log.
@deepikasingh3122
@deepikasingh3122 6 ай бұрын
but what are filter banks?
@uthsingi
@uthsingi 4 ай бұрын
I'd like to politely confirm: at 2.20s, it seems like the note played as C2 might actually be C1. I'm not very familiar with musical notes, but the C2 played in your video sounds lower.
@user-ul2gm5np3i
@user-ul2gm5np3i 3 жыл бұрын
Thanks you are so genius and everyone can understand the concept of Mel Spectrogram by watching your video, however it actually takes too long time to understand a single concept cuz it seems that you repeat certain words or sentences several times and too offer much extra informations time to time. If you can deal with that, I am sure that you will get way more subscribers. Anyways thank you so much.
@LewisWolstanholme
@LewisWolstanholme 3 жыл бұрын
your formula for working out frequency to mel (m = ...) is wrong. your formula for mel to hz however is correct (f = ...)
@ratfuk9340
@ratfuk9340 Жыл бұрын
Why is f=700(10^(m/2595) -1)? Shouldnt it be f=500(10^(m/2595) -1) if m=2595*log(1+f/500)
@eurethia233
@eurethia233 Ай бұрын
非常好的视频,爱来自中国
@shreyaskulkarni5823
@shreyaskulkarni5823 Жыл бұрын
It should have been 2052 actually to get 263 difference constant.When you showed the graph of mel and freq.
@seohopa
@seohopa 10 ай бұрын
kzbin.info/www/bejne/aXndmIiubs-Xr5o 챗지피티 인터프리터로 스펙트로그램 만들기 입니다.
@pranavsingh1081
@pranavsingh1081 3 жыл бұрын
what is this vanilla spectrogram?
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
It's just the "basic" spectrogram without any manipulation (e.g., applying log, transforming amplitude to dBs).
@pranavsingh1081
@pranavsingh1081 3 жыл бұрын
@@ValerioVelardoTheSoundofAI thank u so much
@harshitjuneja9462
@harshitjuneja9462 10 ай бұрын
If we use a CNN model (let's say), shouldn't they automatically learn any such mathematical transformations?
@berankilic
@berankilic 2 жыл бұрын
You are like watching chess videos. And I like chess xd
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 2 жыл бұрын
I love it too!
@oguzynx
@oguzynx 2 жыл бұрын
what da f is mel bands..... dude do not comfuse us..
@mdevelde
@mdevelde 3 жыл бұрын
Wrong explanation with many errors. You clearly have no real understanding of what you're talking about. First of all. Everybody knows since ancient times we perceive frequency mostly logarithmic. For instance octaves / musical intervals / musical instrument tuning etc are based on this. So the question is not how the Mel scale (a recent invention) differs from linear frequency but how it differs from logarithmic frequency. So your whole video is nonsense and fails to explain the actual difference between the Mel scale and the logarithmic scale. And many other errors in explaining things and choice of filterbank type etc etc.
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
I'll wait for your explanation to learn more.
@mdevelde
@mdevelde 3 жыл бұрын
@@ValerioVelardoTheSoundofAI Too large a list to respond to here. But a simple look at the Mel scale wikipedia page should inform you. As for musical intervals they are based on a division of octaves. Octaves are 2/1 ratio, so 100Hz - 200Hz - 400Hz - 800Hz - etc. A logarithmic scale. Again, as I already said, one should compare the Mel scale to a logarithmic scale not to a linear scale. And further, number of filterbands are not just randomly chosen they have good reason. It has to do with ringing of the filters or in other words you cannot zoom in on a narrow frequency band without introducing errors in other ways namely amplitude and time. It always works like this it is the law of nature there's no getting around it. And the choice of triangular filters is a particularly poor and naïve one but understandable as many examples have been written using them. One more thing about the Mel scale. It's likely not a great model for equidistance hearing. Errors were made in the studies when inventing it over 50 years ago. But again, understandable to use it. And apologies for the unfriendly tone of my previous message. I just read it back and could have written it in another way. I was a bit tired and grumphy.
@ValerioVelardoTheSoundofAI
@ValerioVelardoTheSoundofAI 3 жыл бұрын
@@mdevelde I'll avoid commenting on your smug attitude. It speaks volumes by itself. I don't see how the "arguments" you raise clash with the content of the video. What superior power ordered that we should "compare the Mel scale to a logarithmic scale not to a linear scale"? Also, what does this mean? The Mel scale IS a logarithmic scale. Or, do you think that applying a few scaling factors to a logarithm (as in the case of the Mel scale) modifies the nature of the logarithm? If you're referring to the difference between the Mel scale and a log2 function, of course I could have shown that. However, people are usually familiar with linear scales, and they probably have an easier time appreciating the difference between a linear scale and the Mel scale, than they have between the latter and a log2 function. BTW, thank you very much for letting me know about the 2/1 octave ratio. In my 25+ years of study in music and my PhD I never encountered this information. Have you thought of publishing this revolutionary result? Oh wait... I mentioned this revolutionary property in a previous video in the series. Your comment regarding the number of filter bands makes little sense in the context of this video. I'm not sure what's your background, but in AI audio we use a wide array of filter bands (from as little as 40, to as much as 128+), depending on what works best for the problem at hand. I've read papers that suggest that errors were made while working on the experiments for the Mel scale. I'm also aware that triangular filters are not ideal. Nonetheless, Mel spectrograms are used in Machine Learning these days and achieve state-of-the art results in several audio classification problems. This is why I introduced this feature in this series (Audio Signal Processing for ML). I'm not sure if this is clear, but this video approaches the Mel scale from the perspective of machine learning and audio processing, not music cognition.
@shahnaz1981fat
@shahnaz1981fat 2 жыл бұрын
Hai Valerie . Nice explanation on Mel spectrograms. But I could not understand the triangular filter banks. It gives visualilization of the transformation from hz to mels. But as the triangles are overlapping, is it one to many transformation? I am preparing for PhD interview, unless it is not clear for me I cannot be confident. Please clarify…
Extracting Mel Spectrograms with Python
13:00
Valerio Velardo - The Sound of AI
Рет қаралды 36 М.
Short-Time Fourier Transform Explained Easily
34:47
Valerio Velardo - The Sound of AI
Рет қаралды 71 М.
2000000❤️⚽️#shorts #thankyou
00:20
あしざるFC
Рет қаралды 11 МЛН
I Need Your Help..
00:33
Stokes Twins
Рет қаралды 168 МЛН
Would you like a delicious big mooncake? #shorts#Mooncake #China #Chinesefood
00:30
Mel-Frequency Cepstral Coefficients Explained Easily
57:43
Valerio Velardo - The Sound of AI
Рет қаралды 121 М.
But what is the Fourier Transform?  A visual introduction.
20:57
3Blue1Brown
Рет қаралды 10 МЛН
Mel Frequency Cepstral Coefficients (MFCC) Explained
5:58
DataMListic
Рет қаралды 27 М.
Demystifying the Fourier Transform: The Intuition
37:17
Valerio Velardo - The Sound of AI
Рет қаралды 40 М.
The Secrets Behind Voice Cloning & AI Covers
16:54
bycloud
Рет қаралды 68 М.
Types of Audio Features for Machine Learning
22:42
Valerio Velardo - The Sound of AI
Рет қаралды 64 М.
Seeing Voices: 1 - Intro to Spectrograms
8:34
Jay Alammar
Рет қаралды 9 М.
How I Became A Data Scientist (No CS Degree, No Bootcamp)
12:28
Egor Howell
Рет қаралды 74 М.
Xiaomi Note 13 Pro по безумной цене в России
0:43
Простые Технологии
Рет қаралды 2,1 МЛН
ВСЕ МОИ ТЕЛЕФОНЫ
14:31
DimaViper Live
Рет қаралды 35 М.
wireless switch without wires part 6
0:49
DailyTech
Рет қаралды 1,9 МЛН