Audio Classification with Machine Learning (EuroPython 2019)

Рет қаралды 37,113

Күн бұрын

Пікірлер: 59

@jsbisht_ 4 жыл бұрын

Hi Jon. Great presentation. I am absolutely new to machine learning and found your talk really clear and useful. Thanks for sharing.

@Captura22 8 ай бұрын

Hi Jon, I am doing a final year undergraduate project on bioacoustics, I am new to signal processing as well as your channel! I was just wondering - do you have a paper covering some of the stuff you've talked about, which I could reference?

@Jononor 8 ай бұрын

Hi! Yes, this work is mostly in my master thesis. If you search Google Scholar for "Environmental sound classification on microcontrollers using Convolutional Neural Networks" you should find it. I would give you a link, but KZbin tends to shadowblock messages with links...

@michaelwirtzfeld7847 4 жыл бұрын

Thank you. A very good presentation. Is Keras model code you showed (i.e. "block_1", "block_2", etc.) on a couple of your slides available in one of your GitHub repositories?

@Jononor 4 жыл бұрын

Thank you Michael. Yes, all the Keras I tested in my thesis are in the following repo/folder. The one in question is probably in "strided.py" or "sbcnn.py" github.com/jonnor/ESC-CNN-microcontroller/tree/0d3a1231831d3ee61c22a4f8b461a7511fae3de7/microesc/models

@GadisaGemechu-j2u 8 ай бұрын

perfect bro. can you exchange an idea how to prepare dataset ?

@jayshaligram4474 4 жыл бұрын

Hi... great work! Thank you for uploading this video. If you had the exact frequency vs time data for a particular sample in text or csv format, How to use it to improve accuracy of a cnn? Can image data be correlated to corresponding frequency data to get more accurate predictions?

@jayshaligram4474 4 жыл бұрын

Also.. is data augmentation (time shift, pitch shift etc,) manual or is there any automated process for achieving this?

@Jononor 4 жыл бұрын

Hi Jay. The spectrograms contain basically all the time versus frequency data. But if you have some additional information available, there are way to incorporate that. If the data is always available (both training time and prediction time), then you can use it as an additional input to the neural network.

@Jononor 4 жыл бұрын

Data augmentation is basically always automated. Either as a pre-processing batch job. Or done on-the-fly while training the neural network. This posts shows the code for common audio augmentations, medium.com/@makcedward/data-augmentation-for-audio-76912b01fdf6

@Woofawoof_wwooaaf Жыл бұрын

Hi can you please explain how can we convert mp3 audio file into. Wav file

@Jononor Жыл бұрын

For a single file use Audacity. For multiple files can use ffmpeg and shell to script it. To do it from Python, use librosa.load and soundfile.write

@sidalibourenane5377 2 жыл бұрын

Hey Mr Hope you doing good ! Please Can you help me ? How Can we use speech recognition to detect falling in elderly people ? Just another question how to combine audio with image to implement fall detection ?? Thank you

@sigitpriyohartanto2129 3 жыл бұрын

thanks you, for great presentation. i have question : how to make comparisons between one person's voice and another.

@Jononor 3 жыл бұрын

Search for "speaker recognition". I recommend looking into pretrained models based on X-vectors or I-vectors

@sigitpriyohartanto2129 3 жыл бұрын

@@Jononor ok thanks

@a2sirmotivationdoses782 2 жыл бұрын

Respected Sir... My project is to cancel the noise from audio... For this how can i train ML model? And how can i proceed for that plz help me....

@cookingcriss 4 жыл бұрын

Thank you so much for sharing the presentation with us! I m new in machine learning and I have some questions. From where could I download or use datasets of audio for my project? Thank you in advance !

@Jononor 4 жыл бұрын

A good overview of environmental audio datasets can be found at www.cs.tut.fi/~heittolt/datasets

@weirjwerijrweurhuewhr588 4 жыл бұрын

Interesting talk! In the example you showed, lots of the sounds are quite different from each-other, e.g. the children playing, a siren, and a jackhammer. Does it also work for sounds that are very similar? For example different crow calls or different type of chimpanzee sounds?

@Jononor 4 жыл бұрын

Hi Ramon. Yes the same basic approach can be used in such a case. Whether good results can be achieved depends on how hard the task is annd how good the data is.

@girishraghunathan2221 4 жыл бұрын

Interesting Presentation !

@chacmool2581 3 жыл бұрын

Great stuff. How's the job market for this type of knowledge and skills? I am an old EE just starting a DS masters and I've turned my attention to audio classification.

@Jononor 3 жыл бұрын

Hi Chac. For audio, image, video etc type of processing - the kind of companies that before would hire for Digital Signal Processing skills are today hiring for Machine Learning. If you have an EE background with skills around embedded systems, that is a very good compliment for many such companies. At the moment the demand for ML engineers is high - and many are trying to build new ML-based products and functionality - and there is a lack of skilled people. So pretty good I would say - but you need to go for the places that match your skill profile. A masters degree will set you apart from the large number of self-learners, in terms of demonstrated qualifications

@chacmool2581 3 жыл бұрын

@@Jononor Thank you very much for that. Much appreciated.

@chacmool2581 3 жыл бұрын

John, hate to bug you again, but I am actually kinda serious about this. My DS program is actually not geared or focused for 'TinyML' so I need to supplement it with other learning. What online program or set of courses would you recommend to get into 'TinyML'?

@Jononor 3 жыл бұрын

@@chacmool2581 There is a TinyML book. Have not ready, but probably a good start. The TinyML youtube channel has many good talks, but they are on bleeding edge research - not a pedagogical resource. But apart from the usual embedded/DSP topics, the main part of TinyML is computationally efficient and small models. So focus on understanding how to choose and optimize for such models. For CNNs my master thesis has some pointers on that

@Jononor 3 жыл бұрын

@@chacmool2581 Also, do a few practical projects. Get an ESP32 board and build something fun (does not have to be useful)

@sadeghmohammadi5567 3 жыл бұрын

Thank you very much for your very informative presentation. However, I have a question regarding one of your slides, Specifically on Aggregation analysis windows: Could you please explain further (possibly with an example). For instance windows = 6 is number of segment that you have extracted from you audio signals or it is length of windows (6*sampling_rate)? or bands=32? Moreover, regarding base model, is the model that you presented in slide before (3 layers CNN?) so the logic is that we kind get the audio signals convert them into the sequence of windows and pass them through SB-CNN and propagate it over time and compute the average pooling and will use the output of average pooling to the softmax to conduct the prediction. is this logic is correct? In advance thank you for you considerations.

@idrisseahamadiabdallah7669 3 жыл бұрын

Hello Jon , you did a great presentation. Thanks for sharing. I am working on my master's thesis, specifically in Lung Sounds classification using CNN. I am using mfcc's features. I am getting about 88% of accuracy. Do you think that melspectogram can give a high accuracy than 88% ?

@Jononor 3 жыл бұрын

Hi Idrisse! Thank you. Yes, I think that mel-spectrogram instead of MFCC might give you a slight increase in performance for your usecase, at least it is worth trying out!

@idrisseahamadiabdallah7669 3 жыл бұрын

@@Jononor thank you

@idrisseahamadiabdallah7669 3 жыл бұрын

@@Jononor thanks sir, I would like to ask something, please bear me. Step1 : original dataset 177 samples ( 3 classes , each class has 59 audios files). Because of the small size of the data, I did data augmentation. Step 2: After data augmentation, I extracted mfcc's features of the Audio files with its respective labels in order to create a useful dataset. Step 3 : I splitted the new dataset into training, validation and testing sets. Step 4: Feed the CNN with the training and validation sets for the training process. Step 5: evaluated the CNN with the testing set, we are able to reach an accuracy around 90-93%. Is correct ( logic) to test the model with the testing data that l got in step 3? Or I should split the data to training and testing sets before doing the data augmentation.? Doing so l got an accuracy around 40-43. Thanks a lot for replying to me.

@Jononor 3 жыл бұрын

@@idrisseahamadiabdallah7669 the testing set should be kept unmodified. Data Augmentation should only be applied to training. It sounds like your data augmentation may have introduced bigger changes than planned. Check the statistics of the data, it should still be very similar between augmented train and original train/test, otherwise you will get trouble

@idrisseahamadiabdallah7669 3 жыл бұрын

@@Jononor okay I understood, thanks a lot. One other question. Do you think that the 177 wav files , maybe enough to train a CNN model efficiently?

@xXDarQXx 3 жыл бұрын

I was quite surprised that for classification you didn't feed the feature embeddings of the windows to an rnn and instead just used a post processing trick. Wouldn't an rnn work better, what about a transformer? Also, I know that mel spectrograms work better than just feeding raw audio, but how better? is it like +5% accuracy or is it game changing? nvm 😅 both of these questions were answered at the end. another question that came to mind though is: what about speech recognition models or something similar, are spectrogram-based models still dominating or is it a different story?

@Jononor 3 жыл бұрын

Temporal aggregation using mean or majority voting is simple and works pretty well. It can be done with an RNN, or AutoPool, or an attention function - and it can increase performance a bit

@Jononor 3 жыл бұрын

Whether mel-spectrogram or raw audio works best depends on the task and dataset. It is much more challenging, and more data intensive, to make a system that learns from raw audio - but it sometimes performs better once it works. Though combining both tends to work the best. Not always worth the complexity though

@xXDarQXx 3 жыл бұрын

@@Jononor jesus, that was quick XD thank you so much for the reply! I really appreciate it. and that was great presentation btw. It was very easy to follow. I hope you have a nice day ma, cheers :D.

@Jononor 3 жыл бұрын

@@xXDarQXx Thank you :) Happy learning, have a nice day!

@peterm.4026 3 жыл бұрын

I'm new to machine learning and I feel like I watched so many audio machine learning videos and the tips & tricks section to the end on this is the most practical and unique stuff I've seen. Thanks! Does the simple audio recognition by tensor flow tutorial still exist? I can't seem to find it? Also, in the audio augmentation slide you talk about adding noise to your data for benefit of the model but in the Q&A you talk about how de-noising is helpful. Could you clarify the different cases where you use both?

@Jononor 3 жыл бұрын

Hi Peter. The Tensorflow simple audio tutorial still exists, but they keeping moving it around and renaming it. Currently it is called "Simple audio recognition: Recognizing keywords" at www.tensorflow.org/tutorials/audio/simple_audio

@Jononor 3 жыл бұрын

Training with noise via data augmentation is almost always beneficial (possible exception, if one of your classes is very noise like). And given sufficient data, this will work well, and is the simplest solution. However, if one 1) has a small amount of data and 2) there are well known denoising methods that work well for the case - it may be worth a try. Examples of usecases where I have seen denoising step work well is bird audio spotting in remote monitoring cases (forests etc) - here it is often very quiet and the noise floor can be significant. It may be the noise is that of the microphones and electronics themselves, which is near constant, and relatively simple to denoise