Malware and Machine Learning

Malware and Machine Learning - Computerphile

Рет қаралды 74,686

Күн бұрын

Do anti virus programs use machine learning? Dr Fabio Pierazzi looks at the trends and challenges.
Fabio's website: fabio.pierazzi.com
Main paper: Arp et al., “Dos and Don’ts of Machine Learning for Computer Security”, USENIX Security 2022 - Distinguished Paper Award - Project website: dodo-mlsec.org/
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Пікірлер: 82

@ClifBratcher Жыл бұрын

I've spent many years in the industry and the biggest hurdle I've seen to having more dynamic identification is false positives. More specifically stopping users from their day-to-day activities because it has been determined to be malicious. Users are MUCH more forgiving of false negatives (actual infections) than false positives.

@elidrissii Жыл бұрын

To be fair, false positives are really annoying to get as an end user. I don't want to go through hoops to recover my file that I know is safe after dismissing all the warnings.

@RealCyberCrime Жыл бұрын

This. Half my day as an analyst is going through false positives

@c1ph3rpunk Жыл бұрын

@@RealCyberCrime only half?

@TheStevenWhiting Жыл бұрын

@@elidrissii Yep SentinelOne notorious for it.

@prashantd6252 Жыл бұрын

The "false positive" has been an issue since the internet went "public"

@samcooke343 Жыл бұрын

Can we talk about those flawless freehand bell curves?!

@SystemBD Жыл бұрын

No. Arcane magic is not computable (as of yet).

@zzzaphod8507 Жыл бұрын

Was trying to remember how to call those curves, thanks - the name rings a

@ilyaSyntax Жыл бұрын

@@zzzaphod8507RINGS A WHAT??

@cosmic5934 Жыл бұрын

@@ilyaSyntax bell

@KamiK4ze Жыл бұрын

I did machine learning for ransomware detection as part of my thesis, problem I had was trying to obtain data for the newest variants. The model needed consistent training to keep up with the new malware.

@mmm-me4kk Жыл бұрын

Sir which ML algorithms did you use and did you use ROC/AUC and K-Fold Cross Validation?

@kenbobcorn Жыл бұрын

If you are an academic Virustotal has repositories of large Malware samples from the last quarter year or so. VirusShare also has large torrents of recent samples.

@mmm-me4kk Жыл бұрын

@@kenbobcorn @Kamik4ze , Indeed you could also use the ISOT dataset, although I think that one is outdated

@shouldb.studying4670 Жыл бұрын

Surely there is a consistency in what outcomes the malware is trying to achieve that could be used as the basis for detection???

@tommasomorandini1982 Жыл бұрын

Wouldn't the point of machine learning be for the program to learn how the malware generally behaves in order to accomplish it's goal, and so not rely on the latest samples to identify malware? Because if you always need the latest samples you might as well just have your program check the files directly against your database no? Or am I missing something?

@PHF28 Жыл бұрын

I think there might be a mistake in the diagram at 17:10. The red slice should be test data and the remaining slices should be used for training. In any case, great video once again.

@saasthavasan Жыл бұрын

In my experience, the biggest hurdle I faced while using ML for malware detection or behavior detection was choosing and extracting the features. Often the selected features overlap between malicious and benign software (eg. sequence of APIs). Unlike static and dynamic detection which works on heuristics written by an experienced analyst, ML models learn these heuristics on their own during training. And most of the time these heuristics learned by the ML model do not actually make sense. At the end of the day ML models work on pattern detection. It is really difficult to make the model learn actual features that are responsible for behavior rather than some random reoccurring features in the dataset. As a result, we end up getting high FP.

@goldnutter412 Жыл бұрын

Sounds like fun, things were simpler back in the day..

@DumbledoreMcCracken Жыл бұрын

Hence, ML is not what it is sold to be

@lapisFarm Жыл бұрын

Really interesting, thanks

@kenbobcorn Жыл бұрын

I would argue Machine Learning is already very prevalent in industry. As someone who has worked in Malware detection for both Microsoft and Amazon, we leverage large tree models and even large language models for detection.

@ZandarKoad Жыл бұрын

It depends. Massive organizations that don't focus on tech as a core competency can sometimes be very, very slow at adopting the best tools because it is hard for them to even understand what the best tools are, or come up with a framework for comparing tools. Especially governments.

@ttos3093 Жыл бұрын

Look at cybersecurity vendors (Fortinet, Palo Alto) - they apply that. It’s natural that MS or Amazon as infrastructure companies have no tradition in this.

@kenbobcorn Жыл бұрын

@@ZandarKoad I can say, at least for the companies I've consulted for, they use some form of SIEM or other IDS system. A lot of the ML side of things are handled by the SIEM vendors, while they just have to comb through alerts and identify FPs.

@GordonjSmith1 Жыл бұрын

Yup! Totally agree that you can detect previously detected 'models' of current threats, but you are still unable to detect an emerging threat using ML. It is an 'informational problem' that this professor clearly discusses.

@trymoto Жыл бұрын

I could talk to that guy over a pint for like three hours. He's oversimplifying here for a general viewer but this topic is fascinating. Thanks for the video.

@GaryParris Жыл бұрын

There is another way to also think about this issue, but it is one that is not talked about as much and that is separation of data systems and data itself from public and private data. because of the increase in online usability and transparency much of the data is exposed to all these forms of attack, also the monetisation of data & proprietary IP creates a reason to profit from it on both sides of the data fence. if you cannot access it directly, it is less likely to be stolen, if the stored information is not valuable, it becomes pointless to steal it. If the identity requirements are removed/reduced, the identity is less value. everything is a trade off. pattern matching machine algorithms (ML & AI) is limited by the algorithms parameters.

@andrewharrison8436 Жыл бұрын

Well said - all in the name of convenience for the user and exploitation by anyone who handles the data.

@christersmith5470 Жыл бұрын

Using ML to group different types of malicious applications into different families makes the process of malware detection more adaptive, yet we are still getting zero days where a malicious application succeeds by appearing benign. In the medical sciences, there have been many problems, discovered later, where the features used by ML did not accurately predict on new data. This is because researchers let the ML program determine its own features, and the ML program lacked domain expertise. This has resulted in many new companies heavily investing in PhD researchers to prepare the data and relevant features to then run in the model. In cybersecurity, we will still need the human element for similar reasons.

@stavsherman6632 Жыл бұрын

Can you give some examples of this? I am curious to read about it

@GenaTrius Жыл бұрын

I assumed this was going to be about malware that uses machine learning. Terrifying.

@HebaruSan Жыл бұрын

I'm on a team that releases a free open source app. For a while, every time we released a new version we would get a handful of false positive reports from users whose virus scanners tripped on it. Seems like some of the companies just give up and flag everything that isn't in their whitelist when faced with an essentially unsolvable task.

@ewookiis Жыл бұрын

Nope, they don't give up - they have FP's that they sadly don't handle - and this is part of the "lazy" way that was described in the signature approach. Ie - they use to badly written indicators and leave the detection engine with to much weight on that portion. Sometimes it's the odd coding from the program as well..

@graog123 Жыл бұрын

Thanks for uploading in 4K

@ewookiis Жыл бұрын

Actually, there is ways to safely implement this. Using it as a trigger value and not the decision engine. Drillning down into the actual detection tree - there's that many different ways of compromise but can be handled, and they are still limited, in short keeping track of execution, persistence and escalation is first step with this as a possible helper. "EDR/XDR" can be quite sufficient in spanning into a larger chain of "observant" behaviour, ie, the detection engine itself does not have to utilize it, but acting and piecing data together does have elevation from this field. I do however agree that taking on the whole chain of compromise things gets really tricky. Static and/or dynamic binary analysis is such a small portion in the whole part of the indicator chain, but training something to the actual portions, be it a buffer overflow etc etc, it can be used in my opinion.

@titaniumdiveknife2 Жыл бұрын

Very fun to learn about.

@IceMetalPunk Жыл бұрын

I feel like many areas of modern ML, including this one, either do or could benefit greatly from continual learning (which, from my understanding, is synonymous with iterative online learning; if they're different, I'd appreciate an explanation of how!). Now, if only we could make that practically efficient on the massive networks of hundreds of billions of parameters or more 😁

@prashantd6252 Жыл бұрын

I'd recommend reading more on ML and what scale is being worked on right now. . .from your comment I felt like you think a billion "parameters" is too much of a challenge, which it isn't. I'd recommend you check out *huggingface

@romanemul1 Жыл бұрын

@@prashantd6252 well training billion params is not a problem. Spending 5k$ for AWS/Azure/Google processing power is a problem.

@cernejr Жыл бұрын

I like those markers/pens. :)

@delusionnnnn Жыл бұрын

It doesn't help that a lot of false positives are generated by detectors actively equating software piracy with malware. In many cases the techniques are similar, so the issue cannot entirely be dismissed, but even when the techniques are exclusive to piracy, detectors often have a high motivating factor to keep identifying piracy techniques as false positives for "malware", particularly those companies which write both detectors and high-profile commercial software such as Microsoft itself, or who are incentivized by them.

@celivalg Жыл бұрын

Its not quite over-fitting, it's just trained for different threats. The problem is that the patterns would change, as if a panda suddenly didn't mean panda but dog, and the ML system cannot adapt to that. Maybe a more fitting imagery would be if you had a few images of pandas in your training data, and the ML system would recognize them as pandas very well, but now the context changed and dogs are now also pandas. So it should recognize dogs as pandas but it doesn't, as it has either been trained to recognize dogs as dogs, or not trained on them at all, and the image look so different that it has no way of linking the dog to the panda.

@Veptis 7 ай бұрын

So machine learning models, such as classifiers. Require a labeld dataset for supervised trained. So there is datasets of malware? Maybe like vx underground vault?

@Syntax753 Жыл бұрын

Fantastic!

@FrancescoBazzani Жыл бұрын

Heard 20 seconds of the video, and… yes, he’s Italian as me. Stepping aside from this inside joke, great content!

@DumbledoreMcCracken Жыл бұрын

It seems more interesting to write infections with ML that create detection nets

@shiladityasircar9814 8 ай бұрын

Prevalence data and diversity of behaviour are two important crieteria. It's difficult to mount an adversarial attack on models that are behaviour dependent. These modern ML approaches to cyber security use static and dynamic behaviour encoding to stop malware. Cylance ML models are an example of it.

@CodingTrades Жыл бұрын

MLearns evaluates Malware as an Adversarial code execution that's malicious.identity That's detection relies on behavior that is itself a signature representation unique for recognizing it has been deployed. How is a behavior signature not like a fingerprint?

@ewookiis Жыл бұрын

I agree, it is like fingerprints. However, every itteration just like fingerprints are different to an extent that you can't only rely on it.

@goldnutter412 Жыл бұрын

It would be one facet of detection, like the MO (modus operandi) in a crime. Fingerprinting is "specific" .. I like the MALICIOUS.IDENTITY object ! very handy, you could call it a signature but that wouldn't really be accurate. A specific code execution "process" occurring on the CPU is what is being detected, right ?

@thaihocnguyen7113 Жыл бұрын

I have a question Cross-Validation is a method that supports a machine learning model that can surf on all data (with n-folds you can split train or validate). In time, I'm confused about accuracy we need to "test-set" to check again your model right? Because your model which you trained by cross-validation method can overfit. I think cross-validation is used when you have a small data and we need to set of data-test that is checking again. If you have enough data you don't need cross-validation, right? sorry for my English

@SuperCaptain4 Жыл бұрын

Normally cross validation is used for setting hyperparameters to a machine learning model. First you would split your data set into training and test set, say 70/30. Thereafter, you use k-fold cross validation on the training set. What will happen is that a model will be trained k times. (k is a number you choose, the higher k, the better estimates you get for your hyperparameters but the more time you spend cross validating the model as it needs to be retrained) Each time the model is trained, during k-fold cross validation the training dataset, the 70% of all the data you had from the beginning, will be split again. Lets say its split 90/10. The model will then be trained on this 90% and evaluated on the remaining 10% of the validation data. After repeating this k times, we select the hyperparameter value which scored the highest on the 10% validation data. Now to prevent overfitting, we run the model again on the completely unseen test data, the 30% from the original data that we had kept away during training.

@thaihocnguyen7113 Жыл бұрын

@@SuperCaptain4 thanks you so much i got it.

@andrewharrison8436 Жыл бұрын

It seems to me that the hunt for bells, whistles and bling in applications leads to an enhanced attack surface which allows malware. I wrote a secure interface (a long time ago), it was doable because the range of API calls I had to intercept was very limited and I could parse all possible legit parameters and reject the rest. The code was documented and could be checked by my peers. Move to a GUI based environment with more levels of abstraction and the operating system being invoked the whole time for sound or video or malice - no chance. Security starts from the operating system (disclaimer - Windows user - I do hope the antivirus people know their stuff).

@prashantd6252 Жыл бұрын

Were you bragging dude? 😂

@GNARGNARHEAD Жыл бұрын

oh right, check out Christopher Domas talk "The future of RE Dynamic Binary Visualization", I'd bet you'd have much better luck feeding the data in with various transformations, like a Hilbert curve, giving it a semantic structure to deal with.. just might even work with an image recognition algorithm then too.. maybe

@katjejoek Жыл бұрын

It has been a while since I've seen BASIC code! 😂

@RealCyberCrime Жыл бұрын

Just wait until chatgpt can write better malicious software

@bytefu Жыл бұрын

If only it understood what it's writing...

@timothygalvin3021 Жыл бұрын

I can't express in words how much all the empty shelves in this video bother me. Why have all these shelves if you're not going to use them!?

@raicyceprine8953 Жыл бұрын

i don't know why i watch it full even though i dont understand it

@artiem5262 Жыл бұрын

It's heuristics -- educated guessing -- as the halting problem is still out there, so you can guess but you'll never be able to prove if a target is malware or not.

@JorgetePanete Жыл бұрын

I guess that's only when you treat it as a black box, in a white box you could know what it is

@MoxxMix Жыл бұрын

Is there a point in talking about this when windows 11 became a malware.

@barreiros5077 Жыл бұрын

What API said...

@RealCyberCrime Жыл бұрын

Just waiting on chatgpt to write some good malware

@happygimp0 10 ай бұрын

You can not use a computer to detect malware. It is mathematically impossible to do that reliable, since it requires the halting problem to be solvable on a PC, which it isn't.

@cytroyd Жыл бұрын

We need MLware that uses ML to penetrate and replicate across systems. Imagine a GPT-powered worm. Self-generating zero days. I recommend open source LLM's like BLOOM to get started.

@adia.413 Жыл бұрын

The computation requirements to run GPT would have to be much less than today, as not all servers have enough computation power to run a model. On the other hand, I can imagine a trained IA model that could analyze the binaries / source code and create zero day approaches based on the input.

@-FFFridge Жыл бұрын

You could use the same method as actual viruses and randomly mutate the code 1mil times on all already infected system, until some variant actually works, which is then sent outward to penetetrate new hosts. It's incredibly slow, but requires less computing than GPT.