Using Machine Learning to Name Malware

By Elevate posted 06-09-2016 00:33


Originally written in February 2015.


The current situation with malware naming conventions is in disarray. Different antivirus vendors use different naming conventions and sometimes they don’t follow their own standards.


Let’s look at a few results for a random virus. These are the results from VirusTotal, a meta-antivirus scanning service.

We can see that it is a Trojan malware with some vendors (Dr.Web and TrendMicro) setting the family as StartPage, some saying it’s in the Agent family, some saying it is in the FakeAV family and some saying it is Generic “KR” malware. After running it in an online version of the Cuckoo sandbox, StartPage variation seems to be the most descriptive.


There are a number of virus naming conventions and no unanimous industry standard. For example, Caro, Microsoft and Symantec follow different formats.


Some of these conventions are a little strange. Permitted platform names according to Caro platform naming page include both programming languages like Javascript and operating systems like BeOS and OSX. At the same time, a virus that would work on Windows may not work on Linux or OSX even though it is written in Javascript, because it exploits an Internet Explorer vulnerability but would be recognized as Javascript platform according to all AVs.


But even if the conventions were agreed upon, the correct naming is still confusing. For example, if the malware is a HTML file with embedded Javascript that exploits some browser vulnerability and drops and executes a binary file, some antiviruses name the whole procedure after the file being dropped, some name it after the vulnerability type that the Javascript script that initiated the process exploited and some name it after the tag in the HTML file where the Javascript is located. To make the situation worse, all of these components can be modularized (e.g. the Javascript can exploit another vulnerability, or use a different tag in the HTML while still dropping the same binary).


However, we will not attempt to create a new standard – there are enough already – but to generate a decent name given the names we already have.


Generating a “good” name


Where do we start? First we can run a bunch of malware through VirusTotal to get some names for it and then use those names somehow to help us generate a name for malware we haven’t seen before.


Since every antivirus has its own name format that is not always documented or consistently adhered to, we cannot simply come up with an easy set of rules.


Luckily, we are primarily interested in the so-called “family name” for malware, because that is the information that categorizes it into a specific group. Other information provided by AV engines is borderline redundant. For example, platform/language can be determined from running filemagic on the malware. “Malware type” such as whether the malware is a Trojan or a Dropper is fairly subjective in many cases, because almost any Dropper is also a Trojan, and many Trojans are Droppers.


To figure out the name, we can try a simple statistical strategy called Term Frequency x Inverse Document Frequency or tf-idf. It works by counting the number of times each word occurs in a document and multiplying it by the log of number of documents over number of documents that have that word. That is,


Screenshot 2016-06-09 00.25.53.png

Here, t is the term, d is one document, D is all documents, N is the number of documents.


We pick the most common name but penalize it if it occurs too frequently. In other words we want a consensus name that isn’t too common.


Let’s say we have a file called “50kresults.json” with all the antivirus results in the following format:

[{"AV1": "Win32.Agent", "AV2": ""}, {"AV1": "Win32.BadVirus", "AV2": "JS.Iframe"}]


To figure out the family based on tf-idf we can do the following:

import json
import re
from sklearn.feature_extraction.text import TfidfVectorizer

def get_list_of_token_lists(list_of_dicts):
    """Convert [{"AV": "Some.Malware"}, {"AV":"Another.Malware"}] => [["Some", "Malware"], ["Another", "Malware"]]"""
    big_list = []
    for _dict in list_of_dicts:
        inner_list = []
        for v in _dict.values():
            if v is not None:
                inner_list.extend([x for x in re.split("\W", v) if x])
    return big_list

def make_tfidf(list_of_dicts):
    tfidf = TfidfVectorizer(analyzer=lambda x: x)
    return tfidf

j = json.load(open("50kresults.json"))[1:1000]
tfidf = make_tfidf(j)
to_guess = {k: v for k, v in j[0].items() if v}

print("We have to guess the family name in the following result:\n")

l_of_l = get_list_of_token_lists([to_guess])
m = tfidf.transform(l_of_l)
els_to_pos = {e: tfidf.vocabulary_[e] for e in l_of_l[0]}
els_to_scores = {k: m[:, v].toarray()[0][0] for k, v in els_to_pos.items()}

print("\nTop 3 results for families:")
print(sorted([(token, score) for (token, score) in els_to_scores.items() if len(token) > 3], key=lambda x: x[1], reverse=True)[:3])


and we get the following output:

"We have to guess the family name in the following result:"

{'AVG': 'MLoader',
 'Ad-Aware': 'Gen:Application.LoadMoney.1',
 'AhnLab-V3': 'PUP/Win32.Downloader',
 'AntiVir': 'APPL/Downloader.Gen7',
 'BitDefender': 'Gen:Application.LoadMoney.1',
 'Commtouch': 'W32/LoadMoney.K.gen!Eldorado',
 'Comodo': 'TrojWare.Win32.Kryptik.AXJX',
 'DrWeb': 'Trojan.LoadMoney.1',
 'ESET-NOD32': 'a variant of Win32/LoadMoney.AU',
 'F-Prot': 'W32/LoadMoney.K.gen!Eldorado',
 'F-Secure': 'Gen:Application.LoadMoney.1',
 'Fortinet': 'Adware/LoadMoney',
 'GData': 'Gen:Application.LoadMoney.1',
 'Ikarus': 'not-a-virus:Downloader.Win32',
 'Jiangmin': 'Trojan/Generic.bedbi',
 'K7AntiVirus': 'Trojan ( 0040f53f1 )',
 'K7GW': 'Trojan ( 0040f53f1 )',
 'Kaspersky': 'not-a-virus:HEUR:Downloader.Win32.LMN.a',
 'Kingsoft': 'Win32.Troj.Generic.a.(kcloud)',
 'Malwarebytes': 'PUP.Optional.LoadMoney',
 'McAfee': 'Adware-FUI!21AE44F69544',
 'McAfee-GW-Edition': 'Heuristic.BehavesLike.Win32.Suspicious.D',
 'MicroWorld-eScan': 'Gen:Application.LoadMoney.1',
 'NANO-Antivirus': 'Riskware.Win32.Lmn.cgadbh',
 'Panda': 'Trj/Genetic.gen',
 'Rising': 'PE:Trojan.Agent!1.6956',
 'SUPERAntiSpyware': 'Trojan.Agent/Gen-LoadMoney',
 'Sophos': 'Troj/LdMon-A',
 'VBA32': 'Downware.LMN.gen',
 'VIPRE': 'Trojan-Downloader.Win32.LoadMoney.u (v)'}

"Top 3 results for families:"

[('LoadMoney', 0.61993592188127444), 
 ('Win32', 0.2548979272879856), 
 ('Application', 0.24128378400001674)]


So this worked pretty well to guess the “LoadMoney” family for that malware.


However, using the same strategy on the StartPage virus we started with, we get:

"Top 10 results for families:"
[('209736', 0.49923733235769835),
 ('Trojan', 0.31205076914299817),
 ('Agent', 0.2897505833486374),
 ('hmqy', 0.24961866617884917),
 ('Win32', 0.17014484832431584),
 ('FBEG', 0.16641244411923278),
 ('C06FF460EAA3', 0.16641244411923278),
 ('Agent2', 0.12921257845494616),
 ('Generic', 0.10607853677450814),
 ('StartPage', 0.096756039224138807)]

Not very good unfortunately. The identity dominates because it appears more than once and is probably unique to that virus.


There are other issues with simple tf-idf. What if none of the AVs list a family name and only list the platforms? What if we want to know the consensus language in which the malware is written or the platform for which it is written? What if one family name is misspelled consistently by one antivirus?


Using CRFs to generate names


If we re-frame the problem of figuring out a good name into the problem of labeling parts of virus names and then combining the labels from different antiviruses, we can see it as a text segmentation problem.


Conditional Random Fields or CRFs are excellent at dealing with such problems.


In our case, text segmentation can be used to infer what each part of a virus name means. That is,


can be preprocessed into

 ["Win32", "Trojan", "Sality", "A"] 

and then tagged as

 ["platform", "_type", "family", "ident"] 

After that, based on all the platform names that we get from AVs we can figure out what the consensus is. We want to use CRFs instead of something like Naive Bayes, because the order of tokens within each AV’s name for a virus is very important and because the tokens are not independent (i.e. An Iframe family malware is probably written in Javascript).


How CRFs work


Although we won’t be able to go into details here about exactly how CRFs work because it would take about 100 pages, there are a number of texts that do go into the details.


Here we will try to get a high-level understanding. Instead of jumping straight into CRFs, we will first go into somewhat similar Hidden Markov Models or HMMs. But instead of jumping straight into HMMs, we will first discuss dinosaurs and how they lay eggs.


Image by  Moyan Brenn

Imagine you are on a dinosaur island, exploring the jungle. You stumble upon a nice resting area covered with leaves. After clearing some of the leaves, you see a medium-sized dinosaur egg. This could either be a T.Rex egg or a P.Walkeri egg. You know that P.Walkeri eggs are not as large and the dinosaurs themselves are not dangerous, while T.Rex are very dangerous. You just are not sure if you should run. After seeing just one egg, you aren’t sure if it came from T.Rex or P.Walkeri, because it could be some small T.Rex. You clear more leaves and see a smaller egg. Then another medium-sized egg. You decide that these are very likely to be P.Walkeri eggs and keep going. T.Rex probably don’t lay eggs right next to P.Walkeri and no T.Rex could lay that many medium-sized eggs.


In this scenario, HMMs could be used to predict which dinosaur the eggs came from. The eggs here are the observations and the dinosaurs are the hidden states. We can use an algorithm called Viterbi algorithm to predict these hidden states. Let’s say there is a 10% chance of T.Rex producing small eggs, 20% chance of T.Rex producing medium eggs and 70% chance of T.Rex. producing large eggs. P.Walkeri’s probabilities are 50%, 40% and 10% respectively. Also there is a 2% chance that a T.Rex layed an egg close to P.Walkeri and 1% chance that P.Walkeri layed an egg near T.Rex eggs.


Egg-laying probabilities.

Egg-laying probabilities.


If we just see the eggs and we know the probabilities ahead of time, we can figure out which egg belongs to which dinosaur using Viterbi algorithm. To figure out the probabilities in the image above, we would need some experience looking at dinosaurs and which eggs they can lay and use Forward-Backword algorithm. In our case, the eggs are the virus name tokens and the dinosaurs are the tags, such as type, family, etc. Instead of using HMMs, we are using CRFs, which can be viewed as a generalization of HMMs that makes the constant transition probabilities in the image above into arbitrary functions that vary across the positions in the sequence of hidden states (dinosaurs), depending on the input sequence (eggs).


Using CRFsuite library


There’s an excellent CRF C library that we’ll use here called CRFSuite.


We’ll create some training data, convert it to features, train a model and run it on some results we haven’t seen before.


First, let’s create some training data to label parts of virus names with their corresponding tags. We would want to convert antivirus results such as

[{"AntiVir": "TR/Crypt.XPACK.Gen2"}, {"AntiVir":"DR/Delphi.Gen"}]

to the following:

TR 0 AntiVir _type
/ / AntiVir delim
Crypt 1 AntiVir family
. . AntiVir delim
XPACK 2 AntiVir group
. . AntiVir delim
Gen2 3 AntiVir ident

DR 0 AntiVir _type
/ / AntiVir delim
Delphi 1 AntiVir family
. . AntiVir delim
Gen 2 AntiVir ident

With the last column being the labels that will later be guessed. To create the above format, we can use the following:

import itertools
import re
REGEX_NONWORD = re.compile("\W")
REGEX_NONWORD_SAVED = re.compile("(\W)")

def preprocess_av_result(av_result, av):
    """Split an av result into a list of maps for word, pos, av and label

    EG. take something like '' and convert to
        [{'av': 'someav', 'w': 'win32', 'pos': '0', 'label': 'skip'},
         {'av': 'someav', 'w': '.', 'pos': '.', 'label': 'delim'},
         {'av': 'someav', 'w': 'malware', 'pos': '1', 'label': 'skip'},
         {'av': 'someav', 'w': '.', 'pos': '.', 'label': 'delim'},
         {'av': 'someav', 'w': 'group', 'pos': '2', 'label': 'skip'}]

    split_delim = [el if el != ' ' else '_' for el in
    split_no_delim = REGEX_NONWORD.split(av_result)
    delims = set(split_delim) - set(split_no_delim)

    counter = 0
    tags = []
    labels = []
    for el in split_delim:
        if el in delims:
            counter += 1

    return [{'w': i, 'pos': j, 'av': k, 'label': l} for i, j, k, l in
            zip(split_delim, tags, itertools.repeat(av), labels) if i != '']

j = json.load(open("50kresults.json"))[:1000]  # contains the results
with open("all_train.txt", 'w') as f:  # name of the training file
    for d in j:
        for av, res in d.items():
            if res is None:
            features = preprocess_av_result(res, av)
            for fd in features:
                f.write('\t'.join([fd['w'], fd['pos'], fd['av'], fd['label']]) + "\n")

After creating and manually labeling the tokens (we label the last column in “all_train.txt” according to what we think the token actually corresponds to), we want to create a feature file that crfsuite understands. To convert the result to features, we can use the slightly modified script built into crfsuite called ‘’ that converts labeled CSV file to feature file. All we have to do to take advantage of the fact that CRF can use the fact that each antivirus uses a slightly different naming convention is to modify the template (included below).

templates = (
    (('w', -2), ),
    (('w', -1), ),
    (('w',  0), ),
    (('w',  1), ),
    (('w',  2), ),
    (('w', -1), ('w',  0)),
    (('w',  0), ('w',  1)),
    (('pos', -2), ),
    (('pos', -1), ),
    (('pos',  0), ),
    (('pos',  1), ),
    (('pos',  2), ),
    (('pos', -2), ('pos', -1)),
    (('pos', -1), ('pos',  0)),
    (('pos',  0), ('pos',  1)),
    (('pos',  1), ('pos',  2)),
    (('pos', -2), ('pos', -1), ('pos',  0)),
    (('pos', -1), ('pos',  0), ('pos',  1)),
    (('pos',  0), ('pos',  1), ('pos',  2)),
    (('av', 0), ),

We save it to and run it with the following command:

cat all_train.txt | ./ > all_train.crfsuit.txt

After that, we have to train the model using the features file:

crfsuite learn -m all_train.model all_train.crfsuit.txt

After that, we can check how the model performs with some testing data:

cat all_test.txt | ./ > all_test.crfsuite.txt
crfsuite tag -m all_train.model -t all_test.crfsuite.txt


Performance by label (#match, #model, #ref) (precision, recall, F1):
    _type: (80, 80, 80) (1.0000, 1.0000, 1.0000)
    delim: (558, 558, 558) (1.0000, 1.0000, 1.0000)
    family: (159, 162, 159) (0.9815, 1.0000, 0.9907)
    group: (19, 19, 19) (1.0000, 1.0000, 1.0000)
    ident: (152, 152, 156) (1.0000, 0.9744, 0.9870)
    skip: (95, 97, 95) (0.9794, 1.0000, 0.9896)
    platform: (109, 109, 112) (1.0000, 0.9732, 0.9864)
    language: (12, 15, 13) (0.8000, 0.9231, 0.8571)
    method: (7, 7, 7) (1.0000, 1.0000, 1.0000)
    compiler: (0, 0, 0) (******, ******, ******)
    _test: (0, 0, 0) (******, ******, ******)
    malic: (27, 27, 27) (1.0000, 1.0000, 1.0000)

Macro-average precision, recall, F1: (0.697204, 0.705046, 0.700773)
Item accuracy: 1218 / 1226 (0.9935)
Instance accuracy: 152 / 160 (0.9500)

We are primarily interested in the family accuracy which is at ~98%. Good enough.

Now we can use the “all_train.model” file to tag tokens in new malware names.
First, create a function for extracting features from labeled text:

def extract_features(X):
    all_features = []
    for i, _ in enumerate(X):
        el_features = [X[i]['label']]
        for template in templates:
            features_i = []
            name = '|'.join(['%s[%d]' % (f, o) for f, o in template])
            for field, offset in template:
                p = i + offset
                if p < 0 or p >= len(X):
                    features_i = []
            if features_i:
                el_features.append('%s=%s' % (name, '|'.join(features_i)))

    return all_features

Then use the Tagger class from python-crfsuite library to label the malware:

from pycrfsuite import Tagger
tagger = Tagger()"all_train.model")  # our model file we created in previous step.
k, v = 'F-Prot', 'W32/LoadMoney.K.gen!Eldorado'  # av result from previous section
result = tagger.tag(extract_features(preprocess_av_result(v, k)))
print("Antivirus:", k)
print("Antivirus result:", v)
print("Tokenized result:", [res['w'] for res in preprocess_av_result(v, k)])
print("Labeled result", result)

We get the following output:

Antivirus: F-Prot
Antivirus result: W32/LoadMoney.K.gen!Eldorado
Tokenized result: ['W32', '/', 'LoadMoney', '.', 'K', '.', 'gen', '!', 'Eldorado']
Labeled result ['platform', 'delim', 'family', 'delim', 'ident', 'delim', 'skip', 'delim', 'ident']

It worked! We now know what each token corresponds to. We can further improve results by modifying the template, including additional features, do further post-processing such as picking one name among synonymous names, grouping similarly spelled labels, etc.

But this seems to be good enough for now.


Once we have all the post-processing in place, we can even guess our original StartPage virus.

In [1]: import name_generator
In [2]: g = name_generator.Guesser()
In [3]: d= {"AVG":     "Agent2.CHWB",
"Ad-Aware":    "Trojan.Generic.KD.209736",
"Agnitum":     "Trojan.Agent!imlOtEtsC6M",
"AhnLab-V3":   "Trojan/Win32.Agent",
"AntiVir":     "TR/Agent.hmqy",
"Antiy-AVL":   "Trojan[Downloader]/Win32.Agent",
"Avast":   "Win32:Malware-gen",
"Baidu-International":     "Trojan.Win32.Downloader.AAAp",
"BitDefender":     "Trojan.Generic.KD.209736",
"Bkav":    "W32.OnGameXERPPAB.Trojan",
"ClamAV":  "Trojan.Agent-245698",
"Commtouch":   "W32/Trojan.DFCR-0837",
"Comodo":  "TrojWare.Win32.Agent.hmqy",
"DrWeb":   "Trojan.StartPage.34283",
"ESET-NOD32":  "Win32/TrojanDownloader.Agent.QXS",
"Emsisoft":    "Trojan.Generic.KD.209736 (B)",
"F-Prot":  "W32/Trojan2.NNTI",
"F-Secure":    "Trojan.Generic.KD.209736",
"Fortinet":    "W32/BanLoader.AAAK!tr",
"GData":   "Trojan.Generic.KD.209736",
"Ikarus":  "Trojan.Agent2",
"Jiangmin":    "Trojan/Agent.etuj",
"K7AntiVirus":     "Trojan ( 66cd75180 )",
"K7GW":    "Riskware ( 0015e4f01 )",
"Kaspersky":   "Trojan-Downloader.Win32.Agent.gzlz",
"Kingsoft":    "Win32.Troj.Agentgd.kf.(kcloud",
"Malwarebytes":    "Trojan.Agent",
"McAfee":  "Trojan-FBEG!C06FF460EAA3",
"McAfee-GW-Edition":   "Trojan-FBEG!C06FF460EAA3",
"MicroWorld-eScan":    "Trojan.Generic.KD.209736",
"Microsoft":   "Trojan:Win32/Agent.IM",
"NANO-Antivirus":  "Trojan.Win32.Agent.csgiq",
"Norman":  "DLoader.AQGOV",
"Panda":   "Trj/Genetic.gen",
"Qihoo-360":   "Trojan.Win32.StartPage.I",
"Rising":  "PE:Trojan.Win32.FakeAV.bqz!1075345744",
"SUPERAntiSpyware":    "Trojan.Agent/Gen-StartPage",
"Sophos":  "Troj/Agent-RSQ",
"Symantec":    "Trojan.Gen",
"TheHacker":   "Trojan/Agent.hmqy",
"TotalDefense":    "Win32/Agent.BPS",
"TrendMicro":  "TROJ_STARTP.SMHT",
"TrendMicro-HouseCall":    "TROJ_GEN.F47V0115",
"VBA32":   "Trojan.Agent",
"VIPRE":   "Trojan.Win32.Generic!BT",
"nProtect":    "Trojan/W32.Small.24576.QZ"}
In [4]: g.guess_everything(d)
{'platform': 'Win32',
 'family': 'BanLoader',
 'language': 'unknown',
 '_type': 'Trojan',
 'group': 'unknown',
 'ident': 'tr',
 'compiler': 'unknown'}

So we didn’t end up getting StartPage, partially because of the smallish size of training data (some of the AVs that labeled it as StartPage were not trained on) but BanLoader fits that malware alright as well.


All of the post-processing to settle on one common name has already been done and you can find the library that can guess the virus names at this github repo.


Article is also available on the Juniper SecIntel blog.