Feature Selection for Malware Classification

Mitchell Mays Noah Drabinsky Dr. Stefan Brandle Fri, Apr 28, 2017

In applying machine learning to malware identification, different types of features have proven to be successful. These features have also been tested with different kinds of classification methodologies and have had varying degrees of success. Every time a new machine learning methodology is introduced for classifying malware, there is the potential for increasing the overall quality of malware classification in the field. Even new classifiers with the same accuracy as those used previously can be combined using one of a few different ensemble techniques sharpen the classification and raise the accuracy to new heights.

For our purposes, we have attempted to create a coalition of classifiers which each use different features. These classifiers when trained, provide multiple angles to the same problem and can be used to test ensemble techniques. Eventually, such an ensemble of individual malware classifiers could create a highly precise means of picking out malware from other software.

Specifically, we have created a convol- utional neural network which processes byte data as an image, and a deep feed forward neural network which utilizes opcode N-gram features. Both of these classifiers, while not perfect, provide a significant level of classification. They achieve this independently of one another, and when combined, they each contribute enough to improve the final accuracy.

The majority of the effort in this research was placed on gathering the N-gram features, a time and resource intensive process. Tinkering with the parameters or structure of classifiers could provide further improvements to the system.

Resources

236 W. Reade Ave., Upland, IN 46989 · 765-998-5162 · cseinfo@cse.taylor.edu
Copyright © 2016 Taylor University Computer Science and Engineering