Abstract

Perimeter-based detection is no longer sufficient for mitigating the threat posed by malicious software. This is evident as antivirus (AV) products are replaced by endpoint detection and response (EDR) products, the latter allowing visibility into live machine activity rather than relying on the AV to filter out malicious artefacts. This paper argues that detecting malware in real-time on an endpoint necessitates an automated response due to the rapid and destructive nature of some malware. The proposed model uses statistical filtering on top of a machine learning dynamic behavioural malware detection model in order to detect individual malicious processes on the fly and kill those which are deemed malicious. In an experiment to measure the tangible impact of this system, we find that fast-acting ransomware is prevented from corrupting 92% of files with a false positive rate of 14%. Whilst the false-positive rate currently remains too high to adopt this approach as-is, these initial results demonstrate the need for a detection model that is able to act within seconds of the malware execution beginning; a timescale that has not been addressed by previous work.

1. Introduction

Our increasingly digitised world broadens both the opportunities and motivations for cyberattacks, which can have devastating social and financial consequences [1]. Malicious software (malware) is one of the most commonly used vectors to propagate malicious activity and exploit code vulnerabilities.

Due to the huge numbers of new malware appearing each day, the detection of malware samples needs to be automated [2]. Signature-matching methods are not resilient enough to handle obfuscation techniques or to catch unseen malware types and as such, automated methods of generating detection rules, such as machine learning, have been widely proposed [36]. These approaches typically analyse samples when the file is first ingested, either using static code-based methods or by observing dynamic behaviours in a virtual environment.

This paper argues that both of these approaches are vulnerable to evasion from the attacker. Static methods may be thwarted by simple code-obfuscation techniques whether rules are hand-generated [7] or created using machine learning [8]. Dynamic detection in a sandboxed environment cannot continue forever, either it is time-limited [9] or ends after some period of inactivity [10]. This fixed period allows attackers to inject benign activity during analysis and wait to carry out malicious activity once the sample has been deemed harmless and passed on to the victim’s environment. The pre-execution filtering of malware is the model used by antivirus but this is insufficient to keep up with the ever-evolving malware landscape and has led to the creation of endpoint detection and response (EDR) products which allow security professionals to monitor and respond to malicious activity on the victim machine. Real-time malware detection also monitors malware live on the machine thus capturing any malicious activity on the victim machine even if it was not evident during initial analysis. This paper proposes that once a threat is detected, due to the fast-acting nature of some destructive malware, it is vital to have automated actions to support these detections. In this paper, we investigate automated detection and killing of malicious processes for endpoint protection.

There are several key challenges to address in detecting malware on-the-fly on a machine in use by comparison with detecting malicious applications that are detonated in isolation in a virtual machine. These are summarised below:(1)Signal Separation: Detection in real time requires that the malicious and benign activities are separated in order that automated actions can be taken on only the malicious processes.(2)Use of Partial Traces: In order to try and mitigate damage, malware needs to be detected as early as possible, but, as shown in previous work [11], there is a trade-off between the amount of data collected and classification accuracy in the first few seconds of an application launching and the same may be true for individual processes.(3)Quick Classification: The inference itself should be as fast as possible in order to further limit the change of malicious damage once the process is deemed malicious.(4)Impact of Automated Killing in Supervised Learning: Supervised learning averages the error rate across the entire training set but when the classification results in an action, this smoothing out of errors across the temporal dataset is not possible.

This paper seeks to address these key challenges and provides preliminary results including a measure of “damage prevented” in a live environment for fast-acting destructiveware. As well as the results from these experiments, this paper contributes an analysis of the computational resources against detection accuracy for many of the most popular machine-learning algorithms used for malware detection.

The key contributions of this paper are as follows:(i)The first general malware detection model to demonstrate damage mitigation in real-time using process detection and killing(ii)Benchmarking of commonly used ML algorithm implementations with respect to computational resource consumption(iii)Presentation of real-time malware detection against more user background applications than have previously been investigated; increasing from 5 to 36 (up to 95 simultaneous processes)

The next section outlines related work, followed by a report of the three methodologies that were tested to try and address these challenges 3 in which the method for evaluating these models is also explained (6.5). The experimental setup is described in Section 5.2.1, followed by results in Section 6.

2.1. Malware Detection with Static or Post-collection Behavioural Traces
2.1.1. Static Sources

Machine learning models trained on static data have shown good detection accuracy. Chen et al. [5] achieved 96% detection accuracy using statically extracted sequences of API calls to train a Random Forest model. However, static data have been demonstrated to be quite vulnerable to concept drift [12, 13]. Adversarial samples present an additional emerging concern; Grosse et al. [14] and Kolosnaji et al. [8] demonstrated that static malware detection models achieving over 90% detection accuracy could be thwarted by injecting code or simply altering the padded code at the end of a compiled binary, respectively.

2.1.2. Post-Collection Dynamic Data

Dynamic behavioural data are generated by the malware carrying out its functionality. Again machine learning models have been used to draw out patterns between malicious and benign software using dynamic data. Various dynamic data can be collected to describe malware behaviour. The most commonly used data are API calls made to the operating system, typically recorded in short sequences or by frequency of occurrence. Huang and Stokes’s research [3] reports the highest accuracy in recent malware detection literature with a very large dataset of more than 6 million samples to achieve an accurate detection rate of 99.64% using a neural network trained on the input parameters passed to API calls, their return values, and the co-occurrence of API calls. Other dynamic data sources include dynamic opcode sequences (e.g., Carlin et al. [9] achieved 99% using a Random Forest), hardware performance counters (e.g., Sayadi [15] achieved 94% on Linux/Ubuntu malware using a decision tree), network activity and file system activity (e.g., Usman et al. [16] achieved 93% using a decision tree in combination with threat intelligence feeds and these data sources), and machine activity metrics (e.g., Burnap et al. [17] achieved 94% using a self-organising map). Previous work [18] demonstrated the robustness of machine activity metrics over API calls in detecting malware collected from different sources.

Dynamic detection is more difficult to obfuscate but typically the time taken to collect data is several minutes, making it less attractive for endpoint detection systems. Some progress has been made on early detection of malware. Previous work [11]) was able to detect malware with 94% accuracy within 5 seconds of execution beginning. However, as a sandbox-based method, malware which is inactive for the first 5 seconds is unlikely to be detected with this approach. Moreover, the majority of dynamic malware detection papers use virtualised environments to collect data.

2.2. Real-Time Malware Detection with Partial Behavioural Traces

OS = operating system; HPCs = Hardware performance counters; DT = Decision Tree; MLP = Multi-layer perceptron; NN = Neural Network; RF = Random Forest.

Previous work has begun to address the four challenges set out in the introduction. Table 1 summarises the related literature and the problems considered by the researchers.

To the best of our knowledge, challenge (1) signal separation has only previously been addressed by Sun et al. [23] using sequential API call data. The authors execute up to 5 benign and malicious programs simultaneously achieving 87% detection accuracy after 5 minutes of execution and 91% accuracy after 10 minutes of execution.

Challenge (2) to detect malware using partial traces as early as possible has not been directly addressed. Some work has looked at early run-time detection; Das et al. [20] used an FPGA as part of a hybrid hardware-software approach to detect malicious Linux applications using system API calls which are then classified using a multilayer perceptron. Their model was able to detect 46% of malware within the first 30% of its execution with a false-positive rate of 2% in offline testing. These findings however were not tested with multiple benign and malicious programs running simultaneously and do not explain the impact of detecting 46% of malware within 30% of its execution trace in terms of benefits to a user or the endpoint being protected. How long does it take for 30% of the malware to execute? What has occurred in that time?

Greater attention has been paid to challenge (3) quick classification, insofar as this problem also encompasses the need for lightweight detection. Some previous work has proposed hardware based detection for lightweight monitoring. Syadi et al. [15] use high performance counters (HPCs) as features to train ensemble learning algorithms and scored 0.94 AUC using a dataset of 100 malicious and 100 benign Linux software samples. Ozsoy et al. [21] use low-level architectural events to train a multilayer perceptron on the more widely used [25] (and attacked) Windows operating system. The model was able to detect 94% of malware with a false-positive rate of 7% using partial execution traces of 10,000 committed instructions. The hardware-based detection models, however, are less portable than software-based systems due to the ability of the same operating system to run on a variety of hardware configurations.

Both Sun et al. [23] and Yuan [22] propose two-stage models to address the need for lightweight computation. The first stage comprises a lightweight ML model such as a Random Forest to alert suspicious processes, the second being a deep learning model which is more accurate but more computationally intensive to run. Two-stage models, as Sun et al. [23] note, can get stuck in an infinite loop of analysis in which the first model flags a process as suspicious but the second model deems it benign and this labelling cycle continues repeatedly. Furthermore, if the first model is prone to false negatives, malware will never be passed to the second model for deeper analysis.

Challenge (4) the impact of automated actions has been discussed by Sun et al. [23]. The authors also propose the two-stage approach as a solution to this problem. The authors apply restrictions to the process whilst the deeper NN analysis takes place followed by the killing of malicious-labelled processes. The authors found that the delaying strategy impacted benignware more than malware and used this two-stage process to account for the irreversibility of the decision to kill a process. The authors did not assess the impact on the endpoint with respect to the time at which the correctly classified malware was terminated.

3. Methodology-Three Approaches

As noted above, supervised learning models average errors across the training set but in the case of real-time detection and process killing, a single false positive on a benign process amongst 300 true-negatives would cause disruption to the user. The time at which an malware is detected is also important, the earlier the better. Therefore, the supervised learning model needs to be adapted to take account of these new requirements.

Tackling this issue was attempted in three different ways and all three are reported here in the interests of reporting negative results as well as the one which performed the best. These were:(1)Statistical methods to smooth the alert surface and filter out single false-positives(2)Reinforcement learning, which is capable of incorporating the consequences of model actions into learning(3)A regression model based on the feedback of a reinforcement learning model made possible by having the ground-truth labels

Figure 1 gives a high-level depiction of the three approaches tested in this paper.

3.1. Statistical Approach: Alert Filtering

It is expected that transitioning from a supervised learning model to a real-time model will see a rise in false-positives since one single alert means benign processes (and all child processes) are terminated, which effectively renders all future data points as false positives. Filtering the output of the models, just as the human brain filters out transient electrical impulses in order to separate background noise from relevant data [26], may be sufficient to make supervised models into suitable agents. This is attractive because supervised learning models are already known to perform well for malware detection, as confirmed by the previous paper and other related work [11, 20, 27, 28]. A disadvantage of this approach is that it introduces additional memory and computational requirements both in order to calculate the filtered results and to track current and historic scores; therefore, a model which integrates the expected consequences of an action into learning is also tested: reinforcement learning.

3.2. Reinforcement Learning: Q-Learning with Deep Q Networks

The proposed automated killing model may be better suited to a reinforcement learning strategy than to supervised learning. Reinforcement learning uses rewards and penalties from the model’s environment. The problem that this paper is seeking to solve is essentially a supervised learning problem, but one for which it is not possible to average predictions. There are no opportunities to classify the latter stages of a process if the agent kills the process, and this can be reflected by the reward mechanism of the reinforcement learning model (see Figure 1). Therefore, reinforcement learning seems like a good candidate for this problem space.

Two limitations of this approach are that (1) reinforcement learning models can struggle to converge on a balanced solution, and the models must learn to balance the exploration of new actions with the re-use of known high-reward actions; commonly known as the exploration-exploitation trade-off [29] (2) in these experiments, the reward is based on the malware/benignware label at the application level rather than being linked to the actual damage being caused; therefore, the signal is a proxy for what the model should be learning. This is used because, as discussed above, the damage caused by different malware is subjective.

For reinforcement learning, loss functions are replaced by reward functions which update the neural network weights to reinforce actions (in context) that lead to higher rewards and discourage actions (in context) that lead to lower rewards; these contexts and actions are known as state-action pairs. Typically, the reward is calculated from the perceived value of the new state that the action leads to, e.g., points scored in a game. Often this cannot be pre-labelled by a researcher since there are so many (maybe infinite) state-action pairs. However, in this case, all possible state-action pairs can be enumerated, which is the third approach tested (regression model, outlined in the next section).

The reinforcement model was still tested. Here the reward is for a correct prediction, for an incorrect prediction where is the total number of processes impacted by the prediction. For e.g., if there is only one process in a process tree but 5 more will appear over the course of execution, a correct prediction gives a reward of , and incorrect prediction gives a reward of .

There are a number of reinforcement learning algorithms to choose from. This paper explores q-learning [3033] to approximate the value or “quality” (q) of a given action in a given situation. Q-learning approximates q-tables, which are look-up tables of every state-action pair and their associated rewards. A state-action pair is a particular state in the environment coupled with a particular action, i.e., the machine metrics of the process at a given point in time with the action to leave the process running. When the number of state-action pairs becomes quite large, it is easier to approximate the value using an algorithm. Deep Q networks (DQN) are neural networks that implement q-learning and have been used in state-of-the-art reinforcement learning arcade game playing, see Mnih et al. [34]. A DQN was the reinforcement algorithm trialled here; although it did not perform well by comparison with the other methods, a different RL algorithm may perform better [35], but the results are still included in the interests of future work. The following paragraphs will explain some of the key features of the DQN.

The DQN tries out some actions; stores the states, actions, resulting states, and rewards in memory; and uses these to learn the expected rewards of each available action, with the highest expected reward being the one that is chosen. Neural networks are well-suited to this problem since their parameters can easily be updated, tree-based algorithms like random forests and decision trees can be adapted to this end but not as easily. Future rewards can be built into the reward function and are usually discounted according to a tuned parameter usually signified by .

In Mnih et al.’s [34] formulation, in order to address the exploration-exploitation trade off, DQNs either exploit a known action or explore a new one, with the chance of choosing exploration falling over time. When retraining the model based on new experiences, there is a risk that previous useful learned behaviours are lost; this problem is known as catastrophic forgetting [36]. Mnih et al.’s [34] DQNs use two tools to combat this problem. First, experience replay by which past state-action pairs are shuffled before being used for retraining so that the model does not catastrophically forget. Second, DQNs utilise a second network, which updates at infrequent intervals in order to stabilise the learning.

Q-learning may enable a model to learn when it is confident enough to kill a process, using the discounted future rewards. For example, choosing not to kill some malware at time may have some benefit as it allows the model to see more behaviour at t + 1 which gives the model greater confidence that the process is in fact malicious.

Q-learning approximates rewards from experience, but in this case, all rewards from state-action pairs can actually be pre-calculated. Since one of the actions will kill the process and thus end the “experience” of the DQN, it could be difficult for this model to gain enough experience. Thus pre-calculation of rewards may improve the breadth of experience of the model. For this reason, a regression model is proposed to predict the Q-value of a given action.

3.3. Regression Using Q-Values

Unlike classification problems, regression problems can predict a continuous value rather than discrete (or probabilistic) values relating to a set of output classes. Regression algorithms are proposed here to predict the q-value of killing a process. If this value is positive, the process is killed.

Q-values estimate the value of a particular action based on the “experience” of the agent. Since the optimal action for the agent is always known, it is possible to precompute the “(q-) value” of killing a process and train various ML models to learn this value. It would typically be quicker to train a regression model which tries to learn the value of killing a process than to train a DQN which explores the state-action space and calculates rewards between learning, since the interaction and calculation of rewards is no longer necessary. The regression approach can be used with any machine learning algorithm capable of learning a regression problem, regardless of whether it is capable of partial training.

There are two primary differences between this regression approach and the reinforcement learning DQN approach detailed in the previous section. Firstly, the datasets are likely to be the difference. Since the DQN generates training data through interacting with its environment, it may never see certain parts of the state-action space, e.g., if a particular process is always killed during training before time , the model is not able to learn from the process data after .

Secondly, only the expected value of killing is modelled by the regressor, whereas the DQN tries to predict the value of both killing and of not killing the process. This means that the equation used to model the value of process killing is only an approximation of the reward function used by the DQN.

The equation used to calculate the value of killing is positive for malware and negative for benignware; in both cases, it is scaled by the number of child processes impacted and in the case of malware, early detection increases the value of process killing (with an exponential decay). Let be the true label of the process (0 = benign, 1 = malicious), is the number of child processes, and is the time in seconds at which the process is killed; then, the value of killing a process is:

The equation above negatively scores the killing of benignware in proportion to the number of subprocesses and scores the killing of malware positively in proportion to the number of subprocesses. A bonus reward is scored for killing malware early, with an exponential decay over time.

4. Evaluation Methodology: Ransomware Detection

Previous, research has not addressed the extent to which damage is mitigated by process killing, since Sun et al. [23] presented the only previous work to test process killing and damage with and without process killing is not assessed. To this end, this paper uses ransomware as a proxy to detect malicious damage, inspired by Scaife et al.’s approach [24]. A brief overview of Scaife et al.’s damage measurement is outlined below:

Early detection is particularly useful for types of malware from which recovery is difficult and/or costly. Cryptographic ransomware encrypts user files and withholds the decryption key until a ransom is paid to the attackers. This type of attack is typically costly to remedy, even if the victim is able to carry out data recovery [37]. Scaife et al.’s work [24] on ransomware detection uses features from file system data, such as whether the contents appear to have been encrypted, and number of changes made to the file type. The authors were able to detect and block all of the 492 ransomware samples tested with less than 33% of user data being lost in each instance. Continella et al. [38] propose a self-healing system, which detects malware using file system machine activity (such as read/write file counts); the authors were able to detect all 305 ransomware samples tested, with a very low false-positive rate. These two approaches use features selected specifically for their ability to detect ransomware, but this requires knowledge of how the malware operates, whereas the approach taken here seeks to use features which can be used to detect malware in general. The key purpose of this final experiment (Section 6.5) is to show that our general model of malware detection is able to detect general types of malware as well as time-critical samples such as ransomware.

5. Experimental Setup

This section outlines the data capture process and dataset statistics.

5.1. Features

The same features as were used in previous work [11] are used here for process detection, with some additional features to measure process-specific data. Despite the popularity of API calls noted in Ref. [18], due to these findings and Sun et al.’s [23] difficulties hooking this data in real-time, these were not considered as features to train the model.

At the process-level, 26 machine metric features are collected; these were dictated by the attributes available using the Psutil [39] python library. It is also possible to include the “global” machine learning metrics that were used in the previous papers. Although global metrics will not provide process-level granularity, they may give muffled indications of the activity of a wider process tree. The 9 global metrics are: system-level CPU use, user-level CPU use, memory use, swap memory use, number of packets received and sent, number of bytes received and sent, and the total number of processes running.

The process-level machine activity metrics collected are: CPU use at the user level, CPU use at the system level, physical memory use, swap memory use, total memory use, number of child process, number of threads, maximum process ID from a child process, disk read, write and other I/O count, bytes read, written and used in other I/O processes, process priority, I/O process priority, number of command line arguments passed to process, number of handles being used by process, time since the process began, TCP packet count, UDP packet count, number of connections currently open, and 4 port statuses of those opened by the process (see Table 2).

5.1.1. Preprocessing

Feature normalisation is necessary for NNs to avoid over-weighting features with higher absolute values. The test, train, and validation sets () are all normalised by subtracting the mean () and dividing by the standard deviation () of each feature in the training set: . This sets the range of input values largely between −1 and 1 for all input features, avoiding the potential for some features to be weighted more important than others during training purely due to the scalar values of those features. This requires additional computational resources but is not necessary for all ML algorithms; this is another reason why the supervised RNN used in Ref. [11] may not be well-suited for real-time detection.

5.2. Data Capture

During data capture, this research sought to improve upon previous work and emulate real machine use to a greater extent than has previously been trialled. The implementation details of the VM, simultaneous process execution, and RL simulation are outlined below:

5.2.1. Environment: Machine Setup

The following experiments were conducted using a virtual machine (VM) running with Cuckoo Sandbox [40] for ease of collecting data and restarting between experiments and because the Cuckoo Sandbox emulates human interaction with programs to some extent to promote software activity. In order to emulate the capabilities of a typical machine, the modal hardware attributes of the top 10 “best seller” laptops according to a popular Internet vendor [41] were used, and these attributes were the basis of the VM configuration. This resulted in a VM with 4GB RAM, 128GB storage, and dual-core processing running Windows 7 64 bit. Windows 7 was the most prevalent computer operating system (OS) globally at the time of designing the experiment [25]. Although Windows 10 is now the most popular OS, the findings in this research should still be relevant.

5.2.2. Simultaneous Applications

In typical machine use, multiple applications run simultaneously. This is not reflected by behavioural malware analysis research in which samples are injected individually to a virtual machine for observation. The environment used for the following experiments launches multiple applications on the same machine at slightly staggered intervals as if a user were opening them. Each malware is launched with a small number (1–3) and a larger number (3–35) of applications. It was not possible to find up-to-date user data on the number of simultaneous applications running on a typical desktop, so here it was elected to launch up to 36 applications (35 benign + 1 malicious) at once, which is the largest number of simultaneous apps for real-time data collection to date. From the existing real-time analysis literature, only Sun et al. [23] run multiple applications at the same time, with a maximum of 5 running simultaneously.

Each application may in turn launch multiple processes, causing more than 35 processes to run at once; 95 is the largest number of simultaneous processes recorded; this excludes background OS processes.

5.2.3. Reinforcement Learning Simulation

For reinforcement learning, the DQN requires an observation of the resulting state following an action. To train the model, a simulated environment is created from the pre-collected training data whereby the impact of killing or not killing a process is returned as the next state. For process-level elements, this reduces all features to zero. A caveat here is that in reality, killing the process may not occur immediately and therefore memory, processing power, etc., may still be being consumed at the next data observation. For global metrics, the process-level values for the killed processes (includes child processes of the killed process) are subtracted from the global metrics. There is a risk again that this calculation may not correlate perfectly with what would be observed in a live machine environment.

In order to observe the model performance, a visualisation was developed to accompany the simulated environment. Figures 2 and 3 show screenshots of the environment visualisation for one malicious and one benign process.

5.3. Dataset

The dataset comprises 3,604 benign executables and 2,792 malicious applications (each containing at least one executable), with 2,877 for training and validation and 3,519 for testing. These dataset sizes are consistent with previous real-time detection dataset sizes (Das et al. [20] use 168 malicious, 370 benign; Sayadi et al. [15] use over 100 each benign and malicious; Ozsoy et al. [21] use 1,087 malicious and 467 benign; Sun et al. [23] use 9,115 malicious, 877 benign). With multiple samples running concurrently to simulate real endpoint use, there are 24K processes in the training set and 34K in the test set. Overall, there are 58K behavioural traces of processes in the training and testing datasets. The benign samples comprise files from VirusTotal [42], from free software websites (later verified as benign with VirusTotal), and from a fresh Microsoft Windows 7 installation. The malicious samples were collected from two different VirusShare [43] repositories.

In Pendelbury et al.’s analysis [13], the authors estimate that in the wild between 6% and 22% of applications are malicious, normalising to 10% for their experiments. Using this estimation of Android malware, a similar ratio was used in the test set in which 13.5% were malicious.

5.3.1. Malware Families

PUA = potentially unwanted application, RAT = remote access trojan.

This paper is not concerned with distinguishing particular malware families, but rather with identifying malware in general. However, a dataset consisting of just one malware family would present an unrealistic and easier problem than is found in the real world. The malware families included in this dataset are reported in Table 3. The malware family labels are derived from the output of around 60 antivirus engines used by VirusTotal [42].

Ascribing family labels to malware is nontrivial since antivirus vendors do not follow standardised naming conventions and many malware families have multiple aliases. Sebastián et al. [44] have developed an open source tool, AVClass, to extract meaningful labels and correlate aliases between different antivirus outputs. AVClass was used to label the malware in this dataset. Sometimes there is no consensus amongst the antivirus’ output or the sample is not recognised as a member of an existing family. AVClass also excludes malware that belongs to very broad classes of malware (e.g., “agent,” “eldorado,” and “artemis”) as these are likely to comprise a wide range of behaviours and so may be applied as a default label in cases for which antivirus engines are unsure. In the dataset established in this research, 2,121 of the 2,792 samples were assigned to a malware family. Table 3 gives the number of samples in each family for which more than 10 instances were found in the dataset. 315 families were detected overall, with 27 families being represented more than 10 times. These better-represented families persist in the train and test sets, but the other families have little overlap. 104 of the 154 other families seen in the test set are not identified by AVClass as being in the training set.

5.3.2. Malicious Vs. Benign Behaviour

Statistical inspection of the training set reveals that benign applications have fewer sub-processes than malicious processes, with 1.17 processes in the average benign process tree and 2.33 processes in the average malicious process tree. Malware was also more likely to spawn processes outside of the process tree of the root process, often using the names of legitimate Windows processes. In some cases, malware launches legitimate applications, such as Microsoft Excel in order to carry out a macro-based exploit. Although Excel is not a malicious application in itself, it is malicious in this context, which is why malicious labels are assigned if a malware sample has caused that process to come into being. It is therefore possible to argue that some processes launched by malware are not malicious, because they do not individually cause harm to the endpoint or user, but without the malware they would not be running and so can be considered at least undesirable even if only in the interests of conserving computational resources.

5.3.3. Train-Test Split

The dataset is split in half with the malicious samples in the test set coming from the more recent VirusShare repository, and those in the training set from the earlier repository. This is to increase the chances of simulating a real deployment scenario in which the malware tested contains new functionality by comparison with those in the training set.

Ideally, the benignware should also be split by date across the training and test set; however, it is not a trivial task to calculate the date at which benignware was compiled. It is possible to extract the compile time from PE header, but it is possible for the PE author to manually input this date which had clearly happened in some instances where the compile date was 1970-01-01 or in one instance 1970-01-16. In the latter case (1970-01-16), the file is first mentioned online in 2016, perhaps indicating a typographic error [45]. Using Internet sources such as VirusTotal [42] can give an indication when software was first seen, but if the file is not very suspicious, i.e., from a reputable source, it may not have been uploaded until years after it was first seen “in the wild.” Due to the difficulty in dating benignware in the dataset collected for this research, samples were assigned to the training or test set randomly.

For training, an equal number of benign and malicious processes are selected, so that the model does not bias towards one class. 10% of these are held out for validation. In most ML model evaluations, the validation set would be drawn from the same distribution as the test set. However, because it is important not to leak any information about the malware in the test set, since it is split by date, the validation set here is drawn from the training distribution.

5.3.4. Implementation Tools

Data collection used the Psutil [39] Python library to collect machine activity data for running processes and to kill those processes deemed malicious. The RNN and Random Forests were implemented using the Pytorch [46] and Scikit-Learn [47] Python libraries, respectively. The model runs with high priority and administrator rights to make sure the polling is maintained when compute resources are scarce.

6. Experimental Results

6.1. Supervised Learning for Process Killing

First, we demonstrate the unsuitability of a full-trace supervised learning malware detection model, which achieved more than 96% detection accuracy in Ref. [11]. The model used is a gated-recurrent unit recurrent neural network since this algorithm is designed to process time-series data. The hyperparameter configuration of this model was conducted using a random search of hyperparameters (see Table 4 in the Appendix for details.)

It is expected that supervised malware detection models will not adapt well to process-killing due to the averaging of loss metrics as described earlier. Initially, this is verified by using supervised learning models to kill processes that are deemed malicious. For supervised classification, the model makes a prediction every time a data measurement is taken from a process. This approach is compared with one taking average predictions across all measurements for a process and for a process tree as well as the result of process killing. The models with the highest validation accuracy for classification and killing are compared.

Figure 4 illustrates the difference in validation set and test set F1-score, true-positive rate, and false-positive rate for these 4 levels of classification: each measurement, each process, each process tree, and finally showing process killing; see Figure 5 for diagrammatic representation of these first 3 levels. Table 5 reports the F1, TPR, and TNR for classification (each measurement of each process) and for process killing.

The highest F1-score on the validation set is achieved by an RNN using process data only. When process killing is applied, there is a drop of less than 5 percentage points in the F1-score, but more than 15 percentage points are lost from the TNR.

On the unseen test set, the highest F1-score is achieved by an RNN using process data + global metrics, but the improvement over the process data + total number of processes is negligible. Overall, there is a reduction in F1-score from (97.44, 94.61) to (74.91, 77.66), highlighting the initial challenge of learning to classifying individual processes rather than entire applications, especially when accounting for concept drift. Despite the low accuracy, these initial results indicate that the model is discriminating some of the samples correctly and may form a baseline from which to improve.

The test set TNR and TPR for classification on the best-performing model (process data only) are 79.70 and 82.91, respectively, but when process killing is applied, although the F1-score drops by 10 percentage points, the TNR and TPR move in opposite directions with the TNR falling to 59.63 and TPR increasing to 90.24. This is not surprising since a single malicious classification results in a process being classed as malicious. This is true for the best-performing models using either of the two feature sets (see Figure 4).

6.2. Accuracy Vs. Resource Consumption

Previous work on real-time detection has highlighted the requirement for a lightweight model (speed and computational resources). In the previous paper, RNNs were the best performing algorithm in classifying malware/benignware, but RNNs have many parameters and therefore may consume significant RAM and/or CPU. They also require preprocessing of the data to scale the values, which other ML algorithms such as tree-based algorithms do not. Whilst RAM and CPU should be minimised, taking model accuracy into account, inference duration is also an important metric.

Although the models in this paper have not been coded for performance and use common python libraries, comparing these metrics helps to decide whether certain models are vastly preferable to others with respect to computational resource consumption. The PyRAPL library [49] is used measure the CPU, RAM, and duration used by each model. This library uses Intel processor “Running Average Power Limit” (RAPL) metrics. Only data preprocessing and inference is measured as training may be conducted centrally in a resource-rich environment. Batch sizes of 1, 10, 100, and 1000 samples are tested with 26 and 37 features, respectively, since there are 26 process-level features and 37 when global metrics are included. Each model is run 100 times for each of the different batch sizes.

For the RNN, a “large” and a “small” model are included. The large models have the highest number of parameters tested in the random search (981 hidden neurons, 3 hidden layers, sequence length of 17) and the smallest (41 neurons, 1 hidden layer, sequence length of 13). These two RNN configurations are compared against other machine learning models which have been used for malware detection: Multi-Layer Perceptron (feed-forward neural network), Support Vector Machine, Naive Bayes Classifier, Decision Tree Classifier, Gradient Boosted Decision Tree Classifier (GBDTs), Random Forest, and AdaBoost.

26 features = process-level only, 37 features = machine and process level features

Table 6 reports the computational resource consumption and accuracy metrics together. Decision tree with 38 features is the lowest cost to run, RNN performs best at supervised learning classification on the validation set but only just outperforms the decision tree with 26 features, which is the best performing model at process killing on the validation set at 92.97 F1-score. The highest F1-score for process killing uses a Random Forest with 37 features, scoring 77.85 F1, which is 2 percentage points higher than the RF with 26 features (75.97). The models all perform at least 10 percentage points better on the validation set, indicating the importance of taking concept drift into account when validating models.

6.3. How to Solve a Problem like Process Killing?

From the results above, it is clear that supervised learning models see a significant drop in classification accuracy when processes are killed as the result of a malicious label. This confirmation of the initial hypothesis presented here justifies the need to examine alternative methods. In the interests of future work and negative result reporting, this paper reports all of the methods attempted and finds that simple statistical manipulations on the supervised learning models perform better than using alternative training methods. This section briefly describes the logic of each method and provides a textual summary of the results with a formula where appropriate. This is followed by a table of the numerical results for each method. In the following section, let be a set of processes in a process tree, be the time at which a prediction is made, let be the prediction for process at time where a prediction equal to or greater than 1 classifies malware.

6.3.1. Mean Predictions

Reasoning: Taking the average prediction across the whole process will smooth out those process killing results.

Not tested. This was not attempted for two reasons: (1) Taking the mean at the end of the process means the damage is done. (2) This method can easily be manipulated by an attacker: 50 seconds of injected benign activity required 50 seconds of malicious activity to achieve a true positive

6.3.2. Rolling Mean Predictions

Reasoning: Taking the average over a few measurements will eliminate those false positives that are caused by a single false positive over a subset of the execution trace. Window sizes of 2 to 5 are tested. Let be the window size:

Summary of results: A small but unilateral increase in F1-Score using a rolling window over 2 measurements on the validation set. Using a rolling window of size 2 on the test set saw a 10 to 20 percentage point increase in true negative rate (to a maximum of 80.77) with 3 percentage points lost from the true-positive rate. This was one of the most promising approaches.

6.3.3. Alert Threshold

Reasoning: Like the rolling mean, single false positives will be eliminated but unlike the rolling mean, the alerts are cumulative over the entire trace such that a single alert at the start and 30 seconds into the process will cause the process to be killed rather than requiring that both alerts are within a window of time. Between 2 and 5 minimum alerts are tested

Summary of results: Again, a small increase across all models, with an optimal minimum number of alerts being 2 for maximum F1-score, competitive with the rolling mean approach.

6.3.4. Process-Tree Averaging

Reasoning: The data are labelled at the application level; therefore, the average predictions across the process tree should be considered for classification

Summary of results: Negligible performance increase on validation and test set data (less than 1 percentage point). This is likely because few samples have more than one process executing simultaneously.

6.3.5. Process-Tree Training

Reasoning: The data are labelled at the application level; therefore, the sum of resources of each process tree should be classified at each measurement, not the individual processes.

Summary of results: Somewhat surprisingly, there was a slight reduction in classification accuracy when using process tree data. One explanation for this may be that the process tree creates noise around the differentiating characteristics that are visible at the process level.

6.3.6. DQN

Reasoning: Reinforcement learning is designed for state-action space learning. Both pre-training the model with a supervised learning approach and not pre-training the model were tested.

Summary of results: Poor performance, typically converging to either kill or not kill everything, of the few models that did not converge to a single dominant action; it does not distinguish malware or benignware well, indicating that it may not have learned anything. Reinforcement learning may help the problem of real-time malware detection and process killing, but this initial implementation of a DQN did not converge to a better or even competitive solution to supervised learning. Perhaps, better formulation of rewards (e.g., damage prevented) would help the agent learn.

6.3.7. Regression on Predicted Kill Value

Reasoning: Though the DQN explores and exploits different state-action pairs and their associated rewards, when the reward from each action is known in the first place and the training set is limited, as it is here, Q-learning can be framed as a regression problem in which the model tries to learn the return (rewards + future rewards), the training is faster and can be used by any regression-capable algorithm. Let be the number of current and future child processes for at

Summary of results: Improved performance on true negative rate, although not perceptible for the highest-scoring F1 models since F1-scores reward true positives more than true negatives, this metric can struggle to reflect a balance between the true-positive and true-negative rates. The highest true-negative rate models are all regression models.

Table 7 lists the F1, TPR, and TNR on the validation and test set for each of the methods described above. The best-performing model on the test and validation sets is reported and the full results can be found in Appendix Table 810. Small improvements are made by some models on the validation F1-score, but the test set F1-score improves by 4 percentage points in the best instance.

In most cases, the models with the highest F1-score on the validation and test sets are not the same. The highest F1-score is 81.50 from an RF using a minimum alert threshold of 2 and both process-level and global process metrics.

6.4. Further Experiment: Favouring High TNR

Although the proposed model is motivated by the desire to prevent malware from executing, the best TNR reported amongst the models above is 81.50%. 20% of benign processes being killed would not be acceptable to a user. Whilst this research is a novel attempt at very early-stage real-time malware detection and process killing, one might consider the usability and prefer a model with a very high TNR, even if this is at the expense of the TPR.

Considering this, the AdaBoost regression algorithm achieves a 100% TNR with a 39.50% TPR on the validation set. The high FNR is retained in the test set standing at 97.92%, but the TPR drops even further to just 8.40%. The GBDT also uses regression to estimate the value of process killing and coupled with a minimum of 4 alerts performs well on the test set but does not stand out in the validation set, see Table 11.

Although less than 10% of the test set malicious processes is killed by the AdaBoost regressor, this model may be the most viable despite the low TPR. Future work may examine the precise behaviour and harm caused by malware that is/is not detected. To summarise results, the most-detected families were Ekstak (180), Mikey (80), Prepscram (53 processes), and Zusy (49 processes) of the 745 total samples.

6.5. Measuring Damage Prevention in Real Time

Although a high percentage of processes are correctly identified as malicious by the best performing model (RF with 2 alerts and 37 features), it may be that the model detects the malware after it has already caused damage to the endpoint. Therefore, instead of looking at the time at which the malware is correctly detected, a live test was carried out with ransomware to measure the percentage of files corrupted with and without the process killing model working. This real-time test also assesses whether malware can indeed be detected in the early stages of execution or whether the data recording, model inference, and process killing is too slow in practice to prevent damage.

Ransomware is the broad term given to malware that prevents access to user data (often by encrypting files) and holds the means for restoring the data (usually a decryption key) from the user until a ransom is paid. It is possible to quantify the damage caused by ransomware using the proportion of modified files as Scaife et al. [24] have done in developing a real-time ransomware (only) detection system. The damage of some malware types are more difficult to quantify owing to their dependence on factors outside the control of the malware. For example, the damage caused by spyware will depend on what information it is able to obtain, so it is difficult to quantify the benefit of killing spyware 5 seconds after execution compared with 5 minutes into execution. Ransomware offers a clear metric for the benefits of early detection and process killing.

Although the RF with a minimum of 2 alerts using both process and global data gave the highest F1-score on the test set (81.50), earlier experiments showed that RFs are not one of the most computationally efficient models by comparison with those tested. Therefore, a decision tree is trained on process-only data (26 features) in case the time-to-classification is important for damage reduction despite the lower F1-score. For this reason, the decision tree model is used in this test. The DT also has a very slightly higher TPR (see Table 12) so a higher damage prevention rate may be partially due to the model itself rather than just the fewer features being collected and model classification speed.

22 fast-acting ransomware files were identified from a separate VirusShare [43] repository which (i) do not require Internet connection and (ii) begin encrypting files within the first few seconds of execution. The former condition is set because the malicious server may no longer exist and for safety, it is not desirable to connect to it if it does exist. Some malware is able to cause significant damage in seconds, in which the timeframes are impossible for a human to see, process, react to, and alert in.

The 22 samples were executed for 30 seconds each without the process killing model and the number of files modified was recorded. The process was repeated with 4 process killing models: DT with min. 2 alerts and 26 features, RF with min. 2 alerts and 37 features, AdaBoost regressor with 26 features, and GDBT regressor with min. 4 alerts and 26 features.

It was necessary to run the killing model with administrator privileges and to write an exception for the Cuckoo sandbox agent process which enables the host machine to read data from the guest machine since the models killed this process. The need for this exception highlights that there are benign applications with malicious-like behaviours, perhaps especially those used for networking and security.

Figure 6 and Table 13 give the total number of corrupted files across the 22 samples. The damage prevention column is a proxy metric denoting how many files were not corrupted using a given process killing model by comparison with no model being in place. The 22 samples on average each corrupt 910 files within 30 seconds.

The DT model almost entirely eliminates any file corruption with only three being corrupted. The RF saves 92.68% of files. The ordinal ranking of “damage prevention” is the same as the TPR on the test set, but the relationship is not proportional. The same ordinal relationship indicates that the simulated impact of process killing on the collected test set was perhaps a reasonable approximation of measuring at least fast-acting ransomware damage, despite the TPR test set metrics being based on other malware families, too.

The DT demonstrates that this architecture is capable of preventing damage, but the TNR on the test set of the DT model is so low (66.19) that this model cannot be preferred to the RF (81.53 TNR), which still prevents over 90% of file damage.

The GBDT prevents some damage, and detects a comparable number of ransomware samples (1 in 5). The AdaBoost regressor detected 2 ransomware samples of the 22, and in these two cases more than 64% and 45% of files were saved, respectively; perhaps, with more execution time, the files would be detected but the key benefit of process killing is to stop damaging software like these ransomware samples and this algorithm actually saw more files encrypted than when no killing model was used; this is because there will be a slight variance in the ransomware behaviour and execution time each time it runs. The Random Forest is the most plausible model, balancing damage prevention and TNR; however, the delay in classification may be a result of the requirement to collect more features and/or the real-time of the model itself.

7. Discussion: Measuring Execution Time in a Live Environment

Although algorithm execution duration was measured above, due to batch processing used by the models, the number of processes being classified can be increased by an order of magnitude with a negligible impact on execution time. The data collection and process killing both have linear, , complexity; where is the number of processes; therefore it is expected that the number of processes impacts classification time. The RF with statistical filters has complexity where is the number of trees in the forest and is the number of alerts considered by the filter; efficient library implementations of matrix operations means that the execution time does not scale linearly with for the RF inference. Given this, a further experiment was carried out with the RF to measure in a live environment how long the data collection, model inference, and process killing takes as the number of processes increases. This was tested by executing more than 1000 processes in the virtual machine whilst the process killing RF runs.

Some processes demand more computational resources than others, and some malware in our test set locked pages in memory [50], which prevented the model from having sufficient resources to collect data, leading to tens of seconds during which no data were captured and many processes were launched. With better software engineering practices, the model may be more robust against this kind of malicious activity.

These differences in behaviour can cause the evaluation time to lag as demonstrated by the outlier points visible in Figure 7. The data show a broadly linear positive correlation between the number of processes (being monitored or killed) and the time taken for the data collection and process killing; this confirms the hypothesis that more processes equates to slower processing time. The slowest total processing time was 0.81 seconds (seen with both 17 and 40 simultaneous processes running), but the mean processing time is just under 0.3 seconds with 65 simultaneous processes, fitting comfortably within the 1-second goal time. Additional code optimisation could greatly improve on these initial results which indicate that the processing, even using standard libraries and a high-level programming language, can execute reasonably quickly.

8. Implications and Analysis

The experiments in this paper address a largely unexplored area of malware detection, by comparison with post-trace classification. Real-time processing and response has a number of benefits outlined above and the results presented here give tentative indications of the advantages and challenges of such an approach.

The initial experiments (Section 6.1) demonstrate that a high-accuracy RNN (as used in [11]) does not maintain high-accuracy when used in real-time with an automated response to classify individual processes rather than full application traces, since a single false positive classification of sequential data cannot be outweighed by later correct predictions.

The next set of experiments (Section 6.1) showed that whilst the RNN achieves one of the highest classification accuracies of a set of algorithms tested, it is not one of the best in terms of computational resource consumption or latency. However, a clear best-algorithm was not evident either since the low-resource consuming algorithms (like decision tree) did not always achieve high accuracy. Furthermore, all of the supervised learning algorithms were clearly unsuited to process killing with the highest F1 score from any algorithm being 77.85 on the test set compared with 85.55 for process-level classification alone. This 85.55 F1 score is lower than is seen in many dynamic malware detection research publications that use full-application behavioural traces, indicating the challenges of classification at the process level, where malware and benignware may share functionality.

Attempting to improve detection accuracy, three approaches were tested: statistical filtering, reinforcement learning, and a regression model estimating the utility (q-value) of killing a process. Statistical filters using rolling mean or alert thresholds were the only approach to improve on the supervised learning model F1 score. Reinforcement learning tended to kill processes too early and therefore not explore enough scenarios (and thus receive the requisite reinforcement) to allow benign processes to continue; this does not mean that future models could not improve upon this result. This may be supported by the success of the regression models in maintaining a high true-negative rate, given that these models ascribed a similar utility to killing processes as the reinforcement learning models.

The accuracy metrics tested thus far simply indicate whether a process was ever killed, but do not address whether damage was actually prevented by process killing. If damage was not prevented, there is little point to process killing and a database of alerts for analysis would be a better solution since the risk of killing benignware is eliminated. This is why the final set of experiments in Section 6.5 were conducted to test the detection models in real time and see if damage could be prevented by looking at the number file corrupted by ransomware before and after infection. Here, we found that it is possible to prevent 92% of files from being encrypted whilst maintaining a true negative rate of 82%. This result does not indicate that the system is ready for real-world deployment but that perhaps further model analysis probably including anomaly detection could raise the true negative rate to a usable point. This work also demonstrates the damage that certain malware can carry out in a short space of time and reinforces the need for further research in this area, since previous work has either focused solely on ransomware [24] or waited minutes to being classification [23], by which time it is too late.

9. Future Work

Real-time attack detection has wider applications than endpoint detection, as Alazab et al. [51] argue that Internet of Things networks in particular could benefit from real-time attack detection using heterogeneous data feed from different sensors combined using federated learning approaches.

However, some challenges remain to be solved; behavioural malware analysis research using machine learning regularly reports 95% classification accuracy. Although useful for analysts, behavioural detection should be deployed as part of endpoint defensive systems to leverage the full benefits of a detection model. Dynamic analysis is not typically used for endpoint protection, perhaps because it takes too long in data collection to deliver the quick verdicts required for good user experience. Real-time detection on the endpoint allows for observation of the full trace without the user having to wait. However, real-time detection also introduces the risk that malware will cause damage to the endpoint. This risk requires that processes detected as malicious are automatically killed as early as possible to avoid harm.

There are some key challenges to implementation, which have been outlined in this paper:(i)The need for signal separation drives the use of individual processes and only partial traces can be used.(ii)The significant drop in accuracy on the unseen test set, even without process killing demonstrates that additional features may be necessary to improve detection accuracy.(iii)With the introduction of process killing, the poor performance of the models on either benignware classification (RF min 2 alerts: TNR 81% with an 88% TPR on the test set) or on malware classification (GBDT regressor min 4 alerts: 56% TPR with a 94% TNR on the test set) means that considerable further work is needed before very early stage real-time detection can be considered for real-world use.(iv)Real-time detection using full execution traces of processes, however, may be viable. This is useful to handle VM-aware malware, which may only reveal its true behaviour in the target environment. Although the more complex approach using DQNs algorithms did not outperform the supervised models with some additional statistical thresholds, the regression models had better performance in correctly classifying benignware. Reinforcement learning could still be useful for real-time detection and automated cyber defense models, but the DQN in these experiments did not perform well.(v)Despite the theoretical unsuitability of supervised learning models to state-action problems, these experiments demonstrate how powerful supervised learning can be for classification problems, even if the problem is not quite the one that the model is attempting to solve.(vi)Future work may require a more comprehensive manual labelling effort at the process level and perhaps labelling sub-sections of processes as malicious or benign.

An additional consideration for real-time detection with automated actions is whether this introduces an additional denial-of-service vector using process injection for example to trigger process killing. This may also however indicate that an attacker is present and therefore aid the user.

10. Conclusions

This paper has built on previous work in real-time detection to address some of the key challenges: signal separation, detection with partial execution traces, and computational resource consumption with a focus on preventing harm to the user, since real-time detection introduces this risk.

Behavioural malware detection using virtual machines is a well-established research field yielding high detection accuracy in recent literature [3, 6, 11, 20]. However, as is shown here, fixed-time execution in a sandbox may not reveal malicious functionality. Real-time malware analysis addresses this issue but risks executing malware on the endpoint and requires detection to take place at the process level, which is more challenging as the definition of a malicious process can be unclear. These two reasons may account for the limited literature on real-time detection. Looking forward, real-time detection may become more popular if static data manipulation and VM-evasion continue to be used and the costs of malicious execution continue to rise. Real-time detection does not need to be an alternative to these approaches, but could hold complementary value as part of a defense-in-depth endpoint security.

To the best of our knowledge, previous real-time detection work has used up to 5 simultaneous applications, whereas other users may use far more. This paper has demonstrated that up to 35 simultaneous applications (and nearly 100 simultaneous processes) can be constantly monitored, where previous work [23] had tested a maximum of 5. Moreover, these results demonstrated that data collection presented a greater limiting factor than machine-learning algorithms, which can easily process 1000 samples with negligible impact on performance. This result is not too surprising since batch processing allows algorithms to achieve O(1) complexity by comparison with O(n) for data collection.

Automatic actions are necessary in response to detection if the goal is to prevent harm. Otherwise, this is equivalent to letting the malware fully execute and simply monitor its behaviour since human response times are unlikely to be quick enough for fast-acting malware. From a user perspective, the question is not “What percentage of malware was executed?” or “Was the malware detected in 5 or 10 minutes?” but “How much damage has been done?”.

This paper found that by using simple statistical filters on top of supervised learning models, it was possible to prevent 92% of files from being corrupted by fast-acting ransomware thus reducing the requirements on the user or organisation to remediate the damage, since it was prevented in the first instance (the rest of the attack vector would remain a concern).

This approach does not achieve the detection accuracies of state-of-the art offline behavioural analysis models but, as stated in the introduction, these models typically use the full post-execution trace of malicious behaviour. Delaying classification until post-execution negates the principal advantages of real-time detection. However, the proposed model presents an initial step towards a fully automated endpoint protection model, which becomes increasingly necessary as adversaries become more and more motivated to evade offline automated detection tools.

Data Availability

Information on the data underpinning the results presented here, including how to access them, can be found in the Cardiff University data catalogue at 10.17035/d.2021.0148229014.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partly funded by the Engineering and Physical Sciences Research Council (EPSRC)-grant references EP/P510452/1 and EP/S035362/1. The research was also partly funded by Aureirbus Operations Ltd..