Post banner
Cloud SecurityThreat Stack 7 Min Read

Scaling Anomaly Detection Pipelines For Security Telemetry

This blog post is co-authored by Senior Software Engineer Mina Botros, Principal Data Scientist Dmitry Kit, and Senior Software Engineer, Platform Architecture Jackson Westeen 

Threat Stack collects tens of billions of events per day, which helps customers understand their environment, identify undesirable activity, and enhance their security posture. We collect telemetry from across the full stack of the cloud – covering operating systems, containers, cloud provider APIs, and other critical sources. Processing security telemetry efficiently and quickly enhances our ability to deliver value to our customers through real-time alerting, ThreatML, and analytics.

ThreatML enhances the Threat Stack platform’s rules engine and our Oversight and Insight services by intelligently surfacing anomalous activities across cloud workloads. ThreatML learns and baselines user behavior with proven investigation techniques used by Threat Stack’s security services teams to uncover suspicious trends that could otherwise go undetected.

ThreatML anomaly detection complements Threat Stack’s continuous security telemetry collection, robust rules, alerting capabilities, and professional services, all of which enable security teams to improve their risk visibility, context, and response to both known and unknown threats.

Identifying abnormality in billions of events is no easy task, requiring the definition of a training data set, system process name similarity algorithm, and careful selection of telemetry criteria for assessment.

Parallel Pipelines

Threat Stack phased its ThreatML releases within an early access program before its general availability (GA). As a result, with each release, we were able to add more customers to the anomaly detection pipeline for assessment. This allowed us to iterate and enhance our Machine Learning (ML) offering based on direct customer usage and feedback prior to GA.

Although developing the base algorithms is possible in the development environment, the production environment is where we get to observe the real diversity and volume of data to which we need to scale. 

At each milestone of development, we had to test the scalability of the approach against a greater set of customers. This required us to design deployment strategies that would allow us to deploy multiple versions of the solution such that each version had its own outputs and metrics for evaluation. This ensured that anything customer-facing would not change until we were certain the change would positively impact the customer’s experience. The Scale testing and Production pipelines are similar to those shown in Figure 1.

Figure 1

Utilizing the scale testing pipelines allowed us to find bottlenecks in the pipeline early on, remediate, iterate and be prepared to handle the event load for additional customers with every new release. 

Despite the fact that we have successfully and efficiently scaled our ML pipeline to process the entirety of our customer data, we still utilize the scale pipelines to test out changes to the ML models to guarantee the quality, runtime, and resource utilization efficiency of the changes before going live. This is an industry-standard known as Blue/Green or A/B testing. 

Moreover, we have integrated the validation ML models deployment process into our source code deployment pipelines. This makes the process of repetitive deployment of validation pipelines, execution of tests, and evaluation of efficiency by the engineering teams safe and controlled. Once the validation process is complete, it can safely deploy code to production via source code deployment pipelines. The validation and production deployment steps are shown in Figure 2.

Figure 2

Security Telemetry Selection

Threat Stack collects a variety of events from customer infrastructure. Collected events represent the various ways a particular system is used; events can be obtained from interactive sessions due to activity by background processes, activities from servers such as web or database servers, or automated build systems. 

ThreatML compares collected telemetry from one window of data to multiple other windows to determine a certain behavior’s anomaly score. With tens of billions of collected data points per day from our customer base, the more data we need to compare from our historical telemetry to the telemetry points in question. This is where some smart data selection is needed to achieve a reasonable balance between runtime and exhaustiveness.

Finding anomalies in the entire data set proved to be unscalable early on. The amount of time and resources needed to run the anomaly detection process was significant and misaligned with external and internal expectations. Most importantly, our end customers are better informed when they are notified about anomalies earlier rather than later. Moreover, we aimed to achieve high efficiency in our anomaly detection process.

Due to the variety of collected data, it’s important to extract the most useful telemetry to consider given the business case for the machine learning application. To find the right data set, we had to go through the various kinds of categories of data that get analyzed during this process and identify which categories were essential to the analysis and which ones could be skipped without the risk of missing key telemetry that would inform our ML models.  We zeroed in on a single category of data that would be most informative to our ML models.

Choosing the right set of telemetry allowed us to achieve a 25X improvement in cost savings and 16X improvement in runtime. Prior to this narrowing of scope, the set of telemetry that was used for anomaly detection allowed us to only process 15% of our customer base in 4 hours. With the refined data set, we were able to cover 100% of our customer base in just 2 hours, with even 50% less hardware than previously used.

Security Telemetry Redistribution 

Threat Stack supports hundreds of customers. This means that ThreatML has to support a variety of cloud infrastructure usage patterns and event volumes.

Achieving data load symmetry by means of redistribution is a common pattern we use across multiple services to achieve efficiency and resiliency for our services. Threat Stack engineering designs systems with scale in mind to better serve customers interested in growing their Threat Stack deployment and new customers with high event volumes and large-scale deployments.

Data skew, similar to the one shown in Figure 3, is to be expected based on the wide variety of security telemetry Threat Stack collects. 

In the context of an Apache Spark application running on EMR, this skew leads to slower execution of EMR steps, and as a result, resources are wasted since not all EMR workers get to participate in the processing of data, similar to unevenly wide green bars shown in Figure 3, which represent the chunk of data assigned to each Spark worker.

Figure 3

A few steps are taken to smooth the skew and achieve better results in terms of runtime and performance. The main process for redistributing data over all the workers we will discuss here is called “Salting.”

The first step in applying this technique required us to determine the dimensions (i.e., columns) along which we want to split our data. The goal is to split our data into an equal number of groups such that spreading these groups across the cluster results in an even amount of work. The major variability that we observed in data volume variability is across customers. This means that we want to divide the data for customers with large numbers of data points into more groups and those with fewer data points into smaller ones. Likewise, we wanted to make sure daily seasonality didn’t skew the data either. 

To reduce the data volume skew due to these factors, we first introduced a new composite field that concatenates the organizational identifier with the date key separated by a special character. This new field is added as a key column to the Spark DataFrame, as shown in the code snippet below and in Table 1. The following steps will use this column to create a more uniform distribution of data across the cluster.

Organization ID Date key Timestamp Value Composite key
1 02/02/2020 02/02/2020 12:20 10 1_02/02/2020
1 02/02/2020 02/02/2020 12:21 12 1_02/02/2020
1 02/02/2020 02/02/2020 12:22 13 1_02/02/2020
2 02/02/2020 02/02/2020 12:30 15 2_02/02/2020

Table 1

Secondly, we calculate the daily population statistics on each organization to determine the data distribution prior to repartitioning, as shown in the code snippet below and in Table 2.

Composite key Count
1_02/02/2020 15,000
2_02/02/2020 250,000

Table 2

Then we create a Dictionary data structure with the composite identifiers and the ratio of the number of events for each organization, as shown in the code snippet below. The ratio is the number of events for each composite key divided by the total number of events in a batch, which is then multiplied by the number of shuffle partitions available to the EMR cluster in order to maximize parallelism. 

The shuffle partitions count is exposed by the EMR cluster and represents the number of workers available to that cluster. By default, EMR sets the default parallelism to twice the number of CPU cores available to YARN containers when the “maximize resource allocation” configuration flag is set to true. This process allows us to get a set of salt ranges for each customer before we apply the salting step. Table 3 illustrates the number of partitions the various data sets will be split into. 

Composite key Salt Range 
1_02/02/2020 5
2_02/02/2020 94

Table 3

Now we define a User-defined function (UDF) which generates a salted partitionKey within [0, range) by getting a random value between zero and the previously calculated salt range ratio for each partition key. This UDF calculates the partition key when we reshuffle in a future step. Note that the maximum value of the range is equivalent to the number of shuffle partitions available to the EMR cluster.

The next step is to add a new column to the Data Frame by applying the UDF to the composite identifier column, as shown in Table 4.

Organization ID Date key Timestamp Value Composite key Salted Composite Key
1 02/02/2020 02/02/2020 12:20 10 1_02/02/2020 1_02/02/2020_2
1 02/02/2020 02/02/2020 12:21 12 1_02/02/2020 1_02/02/2020_3
1 02/02/2020 02/02/2020 12:22 13 1_02/02/2020 1_02/02/2020_1
2 02/02/2020 02/02/2020 12:30 15 2_02/02/2020 2_02/02/2020_65

Table 4

Lastly, we repartition the data frame using the available number of partitions and the new salted column, which was added to the Data Frame, shown in the code snippet below.

Smoothing out the data by means of salting is a common practice that results in a better distribution of workload over the available workers. The results of this redistribution are shown in Figure 4 and are much more uniform as compared to Figure 3. This distribution has the benefit of not overloading any one worker and preventing out-of-memory exceptions. 

This strategy also enables us to scale the EMR clusters horizontally and handle increased data volumes in the future; ensuring that resources are well utilized, the overall runtime and cost is reduced, data from additional customers can be processed regardless of their event volume, and we can adapt to changing data volumes from our current customer base.

Figure 4

More ML At Scale

ThreatML allowed us to define ML strategies for experimental and production deployment and to metric the efficacy of various models while achieving high performance. These strategies serve as a general framework for developing additional anomaly detection models that will be deployed at scale to analyze security telemetry data and provide more valuable insights to our customers.