Whether you intend to use managed services to handle your organization’s cloud security or have decided to create and manage your own security program, pulling together all the information you need can be a complex task to say the least. To help out, we want to share some of the insights and best practices we’ve gathered from Threat Stack’s managed security service.
To structure your needs analysis, you can use the main steps in this methodology as a series of best practices to develop clear insights into your environment, workloads, and behaviours, and to define a robust cloud security strategy and policies. The issues we focus on and the main questions we ask are outlined below to help you build your strategy.
The process starts with questions that help define the behaviors that are expected of DevOps users related to Access, Privileges, and Activities. These behaviors are probably the biggest unknown for most companies and especially for smaller, fast-moving organizations that are just starting to implement more defined security policies.
User access issues focus on how users remotely access the servers in each environment, and the key questions you need to ask include the following:
- Is there a VPN and/or jumphost(s) for access?
- Is access open to the world or should connections only be received from specific IP addresses or ranges?
- Is access limited to specific user IDs? Are these shared or unique IDs?
- Are root logins used? (If they are, get rid of them!)
We ask many more questions about user behavior, especially as related to production environments, because production tends to be immutable with no direct remote access by anything other than config management and deployment tools. Identifying what your users should and shouldn’t be doing within the protected environment(s) is key to good security and operational hygiene.
Privileges & Activities
Next, we want to know who is expected to carry out specific actions; so once the access questions have been answered, we move on to issues related to privileges and activities. The most common among these are privilege escalations and file transfer activities. These should be monitored closely because they’re the most likely way an insider with permitted access will be able to cause damage by performing activities like transferring files off a server or downloading binaries that can be used for unapproved access. Of course being able to perform these actions as root via sudo only increases the level of access and the ability to hide these actions.
Consider the following when you’re looking at user privileges and activities:
- Who is able to sudo?
- What tools or services will use sudo?
- What users perform scp and wget?
- What other processes should generate alerts when executed?
- find -exe, vi, tcpdump, netcat
It’s important to note that these activities should never occur on production servers. On non-production servers, it’s critical to monitor and control them to ensure that malicious or inadvertent actions do not make it to production.
After completing the onboarding user survey, we start identifying the processes we expect to see running on each workload. This is probably one of the easier areas to benchmark since what’s expected is generally quite well defined based on the server role, at least for production workloads. On the other hand, process behaviours for development servers are much less defined. The questions that need to be answered for production center on the different application stacks in use and matching these to the server hostname and/or tags.
Some of the key questions you need to answer for production are as follows:
- What application stacks are running in production?
- What service processes are running on each host respectively?
- What service/user accounts are these running under?
- What should be running as root (if anything)?
- How are these deployed and/or updated?
- How often are these deployed/or updated?
Other issues definitely need to be considered here, but the processes and services that are running are generally known and understood. Processes and activities you were not aware of will surface quickly once you have the deep visibility into your workload that you need to answer these questions. This step is as much a part of developing an operational understanding as it is a security exercise.
Network activity is another area that is generally quite easy to define, especially for production workloads. Key issues center on identifying where these servers are expected to be communicating. It is important to identify both inbound and outbound services as well as which workloads should have public Internet-facing accesses and which should not.
When identifying and defining network behavior, answer the following:
- What servers have publicly accessible services?
- For non-public services, such as db workloads, what are the expected network connections for each respectively?
- From specific IPs or subnets?
- Outbound? (Most servers don’t make outbound connections, but ones that do are usually quite well defined.)
- What ports are open for each workload respectively?
Again, once alerts start coming in, additional network connections that were not previously known will be identified.
By adding IP reputation information, consideration is also given to the types of actions that should be taken when communication is made to a known bad IP address. Of course for most public-facing services, we expect a barrage of connections from known bad IPs performing scanning, a variety of hacking attempts from exploits against your servers, as well as brute force dictionary attacks against your VPN and jumphosts. This is the risk that is assumed when public Internet-facing services are opened up, and closing off these services is not usually an option. Once these connections are identified, it is possible to block connections to these IP addresses using services like iptables or FailToBan.
For outbound connections, a communication from a workload to a known bad IP is usually another story. This is a Severity 1 event, and if your server is establishing a connection outbound to a known bad IP, this needs to be investigated immediately. There are different IP reputations and not all are the same. For these it is important to do additional investigation to determine what the reputation is as well as the possible risk. One of the things we check is the IP location and then ask:
- Does this come from a Geo with known bad activities?
- Should my server be communicating with an IP in that region?
- What is the identified reputation, spamming, scanning, exploitation, etc.?
- What do other IP reputation sources say about this IP?
File monitoring activity is generally the easiest, and sometimes the most important, area to define. Having a pre-defined set of files and directories to start with certainly helps, but that may just be the start. Files that typically need to be monitored are ones that contain private key information, cert files, key files, etc., as well as server configuration files. Some of the things to consider when creating file monitoring rules are related to the actions you want to alert on. With an advanced file monitoring solution, you can not only be alerted when a file is modified, but also when many other actions happen, such as when a file is being opened, deleted, created, or even copied, to name a few.
Some Final Thoughts
By completing a baselining process and implementing rules and alerts for our customers, Threat Stack has learned that many things happen in customer environments that are not expected. While these are not always security related, they can definitely have security implications.
By following the analytical process we use in our onboarding methodology, you can create a better picture of your environment. This, in turn, will enable you to develop a more complete understanding of the behaviors and activities that are happening within that environment.