Reducing your alert fatigue with AskJeevesSecBot

In incident response, there is a disconnect between a security alert being generated and a user’s confirmation of the security alert. For example, generating an alert every time a user runs “curl” on a production system would generate a bunch of false positives that can lead to what is called “alert fatigue”. But if we extend our incident response capabilities to include the user as part of the triage process we could reduce the number of alerts. This blog post is going to demonstrate AskJeevesSecBot which is an open-source proof of concept (POC) of how to integrate Slack and user responses into your security pipeline, specifically during the triage phase of the incident response process. In addition to a PoC, this blog post will also provide a deep dive into the architecture of this project, design decisions, and lessons learned as an evolving threat detection engineer.

Goals

  • Threat detection engineering project
  • Open-source a project that will prompt a user via Slack about security events
  • Improve Golang skills
  • Reduce alert fatigue

Problem statement

All SOCs and incident response teams are affected by alert fatigue. The goal of this project is to reduce alert fatigue by introducing the user as part of the triage phase of the incident response process. AskJeevesSecBot is a POC to demonstrate how users can be integrated into the process. At Slack, they have a “no no list” of commands that if a user runs a command on that list the user is then sent a Slack message to confirm or deny if they ran. that command. If the user confirms they ran that command then nothing happens, but if the user did not run that command a security investigation is created. This same concept can be applied to user VPN logins. Our threat is going to be a publicly exposed VPN service that can be brute-forced by attackers. 

docker-compose up -d – -no-wait

If you want to skip ahead to a working implementation instead of reading this blog post please go here.

Background

What is triage?

Jared Atkinson from Spector Ops stated triage is answering the question of whether an alert is malicious or benign. This project is targeting this exact question of whether a new VPN login is malicious or benign.

What is Kafka?

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open-sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

What is Heroku?

Heroku is a cloud platform as a service supporting several programming languages. One of the first cloud platforms, Heroku has been in development since June 2007, when it supported only the Ruby programming language, but now supports Java, Node.js, Scala, Clojure, Python, PHP, and Go.

What is a VPN hash?

VPN hash is built on the premise of the JA3 hash. A JA3 hash is an MD5 hash of the following tuple of (source IP address, source port, destination IP address, destination port, protocol). A VPN hash is a SHA3 hash of the following tuple (VPN username, public IP address, geographic ISO code). The hash produced is unique to these three values and if any value changes then a new hash is produced. This hash allows for simple lookups in our table and allows us to track unique logins vs. logins seen before. For example, if a user keeps logging in from home with the same values then we only store this entry once and update the last used field. If the user logins from a different location that has never been seen before we will send a Slack notification to the user.

Math time

VPN entries

MySQL storage space for VPN logins =
  * each entry in MySQL is ~ 200 characters
  * 1 byte per UTF-8 character (only US-ASCII)
  * assuming each user will have 15 entries (on average within a 90 day timeframe)
  * 100,000 employees
  = 300,000,000 bytes ~ 0.3GBs

VPN hash

World Atlas records at the time of this writing records that the largest organization in the world is the United States Department of Defense at 3.2 million employees. Our implementation of VPN hash/SHA3 allows for up to = 2^ 256 combinations ~ 1.157 x 10 ^ 77 combinations. As can be seen, the possible number of combinations for SHA3 is significantly higher than the possible combinations for a VPN hash which means we can support the largest org and it is statistically unlikely for a collision to occur.

How many VPN hash combinations = 
  * 3,000,000 users in an org
  * 3,706,452,992 possible IPv4 addresses (excluding reserved IP addresses)
  * 120,000 possible locations from geo MaxMind database
  ~ 1.334 * 10 ^ 21 combinations

Architecture

Network diagram

Overview

Phase 1: Ingesting VPN logs

This section will review the flow of the application in the network diagram above. First, a user’s machine initiates a connection to the VPN server. In this blog post, I use OpenVPN as my VPN provider but any provider will do if it can log the user’s IP address and username at a minimum. Additional metadata such as operating system, device type (mobile, laptop, desktop), and hostname of the device can be used by AskJeeves. Once the user has established a successful VPN connection, the VPN server will send the log entry off to Rsyslog.

In the case of OpenVPN the data being passed was in Syslog format which is not very friendly to key-value pair extraction like JSON. Rsyslog can be used to extract/transform data such as Syslog into JSON format. It should be noted that after the extraction/transformation the JSON being sent to Kafka from Rsyslog must contain an IP address and username at a minimum but can include additional metadata such as operating system, device type (mobile, laptop, desktop) and hostname of the device. If your VPN platform supports Kafka you could also skip Rsyslog entirely and have the logs ingested straight into Kafka.

The reason Kafka was used in this architecture is because it is a common platform being used in enterprise logging pipelines. In addition, AskJeeves is not the only application that might need the VPN logs. The security team and the IT team may want those logs as well for different reasons. Kafka provides the ability for a single producer of data to be ingested by multiple consumers. In addition, Kafka also provides streaming capabilities that are ideal for this architecture. Once a VPN log has been consumed we no longer need it but of course, a good security team should store these logs in a SIEM.

Phase 2: Generating VPN hash

AskJeeves will listen to the Kafka topic containing VPN logins for new entries and will pull them out one-by-one for processing. First, AskJeeves will start by extracting the JSON entry from Kafka into a struct called “userVPNlog”. Second, AskJeeves will compute the human-readable location (Minnesota, US, NA) and ISO code for the location (5037779) based on the IP address using the MaxMind GeoIP database. Third, AskJeeves will compute the VPN hash of the following tuple (username, IP address, and ISO code). Fourth, AskJeeves will check the MySQL database to see if the VPN hash exists, if it does it will ignore this VPN login, if it does NOT it will add it to the database.

It should be noted that the MySQL functionality of AskJeeves has a configurable value of how long it will keep entries – default is 90 days. There is a GO background task that will check the database on an interval (default 6 hours) for entries older than configured expired value. If an entry is discovered it is purged from the database. This functionality is for when a user goes to a Starbucks and logins into the company VPN but will never ever return to that location. Now you might be wondering, does this mean user’s will be prompted every 90 days if they login from the same location with the same IP address? The answer is no. The database also keeps track of the last login time for that VPN hash. If it is consistently being used, that value is updated accordingly and the user will never receive a Slack notification after the first attempt.

Phase 3: Generate Slack message

Fifth, AskJeeves will generate a Slack message to the user containing an EventID, Username, Location, Timestamp, IP address, VPN hash, Device, Hostname, and user selection (screenshot below). An EventID (UUID4) is generated to track multiple logins that have the same VPN hash login for new logins.  For example, if a user attempts to log in from a new location via a mobile device and a laptop at the same time. The EventID can be used to debug issues but also for the security team to track multiple logins with the same VPN hash.  Next, the Slack message will ask the user to confirm (“This was me”) or deny (“This was NOT me”) that they performed the VPN login based on location, IP address, and username. When the user selects “This was me” or “This was NOT me” the response is sent from Slack to ButlingButler.

Phase 4: Receiving user response

First ButlingButler will extract the Slack signature from the payload to verify the message was sent from Slack. Any messages that can’t be verified are dropped. Second, ButlingButler extracts the payload and stores it into MySQL. Third, ButlingButler will acknowledge the Slack notification by responding with an updated message. The updated message is a party parrot in the event the user selects “This was me” to let the user know we acknowledge their selection and no further action is required. In the event, the user selects “This was NOT me” an updated message acknowledges the user’s selection by stating “Alerting security team”.

Phase 5: Recording user response

On a configured interval AskJeeves has another background task that polls ButlingButler for new user responses. If ButlingButler has new user responses it requests the MySQL database for all entries, deletes all entries from the database, and returns all entries to AskJeeves. AskJeeves iterates through each entry updating it’s MySQL database with user responses. If the user selected “This was NOT me” a ticket is generated in “theHive” for the security team to do an investigation.

Design decisions

You might be wondering why I have two applications (AskJeeves and ButlingButler) for this system. This was due to a limitation of how the Slack API is designed. The Slack API does not have an API endpoint to poll on an interval for user responses. Instead, Slack supports sending the user response in realtime to a URL endpoint. From my perspective, this is incredibly frustrating because that means some part of my system HAS TO BE publicly facing. From a Slack perspective, I understand this decision and it’s probably less resource consumption for this functionality. One could argue I could punch a hole in NAT and just make the two applications as one. While that is a possibility, there is also the risk of having to protect a publicly facing application that contains sensitive information.

At first, I contemplated writing the ButlingButler functionality into AskJeeves but I realized that would create a risk I didn’t want to accept. AskJeeves has access to VPN logs and a vulnerability in AskJeeves could allow an attacker to access these logs or modify actions performed. I decided most organizations wouldn’t want to punch through NAT for this application due to the sensitive information it contains. Secondly, most organizations have a public presence in a cloud platform so this made it an easy choice.

The design decision to separate the architecture into two applications allowed me to eliminate the risk of the VPN logs being compromised. It also allowed me to reduce the risk of the public-facing application by limiting the attack surface of the public-facing application. The attack surface can be greatly reduced because firewall rules can be implemented to only allow traffic from AskJeeves and Slack. Next, ButlingButler only accepts authenticated requests from AskJeeves or only accepts verified Slack requests. If these security controls are put in place the attack surface and risk of this system are virtually non-existent However, this architecture became increasingly more difficult because all the cogs in the machine had to work together.

Design assumptions

  • Your organization uses Slack, Kafka, and theHive
  • The user/username exists in Slack
  • The username for the VPN is the same in Slack
  • Your organization has access to a cloud to publicly expose ButlingButler

Discussion

Future of this project

This blog post demonstrated a POC with VPN logins but I would like to expand the functionality to user commands, access to critical systems, or Slack notification to instruct the user to go to the IT service desk to get malware removed. The possibilities are endless and I hope the community contributes to this project to make some of these ideas come to life.

Lessons learned

I am currently reading a book called “Cracking the Coding Interview” and it is a great book. One interesting part of the book is their matrix to describe projects you worked on and the matrix contains the following sections which are: challenges, mistakes/failures, enjoyed, leadership, conflicts, and what would you’d do differently. I am going to try and use this model at the end of my blog posts to summarize and reflect on the things I learn. I don’t blog to post things that I know, I blog to learn new things and to share the knowledge of my security research.

New skills/knowledge

  • Learned Google’s API for Google Static Maps
  • Learned how to use Slack’s new Block Kit for UI designer
  • Learned how to implement Slack’s Block Kit in Golang
  • Learned what digital signing is and implemented it with Slack
  • Learned how to implement Golang tickers to create background tasks
  • Learned how to use the Golang library GORM to interact with MySQL
  • Learned how to use Flask-JWT for authentication
  • Learned how to use Slack’s signing key to verify HMAC
  • Learned how to use Heroku
  • Learned how to setup theHive and basic functionality
  • Learned about software licenses and applied the Apache 2.0 license to this project

Challenges

  • Working with the Golang Kafka library
  • Marshaling JSON
  • Extracting values from Syslog messages
  • A project that relies on multiple independent components
  • Transitioning ButlingButler’s config.py to work with environment variables for Heroku
  • Heroku has some nuances like they expect a Docker container to be called web to route HTTP traffic to

Mistakes and failures

I grossly underestimated the development time it would take to take this idea to a working PoC. I assumed the Slack API would have an endpoint I could query on an interval for a user’s response but this was not the case. When a user submits a response to Slack it will PUSH the user’s response to a URL that has to be publically accessible. Hence why the creation of ButlingButler to listen for Slack notifications. This revelation completely changed the entire architecture of my application

As a side note, I always wondered why a developer would write one part of an app in one language and the second part in a different language. This project helped me understand that reality because when I had to write ButlingButler I knew I could develop that component faster in Python over Golang. I now understand the developer’s dilemma.

Enjoyed

  • Designing a Golang project to utilize Kafka
  • Calculating the math to determine if MD5 or SHA3 would be a good fit for a VPN hash
  • Enjoyed setting up theHive in my homelab – will def use it in future projects
  • In past projects, I have used SQLAlchemy to target one database system. During the development of ButlingButler, I used MySQL but the free tier of Heroku only supported Postgres. It was exciting to actually utilize the full potential of SQLAlchemy and all I had to do was change the database URL.

What You’d Do Differently

  • Add a mechanism to pull down a new GeoIP database on an interval
  • Lastly, my intention at first was to do this entire project in Golang to become more proficient. However, having to have a publicly visible instance threw a wrench into my original plan and I decided to write ButlingButler in Python because it would be quicker. This decision allowed me to rapidly implement a server with proper security controls.
  • Create automation to create API keys, set up env variables, etc – shorter install/setup doc.
  • Use Docker secrets or Vault to store secrets instead of config files with hard values
  • I assume Heroku has an internal secrets manager
  • Optimized the AskJeeves function that does the MySQL lookup for old entries
  • If an IP address has a GeoIP location for its country but not a city it will crash the application. This has been added to the “to do” list

References

Leave a Reply

Your email address will not be published. Required fields are marked *