My logging pipeline: Splunk, Logstash, and Kafka

Over the years I have built several logging pipelines within my homelab and each used different technologies and methodologies but now I have finally built a pipeline that suites my needs. When I tell people about my pipeline they usually ask if I have a blog post on it because they want to know more or replicate it. This blog post is my attempt to share my logging pipeline as a framework for newcomers. The hope is that the explanation of the architecture, design decisions, working infrastructure-as-code, and the knowledge I accumulated over the years will be beneficial to the community.

Background

What is Splunk?

Splunk is an advanced, scalable, and effective technology that indexes and searches log files stored in a system. It analyzes the machine-generated data to provide operational intelligence. The main advantage of using Splunk is that it does not need any database to store its data, as it extensively makes use of its indexes to store the data. Splunk is a software mainly used for searching, monitoring, and examining machine-generated Big Data through a web-style interface.

Splunk performs capturing, indexing, and correlating the real-time data in a searchable container from which it can produce graphs, reports, alerts, dashboards, and visualizations. It aims to build machine-generated data available over an organization and is able to recognize data patterns, produce metrics, diagnose problems, and grant intelligence for business operation purposes. Splunk is a technology used for application management, security, and compliance, as well as business and web analytics.

Obtaining a Splunk license

Splunk’s common information model

Splunk’s Common Information Model (CIM) is a model for Splunk administrators to follow (but not enforced) so all data sets have the same structure. For example, Suricata uses src_ip for source IP address but Zeek uses id.orig_h which at first glance is not a friendly convention. The Splunk Network Traffic CIM states this field name should be src_ip for all network logging sources. This makes a huge impact because instead of having to remember the naming convention for each logging source you just have to know the logging convention for a source type. Furthermore, the cross-correlation between indexes becomes much easier if the indexes have the same fields.

What is Logstash?

Logstash is an open source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice. Cleanse and democratize all your data for diverse advanced downstream analytics and visualization use cases. While Logstash originally drove innovation in log collection, its capabilities extend well beyond that use case. Any type of event can be enriched and transformed with a broad array of input, filter, and output plugins, with many native codecs further simplifying the ingestion process. Logstash accelerates your insights by harnessing a greater volume and variety of data.

What is Kafka?

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

What is Filebeat?

Filebeat is a lightweight shipper for forwarding and centralizing log data. Installed as an agent on your servers, Filebeat monitors the log files or locations that you specify, collects log events, and forwards them to Logstash for indexing.

The architecture of the logging pipeline

Network diagram

Recommended system requirements

  • 4 cores
  • 16GBs of RAM
  • 500GBs of HDD

Why Filebeat?

Typically in the past, I have preferred the Rsyslog client because in my experience it has been a very simple and lightweight logging client. However, I ran into the following issue which is sending multiple Zeek logs and preserving the original log filename. I don’t want to collect all the Zeek logs (dns.log, conn.log, x509.log, and ssl.log, etc) into a single Kafka topic or log file. I want to have the ability to keep each log source separate. Rsyslog does provide a way to do this but it is NOT clean and NOT easy to debug so I decided to switch to Filebeat.

Filebeat has numerous capabilities such as modules for popular tools (like Zeek and Osquery), provides the ability to preserve the filename, TLS encrypted communications, and the backpressure-sensitive protocol built into beats. The backpressure-sensitive protocol allows Filebeaet and Logstash to communicate and when Logstash feels overwhelmed it can request Filebeat to reduce how much data is being sent. The TLS communication is a huge win because I have boxes in the cloud (screenshot above) and I need to securely ship logs from the cloud to my homelab. Furthermore, the TLS communication supports client certificates which is something I intend to set up in the future.

Why Logstash?

Again, typically in the past, I have preferred the Rsyslog server because in my experience it has been a very simple and lightweight logging server. As stated above the ability to preserve the original log source filename is possible but not clean. While the Rsyslog server has been very faithful for several years it lacks some modern-day features such as scaling multiple instances for high-availability, does NOT work well with JSON in a clean manner, and to implement TLS communication for TCP and UDP inputs you NEED client certificates OR use RELP which is a protocol built specifically for Rsyslog that is not widely supported. I understand client certificates are more secure and they can thwart unauthorized loggers from pushing logs but my setup has strict firewall rules which thwart that risk.

Logstash, on the other hand, is an application that was built in the modern era that supports a high availability setup, configuration files are like code, works very well with JSON, and provides TLS without the need for client certificates. There are multiple methods for high availability which can be reviewed here. The configuration as code is very important because it provides simple programming concepts such as if statements, regex capabilities, ability to extract data, and the ability to transform data. Lastly, the Beats protocol used by Logstash provides TLS with or without client certificates which is convenient for my environment. In addition to the Beats protocol, Logstash has a plugin store for inputs, filters, and outputs to support all kinds of services and logs.

Why Kafka?

I’ve told a bunch of people about this blog post and a common question I got was “why Kafka?”. There are numerous reasons for using Kafka in your pipeline but the reason I choose Kafka was that it can support multiple consumers for a single data source. In this blog post, I only cover shipping Zeek logs from my TOR node to Splunk. However, I have other tools and services that would benefit from Zeek’s data such as Neo4j. Neo4j is a graph database that I can use to map relations between data sources but more on this in a future blog post.

In the future, I also plan on implementing an enrichment process for my logs and using these data sources for future projects. So that means I would have Zeek DNS data coming in, enrich the log event with VirusTotal, and ship the enriched event to Splunk. This process allows me to write the data ONCE to Splunk and not have to query, enrich, and re-write the data back as some implementations do. I’m also aware that Splunk supports the ability to enrich data on ingestion but other consumers of that data won’t benefit from it.

If you haven’t read this LinkedIn article about the architecture of Kafka you really should because it’s super fascinating. LinkedIn designed Kafka because they needed a solution that was fault-tolerant, extremely fast, and supported multiple producers and consumers for a single topic. LinkedIn states the speed of Kafka is due to “Kafka logs are write-only structures, meaning, data gets appended to the end of the log. When using HDD, sequential reads and writes are fast, predictable, and heavily optimized by operating systems. Using HDD, sequential disk access can be faster than random memory access and SSD“. These design decisions ensure that Kafka can handle any homelab load without losing any data. Furthermore, if I need to bring Splunk down for maintenance I can do so without the loss of data because Kafka will act as a cache until Splunk comes back up.

Why Splunk?

I have tried ELK, Graylog, and Splunk and I will be honest I just like Splunk better. I know it’s not free but it is just better in all the ways and being a student allows me to get a license for free. I know there are some die-hard ELK and Graylog users who say with a little magic these platforms can be as good as Splunk. Not gonna lie, I don’t wanna spend that time to make that magic. For example, ELK requires you to run a separate tool that you have to configure to clean the indexes of old data. In Splunk, during the creation of the index, you select the options you want for data rotation and it just does it. Also, let’s be honest ELK and Graylog CANNOT compete with Splunk’s powerful query language.

Why Docker?

I actually debated this for a while but Docker has some really nice features which can be used to enforce good practices. For example, if properly configured everything should be pinned to a version and a known working version with working configurations. I don’t have to manage an install of Java on a machine for Logstash, instead just use a container. Second, I am able to control how many resources (CPUs and memory) for each container. If you allow Kafka to consume 64GBs of memory it will but if you confine it to 1GB it will not exceed that threshold.

Thirdly, Docker manages restarting services if a container crashes which is really nice. Fourth, Docker provides the ability to load balance and dynamically scale the number of instances up and down. Fifth, Docker provides the ability to have infrastructure-as-code. Sixth Docker networks are SOOOOOOOO nice.!!! Both Kafka and Splunk need to use port 8088 but only one of those services needs to be publicly exposed to the world. Since Splunk is contained on the splunk-backend network I can still use port 8088 without interrupting the pubicly exposed Kafka KSQLDB-server.

The flow of data explained

This section will cover in detail the journey a log entry makes from the client to Splunk. For this explanation, I am going to use my TOR node as the starting point. On my TOR node, I have Zeek monitoring all network connections which is logged to disk (/opt/zeek/logs/current/json_streaming_[conn,dns,x509,ssl,etc].log). I installed the Corelight plugin which writes all logs to disk in JSON format because by default Zeek uses TSV. I could have configured Zeek to write data to a Unix socket so the logs are not sucking up disk space but that is NOT a good idea. If for some reason my logging pipeline goes down or the internet doesn’t work those Zeek logs are persisted to disk. Just like a typical sysadmin, you would want backups of your data, well in cybersecurity, the endpoint can act as a backup for logs.

Next, Filebeat is on the machine and it is monitoring the following directory for changes /opt/zeek/logs/current. Filebeat was configured to send data to Logstash and it will do so using the Beats protocol over TLS. In addition to Filebeat sending log entries to Logstash it will also tag each log entry with the tag zeek which will be utilized by Logstash.

Logstash is configured with one input for Beats but it can support more than one input of varying types. Next, the Zeek log will be applied against the various configured filters. The Logstash pipeline provided has a filter for all logs containing the tag zeek. This filter will strip off any metadata added by Filebeat, drop any Zeek logs that don’t contain the field _path, and mutate the Zeek field names to field names specified by the Splunk CIM (id.orig_h -> src_ip, id.resp_h -> dest_ip). Once the Zeek log entry has been applied to all the matching filters it is applied against all matching outputs.

My pipeline has been configured to send all logs with the zeek tag to Kafka and more specifically each Zeek log type is sent to its own topic via topic_id => "zeek_%{[_path]}". This line configuration will extract _path (Zeek log type: dns, conn, x509, ssl, etc) and send it to that topic.  In addition, to sending all Zeek logs to Kafka, Logstash ensures delivery by instructing Kafka to send back an ACK if it received the message kinda like TCP.

Thanks to the people over at Confluent the Kafka stack is actually pretty awesome – seriously shout out to all their hard work! For this pipeline, Kafka is just being used to stream log entries from one point to another and it has been configured to do so. For example, Kafka has been configured to only keep log entries for at max 72 hours or up to 100GBs of logs. The hope with my homelab is that I never generate more than 100GBs before it can be consumed or that I never generate data that isn’t consumed within 72 hours. In addition to data retention, multiple consumers can hook into Kafka topics and pull the latest events.

My homelab setup contains a Kafka Connect consumer (more on this later) and the other is a Python script that analyzes the DNS queries recorded by Zeek and adds malicious domains to MISP. However, this blog post is going to focus on the Kafka Connect connector for Splunk which is instructed to ingest all logs from zeek_* Kafka topics and forward that data to Splunk using an HTTP Event Collector (HEC). I should mention that this blog post does not utilize all the Kafka stack components but the hope is in future blog posts I will implement the HELK logging project but with Splunk.

A Splunk HTTP Event Collector (HEC) collector is an HTTP(S) endpoint that a producer can use to send data to Splunk via HTTP(S). Each HEC is assigned a token which is used as the “API key” to authenticate sending data. Kafka Connect uses this mechanism to pull data from Kafka topics and push the data into Splunk. Splunk ingests the data from the HEC input and stores the data in the tor-zeek index to be searched! In addition to sending data, Kafka Connect is also configured to ensure the delivery of data by requesting Splunk to send back an acknowledge (ACK) that the data has been received, again like TCP.

Install/Setup Docker on Ubuntu 18.04

Install Docker

  1. SSH into Ubuntu. 18.04 VM
  2. apt install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
  3. curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
  4. sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  5. sudo apt update -y
  6. sudo apt install docker-ce.  -y
  7. systemctl enable docker
  8. docker swarm init --advertise-addr <IP addr of Ubuntu VM>
  9. systemctl enable docker

Setup firewall

  1. ufw allow OpenSSH
  2. ufw allow 2376/tcp
  3. ufw allow 2377/tcp
  4. ufw allow 7946/tcp

Optional: Connect to Docker swarm from macOS

  1. Install Docker on macOS
  2. docker-machine create --driver generic --generic-ip-address=<IP addr of Ubuntu VM> addr --generic-ssh-key ~/.ssh/id_rsa --generic-ssh-user=<username of Ubuntu VM> splunk
  3. doccker-machine ls
  4. doccker-machine env splunk
  5. eval $(docker-machine env splunk)

Install/Setup Logging Pipeline with Docker

Get repo

  1. git clone https://github.com/CptOfEvilMinions/MyLoggingPipeline
  2. cd MyLoggingPipeline

Generate TLS certificates

NGINX

  1. openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout conf/nginx/ssl/nginx.key -out conf/nginx/ssl/nginx.crt
    1. Enter a country
    2. Enter a state
    3. Enter a city
    4. Enter an organization
    5. Enter an org unit
    6. Enter a common name
    7. Enter an e-mail
  2. openssl dhparam -out conf/nginx/ssl/dhparam.pem 2048

Logstash

  1. openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout conf/logstash/ssl/logstash.key -out conf/logstash/ssl/logstash.crt
    1. Enter a country
    2. Enter a state
    3. Enter a city
    4. Enter an organization
    5. Enter an org unit
    6. Enter a common name
    7. Enter an e-mail

Spin up logging pipeline

.env

The .env included in this repo pins all the Docker images to specific versions (screenshot below) to ensure operability. These values can be modified to support new releases or to perform rolling upgrades on this logging pipeline. It is important to note that the Confluent Kafka stack is HIGHLY dependent on every component being the same version, which is why all components are currently pinned to the same version.

To roll your own Splunk or to not roll your own Splunk

The official Splunk container does not allow the admin password to be rotated (more on this in the discussion section) because it will break its pre-flight checks on start. Since I wanted the ability to rotate the Splunk admin credentials I decided to install Splunk on the actual host. The biggest changes with this setup are Splunk runs on the host and NGINX responsibilities for Splunk are being managed by an NGINX instance on the host. This was the SIMPLEST method I could engineer to ensure the configuration of this stack has minimal friction between the two setups. If you are just setting up a development pipeline or don’t care that you can’t change the admin password then skip ahead to the next section (Docker-compose build).

If your curious how this works, we connect Kafka Connect to the Docker default network which has access to network services on the host. To keep all the configs consistent I used the extra_hosts docker-compose setting to set the hostname splunk to 172.17.0.1 which is the docker0 interface which has the ability to talk to the host.

Below are instructions to setup NGINX on the host to route HTTP(s) traffic for Splunk. Basically any time a Docker container makes an HTTP call to the hostname splunk it goes through the NGINX instance on the host which forwards it to Splunk. In theory, if you have Splunk running on a remote box that the Docker stack is not running on you can simply set the extra_hosts setting in the Docker compose to the IP address of your remote Splunk instance. While this setup is not pretty nor desirable, it provides operability between running Splunk via Docker or elsewhere.

Setup Splunk on host

  1. Install Splunk on your Ubuntu 18.04 host by following this guide
  2. Login into the Splunk VM
  3. sudo sed -r -i 's/SPLUNK_BINDIP=(\b[0-9]{1,3}\.){3}[0-9]{1,3}\b/SPLUNK_BINDIP=127.0.0.1/g' /opt/splunk/etc/splunk-launch.conf
    1. Forces Splunk to listen on 127.0.0.1
  4. sudo systemctl restart splunk && sudo systemctl status splunk
  5. sudo apt-get update -y && sudo apt-get install nginx curl -y
  6. curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/nginx/nginx.conf --output /etc/nginx/nginx.conf
  7. curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/nginx/splunk_host_web.conf --output /etc/nginx/conf.d/splunk_web.conf
  8. curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/nginx/splunk_host_hec_input.conf --output /etc/nginx/conf.d/splunk_hec.conf
  9. curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/nginx/splunk_host_api.conf --output /etc/nginx/conf.d/splunk_api.conf
  10. sed -i 's/192.168.34.80/<IP addr of host>/g' /etc/nginx/conf.d/splunk_*.conf
  11. sudo systemctl restart nginx && sudo systemctl status nginx
  12. sudo netstat -tnlp | grep 'nginx\|splunkd'
  13. Below instead of running docker-compose ... you need to run docker-compose -f docker-compose-no-splunk.yml ...

Docker-compose build

  1. docker-compose build

Docker-compose up

  1. docker-compose up

Create Kafka Splunk connector

Manual method

I often get feedback from the community that I automate too much and it makes it hard to learn how the internals work. This section is attempting to rectify that by showing the manual process to create a Kafka Connect consumer

Create a Splunk index

  1. Settings > Indexes
  2. Select “New index” in the top right
  3. Enter tor-zeek into index name
  4. Select “Save” at the bottom

Enable/Create Splunk HEC input

  1. Settings > Data inputs > HTTP Event collector
  2. Select “Global Settings” in the top right
    1. Select “Enabled” for “All tokens”
    2. Enter 8090 for port
    3. Select “save”
  3. Select “New token” in the top right
    1. Select source
      1. Enter kafka-splunk-connector for name
      2. Check Enable indexer acknowledgement
      3. Select Next
    2. Input settings
      1. Select _json for “source type”
      2. Select Search & Reporting (search) for “app context”
      3. Move tor-zeek from the left “available items” to “selected items” under the index section
      4. Select review at the top
    3. Review
      1. Select Submit
  4. Settings > Data inputs > HTTP Event collector
  5. Copy token value for kafka-splunk-connector

Create Kafka Connect consumer with CURL

This section is demonstrating how-to manually to create a Kafka Connect consumer with CURL. It should be noted at this time that the CURL command makes an API like-call to Kafka Connect to create a consumer that will ingest log entries from the specified Kafka topics and to forward them to a Splunk HEC input at a specified URL and token. Thanks to Docker networks we can just specify the hostname of Splunk as “splunk” and the Docker engine will take care of resolving it to the container. Also, it should be noted that Splunk generates a self-signed certificate for the HEC inputs which is why splunk.hec.ssl.validate.certs is set to false.

curl http[s]://<Kafka Connect IP addr/FQDN>:8083/connectors -X POST -H "Content-Type: application/json" -d'{
  "name": "<Name of Kafka Connector consumer for Splunk>",
    "config": {
     "connector.class": "com.splunk.kafka.connect.SplunkSinkConnector",
     "tasks.max": "10",
     "topics": "<List of Kafka topics seperated by commas: zeek_conn, zeek_dns, zeek_sll, etc>",
     # Specifying "splunk" as the hostname because Kafka Connect and Splunk share a Docker network so they can communicate by Docker container name
     "splunk.hec.uri":"https://splunk:8088",
     "splunk.hec.token": "<Splunk HEC token>",
     "splunk.hec.ack.enabled : "true",
     "splunk.hec.raw" : "false",
     "splunk.hec.track.data" : "true",
     # The Splunk Docker container generates self-signed certs for HEC endpoints
     "splunk.hec.ssl.validate.certs": "false"
    }
}'

Ensure Kafka Connect consumer was created

  1. Open a browser to https://<Docker IP addr>:9021
  2. Select “Cluster 1” on the left
  3. Select “Consumers” on the left

The automated method with my Python script

As a true sysadmin at heart, I HATE manual process especially when I know I will have to repeat those tasks. Therefore, I assumed in the future I would be onboarding new logging sources so having a script is a must. This Python script has the ability to do the entire manual process demonstrated above which is: create a Splunk index, enable HEC, create an HEC token, and create a Kafka Connect consumer for Splunk or perform select operations, for more options run python3 splunk-kafka-connector.py --help.

  1. virtualenv -p python3 venv
  2. source venv/bin/activate
  3. pip3 install -r requirements.txt
  4. mv conf/python/config.yml.example conf/python/config.yml
  5. open conf/python/config.yml
    1. Splunk
      1. Set external_url to a URL that can be used to reach Splunk externally
      2. Set username to an admin user for Splunk
      3. Set password to a password for an admin user
      4. Set index_name to the name, you want the index to have in Splunk
    2. Kafka
      1. Set connect_extenral_url to a URL that can be used to reach Kafka Connect externally
      2. Set topics to a list of Kafka topics you want to be consumed and ingested into the index specified above
  6. python3 splunk-kafka-connector.py --all

Search Splunk index

  1. Select “Search & Reporting”
  2. Enter index="tor-zeek"

Install/Setup Filebeat on Ubuntu 18.04

Install Filebeat

  1. wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
  2. sudo apt-get install apt-transport-https -y
  3. echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
  4. sudo apt-get update -y && sudo apt-get install filebeat -y
  5. sudo systemctl enable filebeat

Setup Zeek logging

  1. sudo curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/filebeat/filebeat.yml -o /etc/filebeat/filebeat.yml
  2. sudo curl https://raw.githubusercontent.com/CptOfEvilMinions/MyLoggingPipeline/master/conf/filebeat/filebeat_zeek.yml -o /etc/filebeat/conf.d/zeek.yml
  3. sudo sed -i 's/logstash_ip_addr/<Logstash IP addr or FQDN>/g' /etc/filebeat/filebeat.yml
  4. sudo sed -i 's/logstash_port/<Logstash port>/g' /etc/filebeat/filebeat.yml
  5. sudo systemctl restart filebeat

Testing data flow/Troubleshooting

Ensure Filebeat is transmitting data

  1. SSH into the machine with Filebeat
  2. apt-get update -y && apt-get install tcpdump -y
  3. sudo tcpdump -i any port 5044
    1. Ensure traffic is being sent to your server AND you are not receiving resets ([R])
    2. If data is not being sent, ensure that data is being written to the log file
    3. If data is not being sent, ensure Filebeat loaded correctly and there are no errors – systemctl status filebeat

Ensure Logstash is receiving data

  1. SSH into the machine with Filebeat
  2. apt-get update -y && apt-get install tcpdump -y
  3. sudo tcpdump -i any port 5044
    1. If data is not being received, ensure traffic is being received from your logging client AND you are not receiving resets ([R])
    2. If data is not being received, check firewall rules
    3. If data is not being received, ensure Logstash loaded correctly with no errors – docker logs logstash

Kafka is receiving data from Logstash

  1. brew install kafkacat
  2. kafkacat -b <IP addr of Docker>:9092 -t zeek_conn -C
  3. If data is not being received, check Logstash logs for errors
  4. If data is not being received, check Kafka logs for errors – docker logs kafka

Ensure Kafka connector exists

  1. Open a web browser to https://<IP addr of Docker>:9021
  2. Select “Cluster 1” on the left
  3. Select “Consumers” on the right
  4. If Kafka Connect consumer does not exist, check Kafka Connect logs for errors – docker logs connect
  5. If data is not being received, check Kafka logs for errors – docker logs kafka

Ensure Kafka Connect can talk to Splunk

  1. docker ps
  2. docker exec -it <Docker container ID for Kafka connecct> bash
  3. apt update -y && apt install nmap net-tools iputils-ping -y --force-yes
  4. ping splunk
  5. nmap -p80,443,8089,8090 splunk

Ensure Splunk is receiving data from Kafka Connect

  1. Open a web browser to https://<IP addr of Docker>:443
  2. Login into Splunk
  3. Select “Search and reporting” on the left
  4. Enter index="tor-zeek"
  5. If not data exists in the index,  check Kafka Connect logs for errors – docker logs connect

Discussion

Frustration: Not enterprise-ready

Splunk provides an official Docker container with Splunk Enterprise but yet you can’t change the admin password without breaking the container. If you spin up the official Splunk Docker container, login into the webUI, change the admin password, and restart the container it will fail to boot as documented via this Github issue here. I DO NOT blame the engineers at Splunk nor the maintainers of the Splunk Docker image because they have done good work. I blame bad marketing because it insinuates that Splunk on Docker is fully capable but it is not. As much as I wanted to run Splunk via Docker in my homelab environment, instead, I did a typical install but the rest of the stack is dockerized.

System load

Lessons learned

I am currently reading a book called “Cracking the Coding Interview” and it is a great book. One interesting part of the book is their matrix to describe projects you worked on and the matrix contains the following sections which are: challenges, mistakes/failures, enjoyed, leadership, conflicts, and what you’d do differently. I am going to try and use this model at the end of my blog posts to summarize and reflect on the things I learn. I don’t blog to post things that I know, I blog to learn new things and to share the knowledge of my security research.

New skills/knowledge

  • Learned how to use Splunk
  • Learned how to securely transmit logs from client to server
  • Learned how to use and configure the new version of Logstash
  • Learned how to use a Docker .env file to pin versions for Docker images inside Dockerfiles
  • Learned how to use the Splunk SDK for Python
  • Learned how to use the Kafka Connect API
  • Learned the Confluent Kafka stack and how all the services work and integrate together
  • Learned how to troubleshoot the Confluent Kafka stack
  • Learned about the Splunk’s Common Information Model (CIM) and applied it
  • Learned about how Docker networks work

Challenges

  • The Confluent cp-all-in-one-community/docker-compose.yml uses a cp-server Docker image which is specially configured for a demo. Once I replaced that Docker image with images for each service I was able to get a working PoC.
  • For some odd reason, the Splunk HEC input kept switching from Non-SSL to SSL input when the container was restarted
  • Splunk container doesn’t handle password change after a restart –  Github issue – Splunk fails to start after changing admin password
  • Passing Docker container network traffic to services running on the host

Mistakes and failures

  • Learning how to get the Kafka Cluster ID to be persistent – The Kafka broker stores its logs at /var/lib/kafka/logs but ZooKeeper stores it at /var/lib/zookeeper/log and I assumed it would be “logs” which lead to the ClusterID mismatch.

What You’d Do Differently

  • Enforced mutual TLS on all endpoints that ingest traffic and placed client certificates on all logging clients

References

3 thoughts on “My logging pipeline: Splunk, Logstash, and Kafka

  1. Furkan says:

    Suricata uses src_ip for source IP address but Zeek uses id.orig_h not id.resp_h, as an erreta

  2. Marc Lowman says:

    Nice article. Have you looked at doing something like this but with cloud data stores (s3, blob, gcs). Clearly I’m biased toward Snowflake but we are seeing more interest in “security data lakes” and what you’ve built seems like it would fit nicely into that since Kafka already has connectors.

    • admin says:

      Hey Marc,

      Unfortunately, I have not looked into cloud data stores such as S3, blob, and GCS. I might look into this idea for a future blog post :).

Leave a Reply to Marc Lowman Cancel reply

Your email address will not be published. Required fields are marked *