page-loader

Intro

What originally started off as a very simple goal, ended up turning into a rather large project that I’m rather happy with. I wanted my homelab Splunk instance to be able to send me emails for various things happening in my environemnt, but Spectrum doesn’t allow any SMTP traffic to originate from one of their residential IP addresses, so anything sourcing directly from one of my servers was out.

I could have set up Splunk to work with Gmail quite easily, but that only solves my problem of getting Splunk to send emails. What if I wanted my pfSense box to send email out, or my Ansible server? Those might not be the most likely use cases, but the more I’ve been working in my homelab, the more I’ve appreciated reusable solutions, which the Email Server Settings in Splunk are not… This led me down the rabbit hole of designing a full alert framework for my homelab using Splunk and Ansible.

Quick note, my use of Ansible and Python is mostly the result of self-teaching, so I’m very open to comments, suggestions, and criticisms.

Design

In designing the alert framework, I had five major goals in mind:

  1. One script that would be called that could handle all possible permutations of actions required, based on an alert’s specific needs
  2. As many alert actions as possible would be wrapped up into Ansible Playbooks for re-usability and for general integration into the automation approach I’m working on
  3. Creation of alerts and their actions would be done completely through the Splunk UI
  4. Python as my primary scripting language
  5. As many of the components as possible would record their successes and failures to logs, which would then be ingested into Splunk for error monitoring

Flow

The workflow I settled on starts off with Splunk searches finding results that need to be alerted on. Those results are passed on to a python script which opens and parses them. The python script then opens a lookup kept locally on the Splunk Search Head which contains a unique entry for each Splunk search.

Based on the flags set in the lookup, the script then calls Ansible playbooks for each necessary action. These Ansible playbooks serve as wrappers for additional Python scripts that handle all of the heavy lifting of input validation and manipulation, contacting APIs, and so on, but more on that in the next section…

Technical Functionality

Lookup File – alert_actions_config.csv

So how does it all work? The first thing to cover is the alert_actions_config.csv lookup file that serves as a configuration base for all of the actions that the different searches should execute. This lookup contains one entry per alert running in the environment, each on a separate row. The structure of a lookup entry is to give the search name, a pipe to serve as a delimiter, and then the actions to perform, like so:

Auth – Failed Login Attempts|send_email=true, create_ticket=true, ticket_urgency=1

Base Script – alert_actions.py

Instead of dropping a wall of code on this page, I’ll go over some of the highlights as to how the scripts work. If you’d like to see the source code, please let me know and I can get it over to you.

I’m going to be following through the process of opening a ServiceNow incident here, but the process for sending an email is rather similar.

Number 5 on my priority list was to make sure every component in this workflow wrote adequate logging. To that end, every python script is configured to log:

import logging
import logging.handlers
from logging.handlers import RotatingFileHandler

def configure_logging():
	logger = logging.getLogger(str(sys.argv[0]))
	logger.setLevel(logging.INFO)
	handler = RotatingFileHandler('/var/log/python.log', maxBytes=2000000, backupCount=5)
	handler.setLevel(logging.INFO)
	formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
	handler.setFormatter(formatter)
	logger.addHandler(handler)
	return logger

def main():
	global logger
	logger = configure_logging()

The search results are opened by the script, and then the lookup is opened from the search head. You can see here an example of the error logging I’m implementing through all of the scripts, where if for some reason the script were to break, we’ll end up with logs being written to help us determine what’s going on:

def open_lookup(search_name):
	path = '/opt/splunk/etc/apps/search/lookups/alert_action_config.csv'
	try:
		lookup = open(path, 'r')
	except OSError:
		logger.error('Cannot open ' + str(path))
		sys.exit(1)
	lookup_entry = 'not found'
	for line in lookup:
		# Sometimes Lookup Editor surrounds strings in double quotes.  If so, get rid of the starting double quote.
		if str(line).startswith('"'):
			line = line[1:]
		if str(line).startswith(search_name):
			lookup_entry = line
	if lookup_entry == 'not found':
		logger.error('Error in alert_actions.py - unable to find a matching search_name in alert_action_config.csv - search_name looked for:  ' + search_name)
		sys.exit(1)
	else:
		return lookup_entry

Once we have the lookup retrieved from the search head, its fairly simple to break the lookup entry down and determine what all actions we’d like to execute:

def split_lookup_into_params(lookup_entry):
	alert_actions = lookup_entry.split('|')[1]
	# action flags
	send_email, create_ticket = False, False
	# Specific Action Flags - ServiceNow
	ticket_urgency = 0
	if 'send_email=true' in alert_actions:
		send_email = True
	if 'create_ticket=true' in alert_actions:
		create_ticket = True
	if create_ticket:
		if 'ticket_urgency=' in alert_actions:
			m = re.search(r'ticket_urgency=(?P<urgency>\d+)', alert_actions)
			if m:
				ticket_urgency = m.group('urgency')
	return send_email, create_ticket, ticket_urgency

Lastly, within main, we handle the calling of the Ansible playbook to send the email or create the incident:

send_email, create_ticketticket_urgency = split_lookup_into_params(lookup_entry)
	if create_ticket:
		# Ansible doesn't like spaces in the variables that get passed to it, so replace those with '%20'
		search_results = str(search_results).replace(' ', '%20')
		search_name = search_name.replace(' ', '%20')
		# Ansible is also very picky about quotes in variables, so replace all single qutoes with the string 'singlequote'.  
		# Its hinky, but it works
		search_results = search_results.replace("'", 'singlequote')
		# Input validation
		if int(ticket_urgency) < 1: 
			ticket_urgency = 1
		elif int(ticket_urgency) > int(3):
			ticket_urgency = 3
		# Create a command to run on the local box, with formatting variables
		command = 'ansible-playbook /etc/ansible/playbooks/servicenow_createincident.yml --extra-vars "search_name=' + search_name + ' search_results=' + search_results + ' ticket_urgency=' + str(ticket_urgency) + '"'
		# Execute local command to call ansible-playbook
		CompletedProcess = subprocess.run(command, shell=True, check=True, stdout=subprocess.PIPE)
		# Log result of the local command
		logger.info(str(CompletedProcess.stdout))

At this point, things hop over to the Ansible playbook land…

Ansible Playbook – servicenow_createincident.yml

So this section is quite simple. At this point, all we’re doing is running an additional python script, and using Ansible as the wrapper around that script. The upside of this, is that we can in the future call this playbook from other locations and for other purposes, and it will always act like we expect it to:

---
- hosts: localhost
  remote_user: root
  tasks:
    - name: Creates a ServiceNow Incident with specified parameters
      command: /usr/bin/python3.6 /opt/splunk/bin/scripts/servicenow_createincident.py {{search_name}} {{search_results}} {{ticket_urgency}}
...

ServiceNow Script – servicenow_createincident.py

So far, we’ve successfully delivered the name of the search, the results from the search, and the ticket’s urgency over to this new script. First things first are to open and format the arguments into a bit easier to work with form:

def open_and_format_arguments():
	search_name = str(sys.argv[1]).replace('%20', ' ')
	search_results = str(sys.argv[2]).replace('%20', ' ')
	search_results = search_results.replace('singlequote', "'")
	search_results = ast.literal_eval(search_results)
	ticket_urgency = int(sys.argv[3])
	if ticket_urgency > 3:
		ticket_urgency = 3
	elif ticket_urgency < 1:
		ticket_urgency = 1
	search_results_keys = []
        # note, the syntax highlighting in wordpress bugs out if you use the word "code" in brackets, so replace the 0 with an o
	search_results_string = '[c0de]<h3>Alert Details:</h3>'
	for key in search_results[0].keys():
		search_results_keys.append(key)
	for i in range(len(search_results)):
		for j in range(len(search_results_keys)):
			search_results_string += '<p>' + str(search_results_keys[j]) + ':  ' + str(search_results[i][search_results_keys[j]]) + '</p>'
	search_results_string += '[/c0de]'
	return search_name, search_results_string, ticket_urgency

Breaking down what we have going on there, first, the script strips out the encoded characters that were added to get it through Ansible’s parsing. Then we interpret the search_results literally in order to get it back into the original list format. Next, we want to format the search results into an HTML string which can be passed as a comment to ServiceNow. We start this off by prepending a bit to the string, then looping through the search results and adding each value to the string, HTML encoded. Append a slight bit more code, and we’re ready to pass back the values for opening the ServiceNow incident.

Now for where everything comes together:

def create_incident(search_name, search_results_string, ticket_urgency):
	url = 'https://redacted.service-now.com/api/now/table/incident?sysparm_limit=10'
	user = ''
	pwd = ''

	headers = {"Accept":"application/json"}

	form_data='{"comments":"' + str(search_results_string) + '","short_description":"' + str(search_name) + '","caller_id":"cmbusse","urgency":"' + str(ticket_urgency) + '"}'

	response = requests.post(url, auth=(user, pwd), headers=headers, data=str(form_data))

	if response.status_code != 201: 
		logger.error('Status:  '+ str(response.status_code) + ' Headers:  ' + str(response.headers) + ' Error Response:  ' + str(response.json()))
		sys.exit(1)

	# Decode the JSON response into a dictionary and use the data
	logger.info('Status:  '+ str(response.status_code) + ' Headers:  ' + str(response.headers) + ' Error Response:  ' + str(response.json()))

Opening the ServiceNow incident is as simple as sending a POST to my dev instance of ServiceNow. For this I’ll be using the python requests module and closely following the ServiceNow Table API Python documentation. The ServiceNow fields I’m making use of to display all of the search results are:

  • Short Description – This will contain the name of the alert
  • Caller_id – This will contain my user account on ServiceNow
  • Urgency – The urgency of the incident we set from the lookup
  • Comments – This will contain the search results we formatted previously

Put it all together, and we get a properly opened ServiceNow incident:

We also see Splunk being aware of all the activity of the scripts and playbooks:

Closing Thoughts

Overall I’m rather happy with the way this project turned out. At some future point I plan on adding the ability to override the static configurations in the lookup through field created in the search results. I can see a future where a different urgency than what is in the lookup might be required, based on the results of the search, something the current architecture can’t handle. My general plan for implementing something like this would be to create fields in the search results like so:

| eval override_incident_urgency=case(activity_level=="warn", 3,
activity_level=="error", 2, 
activity_level=="critical", 1, true(), 3)

Which would then be present in the search results:

Event_Details override_ticket_urgency
The house is on fire 1

The action specific python scripts would then look for override fields in the search results, and use those instead of the lookup’s values. There would be one function that pulled out all of the required overrides (and removed the override fields from the results as they wouldn’t be needed):

def check_for_overrides(search_results):
	override_keys = []
	override_values = []
	for key in search_results[0].keys():
		if key.startswith('override_'):
			override_keys.append(key)
			override_values.append(search_results[0][key])
	for i in range(len(override_keys)):
		for j in range(len(override_keys)):
			del search_results[i][override_keys[j]]
	return search_results, override_keys, override_values

And then when it comes to parsing the lookup, it would also consider the overrides that have been found:

if 'override_ticket_urgency' in override_keys:
	ticket_urgency = override_values[override_keys.index('override_ticket_urgency')]
return send_email, create_ticket, ticket_urgency

This way, when the search results and configuration parameters arrive at the code blocks that handle the API calls, they’ll have the pertinent search results, and then properly adjusted parameters to use. If the need ever arises for something like that, I’ll be sure to write up another blog post and include the link here.

Thanks for reading, if you have any questions or comments, feel free to leave them below.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *