CyberSecurity Data Acquisition: Collecting and Processing Log Data with Python

DataScience CyberSecurity - #0001

Jul 31, 2024

An apt understanding of what a cybersecurity specialist does can be achieved by analyzing the details of their daily work, more specifically, the collection and analysis of log data are very important in cybersecurity operations in order to focus on the potential security threats and improve the handling of incidents. On the same note, a single log entry is without significance on its own, but once consolidated and analyzed together with other records and from other source, it becomes a very effective tool in any investigation. Logs can make it possible to provide answers to critical questions about an event, such as:

What happened?
When did it happen?
Where did it happen?
Who is responsible?
Were their actions successful?
Impact of their actions?
Why are logs important?

Why are logs important?

Therefore, there are several reasons why collecting logs and adopting an effective log analysis strategy is vital for an organization's ongoing operations. Some of the most common activities include:

System Troubleshooting: Analyzing system errors and warning logs helps IT teams understand and quickly respond to system failures, minimizing downtime and improving overall system reliability.
Cyber Security Incidents: In the security context, logs are crucial in detecting and responding to security incidents. Firewall logs, intrusion detection system (IDS) logs, and system authentication logs, for example, contain vital information about potential threats and suspicious activities.
Threat Hunting: On the proactive side, cyber security teams can use collected logs to actively search for advanced threats that may have evaded traditional security measures.
Compliance: Organizations must often maintain detailed records of their system's activities for regulatory and compliance purposes.

Types of Logs

There are several types of logs, each providing unique insights into system activities:

Application Logs: Notifications from certain applications, which include information regarding their condition, problems, or malfunctions, and also indications and notifications of their functioning.
Audit Logs: Activities or events in a system or an application, that give a description of what users have done and what the system has done.
Security Logs: activity of granting or changing access permissions, providing or modifying firewalls or other similar security features, changes in user accounts and logins.
Server Logs: They are system logs, event logs, error logs, and access logs which provide different types of server information.
System Logs: Kernel processes, system faults, booting operations, and hardware conditions that help in the isolation of problems in the system.
Network Logs: Activities and information recorded concerning events and communications within a network, as well as interactions and exchanged data.
Database Logs: transactions that occur within a database system include queries made, operations carried out, or changes made.
Web Server Logs: Sequential records of the request that may contain the URLs, source IP addresses, request type, response codes, etc., received by the web servers.

Log Analysis Techniques

Log analysis techniques are therefore the processes or approaches of analyzing log data. These techniques may range from simple to complex and are essential for pattern recognition, outliers, and key variables. Some common techniques include:

Pattern Recognition: Useful for recognizing patterns, cycles, or issues that affect log data.
Anomaly Detection: To identify cases that fall outside the normal range during the collection of data and statistics.
Correlation Analysis: When comparing one entry with another, different events can be connected and compared to see their correlation and dependency.
Timeline Analysis: Dealing with logs over a period assists in realizing trends and seasonal and periodic patterns.
Machine Learning and AI: The use of machine learning models can assist with augmenting a number of methods involved with log analysis.
Visualization: And as was shown above, graphs and charts help to represent log data meaningfully and provide instant information.
Statistical Analysis: Applying statistical methods to log information can be very useful to calculate quantitative results and make them quantitative.

Collecting Log Data with Python

There are several ways to collect log data using Python, including:

Syslog: An internationally acknowledged procedure of the acquisition of data from network devices and systems.
Logstash: One of the most widely used log collection and processing tools that can integrate with Python.
Python logging modules: There are many implemented Python modules, including logging and logbook, that can be used to gather log information from Python applications.

Example:

It is important to simulate a Network Intrusion Detection System (NIDS) in order to track the traffic coming from the internet.

In this example, a simulated Network Intrusion Detection System (NIDS) will be implemented using Python. The focus will be on generating log data with Python’s built-in logging library, using structlog for structured logging, and handling and analyzing data. This practical exercise aims to provide intermediate cybersecurity professionals with insights into managing log data for NIDS.

This Python script generates simulated log data for a network intrusion detection system (NIDS) and saves it to a CSV file. It uses the structlog library for structured logging and pandas for data manipulation. The script generates 1000 log entries with random values for timestamp, user ID, source IP address, destination IP address, action, and success value. The log entries are then saved to a CSV file named network_logs.csv.

This code loads the log data into a pandas DataFrame, filters the data to only include login attempts, and aggregates the data by user. The resulting aggregated data is printed to the console.

This code snippet visualizes login attempt data. It first counts the number of login attempts by the user and selects the top N users. Then, it creates a figure with two subplots: a bar chart showing the top N users by login attempts and a histogram displaying the distribution of login attempts. The plot is customized with titles, labels, and rotated x-axis tick labels, and the layout is adjusted for a clean display.

Interpretation:

The bar chart on the left shows the top 20 users by login attempts. The chart reveals that User 75 has the most login attempts with 6 attempts, followed by several users with 5 attempts each. The number of login attempts decreases slightly as you move down the list, with the range of attempts among the top 20 users being between 4 to 6 attempts.

The histogram on the right shows the distribution of login attempts across all users. The chart reveals that most users (around 25) have only attempted to log in once, indicating infrequent login activity. There are smaller groups of users with 2, 3, 4, and 5 login attempts, suggesting moderate login activity. Notably, only one user has attempted to log in 6 times, which is the maximum number of attempts.

Common Log Locations, Details, and Access Methods

Top Important Logs for Threat Detection & Analysis

Windows Security Log: Critical for identifying unauthorized access attempts and tracking user activities.
Linux Auth.log: Essential for monitoring authentication attempts and potential unauthorized access.
Web Server Access Logs (IIS/Apache): Useful for detecting web-based attacks, such as brute force attempts or SQL injection.
System Logs (Windows System Log / Linux Syslog): Important for identifying system-related issues and potential indicators of compromise.
Kernel Logs (Linux): This helps in diagnosing hardware-related issues and kernel vulnerabilities.
Firewall Logs (Linux UFW / Windows Firewall): Crucial for detecting and analyzing network intrusions and unauthorized access attempts.