Factor Graphs and the Matter of Security

In 2017, Equifax had a security breach. The breach caused millions of users’ personal information to be compromised. In 2018, in my data science class, we applied some knowledge on data science, factor graphs, and a bit of networking to detect these kind of attacks.

The Equifax Attack

To understand this project, we need to take a look back at an attack that happened to Equifax in 2017. In essence, Equifax, one of the big three consumer credit reporting agencies, was successfully breached, compromising the data of 145.5 million people. The data includes names, Social Security numbers, birth dates, addresses, and credit card numbers.

Equifax’s competency has been put into question following the attack. If you looked into it, it’s basically a case study into “how not to have secure systems” and “how not to handle your customers following a data breach”. To briefly go over the latter, they took six weeks after the breach to notify their customers, execs sold off almost $2 million of stocks, and they offered free credit monitoring and identity theft protection, but those who took advantage might have waived their right to join a class action lawsuit. Read more about that here and here.

To talk about their inability to have secure systems, basically, they didn’t update their Apache Struts software. If there’s any lesson you want to draw right now, it’s don’t be like Equifax – update your software.

The Project

The goal at hand is to analyze network traffic and use a factor graph to detect (and ideally stop) a similar attack. The data was provided as HTTP and DNS network packet captures and OSQuery activity logs from the recreated attack.

Using Python to Analyze Networks

Before we get started, how do we use Python to analyze networks? Wireshark is a popular network protocol analyzer. With it, you can observe incoming and outgoing traffic. PyShark is a wrapper that lets you use analyze data from Wireshark. Within the project, the data was in the form of packet capture files (pcaps), which PyShark is able to convert into Python-usable data.

Right after that, I was able to convert them into pandas-usable dataframes.

Among the elements available in each packet, I grabbed the source IP, the timestamp, the packet length, the destination IP, the destination port, and the highest layer the data came from.

On the Attack

So how did the attack work? The attack started by initiating a scan with /showcase.action. Going through the HTTP packet capture, we can log all instances of /showcase.action, timestamp, source IP, destination IP, and port.

We can answer a few questions from this: who’s attacking, where are they attacking from, when did they commence the attack, and where are the vulnerabilities? Here, 172.17.0.1 is the attacking IP, while 172.17.0.2 at port 8080 make up the vulnerable server and port. The attack in this simulation began somewhere around 12:41 pm on March 3, 2018.

When we look into the Content-Type headers from the IP, it becomes definitely apparent that 172.17.0.1 is the attacker. From the packet capture, we looked at the length of content header and the actual content.

The attacker tested the waters by modifying some data (packets 413, 434, and 449), before beginning a hijack (packets 524 onward). For brevity, I’ll give the high level of what the attacker did. The attacker was able to execute terminal commands remotely (look for any #cmd=). Now with root privilege, the attacker is able to remotely download a module and replace the legitimate module in the system with it. We can look to the OSQuery logs and verify this happened.

Additionally, the attacker was able to extract both the public and private RSA keys, noted in this section of the packet capture:

We also would like to know who was making DNS requests, and by looking at DNS responses, we find that 162.212.156.148 is the malicious server. Not only is it going to dodgy sites, but it is also sending unusually long responses.

Let’s do some actual data analysis. Instead of looking at DNS responses, let’s look at DNS queries, specifically query lengths of legitimate versus malicious queries:

It’s clear that there are quite more malicious queries being made, and the length of these queries are far greater than legitimate queries.

So far, we’ve identified the attacker, what the attacker did, and we have an indicator for malicious queries. So, how do we detect an attack?

The Factor Factor

Let’s go over a factor graph. A factor graph is a bipartite graph that represents the factorization of a function. In probability theory, we use these to represent the factorization of a probability distribution. The vertices are either factors or variables, while the edges connect variable vertices with factor vertices only if the variable is part of the factor. What’s nice about these graphs is that they can be used to quickly calculate marginal distributions, and in this case, they can be used for inference.

Above is the factor graph used, where we keep track of events (E) over time and try and deduce what state (S) the system is in with respect to an attack. Events include Scan, Login, Sensitive URI, New Kernel Module, and DNS Tunneling, while the stages include Benign, Discovery, Access, Lateral Movement, Privilege Escalation, Persistence, Defense Evasion, Collection, Exfiltration, Command and Control, and Vulnerable Code Execution.

At this point, all that had to be done was build the factor graph and infer at each stage. The way our model worked out, it was by time-step 7 that the recommended action would be to stop.