Security without Compromise: How Cisco Engineers Used Machine Learning to Solve an Impossible Problem
In 2015 Rich West, a systems architect with Cisco’s infosec team, approached an engineer on Cisco’s Advanced Security Research team with a novel problem. The infosec team was looking for a way to protect Cisco employees from malware in encrypted traffic without sacrificing their privacy. At the time, there was really only one viable option, which was to proxy and inspect all SSL and TLS traffic by decrypting it.
When done maliciously it’s referred to as a man-in-the-middle attack. And even when done as a defensive measure, it can still be viewed as a breach of privacy, since it essentially breaks the encryption trust chain of any end user sending traffic to a secure site like a bank or an encrypted e-mail service. It’s also computationally expensive, enough so to cause a substantial degradation in network performance, not to mention the burden of managing extra SSL certificates, which are required to re-sign traffic after it is inspected.
Rich West and his team decided ultimately that the privacy trade-off was not worth it. They wanted a new approach, one that didn’t involve sending Cisco’s internal traffic through a bottleneck inspection point. To help, West contacted Cisco Engineering Fellow David McGrew.
A Complex, Unsolved Problem
McGrew had been working in Cisco’s Advanced Security Research group on new ways of algorithmically finding malware using NetFlows. When West made his team’s case for needing a method of finding attacks in encrypted data, McGrew decided to see if he could blend the two efforts. What followed is a two-year project that is now nearing its completion.
It is part of Cisco’s launch announced this week, a host of new networking products and software aimed to fundamentally change the blueprint of modern networks to one that is powered by intent and informed by context.
The data model McGrew and his team developed is called Encrypted Traffic Analytics (ETA), and it represents a huge step forward in Cisco’s goal to use its massive network and data set, combined with automation and machine learning, to apply security everywhere.
Encryption is most often viewed as a good thing. It keeps private Internet transactions and conversations private, free from man-in-the-middle attackers looking to glean private info or alter data in transit.
With the growing use of cloud services in enterprise environments, and the gentle pushes from companies like Google and Mozilla that force sites to use TLS, companies are accepting and routing a lot more encrypted traffic.
All encrypted traffic must first be signed with a certificate from a trusted certificate authority (CA). New authorities are spurring the growth in TLS traffic by making the process much easier and more cost-effective. A recent CSO article quoted a vice president at Venafi who described a “dangerous scenario” in which the cost of encryption has now effectively dropped to zero, leaving it a cheap operative for hackers to employ. And they are doing so.
Cyber attackers are hiding their command and control activity and data exfiltration efforts by passing through common ports just like normal TLS or SSL traffic.
For McGrew, detecting malware mixed in with that normal traffic was a rich, complex use case involving massive data sets, which is just the type of problem he enjoys tackling.
A NetFlow, he said in an interview, contains valuable information, but also has its contextual limitations. “It tells you what two devices talked, how long, how many bytes they sent, and things like that.” But it’s by no means a complete picture.
McGrew believed, however, the debate of privacy versus security should not always be a duality—a one-or-the-other choice. He also knew that in order to find a solution for a problem this complex, he’d need to create it from scratch, which would involve a lot of code writing and intensive data modeling.
‘OK, What Do You Need?’ — ‘Data. Lots of data.’
McGrew needed resources to get started, so he tapped Cisco’s Technology Investment Fund.
‘Tech Fund’ projects at Cisco are typically those that develop new products or technologies with the goal of disrupting the status quo. These projects can often take years to develop.
Even with funding secured, before a project could be formed and code written, McGrew also needed help obtaining and analyzing data samples from Cisco’s enormous network, including malware samples.
To leap this hurdle, McGrew enlisted the help of Blake Anderson in March 2015. Anderson is a data scientist who obtained his Ph.D. studying the application of machine learning to cybersecurity. At the time, he was working with the Los Alamos National Laboratory in New Mexico, applying machine learning methods to malware analysis.
When Anderson joined Cisco, McGrew’s small team had been developing analysis tools and had some success identifying specific applications using only NetFlow data. For instance, they were able to spot when a NetFlow was coming from a user’s Chrome browser or Microsoft’s update service. But the team had not yet applied any malware data.
To get the samples they needed, Anderson and the team worked with practically every product team at Cisco, including the internal infosec team, the Talos threat intelligence group and the recently acquired ThreatGRID team, which joined the Cisco portfolio in 2014.
After spending months writing more than 10,000 lines of code, McGrew and Anderson had a practical test for their data models. Using millions of packet captures and known malware samples, Anderson began sifting through it all and finding the “most descriptive characteristics” that would differentiate what was malware and what was benign traffic without decrypting anything.
“I think [gathering the right data] was the most important thing,” Anderson said. “Whereas a lot of times you see people saying, ‘We have this interesting data, what can we do with it?’ We took the opposite approach.” Anderson and McGrew began with a wish list of what data they would need, and then shopped it around Cisco’s product teams to help make it possible.
Fingerprinting Hidden Malware
Since at least 2009 attackers have been finding ways to abuse the trust system of the Internet by using forged, stolen or even legitimately signed SSL certificates.
TLS certificates signed by a valid CA give users reassurance that the site they’re visiting is legitimate. But it can also prove a false reassurance, as this sense of trust can play right into the hands of attackers. They will use that false sense of safety to lure victims into handing over their login credentials or downloading a malware payload.
In the last few months, attacks using legitimate TLS certs appear to be rising. Part of the reason may be that obtaining a valid TLS certificate has become essentially free through CAs like Let’sEncrypt, and incredibly easy as mentioned before. As a result, phishing authors have capitalized on the opportunity and have been recently flooding the Internet with phishing sites spoofing legitimate sites like PayPal or bitcoin wallet providers.
Easily obtainable crypto keys can prove to be a sort of double-edged sword.
According to Rich West, security departments are, in a way, a victim of their own success. “They’ve pushed IT departments, vendors and app developers to better secure data in motion, but that creates new challenges for how to handle these encrypted flows,” he said.
Fortunately, by analyzing millions TLS flows, malware samples and packet captures, Anderson and McGrew found that the unencrypted metadata in a TLS flow contains fingerprints that attackers cannot hide, even with encryption. TLS is really good at obscuring plain text, but by doing so it also creates a “complex set of observable parameters” that engineers like McGrew and Anderson can use to train their data model.
For instance, when a TLS flow begins, it starts with a handshake. The client (like your Chrome browser) sends a ClientHello message to the server it’s trying to reach (like Facebook). The “hello” message includes a list of parameters, like what cipher suite to use, what versions are acceptable and a list of optional extensions.
TLS metadata like the ClientHello are not encrypted, because they transfer back and forth before the encrypted messages begin. This means Anderson’s model can analyze the unencrypted data with no knowledge of what is actually inside the message. And the model will then accurately categorize what traffic is malware and what is benign.
According to Anderson’s latest testing, not only does this approach preserve user privacy by not breaking encryption, but tests of ETA against large samples of network data and malware samples show promising results for its accuracy. Using only NetFlow features, ETA catches malware about 67 percent of the time. When ETA is fed those NetFlow features with additional feature sets like Service Packet Length (SPL), DNS, TLS metadata, HTTP and others, the accuracy jumps up to more than 99 percent.
“The position Cisco is in,” Anderson said, “gave us all this data, and the Tech Fund Project allowed us to do this rapid prototype approach. It’s really invaluable, and allowed us to get some pretty powerful results quickly.”
With all the right resources, and as a result of their inter-departmental work, McGrew and Anderson have in two short years created a promising solution for a dire problem in cybersecurity.
McGrew said it’s a solution that likely could only have come from a company like Cisco, with both the resources and, just as important, the data to do it.
“For an engineer, for a scientist, being able to really focus on the technology is a fantastic privilege,” McGrew said. “Being a product engineer you don’t get to do that as often.”
Anderson’s hope is that ETA can be applied nearly anywhere through a software code update, and provide enforcement for encrypted malware at any location on the network. Any appliance handling network traffic could then be converted into a security appliance, even if its function is not security.
“I think it makes a lot of sense to push more of the decision making and enforcement to routers and switches,” Anderson said. “We can still train the models in the cloud, but we can push a lot of the intelligence to the network level. Me personally, I would like to see integration of this type of detection integrated into the actual data path, and [done] in an efficient way.”
For more information about ETA and how it works, you can read McGrew and Anderson’s academic write up here.
You can also check out the open source version of ETA (called JOY) here.Tags: