Introducing Open Threat Partner eXchange - OpenTPX

OpenTPX is a contribution by LookingGlass Cyber Solutions to the open source community.

Posted by LookingGlass TPX Team on October 9th, 2015

It defines a comprehensive model of threat associated with the global Internet enabling interconnected systems to exchange threat intelligence, threat context, collections, networks and threat mitigation information. OpenTPX is based on practical experience building highly scalable threat intelligence analysis & management systems deployed in real-world scenarios.

The OpenTPX specification, data model, data schema and supporting tools are freely available from www.opentpx.org.

Why did we create a new model for human and machine threat exchange?

Necessity

We collect millions of observables a day from many different sources that have their own data formats and multiple representations of often similar data. We needed a format so data going into our system was the same every time. It needed to be concise and easy to read and easy to troubleshoot. Being human readable wasn't a high priority, but it's handy for someone to be able to easily open and view. We didn't just decide we needed to build another format. Our format evolved from many hard learned lessons of ingesting the different formats, the millions of records and learning what were the key elements of that data.

Our previous format was binary, it carried schema info inside each file, but there was a problem. It required serious effort to extend. It solved the problem it was designed for, but as we evolved our products, the format became a pain point. We started making a list of all the requirements a format needed to achieve. Number one, it needed to be extensible with no changes to existing parsers. If the format changed and an old parser didn't understand the changes, it wouldn't break, the parser would just ignore the parts it didn't recognize. This helps during upgrades, in theory, we can update all data producers before we update data consumers. If we can't update a data consumer in a timely manner, it won't be severely impacted by a schema change. If someone decides they need to share data that we haven't already built support for....they can add it. If one is feeling ambitious, they can submit those changes back to the repo and we can integrate. If we're not fast enough incorporating changes, the format shouldn't be broken by those extensions.

Data size played a part

We need to be able to read tiny files with one or two elements or huge files with millions of elements. An aspect of this that a lot of developers overlook, is that some formats force the parser to read the entire file into memory in order to process all elements. If you've ever tried to read a file into memory that's bigger than available RAM, you'll appreciate that there aren't deep levels of nested elements in the format. To avoid scenarios where file processing may be limited by the system resources, the file size is limited to 1Gb and we've included the ability to reference additional files in the specification. If you're dealing with a single event that will span multiple files, it's supported. The format is JSON so it's compressible, which makes transport and storage more efficient.

Simplicity

If you're trying to send a list of IP addresses, when you look at the file, you know it's a list of IP addresses. If you're sending a set of IOCs and a report about those IOCs, you'll be able to associate those IOCs with the report in the same file. If your data is complex and you need multiple files to organize it, you can send those same IOCs in multiple files. We can't think of every scenario, but we've made an attempt to address all the situations we've encountered handling terabytes of data in every format possible, representing an ever growing list of datatypes.

One of the key aspects when dealing with threat intelligence is the rich set of data that conveys context. Full context requires representation of all aspects of a threat including IOCs, network topology & ownership, network behaviors and protocol captures. We focused on representing this full context in a simplified manner without losing important information, by focusing the representation on the raw data with as little additional annotations or wrappers as possible.

A common technique in modelling real-world threats is to capture a set of characteristics or behaviors that are shared across a set of threats in the real-world as a base definition or base class for that threat. OpenTPX allows the provider to capture those common behaviors in a base class in the threat observable dictionary. But the real advantage of OpenTPX is where the provider can leverage the base definition but override the characteristics of the threat or the mitigation recommendations for a specific instance associated with an element. This technique is effectively using an inheritance model for definitions of Internet and Threat intelligence. It makes the definition of both common shared attributes as well as overridden behaviors extremely efficient.

Speed

On the surface, ingesting millions of elements is an easy task. In reality, many data formats that are in use today are built with a relational mindset. Each element is tied to another element and without one, the other is broken. OpenTPX is built on a key/value storage model. Keys can be related, but you don't need all the relations to generate a data file for ingestion. It may seem trivial but it's core to having a system that can process millions or even billions of items quickly. If you've ever done ETL at scale, you already know the pain of feeding a relational datastore a file with millions of lines and that's assuming the format of the file matches your schema. A side effect of our move away from a relational schema is the ability to easily put OpenTPX data into a graph. The node edge relationship is easy to express in the OpenTPX format. If you're not using graphs to visualize, model and query your data, you will reach the limitations of your platform very soon.

What excites me most about the format? There are a lot of elements involved in storing data that most users never consider. Those behind the scenes features are what sets systems apart and enables them to expand beyond their original design. One of the most important elements is time. Temporal analysis is often overlooked in network and threat analysis tools. The reason for this is that a system optimized to quickly analyze large number of IPs or malware is not optimized for dealing with the timestamps for all of those elements. The key to threat remediation and defense is knowing when an event took place. Being able to describe every aspect of time is immensely important for linking events together. There is no such thing as a single time aspect to an event. In Threat Intelligence, there are multiple aspects of time that must be considered when considering the full context:

  • When was the threat or security data captured
  • When was the threat data last modified
  • When was the threat event first seen
  • When was the threat event last seen
  • When was threat analysis performed
  • When does this threat data expire
  • How quickly should the data's value decay when considering its impact

Simplified Typing

In XML and JSON typically there are two ways to convey type information for documents.

Implicit types where the actual object itself defined in the document has named attributes but the type is known a priori either via a previously shared schema file or predefined types for each name attribute. This has the advantage that no type information is necessarily conveyed in the document itself but the provider and consumer of the document must have previously defined agreement on the schema types.

The 2nd form of conveying type information is to use Explicit types where the document includes both a type attribute for each value attribute. The advantage of explicit type conveyance is that no predetermined sharing of type information is necessary. This is one of the primary drivers for Threat Intelligence sharing. The explicit type definition has the advantage of allowing new keys to be defined with a type but the disadvantage that all documents have both a value and a type attribute for each data element to convey.

To get the advantages of explicit type conveyance without the penalty of introducing additional type attributes into the documents, we introduced a simple naming convention to combine key names and their types into a single attribute that represents both. By doing this, providers can create new attribute keys with their associated types and convey that information to a consumer that has no a priori knowledge of the type for that key.

Extensible

We use this format ourselves. As a result it needs to be able to handle large amounts of data as well as elements we plan to add and features we don't even have on our radar.

We use the format to ease data processing, the process that actually ingests data reads OpenTPX, so we aren't constantly re-writing that code. Our feeds come in various formats and they all get converted to OpenTPX. There is work involved in converting, but that work is necessary no matter what formats you choose. We publish our format including a normalized dictionary of terms we've defined for partners and if they send us data in OpenTPX format, there is no work on our end to convert. If customers want to insert data into their system, they can use our API directly or feed their system OpenTPX formatted JSON and they are able to setup their own feeds. We find that customers would rather focus on the research, data and intelligence in the system and let us worry about how to maintain feeds. If someone tells you there is no feed maintenance, then you're being lied to. Maintaining a TIP is enough work for at least one person, imagine freeing up that person to perform analysis instead of sysadmin work.

For all elements, we support a timestamp. We've described the concept of an element and a timestamp together as an observable. Part of the move to OpenTPX is to avoid a lot of confusion due to terminology, so everything is defined and terms avoid having dual meanings.

  • IPs
  • CIDRs
  • ASNs
  • File Hashes
  • FQDN
  • Domains
  • TIC scores
  • Classifications
  • Categories

Mitigation

Threat Intelligence information is not only input to a threat analysts view of the world but can also convey recommendations on how to deal with those threats. We felt that much of the threat intelligence we dealt with did not easily convey mitigation steps.

OpenTPX focused on making sure that threat mitigation recommendations, leveraging the inheritance technique described earlier, could be sequenced and tied to both a class of threats as well as specific mitigation recommendations for a specific instance of threat.

We took a minimalist approach to defining the mitigation language requiring standardization. Many vendors have their own syntax and languages for their products and so we felt it important to support mitigation rules that are vendor specific.

How mitigation steps are connected to threats and their associated elements is defined by OpenTPX. If a vendor wishes to define their own mitigation recommendations the language allows that, or they can use the terms defined in OpenTPX.

Why something new?

We looked at XML and realized that it introduces, conservatively, double the amount of data we needed to process and transfer. When you're dealing with terabytes of raw data, doubling the size to conform to a format doesn't make a lot of sense. If we can convey the same information at half the size, it's an easy decision. JSON accomplishes the same goal with less overhead. XML formats like STIX also didn't have support for data types we needed. We already have experience trying to extend formats that didn't meet our current needs or weren't flexible enough for future needs. We looked at binary solutions like protobufs and realized that most producers of data were not going to spend time converting their processes into a format that was complicated for humans to quickly evaluate. A lot of data feeds are plain text, often compressed and the work involved in moving from a lists of IPs or domains to a JSON format is minimal, so the work for the data producer was not demanding. And to be honest, we're commonly the ones doing the conversion, so a common language was our goal.

Summary

We've released this because it's the easiest way to get data into our systems. We know there are folks out there that could use and expand the format, so we're open to pull requests. We can't promise the PR will be implemented exactly as requested, as managers of the format, we assume responsibility for making sure additions work without breaking the system. We'd like to avoid forks, so more than likely extensions that are submitted will make it into the system. We also believe that if someone has a data feed and wants it shared widely, OpenTPX is a great way to make it available to all of our customers quickly.