Magnifying glassData Sets

Here is a list of potentially useful data sets for the VizSec research and development community. If you have any additions or if you find a mistake, please email us, or even better, clone the source send us a pull request.

  • Stanford Large Network Dataset Collection (SNAP): Not specific to security, but there are several relevant graph data sets.
  • APTnotes: APTnotes is a repository of publicly-available papers and blogs (sorted by year) related to malicious campaigns/activity/software that have been associated with vendor-defined APT (Advanced Persistent Threat) groups and/or tool-sets.
  • Open Malware: A database of live malware.
  • Shadow Server Malware Data site
  • Darpa CGC (known vulnerabilities)
  • DNS data: publication link and data link Description: more than a terabyte of unprocessed DNS PCAPs along with tens of gigabytes of de-duplicated DNS records per day. Thus, the active DNS datasets represent a significant portion of the world’s daily DNS delegation hierarchy.
  • SecRepo is curated list of Security data. It includes malware, NIDS, Modbus, and System logs. It contains many of the below links in addition.
  • malware-traffic-analysis provides samples and PCAPs. It gives a day-by-day listing of what campaigns are active here.
  • NETRESEC Data: a list of public packet capture repositories, which are freely available on the Internet. Most of the sites listed below share Full Packet Capture (FPC) files, but some do unfortunately only have truncated frames. This includes SCADA/ICS Network Captures.
  • CTU Data: The CTU-13 dataset consist in a group of 13 different malware captures done in a real network environment. The captures include Botnet, Normal and Background traffic. The Botnet traffic comes from the infected hosts, the Normal traffic from the verified normal hosts and the Background traffic is all the rest of traffic that we don’t know what it is for sure. The dataset is labeled in a flow by flow basis, consisting in one of the largest and more labeled botnet datasets available.
  • Digital Corpora: DigitalCorpora.org is a website of digital corpora for use in computer forensics education research. All of the disk images, memory dumps, and network packet captures available on this website are freely available and may be used without prior authorization or IRB approval. We also have available a research corpus of real data acquired from around the world. Use of that dataset is possible under special arrangement.
  • Impact: Previously known as PREDICT, the Protected Repository for the Defense of Infrastructure Against Cyber Threats, is a community of producers of security-relevant network operations data and researchers in networking and information security. The repository provides developers and evaluators with regularly updated network operations data relevant to cyber defense technology development. The Dataset Catalog is publicly accessible and you can browse dataset details without logging in. Current users can log in to request datasets.
  • Kyoto: Traffic Data from Kyoto University’s Honeypots.
  • The Honeynet Project: Many different types of data for each of their challenges, including pcap, malware, logs.
  • VAST Challenge 2013: Mini-challenge 3 is related to cybersecurity and includes network flow data, network status data (via big brother), and intrusion prevention system data.
  • VAST Challenge 2012: This challenge has two mini-challenges, one related to situation awareness (metadata and periodic status reports from all computing equipment) and one to forensics (Firewall and IDS logs).
  • VAST Challenge 2011: Mini-challenge 2 is related to Cybersecurity -  Situational Awareness in Computer Networks (Firewall and IDS logs).
  • DARPA Intrusion Detection Data: This data set has numerous issues that have been documented in the literature.
  • ORNL Auto-labeled corpus: A corpus of automatically labeled text data in the cyber security domain.
  • Industrial Control System (ICS) Cyber Attack Data Set: Data from MSU. The dataset is made up of tuples of timestamp, network protocol (MODBUS), and system information (measurements and settings), and attack attributes.