Magnifying glassData Sets

Here is a list of potentially useful data sets for the VizSec research and development community. If you have any additions or if you find a mistake, please email us, or even better, clone the source send us a pull request.

  • USB-IDS Datasets: USB-IDS-1 consists of 17 (compressed) csv files providing ready-to-use labeled network flows.
  • Mordor Project: The Mordor project provides pre-recorded security events generated by simulated adversarial techniques in the form of JavaScript Object Notation (JSON) files for easy consumption. The pre-recorded data is categorized by platforms, adversary groups, tactics and techniques defined by the Mitre ATT&CK Framework. The pre-recorded data represents not only specific known malicious events but additional context/events that occur around it.
  • Advanced Research in Cyber System Data Sets: ARCS provides multiple data sets collected from the Los Alamos National Laboratory enterprise network.
  • UNSW-NB15 Dataset: The raw network packets of the UNSW-NB 15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. Tcpdump tool is utilised to capture 100 GB of the raw traffic (e.g., Pcap files). This dataset has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus, Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features with the class label.
  • Canadian Institute for Cybersecurity’s Datasets: Canadian Institute for Cybersecurity datasets are used around the world by universities, private industry and independent researchers. There are multiple data sets available.
  • UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs: The dataset presented here is built with real traffic and up-to-date attacks. These data come from several netflow v9 collectors strategically located in the network of a spanish ISP. The main advantage of this dataset over previous ones is its usefulness for evaluating IDSs that consider long-term evolution and traffic periodicity. Models that consider differences in daytime/night or labour weekdays/weekends can also be trained and evaluated with it.
  • Stanford Large Network Dataset Collection (SNAP): Not specific to security, but there are several relevant graph data sets.
  • APTnotes: APTnotes is a repository of publicly-available papers and blogs (sorted by year) related to malicious campaigns/activity/software that have been associated with vendor-defined APT (Advanced Persistent Threat) groups and/or tool-sets.
  • Darpa CGC (known vulnerabilities)
  • SoReL-20M Windows Portable Executable files
  • DNS data: publication link and data link Description: more than a terabyte of unprocessed DNS PCAPs along with tens of gigabytes of de-duplicated DNS records per day. Thus, the active DNS datasets represent a significant portion of the world’s daily DNS delegation hierarchy.
  • SecRepo is curated list of Security data. It includes malware, NIDS, Modbus, and System logs. It contains many of the below links in addition.
  • malware-traffic-analysis provides samples and PCAPs. It gives a day-by-day listing of what campaigns are active here.
  • NETRESEC Data: a list of public packet capture repositories, which are freely available on the Internet. Most of the sites listed below share Full Packet Capture (FPC) files, but some do unfortunately only have truncated frames. This includes SCADA/ICS Network Captures.
  • CTU Data: The CTU-13 dataset consist in a group of 13 different malware captures done in a real network environment. The captures include Botnet, Normal and Background traffic. The Botnet traffic comes from the infected hosts, the Normal traffic from the verified normal hosts and the Background traffic is all the rest of traffic that we don’t know what it is for sure. The dataset is labeled in a flow by flow basis, consisting in one of the largest and more labeled botnet datasets available.
  • Digital Corpora: DigitalCorpora.org is a website of digital corpora for use in computer forensics education research. All of the disk images, memory dumps, and network packet captures available on this website are freely available and may be used without prior authorization or IRB approval. We also have available a research corpus of real data acquired from around the world. Use of that dataset is possible under special arrangement.
  • Impact: Previously known as PREDICT, the Protected Repository for the Defense of Infrastructure Against Cyber Threats, is a community of producers of security-relevant network operations data and researchers in networking and information security. The repository provides developers and evaluators with regularly updated network operations data relevant to cyber defense technology development. The Dataset Catalog is publicly accessible and you can browse dataset details without logging in. Current users can log in to request datasets.
  • Kyoto: Traffic Data from Kyoto University’s Honeypots.
  • The Honeynet Project: Many different types of data for each of their challenges, including pcap, malware, logs.
  • VAST Challenge 2013: Mini-challenge 3 is related to cybersecurity and includes network flow data, network status data (via big brother), and intrusion prevention system data.
  • VAST Challenge 2012: This challenge has two mini-challenges, one related to situation awareness (metadata and periodic status reports from all computing equipment) and one to forensics (Firewall and IDS logs).
  • ORNL Auto-labeled corpus: A corpus of automatically labeled text data in the cyber security domain.
  • Operationally Transparent Cyber (OpTC) Data: Endpoint activity from about 500 endpoints with Zeek data from the enterprise egress point. Includes benign record generation and malware injection from a red team.