Skip to main content

CTU-13 Dataset Preprocessing

Source : New dataset, CTU-13-Extended, now includes pcap files of normal traffic — Stratosphere IPS

qTUimage.pngContents of Dataset File :
  • CTU-Malware-Capture-Botnet-42
  • CTU-Malware-Capture-Botnet-43
  • CTU-Malware-Capture-Botnet-44
  • CTU-Malware-Capture-Botnet-45
  • CTU-Malware-Capture-Botnet-46
  • CTU-Malware-Capture-Botnet-47
  • CTU-Malware-Capture-Botnet-48
  • CTU-Malware-Capture-Botnet-49
  • CTU-Malware-Capture-Botnet-50
  • CTU-Malware-Capture-Botnet-51
  • CTU-Malware-Capture-Botnet-52
  • CTU-Malware-Capture-Botnet-53
  • CTU-Malware-Capture-Botnet-54

Preprocessing

(Truncated) PCAP files in the extended data set extracted using geek The Zeek Network Security Monitor

To prepare the data for training the files will be converted :

PCAP > ZEEK LOGS > CSV > Structured CSV > ML TRAINING

Extracted Files :

  • analyzer.log
  • capture_loss.log
  • conn.log
  • loaded_scripts.log
  • notice.log
  • packet_filter.log
  • stats.log
  • telemetry.log
  • weird.log


conn.log fields

ts Timestamp of first packet seen
uid Unique connection ID
id.orig_h Originator’s IP address
id.orig_p Originator’s port
id.resp_h Responder’s IP address
id.resp_p Responder’s port
proto Transport protocol (TCP/UDP/ICMP)
service* Application service (http, dns, ssl) if identified
duration Connection’s total duration in seconds
orig_bytes Payload bytes from originator
resp_bytes Payload bytes from responder
conn_state Overall state of connection (e.g., ESTABLISHED, REJ)
local_orig Whether the originator is a local host
local_resp Whether the responder is a local host
missed_bytes Bytes not captured due to packet loss
history Packet-level flags indicating handshake/data flow
orig_pkts Number of packets sent by the originator
orig_ip_bytes Total IP-layer bytes from originator (including headers)
resp_pkts Number of packets sent by the responder
resp_ip_bytes Total IP-layer bytes from responder (including headers)
tunnel_parents Reference to any parent tunnel connection (UID)
#truncate \\t for comma delimitted data
cat conn.log | zeek-cut ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents | tr "\\t" "," > conn.csv


 

  1. Random Forest
    1. Bot IP to normal IP ratio : 7% of total

2. When training

class inbalance problem

feature selection algorithm - cfs chisquared


Merged all scenarios in to one : 

Feature selection algorithm showed that even with different viruses, the most viable features were the same. Since the botnets aim to achieve infection and similar malignant behavior, they should have no prolbme merged toegerth