CTU-13 Dataset Preprocessing
Source : New dataset, CTU-13-Extended, now includes pcap files of normal traffic — Stratosphere IPS
Contents of Dataset File :
- CTU-Malware-Capture-Botnet-42
- CTU-Malware-Capture-Botnet-43
- CTU-Malware-Capture-Botnet-44
- CTU-Malware-Capture-Botnet-45
- CTU-Malware-Capture-Botnet-46
- CTU-Malware-Capture-Botnet-47
- CTU-Malware-Capture-Botnet-48
- CTU-Malware-Capture-Botnet-49
- CTU-Malware-Capture-Botnet-50
- CTU-Malware-Capture-Botnet-51
- CTU-Malware-Capture-Botnet-52
- CTU-Malware-Capture-Botnet-53
- CTU-Malware-Capture-Botnet-54
Preprocessing
(Truncated) PCAP files in the extended data set extracted using geek The Zeek Network Security Monitor.
To prepare the data for training the files will be converted :
PCAP > ZEEK LOGS > CSV > Structured CSV > ML TRAINING
Extracted Files :
- analyzer.log
- capture_loss.log
- conn.log
- loaded_scripts.log
- notice.log
- packet_filter.log
- stats.log
- telemetry.log
- weird.log
conn.log fields
| ts | Timestamp of first packet seen |
| uid | Unique connection ID |
| id.orig_h | Originator’s IP address |
| id.orig_p | Originator’s port |
| id.resp_h | Responder’s IP address |
| id.resp_p | Responder’s port |
| proto | Transport protocol (TCP/UDP/ICMP) |
| service* | Application service (http, dns, ssl) if identified |
| duration | Connection’s total duration in seconds |
| orig_bytes | Payload bytes from originator |
| resp_bytes | Payload bytes from responder |
| conn_state | Overall state of connection (e.g., ESTABLISHED, REJ) |
| local_orig | Whether the originator is a local host |
| local_resp | Whether the responder is a local host |
| missed_bytes | Bytes not captured due to packet loss |
| history | Packet-level flags indicating handshake/data flow |
| orig_pkts | Number of packets sent by the originator |
| orig_ip_bytes | Total IP-layer bytes from originator (including headers) |
| resp_pkts | Number of packets sent by the responder |
| resp_ip_bytes | Total IP-layer bytes from responder (including headers) |
| tunnel_parents | Reference to any parent tunnel connection (UID) |
#truncate \\t for comma delimitted data
cat conn.log | zeek-cut ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents | tr "\\t" "," > conn.csv
- Random Forest
- Bot IP to normal IP ratio : 7% of total
2. When training
class inbalance problem
feature selection algorithm - cfs chisquared
Merged all scenarios in to one :
Feature selection algorithm showed that even with different viruses, the most viable features were the same. Since the botnets aim to achieve infection and similar malignant behavior, they should have no prolbme merged toegerth