On Microsoft’s twenty-fifth anniversary, Steve Ballmer shouted "Developers, Developers, Developers…" energetically like a Duracell bunny. In machine learning, you should have heard the same yelling from everyone — "Data, Data, Data…"
Data Preparation
In the corner of the last decade, I was busy developing TCP packet sorter application to collect data for classification. To write this bespoke tool, I had to deep dive into several RFC documents and read TCP implementation from BSD 4.4-Lite source code.
I don’t claim myself as a TCP expert now. But I did find a serious bug in TCP sequence number comparison from PcapPlusPlus library. If you were a math geek, I think you would enjoy reading my bug report.
Because the PcapPlusPlus maintainer insists that the library should be C++ 11 free. He couldn’t accept my packet sorting feature pull request in the upstream. I had to maintain the TCP packet sorting new features in my forked PcapPlusPlus repository. The PacketSorter application, which is built upon the sorting feature, is hosted in my another repository
PacketSorter is a time-critical application on the server-side. It ran 24×7 in my pfSense router for more than a month to collect all network traffic at home. It has to be free from any resource leak: memory leak or file handle leak. The smart pointers from C++ 11 and Valgrind did help me a lot to achieve this goal.
It is interesting that even my home Internet speed is only at 100 Mbps, the packet live capturing engine libpcap still misses capturing some TCP packets from time to time. I tried different approaches to minimize the missing capturing. For example, I created two threads: one thread which is invoked by libpcap packet arrival callback puts the packet into the lock-free ring buffer and another thread consumes the packet from the ring buffer and saves it to disk as a pcap file. But neither of them eliminates or even improves a little. You can find my trials from several different development branches in my PacketSorter repository.
I collected some numbers in the first 15 days: among 329,841 TCP connections, there are 2,630 cases that libpcap misses capturing TCP packet. The ratio is relatively low — 0.80%. Perhaps, this might be the limit of libpcap. That is why the cool kids today work on DPDK to accelerate packet processing.
Data Cleaning
Depending on computation and storage requirements, data cleaning has been done in different stages.
During packet live capturing, only non-empty TCP payload packets are saved into the pcap files. In each TCP connection, it only retains the first 16th packets.
In the offline stage, process those pcap files with exactly 16 packets. Write a program in C/C++ (not in Python) to post-process packets:
- Remove the link layer, a.k.a the Ethernet frame header.
- In the TCP layer and the IP layer: mask source’s and destination’s IP address and port with zero value. Recompute their checksum individually after modification.
Carefully planning where to do what is key to success. After all, you won’t find it fun to write a Python script to mess around the raw packet data in memory. Python is great in data management but it is neither low-level nor efficient enough to provide you a pointer to do this kind of plumbing work.