A Journey to Collect Training Data

In the last blog, I mentioned that I got my hardware ready within a day for collecting training data from my home network. It turned out that getting ready for software cost me more than a month now.

It is common in my jobs that the data management task, such as data preparation, data cleaning and data transformation, takes way much more effort and time than write any analytic code to draw an informed conclusion.

The claim "V2ray traffic identification method based on long short-term memory neural network" captures link layer packets from a network switch. It proposes to collect the following data:

  1. Label packet into two categories: V2ray traffic and any other legitimate traffic.
  2. Exclude redundant packets.
  3. Exclude packet from DNS protocol.
  4. Exclude TCP 3-way handshake packets.
  5. Keep the first 16 packets in each TCP/IP communication and aggregate the packets as one single data entry.

and perform the following data cleaning:

  1. Remove the link layer header and keep the network layer packet.
  2. Padding UDP header with zero so that its header size is the same as TCP’s header size.
  3. Set zero value to the network layer IP address and port.
  4. Set uniform size for each packet with zero paddings.

Build a temporal data set from above to train binary classifier recurrent neural network.

I started with tcpdump first in my pFsense router. For sure, the targeted packets are captured successfully by applying a proper BPF (Berkeley Packet Filter). But the captured packets are not sorted by TCP connection. Then I looked into some existing tools that reassemble TCP connection like tcpflow, PcapPlusPlus, and etc. Those tools are built on top of packet capture library libpcap and only focus on rebuilding the TCP payload in order rather than keep the raw packets untouched.

After some investigation, I decided to write my own packet capture tool on top of an open-source repository PcapPlusPlus. The C++ source code in the repository is well-written and well-documented. As far as I know, its user guide and developer guide is far more superior to those commercial software libraries. I already sent a PR that ports FreeBSD build for PcapPlusPlus.

Now I deep dive into TCP protocol specification — RFC 793 and two different TCP implementation — Minix 2.0.0 and BSD4.4-Lite. I don’t know if it is possible to find an emulator to run 4.4BSD-Lite. But I spent a night to fork and patch Bochs 2.1.1 to run Minix 2.0.0 with networking working again. It was absolute fun to play with Minix back in my college years. You will never know that the term project in 15-year ago operating system undergrad class is still helping me now.

Happy Hacking!

This entry was posted in Technology and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply