After 3 months of data preparation work, I spent only one night to write a model in Keras to classify V2Ray traffic. What is the result? Well, as my title says — "We are Fucked". The patent claim is valid.
Before publishing this post, I tried to contact the V2Ray development team. But their members are hard to reach. In Github, they even forbid people to openly discuss this subject matter. It is an irony that the V2Ray tool is anti-censorship. The V2Ray developers prefer censoring others in a technical discussion.
In any case, I posted my classification model in Github repository. If you are interested in machine learning, that’s good techie Jupyter notebook you should check it out.
Unlike the patent claim, I don’t use Recurrent Neutral Network (RNN) due to my concern to its slow training and inference speed. I treat the stream of packet data as a sequence of 1D feature images. The input shape is (16, 1500), which is one TCP connection with the first 16 packets and 1500 bytes per each packet. In other words, it is a sequence of 1500 features with 16 time steps. 1D Convolution Neural Network (CNN) is a good fit to process this kind of time series.
To speed up I/O, I also wrote a data generator to load the GB size of packet data in a parallel manner. Keras provides a good interface like fit_generator, evaluate_generator and predict_generator to take advantage of splitting data processing work and training work between CPU and GPU. All you should do is to RTFM from Keras.
The classification result is amazingly impressive. The accuracy can reach 99%. The ROC curve looks excellent. To be honest, I have mixed feelings about the outcome. On one hand, I feel happy about that. After 3 months of work, I can see something works. On the other hand, I feel concerned about this. Because big brother has unprecedented AI surveillance tools to suppress freedom.
The CNN inference speed is blazingly fast in my rig. During the evaluation, with 32 samples/batch at 7ms/step the model can classify one sample at 0.21875 ms. In other word, it can classify 4571 TCP connection per second.
For a country with a billion population, can the evil regime deploy this deep packet inspection AI classifier in practice? If they can, how can we bypass it? That is a million-dollar question.