A Brief, Incomplete, and Mostly Subjective History of Chinese Internet censorship and its countermeasures

The title was inspired by the priceless A Brief, Incomplete, and Mostly Wrong History of Programming Languages.

Foreword

"翻墙," or "going over the wall," has been an open secret for Chinese Internet users for quite some time. Accessing Google in China takes much more than typing google.com in one's browser bar -- in fact, so much that it becomes somewhat a shared experience that few outsiders could understand. The walls are becoming higher and higher while the ladders are also becoming better and better. It's a constant struggle that's been going on for at least two decades.

Personally, my struggle with censorship is almost as long as the total time I've been using the Internet. In retrospect, both the technological complexity and the evolution of both parties in the struggle have been truly fascinating, if not epic. But so far, there seems to be no complete recorded history on this subject -- what a pity! I suppose I can make an attempt here. I will tell the story mainly in chronological order. That said, the exact dates or even years could be inaccurate, and I can't give references to everything I say. I am not writing an academic paper here, and too many things have been lost in time.

There is a Wikipedia entry that covers many things beyond the technical stuff. In contrast, in this post I would like to try my best to just focus on the technical part which I consider fascinating enough on its own.

Without further ado, what did it take for people in China to access the true Internet?

The Story

Long Long Ago

In 1987, China sent its first email to Germany, marking its first step into the Internet.

Across the Great Wall, we can reach every corner in the world.

People were excited, but probably no one then expected this email to be a prophecy.


For 12 years since then, there had been no technical restrictions for Internet access in China, though the affordability of PCs and the knowledge required for their operations did make Internet users back then a minority.

The Early Days of GFW and VPNs

In 1998 (some say 2002), nation-wide DNS spoofing emerged. Domestic DNS servers were configured to return bogus IPs when asked about the actual address of specific domains, so they appeared to be inaccessible. There were two simple workarounds: explicitly switch to a foreign DNS server (still vulnerable to DNS hijacking) or use a Hosts file that contains correct resolutions. Various Hosts files started to be passed around in Chinese forums for the next decade, and this workaround would still partially work until the early 2010s if I recall correctly. DoT and DoH also would have been good workarounds if only they had been invented by then.

(The DNS spoofing is still active today as shown above, where facebook.com resolves to a Dropbox's internal IP. 114.114.114.114 is a widely used DNS server in China. IP information source)

The censors soon realized that DNS spoofing was insufficient, so they started installing extra special hardware devices at backbone network nodes, such as routers at provincial boundaries or submarine cable landing stations. These devices are connected to the network channel via beam splitters so they can tap the traffic without interfering with it in most cases. In its simplest form, these devices maintain a blacklist of IPs, and when it detects an established TCP connection where either end is on the list, it sends an RST packet to both ends on behalf of each other to reset the connection. Both ends wouldn't know the reset packet had been faked because it would look identical to a legitimate one on the wire. Browsers would then raise a "Connection Reset" error. These devices became the initial hardware foundations for what would be referred to as "The Great Firewall," or GFW in short (although this name or even the censorship itself has never been officially recognized).

(Source)

In addition to IP blacklists, GFW had a keyword-based filter around the same time that resets connections whenever it spots sensitive content (e.g., content containing names of top government officials or scandals). This was possible because HTTPS hasn't gained traction yet and most sites operated in unencrypted HTTP. Indeed the golden age for censors. In the age of HTTPS today, GFW would need to exert pressure on the site operators to keep an eye on sensitive content, which is technically uninteresting and tedious (though still quite effective for big sites).

By then, GFW had primarily focused on blocking direct access to "inappropriate content" and had not turned to people's countermeasures. Consequently, such censorship could often be circumvented with simple HTTPS/SOCKS proxies or VPNs based on standard protocols like PPTP or L2TP. They can usually be set up without installing additional software because popular OSes support them out of the box. People started passing credentials to these services around the Chinese Internet. These simple solutions would still have a slight chance to work in 2014 (personal experience). Some big techs in China also released early versions of their "game accelerators," which are VPNs in disguise, and some clever people could use that to "cross the wall" as well.

Early 2010s

That didn't last long. The first move GFW took was to blacklist the IPs of the VPNs they know and order the popular Chinese forums to promptly delete any advertising of such techniques. A VPN would "die" in 2-3 days, and the same went for proxies. A solution was to have a long list of VPNs so one would always have at least one of them not blacklisted. Better still, this long list should be actively maintained so that it doesn't die out altogether one day. This was when VPNGate, led by the University of Tsukuba, became popular. VPNGate has a directory of VPN nodes contributed by volunteers around the world and which one can try one by one. The success rate wasn't very high (by personal experience), but acceptable. Tor was also an option, but the bridges and entry nodes are infrequently updated, so blacklisting them was easy.

(VPNGate seems to be operational today but it is not usable at all in China now)

It might be worth mentioning that in 2008 the government once shortly promoted the installation of Green Dam Youth Escort, an access control software, on every PC. It didn't take long for the software to be a laughing stock for its disastrous implementation. Let alone the BSODs it caused; all blacklisted URLs were stored in a plaintext text file under the installation directory. Many curious teens back then later said they wouldn't have known so many porn sites but for Green Dam. A true "blessing in disguise?"

(Green Dam ended up as such a joke that people made cartoons out of it. Image taken from Wikipedia entry.)

Around this time (the late 2000s to early 2010s), software such as FreeGate or Wujie had also been popular. Behind them were organizations politically hostile to the Chinese government. Compared with volunteers in VPNGate, developers of this software tend to be more motivated and well-funded, so the reliability was a bit higher. The downside was that upon launching, it would always direct one's browser to their homepages, which contain political propaganda that could be misleading or disturbing to some.

(Image taken from Wikipedia entry)

GoAgent

Yet another solution for the geeks that appeared a little later was to use private foreign nodes. The idea was that if one set up a "ladder" (which Chinese Internet users use to refer to any means of GFW circumvention) that is used privately and never advertised, it would be much less noticeable to the censors. This was the basic idea that underpinned most ladders til today. An early example that emerged around that time was GoAgent, which uses the Google App Engine (GAE) as the proxy server. Everyone could create a unique GAE instance and get a personal proxy node.

I recall a lot of tutorials back then teaching newbies how to go through this entire process of setting up GoAgent. Indeed this seemed to be the moment when ladders became harder to configure. In the past, one pasted a link in the browser proxy settings or the OS VPN setup wizard, and that's it -- all easy Ctrl+C/V and GUI. Starting with GoAgent, one needs to do some remote configuration on the GAE dashboard, download some stuff, edit some config files, and open some scary terminal windows. Kudos to GoAgent devs for writing a script for one-click deployment to GAE, for many of its successors would require SSH-ing into a Linux VPS for deployment, which I imagine would have been an even greater nightmare for beginners. There have been numerous attempts in the future to make bypassing GFW easy again, but IMO the problem hasn't truly been solved.

GoAgent was shut down in late 2014 under government pressure. RIP GoAgent. Its direct successor, XXnet, claims to be usable til today, though I haven't checked.

The Death of VPNs

As various VPNs flourished, GFW is also receiving its upgrade, which soon led to the eventual downfall of all simple proxies and VPNs. The caveat of these solutions lies in that they all have highly distinctive traffic patterns. These protocols have some handshaking procedure where the byte pattern is fixed (regardless of whatever encryption there might be). GFW can therefore recognize the use of these protocols and shut down the connection. From a technical perspective, this is easier said than done compared with GFW's previous filters. Previously, GFW only needed to scan individual packets (TCP segments) for addresses and specific keywords. In contrast, now it has to reconstruct the TCP streams from packets and does the filtering on a higher level. This requirement of context awareness raises the complexity of GFW's algorithms, which is probably why such filters had not been implemented in the first place -- this is my educated guess, at least.

Protocol-based detection was indeed a powerful technique that killed many protocols that had not been designed to face such adversaries. Apart from SOCKS/HTTP proxies and PPTP/L2TP VPNs, the death list also included SSH tunnels, OpenVPN, and some other once-usable but unpopular ladders I don't have time to cover.

It was also around the same time that multiple blog posts appeared with attempts to probe the inner architectures of GFW. People figured out which backbone routers had GFW devices on the side using TTLs and IP reverse lookup. There was even a script called GFW looking glass that could trigger a buffer overflow in GFW with a deliberately crafted DNS query so that the response would contain parts of GFW's memory. Such knowledge proved not particularly helpful in developing ladders but is really cool. (Pity, by the time I knew the script, the buffer overflow bug had been patched :( )

Shadowsocks

Among the few survivors of the GFW upgrade was Shadowsocks. First appeared in 2012, Shadowsocks, in its simplest variant, is just SOCKS5 encrypted with algorithms like AES or Chacha using a predetermined key. The encryption makes its traffic characteristics much harder to model and identify, though it was still possible.

To counter Shadowsocks, GFW looks at the length distribution of the TCP packets, which the encryption does not change so much as the packets' content. In response to this move, later versions of Shadowsocks started adding random-length paddings to the messages, thus altering the length distribution. GFW's next move was to use replay attacks or disguise as a potential client to actively probe what it considered a possible Shadowsocks server (which is not that hard because Shadowsocks typically uses non-standard ports). If it receives a response typical from Shadowsocks servers, then GFW can be sure that the target is a proxy and blacklist the IP. The Shadowsocks developers managed to cope by updating the protocols and elaborately implementing the server. I would say Shadowsocks is the first family of protocols that considered GFW-like adversaries in its design in the very first place.

In addition to encryption, people also started experimenting with various obfuscation techniques on top of Shadowsocks. Multiple Shadowsocks implementations can disguise the proxy traffic to mimic "innocent" protocols such as HTTP, Skype, and WeChat video calls. The effect of such obfuscation has been controversial. Later, people found the paper The Parrot is Dead from UT Austin, confirming that such obfuscation did not improve undetectability (and actually quite the opposite).

In 2017, a Shadowsocks user found an intriguing patent application on the website of the Chinese Patent Office, which documents classifying Shadowsocks traffic with deep random forest. The discovery caused quite a stir (and panic) in the community with rumors that the patent came from "the wall builders" and Shadowsocks was going to die. Though there have never been official documents of GFW's inner workings, this might be the first evidence that GFW has started deploying machine-learning-based technologies to classify proxying traffic. (which, if you think of it from a purely technical perspective, is an engineering feat given the immense volume of traffic going through GFW every second) Since then, there have been increasing anecdotal reports of Shadowsocks servers being blocked, although anecdotal evidence also suggests that the protocol is still usable today under some configurations.

Shadowsocks has been more than influential in the ladder community for the past decade. When people say Shadowsocks, they are referring to not just one protocol or implementation but a family of protocols and implementations in different languages and on various platforms. The original Shadowsocks repo was once deleted under government pressure, and people moved to other forks (the most famous one being ShadowsocksR). Then somehow, the original Shadowsocks repo was revived, but some are still using some fork, or fork of fork of fork... The full Shadowsocks pedigree is daunting to draw. Almost all Shadowsocks apps contain a paper plane in their icons, and thus many call Shadowsocks servers "airstrips." The name was soon extended to other proxy servers.

A Short Detour, and Wireguard

I suppose that's enough Shadowsocks. Before we move on to the next milestone of the GFW or ladders, I would like to take a detour to talk about developments that took place on a more fundamental level -- VPS, routes, and TCP stacks.

None of these had been problems before 2010 when most of us just used VPNs set up by some random but generous geek. These aren't problems for many today who buy commercial ladders and have everything taken care of. But the last decade witnessed a leap forward in the computer skills of Chinese Internet users and a boom in cloud computing platforms, which made hosting one's own server node (VPS) a tempting option for more and more people (myself included). Those people started comparing service qualities and uptimes of different VPS providers and learned about different tiers of routes (mainly between China and the US). E.g., the China Telecom 163 backbone route is usually the default but will be intolerably laggy at night due to high traffic; The CN2 premium route was good but more expensive. People started tracerouting back and forth and created more benchmarking scripts. Some attacked the speed problem from another angle by replacing the default TCP CUBIC congestion control algorithm with a much more aggressive one. The LiteSpeed kernel module was created to this end, and then when Google's BBR came out, there were tutorials everywhere on enabling BBR on the VPS and a lot of convenience scripts. Such hacking inside the kernel requires support from the virtualizing technology, so people started comparing KVM and Xen and such. Some also created iptables scripts to control access to the servers so that they are more resilient to the GFW's active probing. These techniques are orthogonal to most proxy applications and could improve the experience significantly. I personally found messing around with these technologies great fun and learned a ton of Linux and networking knowledge, so I think they deserve a place in this article.

People also noticed Wireguard as a rising star around the time. To install it on the server side, one has to dig deep and upgrade the Linux kernel. This is closer to the low-level stuff I just covered, so I'll take the opportunity to talk about it here. Its simplicity and UDP-based nature were very appealing at the moment, and indeed it worked like a charm. The hype, however, quickly cooled down upon the realization that while the protocol was designed to be secure, it wasn't intended to be undetectable. The patterns of Wireguard packets are apparent. Thus, it was no surprise that in several months GFW could accurately detect Wireguard usage and block it. Another caveat is that as a VPN, Wireguard is not very flexible (i.e., either all app traffic goes through it or none) relative to proxies, especially compared with the one I am about to introduce...

V2Ray

In 2016, the first version of V2Ray was released. It was a game-changer.

The killer feature of V2Ray is that it offers highly flexible and fine-granularity control of the proxy architecture via its JSON-based config. V2Ray models all proxied traffic as coming through some "inbound," gets filtered or dispatched by some routing rules, and passed to some "outbound." All these three parts can be configured. Inbounds and outbounds can be assigned tags that can be used as a reference in routing rules. Notable inbound protocols include SOCKS5 and HTTP proxy. Outbound protocols include SOCKS5, Shadowsocks, "freedom" (which simply sends the traffic to the free Internet), "blackhole" (which silently consumes all traffic), and VMess, which was V2Ray's own proxy protocol, built upon the lessons learned from its predecessors over the years.

Under this model, a ladder's client and server sides can use the same V2Ray binary with different configurations. A simple client could be a SOCKS5 inbound chained to a VMess outbound; A simple server could be a VMess inbound chained to a "freedom" outbound. This also allows complicated setups involving multiple inbounds and outbounds, relay nodes, transparent proxies (using the "TProxy" inbound), port forwarding (using the "Doko demo door" outbound), and more. In addition, the routing rules can be very flexible. IPs and domains can be matched against CIDR, pre/suffices, and a geographic database that comes with the program. Want local traffic to not go through the proxy? Add a rule directing local IP ranges to "freedom" in the client. Want to block some ad domains? Add a rule to divert such requests to a "blackhole" outbound. Compared with configurations of previous ladders -- which usually contain a local address, a server address, and credentials -- what V2Ray offered was such an eye-opener.

Since V2Ray, other proxy applications have emerged with similar high configurability. Most call themselves "rule-based proxies" or "proxy frameworks." Notable examples include Clash and, more recently, Leaf. These alternatives offer some quality-of-life features, such as

  • A more user-friendly, YAML-based configuration format (V2Ray uses JSON);
  • A "true global proxy" with Tun2socks, usually referred to as a "TUN inbound." On activation, this special inbound augments the system network stack and modifies the routing table so it can capture traffic from applications that aren't proxy-aware.
  • Runnable on OpenWRT routers, so any devices connected do not need extra setups to be able to circumvent GFW.
  • GUI or WebUI front ends for graphical configuration.
  • More complicated routing with python scripts.
  • Subscription -- a mechanism that allows the proxy frameworks to auto-update configurations from a URL. It makes it easy for users of commercial proxy services to set up their clients and keep their node directory and protocol configurations up-to-date with the service provider. V2Ray offers a URL scheme that encodes simple, static configs as short URLs (e.g., vmess://...), which can be more easily shared as QR codes or messages, but this feature does not come close to the flexibility subscription offers.

Notwithstanding these fancy features, these frameworks adhere to V2Ray's inbound-routing rules-outbound model. All of these frameworks aim to be somewhat of an all-in-one solution and have popular protocols such as SOCKS5, Shadowsocks, and VMESS built-in. Because these protocols don't change, mixing these frameworks or chaining them up is possible. For instance, one can use V2Ray on the server side (because it's lightweight) and Clash for the client. Personally, I've used Clash to locally redirect traffic to a local V2Ray instance so I can have both a TUN inbound (Clash exclusive) and a VLESS outbound (a successor to VMess, V2Ray exclusive). Such mix-and-match adds another level of flexibility.

In general, V2Ray and its friends immensely diversified the proxy "ecosystem," improved user experience, and lowered the barrier to entry for new users. The massive number of possible combinations of proxy schemes posed a considerable challenge to GFW. The advanced routing features can make a technique from GFW's arsenal useless: If one accidentally visits a Chinese website through the proxy, the traffic goes abroad and returns, passing GFW twice. GFW can then easily correlate the two streams and identify the use of a proxy. However, if one configures V2Ray to route domestic traffic away from the proxy, this would not be a problem.

But still, the outbound/inbound proxy protocols themselves remain the weakness of the whole system. If GFW can identify every protocol you use and block it, then V2Ray will not take you very far, no matter how many hops of relay you use or how many different protocols are used in the process. Fortunately, the next breakthrough in proxy protocol was around the corner.

Back to HTTP?

work in progress, to be continued...