The title was inspired by the priceless A Brief, Incomplete, and Mostly Wrong History of Programming Languages.

Foreword

“翻墙,” or “going over the wall,” has been an open secret for Chinese Internet users for a long time. Accessing Google in China takes much more than typing google.com in one’s browser bar—so much that it becomes a shared, identity-building experience not much known to the outsiders. The walls are growing higher and higher while the “ladders” are also becoming better and better. This constant struggle has been on for at least two decades.

My struggle with censorship is almost as long as the total time I’ve been using the Internet. In retrospect, both the technological complexity and the evolution of both parties in the struggle have been truly fascinating. But so far, there seems to be no complete recorded history on this subject—what a pity! Allow me to make an attempt here. I will tell the story mainly in chronological order. That said, the exact dates or even years could be inaccurate, and I can’t give references to everything I say. This is not an academic paper here, and too many things have been lost in time.

There is a Wikipedia entry that covers many non-technical stuff on this topic. In contrast, in this post I would like to zoom in on the technical part which is no less intriguing.

Without further ado, what did it take for people in China to browser the “World-Wide Web?”

The Story

Long Long Ago

In 1987, China sent its first email to Germany, marking its first step into the Internet.

Across the Great Wall, we can reach every corner in the world.

People were excited, but no one then expected this email to be a prophecy.


For 12 years since then, there had been no technical restrictions to Internet access in China. Internet users in China were still a minority though, because computers were expensive and not user-friendly by today’s standards.

The Early Days of GFW and VPNs

In 1998 (some say 2002), nation-wide DNS spoofing emerged. Domestic DNS servers were configured to return bogus IPs when asked to resolve specific domains so they appeared inaccessible. There were two workarounds: explicitly switch to a foreign DNS server (still vulnerable to DNS hijacking) or use a Hosts file that contains correct resolutions. People passed around various Hosts files in Chinese forums for the next decade, and this workaround would still partially work until the early 2010s. DoT and DoH also would have been good workarounds if only they had been invented by then.

DNS spoofing in action

(The DNS spoofing is still active today as shown above, where facebook.com resolves to a Dropbox’s internal IP. 114.114.114.114 is a widely used DNS server in China. IP information source)

The censors soon realized that DNS spoofing was insufficient, so they installed extra special hardware devices at backbone network nodes, such as routers at provincial boundaries or submarine cable landing stations. Connected to the network channel via beam splitters, these devices can passively tap the traffic without interfering with it. In its simplest form, these devices maintain a blacklist of IPs, and when it detects an established TCP connection where either end is on the list, it sends an RST packet to both ends on behalf of each other to reset the connection. Both ends wouldn’t know the reset packet had been faked because the packet would look indistinguishable to a true one on the wire. Browsers would then raise a “Connection Reset” error. These devices became the initial hardware foundations for what would be referred to as “The Great Firewall,” or GFW in short (although this name, or even the censorship itself, has never been officially recognized).

The GFW topology

(Source)

In addition to IP blacklists, GFW had a keyword-based filter around the same time that resets connections whenever it spots sensitive content (e.g., containing names of top government officials or scandals). This was possible because HTTPS hasn’t gained traction yet and most sites operated in unencrypted HTTP. It was the golden age for censors indeed. In the age of HTTPS today, GFW would need to exert pressure on the site operators to keep an eye on sensitive content, which is technically uninteresting and tedious (though still quite effective for big sites).

By then, GFW had focused on blocking direct access to “inappropriate content” and was not engineered against people’s countermeasures. Consequently, GFW back then could often be circumvented with simple HTTPS/SOCKS proxies or VPNs based on standard protocols like PPTP or L2TP. They can be set up without installing additional software because popular OSes support them out of the box. People started passing credentials to these services around the Chinese Internet. These simple solutions would still have a slight chance to work in 2014 (personal experience). Some big techs in China also released early versions of their “game accelerators,” which are VPNs in disguise, and clever people could reconfigure that to “cross the wall” as well.

Early 2010s

That didn’t last long. The first move GFW took was to blacklist the IPs of the VPNs they know and order the popular Chinese forums to promptly delete any advertising of such services. A VPN would “die” in 2-3 days after going public, and the same went for proxies. Naturally, a solution was to have a long list of VPNs so one would always have at least one of them not blacklisted. Better still, this long list should be actively maintained so that it doesn’t die out altogether one day. People looked around and found VPNGate to be a good realization of this idea. Led by the University of Tsukuba, VPNGate has a directory of VPN nodes contributed by volunteers around the world and which one can try one by one. The success rate wasn’t high, but acceptable. Tor was also an option, but the bridges and entry nodes are infrequently updated, thus they are blocked most of the time.

VPNGate

(VPNGate seems to be operational today but it is not usable at all in China)

A tangent: in 2008 the government once shortly promoted the installation of Green Dam Youth Escort, an “client-side GFW”, on every PC. It didn’t take long for the software to be a joke for its disastrous implementation. Let alone the BSODs it caused; all blacklisted URLs were stored in a plaintext text file under the installation directory. Many curious teens back then said they wouldn’t have known so many porn sites but for Green Dam. Truly ironic. What a pity that I missed that :(

Green dam cartoon girl

(Green Dam ended up as such a joke that people made cartoons out of it. Image taken from Wikipedia entry.)

Around this time (the late 2000s to early 2010s), proxies such as FreeGate or Wujie had also been popular. They are backed by organizations politically hostile to the Chinese government. Compared with volunteers in VPNGate, FreeGate and Wujie’s developers were more motivated and well-funded, so these proxies ended up being more reliable. The downside was that upon launching, it would always direct one’s browser to their homepages with political propaganda.

Freegate GUI

(Image taken from Wikipedia entry)

GoAgent

Another solution the tech-savvy folks popularized a little later was to use private foreign nodes. The idea was that if one set up a private proxy server and keep it secret, it would be much less noticeable to the censors. This was the basic idea that underpinned most proxy solutions til today. An early pioneer was GoAgent, which uses the Google App Engine (GAE) as the hosting platform. Everyone could create a unique GAE instance and get a personal proxy node.

I remember seeing a lot of tutorials back then teaching beginners how to go through this entire process of setting up GoAgent. Indeed this seemed to be the turning point after which proxies became hard to configure. In the past, one pasted a IP to the browser proxy settings or the OS VPN setup wizard, and that’s it—all easy Ctrl+C/V and GUI. Starting with GoAgent, it took remote configuration on the GAE dashboard, downloading third-party executables, editing config files, and some scary terminal windows to have everything up and running. Kudos to GoAgent devs for writing a script for one-click deployment to GAE, for many of its successors would require SSH-ing into a Linux VPS for deployment, an even greater nightmare for beginners. There have been numerous attempts in the future to make bypassing GFW easy again, but the problem hasn’t truly been solved.

GoAgent was shut down in late 2014 under government pressure. RIP GoAgent. Its direct successor, XXnet, claims to be usable til today, though I haven’t checked.

The Death of VPNs

VPNs did not last long either. GFW soon received an upgrade that led to the eventual downfall of all naive proxies and VPNs. Their critical weakness lies in their highly distinctive traffic signatures. These protocols have handshaking procedures where the byte pattern is fixed (regardless of whatever encryption might follow). GFW can thus recognize the use of these protocols and reset the connection.

Sounds easy, isn’t it? Not really. Compared with previous techniques, its was technically harder for GFW to pull this off because it required more context. Back then, GFW only needed to scan individual packets (TCP segments) for addresses and keywords. But proxy traffic signatures span multiple packets, so GFW now has to reconstruct whole TCP streams. This requirement of higher context length made GFW stateful and its algorithms more complicated, which is probably why such filters had not been implemented in the first place—this is my educated guess.

Signature-based detection was so powerful that it killed almost all protocols that had not been designed to face such adversaries. In addition to SOCKS/HTTP proxies and PPTP/L2TP VPNs, the death list also included SSH tunnels, OpenVPN, and some other once-usable but unpopular proxies I don’t have time to cover.

Another fun tangent: It was also around the same time that multiple blog posts appeared with attempts to probe the inner architectures of GFW. People figured out which backbone routers had GFW devices on the side using TTLs and IP reverse lookup. There was even a script called GFW looking glass that could trigger a buffer overflow in GFW with a deliberately crafted DNS query so that the response would contain parts of GFW’s memory. Such knowledge proved not particularly helpful in developing proxies but is really cool. (Pity, by the time I knew the script, the buffer overflow bug had been patched :( )

Shadowsocks

Among the few survivors of the GFW upgrade was Shadowsocks. First appeared in 2012, Shadowsocks, simply put, is SOCKS5 encrypted with algorithms like AES or Chacha using a preset key. The use of encryption makes Shadowsocks’ traffic pattern much harder to model and identify, though it was still possible.

To counter Shadowsocks, GFW looks at the length distribution of the TCP packets, which the encryption does not change so much as the packets’ content. In response to this move, later versions of Shadowsocks added random-length paddings to the messages. GFW’s next move was to employ replay attacks or disguise as a potential client to actively probe Shadowsocks server suspects (identifying suspects is not that hard because Shadowsocks typically uses non-standard ports). If the probe returns a Shadowsocks-y response, then GFW can be sure that the target is a proxy and blacklist the IP. The Shadowsocks developers managed to cope by updating the protocols and meticulously implementing the server to not disclose its existence under probes.

Shadowsocks was an open source project, and you just saw a series of back-and-forth between the Shadowsocks devs and GFW, the first time in history. I would say Shadowsocks is the first family of protocols that actively considered GFW-like adversaries in its design.

In addition to encryption, people also started experimenting with various obfuscation techniques on top of Shadowsocks. Multiple Shadowsocks implementations can obfuscate the proxy traffic to mimic popular protocols such as HTTP, Skype, and WeChat video calls. Whether obfuscation actually works or not, however, has been subject to debate. Notably, the paper The Parrot is Dead from UT Austin argued such obfuscation did not improve undetectability (and actually quite the opposite).

In 2017, a Shadowsocks user found an intriguing patent application on the website of the Chinese Patent Office, about a technology to identify Shadowsocks traffic with deep random forest. The discovery caused quite a stir (and panic) in the community with rumors that GFW was behind the patent and Shadowsocks was going to die. Though there have never been published official documents of GFW’s inner workings, this might be the first evidence that GFW has started deploying machine-learning-based technologies to classify proxying traffic. If you think of it from a purely technical perspective, this is also an engineering marvel given the immense nation-wide throughput GFW has to handle every second. Since then, there have been increasing anecdotal reports of Shadowsocks servers being blocked, although other anecdotes also suggests that the protocol is still usable today under some configurations.

Shadowsocks has been more than influential in the Chinese proxy community for the past decade. When people say Shadowsocks, they are referring to not just one protocol or implementation but a lineage of protocols and implementations in different languages and on various platforms. The original Shadowsocks repo was once deleted under government pressure, and people moved to other forks (the most famous one being ShadowsocksR). Then somehow, the original Shadowsocks repo was revived, but some are still using some fork, or fork of fork of fork… Drawing the full Shadowsocks pedigree is an interesting exercise left to the readers. Almost all Shadowsocks apps contain a paper plane in their icons, and thus many call Shadowsocks servers “airstrips.” The name was soon extended to other proxy servers.

Shadowsocks’ iconic paper plane (Shadowsocks’ Android App icon, featuring the paper plane)

A Short Detour, and Wireguard

I suppose that’s enough Shadowsocks. Before we move on to the next milestone of GFW or proxies, I wanted take a detour to talk about developments that took place on a lower level—VPS, routes, and TCP stacks.

None of these had been problems before 2010 when most of us just used VPNs set up by some random dudes on the Internet. These aren’t problems for many today who buy commercial proxies and have everything taken care of. But the last decade witnessed a leap forward in the computer skills of Chinese Internet users and a boom in cloud computing platforms, which made hosting one’s own server (VPS) a tempting option for more and more people (myself included). Those people started comparing service qualities and uptimes of different VPS providers and learned about different tiers of routes (mainly between China and the US). For example, the China Telecom 163 backbone route (AS4134) is the default but will be intolerably laggy at night due to high traffic; The CN2 premium route (AS4809) was good but more expensive. People trace-routed back and forth and created many benchmarking scripts. Forums emerged so people can show off their VPS just like how average people flex about their PC builds today.

Some attacked the speed problem from another angle: congestion control algorithms. In case you are not familiar with TCP, congestion control algorithms governs how many packets can be in-flight at once. From a 5000-foot view, you let more packets inflight when none of them are dropped, and cut the quota once you observe packet drops (congestion)—congestion hurts everyone and you obviously don’t want your neighbors to be mad at you for making their Internet unusable. As simple as that.

The default TCP congestion control algorithm, called CUBIC, was very conservative. What about a more aggressive one? Enter the LotServer kernel module. You also shouldn’t be surprised to hear that big techs are interested in this topic as well because they have servers world-wide that communicate with each other. Google released the famous BBR algorithm in 2016. The proxy community was quick to react, and there were soon tutorials everywhere on enabling BBR on the VPS and a lot of convenience scripts. Congestion control is part of the networking stack and deeply integrated with the kernel. Kernel-level hacking requires support from the virtualizing technology, so people started comparing KVM and Xen and such. On a related note, some also created iptables scripts to control access to the servers so that they are more resilient to the GFW’s active probing.

All these techniques are orthogonal to most proxy applications and could improve the experience significantly. I, for one, found messing around with these technologies great fun and learned a ton of Linux and networking knowledge, so I think they deserve a place in this article.

People also noticed Wireguard as a rising star around the time. Wireguard is a UDP-based VPN protocol built into new Linux kernels that’s much simpler than OpenVPN and offers better security. Setting it up requires one to dig deep and upgrade the Linux kernel, so this is closer to the low-level stuff I just covered, and I’ll take the opportunity to talk about it here. Its simplicity and UDP-based nature attracted a lot of interest. The hype, however, quickly cooled down upon the realization that while the protocol was designed to be secure, it wasn’t intended to be undetectable. The traffic signature of Wireguard is apparent. Thus, it was no surprise that in several months GFW could accurately detect Wireguard usage and block it. Another drawback is that as a VPN, Wireguard is not flexible (i.e., either all app traffic goes through it or none) compared with proxies. This is especially true compared with the one I am about to introduce…

V2Ray

In 2016, the first version of V2Ray was released. It was a game-changer.

The killer feature of V2Ray is a flexible, fine-granularity control of the proxy architecture via its JSON-based configuration system. Under V2Ray’s model, proxy traffic on both client and server side

  1. enters the system through some “inbound protocol,”
  2. gets filtered or dispatched by some routing rules, and
  3. exit to some “outbound protocol.”

V2Ray made all three parts configurable. Inbounds and outbounds can be assigned tags that can be referenced in routing rules. Notable inbound protocols include SOCKS5 and HTTP proxy. Outbound protocols include SOCKS5, Shadowsocks, “freedom” (which simply sends the traffic to the free Internet), “blackhole” (which silently consumes all traffic), and VMess, which was V2Ray’s own proxy protocol, built upon the lessons learned from its predecessors over the years.

Under this model, a proxy’s client and server sides can use the same V2Ray binary with different configurations. In a simple example, the client uses a SOCKS5 inbound chained to a VMess outbound; the server uses a VMess inbound chained to a “freedom” outbound. This also allows complicated setups involving multiple inbounds and outbounds, relay nodes, transparent proxies (using the “TProxy” inbound), port forwarding (using the “Doko-demo door” outbound), and more. In addition, the routing rules can be very flexible. IPs and domains can be matched against CIDR, pre/suffices, and a geographic database that comes with the program. Want local traffic to not go through the proxy? Add a rule directing local IP ranges to “freedom” in the client. Want to block some ad domains? Add a rule to divert such requests to a “blackhole” outbound. Compared with configuration system of previous proxy—usually consisting of no more than a local address, a server address, and credentials—what V2Ray offered was such an eye-opener.

Many proxy applications since V2Ray adopted V2Ray’s inbound→routing rules→outbound. Most call themselves “rule-based proxies” or “proxy frameworks.” Notable examples include Clash and, more recently, Leaf. These successors distinguish themselves from V2Ray in quality-of-life features, such as

  • A more user-friendly, YAML-based configuration format (V2Ray uses JSON);
  • A “true system-wide proxy” with Tun2socks, usually referred to as a “TUN inbound.” When enabled, this special inbound hooks into the OS network stack and modifies the routing table so it can proxy traffic from applications that aren’t proxy-aware.
  • Support for OpenWrt routers, so any devices connected do not need extra setups to be able to circumvent GFW.
  • GUI or WebUI front ends for graphical configuration.
  • Programmable routing rules with Python.
  • Subscription—a mechanism that allows the proxy frameworks to auto-update configurations from a URL. It makes it easy for users of commercial proxy services to set up their clients and keep their node directory and protocol configurations up-to-date with the service provider. V2Ray offers a URL scheme that encodes simple, static configuration as short URLs (e.g., vmess://...), which can be shared as QR codes or messages, but this feature does not come close to the flexibility subscription offers.

All these frameworks aim to be an all-in-one solution and have common protocols such as SOCKS5, Shadowsocks, and VMess built-in. Protocols are independent across implementations, so mixing these frameworks or chaining them is possible. For instance, one can use V2Ray on the server side (because it’s lightweight) and Clash for the client (for better UX). Personally, I’ve used Clash to locally redirect traffic to a local V2Ray instance so I can have both a TUN inbound (Clash exclusive) and a VLESS outbound (a successor to VMess, but V2Ray exclusive). Such mix-and-match adds another level of flexibility.

In general, V2Ray and its friends immensely diversified the proxy “ecosystem,” improved user experience, and lowered the barrier to entry for new users. The massive number of possible combinations of proxy schemes posed a considerable challenge to GFW. In particular, customizable routing rules made round-trip traffic correlation, a technique from GFW’s arsenal, obsolete: When proxy was all-or-nothing, traffics to Chinese websites went through the proxy as well. The traffic goes abroad and returns, passing GFW twice. This allows GFW to correlate the two streams and identify the use of a proxy. However, if one configures V2Ray to route domestic traffic away from the proxy, correlation attacks will be much harder to pull off.

But still, the outbound/inbound protocols themselves remain the weakness of the whole system. If GFW can identify every protocol you use and block it, then V2Ray will not take you very far, no matter how many hops of relay you use or how many different protocols are used in the process. Fortunately, the next breakthrough in proxy protocol was around the corner.

Obfuscation and the Return of HTTP(S)

As the GFW moves from keyword-based detection to more heuristic, signature-driven detection, proxies have went a long way to make sure that the traffic it handles is both secure and undetectable. As I mentioned earlier, the Shadowsocks family contributed to this series of efforts by varying packet length by randomized padding and offering a wide variety of encryption scheme (each of which hopefully has a somewhat different statistical signature). Another idea Shadowsocks devs played with was obfuscation: For example, a proxy client adds a fake HTTP header to its request and the server prepends the response with a similar HTTP response header. Without scrutiny, the traffic looks just like …. HTTP! Besides HTTP, common protocols in Shadowsocks’ obfuscation toolkit include Skype, WeChat, BitTorrent, and more.

Sounds cool, isn’t it? As I said earlier, the effectiveness of such obfuscation techniques have always been controversial. A not-so-obvious caveat of these obfuscation techniques is that mimicking just the traffic stream is not enough, and one must mimick the traffic metadata—especially ports.

What should one do about this? The natural next step is to have the server not only encapsulate data in a common protocol but also to actually listen on that protocol’s intended port. Returning to the HTTP example, our proxy software should not communicate with HTTP headers, but the server must also run on port 80. Actually, HTTP is not a very good idea because payloads of HTML are plain-text XMLs whereas the proxied traffic can be arbitrary binary stream. GFW could quickly tell these apart. What’s an alternative that’s just as popular and uses binary stream as payload?

Enter HTTPS. A defining feature of any encryption scheme is the ability to turn structured content into bytes that are statistically indistinguishable from being generated randomly. If we proxy and encode arbitrary Internet traffic in HTTPS, no censors can find out just by looking at the encrypted byte streams. Problem solved!

This idea sparked many new proxies, starting with Trojan. V2Ray followed by introducing “TLS transport,” which is essentially HTTPS since HTTPS is built on TLS. These proxies’ server ends can be set up to be as indistinguishable from a legitimate HTTPS server as possible.

  1. They require a certificate just as a regular HTTPS site would. A self-signed one works, but LetsEncrypt has also made obtaining a real one easy.
  2. Some of the proxies implemented a full-fledged HTTPS server (think about the technical complexity of doing that for a second!).
  3. Other proxies took an easier path of implementing an HTTP server and instructing users to place them behind an HTTPS reverse proxy like Apache, Nginx, or Caddy. This also allows users to easily peggy-back their proxies onto legitimate websites they already host overseas (like a personal blog).
  4. Almost all these proxies were engineered to counter GFW’s active probing. If it detects non-proxy clients (an actual browser or a GFW probe), it redirects the traffic to a real HTTPS site (like one’s blog).

HTTPS encryption can be computationally expensive, so many attempts followed to improve the performance of these HTTPS/TLS-based protocols:

  1. V2Ray modeled TLS as a specific “transport layer” that’s decoupled from the proxy protocol that runs through it. While conceptually clean, the proxy protocol typically applies another level of encryption, so V2Ray must encrypt or decrypt twice. XTLS, a V2Ray fork, ingeniously modified the protocol to “zip” the two encryptions into one.
  2. People looked into more efficient Linux systems APIs that reduced the number of copies involved in handling messages.
  3. Rewrite it in Rust! V2Ray is written in Go. There are successors of the concept in Rust. See Leaf.

Thanks to the incredible amount of effort people have put into optimizing this stuff, CPU performance is rarely the bottleneck to proxy throughput these days, even on some of the lowest-end hardware people can find (e.g., OpenWrt routers or potato VPSes with <=512GB of RAM).

That’s so much for the server side. Another thing people looked into is the client: it turns out that one can fingerprint TLS parameters a particular client uses by observing the traffic between it and the server. The naive HTTPS client many proxies uses (often built on top of Go/Rust TLS libraries) wears TLS parameter fingerprints distinguishable from those of mainstream browsers. Encryption does not help in this case as TLS handshake happens pre-encryption. If GFW tries hard enough (which it certainly has the resources and incentive to do), it can recognize proxy traffic behind HTTPS this way! As always, people figured out a mitigation: Naiveproxy solves this issue by building itself on Chromium’s network stack, taken straight from Chromium’s repository. Its TLS fingerprint would be identical to Chrome’s.

Is proxying through HTTPS is a flawless solution? No. The weakness still originates from traffic metadata. This time it’s not the port, but the host. Many people only have one proxy server they connect to over a long time. From GFW’s perspective, these people appear to visit a niche HTTPS website for hours with constant website traffic, and this website only has very few visitors throughout the day—looks very suspicious indeed. Unfortunately, I am unaware of anything that can work around this besides having a pool of proxy servers and many clients (commercial proxies do this naturally). But it’s fair to say that people have done as much as they can. HTTPS-based proxies came out in 2019, but they still work and remain the go-to solution in 2024. People are still actively contributing to these solutions, but I have not followed the developments since I came to MIT. This seems to be a good place to end this article.

Afterword

Thanks for staying around till here! It really took me a while to finish the article. I started writing this in late 2022 and finished most of it quickly. However, I was distracted last-minute by coursework and the last part remained in an unfinished state for about two years: ugh, for so long there are just so many things to do everyday and finishing a 95%-written blog post is not high on my todo list. The opportunity finally came in late 2024, when I was trying to push the rewritten version of my blog (another thing that I procrastinated for almost infinitely long!). This was the most recent article that shows up, and it would make my rewrite a joke if I spent so many time optimizing the design of the website but leave its content unfinished. So I spent a Sunday evening finishing the last part, and here we are.

Content-wise, I am happy with what I have managed to cover. Much of this article is based on my knowledge of this area as of 2022—when I could still confidently call myself an expert. After I entered college in US, setting up these proxies is no longer a necessity, so I gradually stop following new developments—but luckily there does not seem to be any substantial :). All my previous setup works very well on the few occations I needed it. People have largely converged on a few key ideas that stood the test of time: V2Ray’s configuration model, HTTPS-based proxy, etc. But it is still good to know how we ended up here. Rome was not built in a day. Hopefully you’ve enjoyed the read sofar.