Chengyuan Ma's Blog



The title was inspired by the priceless A Brief, Incomplete, and Mostly Wrong History of Programming Languages.


"翻墙," or "going over the wall," has been an open secret for Chinese Internet users for quite some time. Accessing Google in China takes much more than typing in one's browser bar -- in fact, so much that it becomes somewhat a shared experience that few outsiders could understand. The walls are becoming higher and higher while the ladders are also becoming better and better. It's a constant struggle that's been going on for at least two decades.

Personally, my struggle with censorship is almost as long as the total time I've been using the Internet. In retrospect, both the technological complexity and the evolution of both parties in the struggle have been truly fascinating, if not epic. But so far, there seems to be no complete recorded history on this subject -- what a pity! I suppose I can make an attempt here. I will tell the story mainly in chronological order. That said, the exact dates or even years could be inaccurate, and I can't give references to everything I say. I am not writing an academic paper here, and too many things have been lost in time.

There is a Wikipedia entry that covers many things beyond the technical stuff. In contrast, in this post I would like to try my best to just focus on the technical part which I consider fascinating enough on its own.

Without further ado, what did it take for people in China to access the true Internet?

The Story

Long Long Ago

In 1987, China sent its first email to Germany, marking its first step into the Internet.

Across the Great Wall, we can reach every corner in the world.

People were excited, but probably no one then expected this email to be a prophecy.

For 12 years since then, there had been no technical restrictions for Internet access in China, though the affordability of PCs and the knowledge required for their operations did make Internet users back then a minority.

The Early Days of GFW and VPNs

In 1998 (some say 2002), nation-wide DNS spoofing emerged. Domestic DNS servers were configured to return bogus IPs when asked about the actual address of specific domains, so they appeared to be inaccessible. There were two simple workarounds: explicitly switch to a foreign DNS server (still vulnerable to DNS hijacking) or use a Hosts file that contains correct resolutions. Various Hosts files started to be passed around in Chinese forums for the next decade, and this workaround would still partially work until the early 2010s if I recall correctly. DoT and DoH also would have been good workarounds if only they had been invented by then.

(The DNS spoofing is still active today as shown above, where resolves to a Dropbox's internal IP. is a widely used DNS server in China. IP information source)

The censors soon realized that DNS spoofing was insufficient, so they started installing extra special hardware devices at backbone network nodes, such as routers at provincial boundaries or submarine cable landing stations. These devices are connected to the network channel via beam splitters so they can tap the traffic without interfering with it in most cases. In its simplest form, these devices maintain a blacklist of IPs, and when it detects an established TCP connection where either end is on the list, it sends an RST packet to both ends on behalf of each other to reset the connection. Both ends wouldn't know the reset packet had been faked because it would look identical to a legitimate one on the wire. Browsers would then raise a "Connection Reset" error. These devices became the initial hardware foundations for what would be referred to as "The Great Firewall," or GFW in short (although this name or even the censorship itself has never been officially recognized).


In addition to IP blacklists, GFW had a keyword-based filter around the same time that resets connections whenever it spots sensitive content (e.g., content containing names of top government officials or scandals). This was possible because HTTPS hasn't gained traction yet and most sites operated in unencrypted HTTP. Indeed the golden age for censors. In the age of HTTPS today, GFW would need to exert pressure on the site operators to keep an eye on sensitive content, which is technically uninteresting and tedious (though still quite effective for big sites).

By then, GFW had primarily focused on blocking direct access to "inappropriate content" and had not turned to people's countermeasures. Consequently, such censorship could often be circumvented with simple HTTPS/SOCKS proxies or VPNs based on standard protocols like PPTP or L2TP. They can usually be set up without installing additional software because popular OSes support them out of the box. People started passing credentials to these services around the Chinese Internet. These simple solutions would still have a slight chance to work in 2014 (personal experience). Some big techs in China also released early versions of their "game accelerators," which are VPNs in disguise, and some clever people could use that to "cross the wall" as well.

Early 2010s

That didn't last long. The first move GFW took was to blacklist the IPs of the VPNs they know and order the popular Chinese forums to promptly delete any advertising of such techniques. A VPN would "die" in 2-3 days, and the same went for proxies. A solution was to have a long list of VPNs so one would always have at least one of them not blacklisted. Better still, this long list should be actively maintained so that it doesn't die out altogether one day. This was when VPNGate, led by the University of Tsukuba, became popular. VPNGate has a directory of VPN nodes contributed by volunteers around the world and which one can try one by one. The success rate wasn't very high (by personal experience), but acceptable. Tor was also an option, but the bridges and entry nodes are infrequently updated, so blacklisting them was easy.

(VPNGate seems to be operational today but it is not usable at all in China now)

It might be worth mentioning that in 2008 the government once shortly promoted the installation of Green Dam Youth Escort, an access control software, on every PC. It didn't take long for the software to be a laughing stock for its disastrous implementation. Let alone the BSODs it caused; all blacklisted URLs were stored in a plaintext text file under the installation directory. Many curious teens back then later said they wouldn't have known so many porn sites but for Green Dam. A true "blessing in disguise?"

(Green Dam ended up as such a joke that people made cartoons out of it. Image taken from Wikipedia entry.)

Around this time (the late 2000s to early 2010s), software such as FreeGate or Wujie had also been popular. Behind them were organizations politically hostile to the Chinese government. Compared with volunteers in VPNGate, developers of this software tend to be more motivated and well-funded, so the reliability was a bit higher. The downside was that upon launching, it would always direct one's browser to their homepages, which contain political propaganda that could be misleading or disturbing to some.

(Image taken from Wikipedia entry)


Yet another solution for the geeks that appeared a little later was to use private foreign nodes. The idea was that if one set up a "ladder" (which Chinese Internet users use to refer to any means of GFW circumvention) that is used privately and never advertised, it would be much less noticeable to the censors. This was the basic idea that underpinned most ladders til today. An early example that emerged around that time was GoAgent, which uses the Google App Engine (GAE) as the proxy server. Everyone could create a unique GAE instance and get a personal proxy node.

I recall a lot of tutorials back then teaching newbies how to go through this entire process of setting up GoAgent. Indeed this seemed to be the moment when ladders became harder to configure. In the past, one pasted a link in the browser proxy settings or the OS VPN setup wizard, and that's it -- all easy Ctrl+C/V and GUI. Starting with GoAgent, one needs to do some remote configuration on the GAE dashboard, download some stuff, edit some config files, and open some scary terminal windows. Kudos to GoAgent devs for writing a script for one-click deployment to GAE, for many of its successors would require SSH-ing into a Linux VPS for deployment, which I imagine would have been an even greater nightmare for beginners. There have been numerous attempts in the future to make bypassing GFW easy again, but IMO the problem hasn't truly been solved.

GoAgent was shut down in late 2014 under government pressure. RIP GoAgent. Its direct successor, XXnet, claims to be usable til today, though I haven't checked.

The Death of VPNs

As various VPNs flourished, GFW is also receiving its upgrade, which soon led to the eventual downfall of all simple proxies and VPNs. The caveat of these solutions lies in that they all have highly distinctive traffic patterns. These protocols have some handshaking procedure where the byte pattern is fixed (regardless of whatever encryption there might be). GFW can therefore recognize the use of these protocols and shut down the connection. From a technical perspective, this is easier said than done compared with GFW's previous filters. Previously, GFW only needed to scan individual packets (TCP segments) for addresses and specific keywords. In contrast, now it has to reconstruct the TCP streams from packets and does the filtering on a higher level. This requirement of context awareness raises the complexity of GFW's algorithms, which is probably why such filters had not been implemented in the first place -- this is my educated guess, at least.

Protocol-based detection was indeed a powerful technique that killed many protocols that had not been designed to face such adversaries. Apart from SOCKS/HTTP proxies and PPTP/L2TP VPNs, the death list also included SSH tunnels, OpenVPN, and some other once-usable but unpopular ladders I don't have time to cover.

It was also around the same time that multiple blog posts appeared with attempts to probe the inner architectures of GFW. People figured out which backbone routers had GFW devices on the side using TTLs and IP reverse lookup. There was even a script called GFW looking glass that could trigger a buffer overflow in GFW with a deliberately crafted DNS query so that the response would contain parts of GFW's memory. Such knowledge proved not particularly helpful in developing ladders but is really cool. (Pity, by the time I knew the script, the buffer overflow bug had been patched :( )


Among the few survivors of the GFW upgrade was Shadowsocks. First appeared in 2012, Shadowsocks, in its simplest variant, is just SOCKS5 encrypted with algorithms like AES or Chacha using a predetermined key. The encryption makes its traffic characteristics much harder to model and identify, though it was still possible.

To counter Shadowsocks, GFW looks at the length distribution of the TCP packets, which the encryption does not change so much as the packets' content. In response to this move, later versions of Shadowsocks started adding random-length paddings to the messages, thus altering the length distribution. GFW's next move was to use replay attacks or disguise as a potential client to actively probe what it considered a possible Shadowsocks server (which is not that hard because Shadowsocks typically uses non-standard ports). If it receives a response typical from Shadowsocks servers, then GFW can be sure that the target is a proxy and blacklist the IP. The Shadowsocks developers managed to cope by updating the protocols and elaborately implementing the server. I would say Shadowsocks is the first family of protocols that considered GFW-like adversaries in its design in the very first place.

In addition to encryption, people also started experimenting with various obfuscation techniques on top of Shadowsocks. Multiple Shadowsocks implementations can disguise the proxy traffic to mimic "innocent" protocols such as HTTP, Skype, and WeChat video calls. The effect of such obfuscation has been controversial. Later, people found the paper The Parrot is Dead from UT Austin, confirming that such obfuscation did not improve undetectability (and actually quite the opposite).

In 2017, a Shadowsocks user found an intriguing patent application on the website of the Chinese Patent Office, which documents classifying Shadowsocks traffic with deep random forest. The discovery caused quite a stir (and panic) in the community with rumors that the patent came from "the wall builders" and Shadowsocks was going to die. Though there have never been official documents of GFW's inner workings, this might be the first evidence that GFW has started deploying machine-learning-based technologies to classify proxying traffic. (which, if you think of it from a purely technical perspective, is an engineering feat given the immense volume of traffic going through GFW every second) Since then, there have been increasing anecdotal reports of Shadowsocks servers being blocked, although anecdotal evidence also suggests that the protocol is still usable today under some configurations.

Shadowsocks has been more than influential in the ladder community for the past decade. When people say Shadowsocks, they are referring to not just one protocol or implementation but a family of protocols and implementations in different languages and on various platforms. The original Shadowsocks repo was once deleted under government pressure, and people moved to other forks (the most famous one being ShadowsocksR). Then somehow, the original Shadowsocks repo was revived, but some are still using some fork, or fork of fork of fork... The full Shadowsocks pedigree is daunting to draw. Almost all Shadowsocks apps contain a paper plane in their icons, and thus many call Shadowsocks servers "airstrips." The name was soon extended to other proxy servers.

A Short Detour, and Wireguard

I suppose that's enough Shadowsocks. Before we move on to the next milestone of the GFW or ladders, I would like to take a detour to talk about developments that took place on a more fundamental level -- VPS, routes, and TCP stacks.

None of these had been problems before 2010 when most of us just used VPNs set up by some random but generous geek. These aren't problems for many today who buy commercial ladders and have everything taken care of. But the last decade witnessed a leap forward in the computer skills of Chinese Internet users and a boom in cloud computing platforms, which made hosting one's own server node (VPS) a tempting option for more and more people (myself included). Those people started comparing service qualities and uptimes of different VPS providers and learned about different tiers of routes (mainly between China and the US). E.g., the China Telecom 163 backbone route is usually the default but will be intolerably laggy at night due to high traffic; The CN2 premium route was good but more expensive. People started tracerouting back and forth and created more benchmarking scripts. Some attacked the speed problem from another angle by replacing the default TCP CUBIC congestion control algorithm with a much more aggressive one. The LiteSpeed kernel module was created to this end, and then when Google's BBR came out, there were tutorials everywhere on enabling BBR on the VPS and a lot of convenience scripts. Such hacking inside the kernel requires support from the virtualizing technology, so people started comparing KVM and Xen and such. Some also created iptables scripts to control access to the servers so that they are more resilient to the GFW's active probing. These techniques are orthogonal to most proxy applications and could improve the experience significantly. I personally found messing around with these technologies great fun and learned a ton of Linux and networking knowledge, so I think they deserve a place in this article.

People also noticed Wireguard as a rising star around the time. To install it on the server side, one has to dig deep and upgrade the Linux kernel. This is closer to the low-level stuff I just covered, so I'll take the opportunity to talk about it here. Its simplicity and UDP-based nature were very appealing at the moment, and indeed it worked like a charm. The hype, however, quickly cooled down upon the realization that while the protocol was designed to be secure, it wasn't intended to be undetectable. The patterns of Wireguard packets are apparent. Thus, it was no surprise that in several months GFW could accurately detect Wireguard usage and block it. Another caveat is that as a VPN, Wireguard is not very flexible (i.e., either all app traffic goes through it or none) relative to proxies, especially compared with the one I am about to introduce...


In 2016, the first version of V2Ray was released. It was a game-changer.

The killer feature of V2Ray is that it offers highly flexible and fine-granularity control of the proxy architecture via its JSON-based config. V2Ray models all proxied traffic as coming through some "inbound," gets filtered or dispatched by some routing rules, and passed to some "outbound." All these three parts can be configured. Inbounds and outbounds can be assigned tags that can be used as a reference in routing rules. Notable inbound protocols include SOCKS5 and HTTP proxy. Outbound protocols include SOCKS5, Shadowsocks, "freedom" (which simply sends the traffic to the free Internet), "blackhole" (which silently consumes all traffic), and VMess, which was V2Ray's own proxy protocol, built upon the lessons learned from its predecessors over the years.

Under this model, a ladder's client and server sides can use the same V2Ray binary with different configurations. A simple client could be a SOCKS5 inbound chained to a VMess outbound; A simple server could be a VMess inbound chained to a "freedom" outbound. This also allows complicated setups involving multiple inbounds and outbounds, relay nodes, transparent proxies (using the "TProxy" inbound), port forwarding (using the "Doko demo door" outbound), and more. In addition, the routing rules can be very flexible. IPs and domains can be matched against CIDR, pre/suffices, and a geographic database that comes with the program. Want local traffic to not go through the proxy? Add a rule directing local IP ranges to "freedom" in the client. Want to block some ad domains? Add a rule to divert such requests to a "blackhole" outbound. Compared with configurations of previous ladders -- which usually contain a local address, a server address, and credentials -- what V2Ray offered was such an eye-opener.

Since V2Ray, other proxy applications have emerged with similar high configurability. Most call themselves "rule-based proxies" or "proxy frameworks." Notable examples include Clash and, more recently, Leaf. These alternatives offer some quality-of-life features, such as

  • A more user-friendly, YAML-based configuration format (V2Ray uses JSON);
  • A "true global proxy" with Tun2socks, usually referred to as a "TUN inbound." On activation, this special inbound augments the system network stack and modifies the routing table so it can capture traffic from applications that aren't proxy-aware.
  • Runnable on OpenWRT routers, so any devices connected do not need extra setups to be able to circumvent GFW.
  • GUI or WebUI front ends for graphical configuration.
  • More complicated routing with python scripts.
  • Subscription -- a mechanism that allows the proxy frameworks to auto-update configurations from a URL. It makes it easy for users of commercial proxy services to set up their clients and keep their node directory and protocol configurations up-to-date with the service provider. V2Ray offers a URL scheme that encodes simple, static configs as short URLs (e.g., vmess://...), which can be more easily shared as QR codes or messages, but this feature does not come close to the flexibility subscription offers.

Notwithstanding these fancy features, these frameworks adhere to V2Ray's inbound-routing rules-outbound model. All of these frameworks aim to be somewhat of an all-in-one solution and have popular protocols such as SOCKS5, Shadowsocks, and VMESS built-in. Because these protocols don't change, mixing these frameworks or chaining them up is possible. For instance, one can use V2Ray on the server side (because it's lightweight) and Clash for the client. Personally, I've used Clash to locally redirect traffic to a local V2Ray instance so I can have both a TUN inbound (Clash exclusive) and a VLESS outbound (a successor to VMess, V2Ray exclusive). Such mix-and-match adds another level of flexibility.

In general, V2Ray and its friends immensely diversified the proxy "ecosystem," improved user experience, and lowered the barrier to entry for new users. The massive number of possible combinations of proxy schemes posed a considerable challenge to GFW. The advanced routing features can make a technique from GFW's arsenal useless: If one accidentally visits a Chinese website through the proxy, the traffic goes abroad and returns, passing GFW twice. GFW can then easily correlate the two streams and identify the use of a proxy. However, if one configures V2Ray to route domestic traffic away from the proxy, this would not be a problem.

But still, the outbound/inbound proxy protocols themselves remain the weakness of the whole system. If GFW can identify every protocol you use and block it, then V2Ray will not take you very far, no matter how many hops of relay you use or how many different protocols are used in the process. Fortunately, the next breakthrough in proxy protocol was around the corner.

Back to HTTP?

work in progress, to be continued...

The Problem

Is it just me, or Chrome really isn't as fast and smooth as it claims?

Most people switch to Firefox for all its privacy claims. I did not (although that's indeed a plus). I have been using Firefox because the UI animations on nearly all Chromium-based browsers are intolerably laggy.

By “UI,” I mean all the components surrounding the web view: the tap strip, the toolbar, the download shelf, the Omnibox. The problem is most noticeable when you close a tab and see how neighbors adjust their bounds to fill the gap. You observe uncomfortable stutters, and that discomfort lures you into repeating the whole thing to confirm that it's not an illusion. It's not, and the pain only intensifies with the repetition until it completely ruins your day.

Do not mistake me. The animations in web pages are impeccable (Chrome couldn't have taken over more than half of the market share if it can't even do this right). But this contrast just makes the whole problem more annoying. Worse still, this lag passes down to nearly all Chromium derivatives like Edge or Brave and makes the entire Chromium world visually unacceptable. Vivaldi is an exception because it bases its UI on its own framework, but then its homemade UI does not blend well with other Chromium UI (which the Vivaldi devs don't bother to hide). This inconsistency itself is a problem.

So I started using Firefox. It was great until all the incompatibility issues reminded me again that I would need a Chromium-based browser in case.

Of course, you would expect someone to file a bug report somewhere for such a big user experience problem. Someone did indeed, but the devs don't seem to care.

Tfw you run across this thread while trying to find out if anyone else's scrolling stutters sometimes.

Blows my mind how stuff like this remains unsolved for months. I've sorta gotten used to laggy tabs by now but it's like they're fine with ignoring reasons for people to trash talk their browser's performance.

Enough complaints. If the devs aren't going to fix it, I will do it myself.

Checking Out the Code

The first step into fixing the bug is to check out the code. I followed this guide on Chromium's project homepage and encountered basically no problems except for two,

  • The guide asks you to run everything in a Windows command prompt instead of other shells. This also implies that you shouldn't start the command prompt from other shells (e.g., by typing cmd.exe in PowerShell) because then the launched process inherits the environment variables from the parent shell. I have Conda installed and this inheritance caused problems. The Conda shell integration somehow dynamically prepends its Python executable to Path on PowerShell startup. Consequently, no matter how up front I put depot tools in Path it will always be preceded by the Conda Python as long as I launch cmd.exe from PowerShell.

  • Use NTFS. The guide says

    At least 100GB of free disk space on an NTFS-formatted hard drive. FAT32 will not work, as some of the Git packfiles are larger than 4GB.

    I did not take this very seriously and thought I would be good as long as I don't use FAT32, so I initially used my external SSD in exFAT. The checkout did complete with no warnings and I was able to build. However, I faced massive >1h overbuilds even when I modified nothing (and more than 6000 targets have to be rebuilt when I changed a .cc file). For a time I thought this is the norm and had a very bad impression on the efficiency of Chromium's build system. ExFAT is also very inefficient in storing small files such that ~30G of source files ends up occupying ~180G of disk space, which is complete nonsense. Reformatting my drive as NTFS solves both problems.

First Attempt

I have never dealt with Chromium code before. Fortunately, Chromium's excellent documentation saves me the trouble of diving into the tens of millions of lines of code in the codebase completely clueless. It also has this online code search engine which makes navigating the code orders of magnitude easier. I was able to find this piece of code, a good start:

A further look into BoundsAnimator reveals that it has the member function SetAnimationDuration.

My first hypothesis is that if the animation is somehow laggy, then by making it shorter in duration the lag becomes less noticeable.

A search in tab****.cc reveals that SetAnimationDuration is not called by tab strip except with an argument of 0 to disable rich animation, so the default duration is nearly always used, which is defined here:

So I changed 200 down to 100 and then to 60. Now the animations are blazingly fast (literally), but is it smoother? Lags are much less likely to occur and be noticed but I could still found some from time to time. This fix is not satisfactory. Time to go deeper.

Second Attempt

A closer look at BoundsAnimator shows that it calls SlideAnimation under the hood:

SlideAnimation, in turn, inherits LinearAnimation, where an interesting constant gets defined,

The above screenshot basically captures everything I want to say. The LinearAnimation class uses the framerate passed in the constructor to calculate the timer interval at which it updates its internal state and calls observers (that's the role of BoundsAnimator). The observers then propagate the update to redraw interfaces. Most calls to LinearAnimation constructors ignore the frame_rate argument so kDefaultFrameRate is used. 60 appears to be a reasonable value at the first glance but it no longer is given that I am using a 165Hz monitor.

There has been a common argument that human eyes cannot differentiate motions beyond 60Hz and high refresh rate monitors are just a hype. I had believed this for years until I switched to a 90Hz phone. When everything else is animating at just 90Hz a 60Hz animation stands out as clunky, not to mention 120Hz or 165Hz. So here lies the problem: the Chromium UI animation appears to lag because somehow the devs decided to hardcode a framerate ceiling into its rendering process. They take VSync onto another level.

(This realization might also explain an interesting phenomenon I discovered when I attempted to file a bug report myself a while ago. I used OBS to record the tab strip animations. The lag was obvious at the time of the recording but when I looked back at the recording it seemed less so. Perhaps that was due to the screen capture being 60FPS too? Probably there was some magic going on at the video decoders that makes 60FPS video content appears smoother?)

If this is indeed the root of the problem, then the solution is simple: just tune up kDefaultFramerate. Do not forget to tune down the minimum timer interval in CalculateInterval accordingly though, because as it is now (10000) any framerate greater than 100 will be effectively capped to 100.

This is literally a change to 3 lines of code, and yes, the lags are gone!

(So in some sense the “3x as smooth” in the title is a clickbait. 165 / 60 = 2.75, so the framerate does approximately triple. The question is: is framerate a good measure of “smoothness”?)

Some Thoughts

I have had a very strong respect for the Chromium Project for years, so strong that I may well call it worship. After all, it is a codebase with over 10M LoCs (and growing) while I was still struggling to manage my personal projects, which all go below 10k LoCs. I've read many posts analyzing the elaborate architecture of Chrome, and many others call Chromium an engineering masterpiece. It was under this mindset that, when I first observed this bug a year ago, I thought there must be some seriously complicated regression deep inside the rendering engine of Chromium. Any fix would be highly untrivial and beyond the understanding of us mortals.

I consider a great takeaway from this bug-fixing experience to be the demystification of the project. I downloaded the code; I found something I could fix; I fixed it. Locating the root of the problem wasn't hard. So I suppose the only justification for the devs not to have fixed it is that the browser UI has a lower priority than the kernel.

I wouldn't say that my fix is elegant in any sense. All I can say about it is that it works well on my machine. Now it seems that the approach to hardcode the default framerate as a device-independent constant is fundamentally flawed. But how to improve on that? Should we query the max framerate of the monitor whenever we play an animation? Should we cache the result somewhere? Should we make it a configurable item in chrome://flags? A general and sustainable fix is much more complicated than changing the constants. I may be able to come up with one someday, but IAP is drawing to an end, and there are too many things for me to do. So, I choose to file a more technical issue on Chromium's issue tracker for the time being. Leave it to the devs. But before they officially fix the bugs, I will be using my own fork of Chromium.

Speaking of building Chromium on my own. A side product is that now I can enjoy the latest features of Chrome (I am already using M100) without worrying about the privacy controversies Chrome has. I am pretty sure there still exists some telemetry modules in Chromium (that's why ungoogled Chromium exists), but that's arguably much better than Chrome. I also leave the API key empty, so nothing sensitive will be sent to Google -- perhaps? My guess is that now it's on par with unhardened Firefox.

By the way, here I would like to briefly mention Firefox's approach to the UI problem. When investigating the bug I was surprised to know that Firefox's UI are web-based, i.e., it uses the same kernel to render both the web pages and the UI. Thus some very interesting things will happen if you type chrome://browser/content/browser.xhtml into the Firefox address bar! I feel like this is a better way to go since building web UI is a mature subject, and now by optimizing your kernel you are also optimizing your UI. I am not sure.

最近回国了就惦记起科学上网的事情了。很多科学上网工具的inbounds都是SOCKS 5协议,SOCKS 5很常用也很好用,但是因为在传输层所以需要应用的配合,这样一来全局代理就比较困难。目前比较成熟的解决方案是tun2socks。Tun2socks有很多实现,原理都是创建一个虚拟网卡(TUN device),然后从发往这个网卡的IP包中辨识出TCP和UDP的流量并转发给代理。为此,tun2socks也需要操作系统的配合。我们需要通过修改路由表把tun2socks创建的而TUN设置为默认网关。这是使用tun2socks的难点,而对于这一步,很多教程都没有说清楚,大部分人也都是使用Clash等可以自动配置tun2socks的应用。我作为一个Ubuntu用户,使用的代理协议又比较特殊,导致Clash不够用,配置起最原始的tun2socks真是踩坑不断。幸好折腾了一个礼拜总算是折腾出几个自动化的脚本,用这篇文章记录一下自己的心得。


Linux的路由表可以使用ip route show命令查看,例如:

default via dev wlp4s0 proto dhcp metric 600 dev wlp4s0 scope link metric 1000 dev docker0 proto kernel scope link src linkdown dev wlp4s0 proto kernel scope link src metric 600 dev tun0 proto kernel scope link src




  • dev xxx表示应该这个IP包应该从名为xxx网卡(interface)发出。
  • via xxx表示这个IP包应该发往地址为xxx的网关/路由器并由后者进一步地路由。这一条一般在default规则中最常见,例如via中的192.168.100.1就是我家里WiFi的网关地址。
  • scope link表示目标地址在链路层可以直达,即不需要通过其他设备进一步转发了。这一条对于tun2socks的配置倒不是很重要。
  • src xxx表示这个IP包的源地址最好是xxx。这个属性只有在一个IP包还没有确定源地址(也就是刚刚发出)的时候才有效,如果一个IP包已经有了一个源地址,那么这条规则是不会去改的。
  • linkdown表示这一行对应的网卡设备处于离线状态。
  • proto xxx记录了这一行路由规则的来源。例如,proto kernel说明这一行是内核加的,proto dhcp说明这一行来源于DHCP。
  • metric xxx表示这一行路由规则的优先级,如果路由表里有好几行都能匹配目标地址,那么系统会按照metric较小的一行来转发。

到这里,我们就基本可以从ip route的输出看懂一张路由表了。

然而Linux里面的路由表并不只有一张(悲)。上面展示的路由表只是路由表(table main)。默认还有两张路由表:local表由内核自动维护我们一般不管;还有一张默认为空的default表。既然有好几张路由表,Linux怎么知道应该查那一张呢?这是由路由规则决定的,可以用ip rule show查看,默认的路由规则是:

0:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

冒号前的数字表示一条规则的优先级,可以取0到32767的整数,值越小优先级越高。from all表示这条规则匹配所有源地址的IP包(all也可以换成具体的IP或者IP段,类似的条件还有to all),lookup local表示去查local这张路由表。因此,上面ip rule的输出翻译成人话就是:

  1. 首先查local这张路由表,如果查到有一行匹配就按照这一行走;
  2. 如果local表匹配不上,接着查main表;
  3. 如果main表也匹配不上,就查default表;
  4. 如果default表也匹配不上,这包就寄了。



试一试:给 tun2socks 配置路由表



ip route add default via


ip route del default via dev wlp4s0


  1. 如果SOCKS 5的inbounds不在本机上,那么从tun0发出的代理请求现在也会被绕回到tun0,这下死循环了。
  2. 就算SOCKS 5的inbounds在本机上,它的下一个节点也会在海外,而发向这个节点的代理请求也会被绕回到tun0,还是死循环。


ip route add default via dev wlp4s0 table default

这里注意到我们在末尾加上了table default来指定修改的路由表,如果不加默认是table main


#是我的局域网IP,这条规则对应SOCKS 5入口不在localhost的情况
# (这个时候需要用tun2socks的 -interface 参数绑定出包用的设备,绑定了之后
# tun2socks会用这个设备发送所有发给SOCKS 5的请求,IP包上的源IP此时自然就是局域网IP)
ip rule add from lookup default
# 或者如果SOCKS 5入口在localhost但是你知道远程节点的确切IP的时候也可以这样,
# 这可以确保到远程节点的连接永远是直连。
ip rule add to <代理服务器IP> lookup default


0:  from all lookup local
32764:  to <代理服务器IP> lookup default
32765:  from lookup default
32766:  from all lookup main
32767:  from all lookup default





  1. 如果我的远程节点不止一个,甚至是依据订阅动态变化的,导致一条条加to <远程节点IP> lookup default会很麻烦,怎么办?
  2. 如果我在本地开了一个v2ray而且在那里设置了国内地址直连,又通过tun2socks把所有来自本机的请求导到v2ray的SOCKS 5 inbounds里面,那么如何确保v2ray出来的流量不会再次走到tun2socks那里形成死循环?注意到因为设置了国内地址直连,所以v2ray outbounds的流量的目标地址可以是远程节点的地址加上任意的国内IP,很难设置路由规则。
  3. 如果我不想通过v2ray等辅助软件实现国内地址分流,应该如何操作?




  1. 到达本机的包先通过PREROUTING链变换过滤,然后这些包才会按照路由表分派到本机或者继续转发。
  2. 继续转发的包通过FORWARD链变换过滤。
  3. 发给本机进程的包接着通过INPUT链变换过滤。
  4. 本机进程准备发包时,先按照路由表确定一个初步的路径,然后构造好的IP包经过OUTPUT链变换过滤。
  5. FORWARD和OUTPUT输出的包接着汇合,此时Linux会再次查找路由表确定出包的最终路径——对于本机发出的包,如果最终的路径和经过OUTPUT前确定的路径不同,以新确定的路径为准(但是并不会修改按照原路径确定的源IP地址,这也是一个坑)
  6. 在路由完成后,所有表经过POSTROUTING链再次变换/过滤才会被真正发出。

可以看到,如果我们需要过滤由代理应用发出来的包,这个过滤的逻辑应该在OUTPUT链上。Iptables和ip rule配合的方式是使用iptables给包加上特定的fwmark,并在ip rule当中通过fwmark匹配来进行分流。所以我们就有(我这里选择给要直连的包打上23333的fwmark,具体数字可以任意选):

iptables -t mangle -A OUTPUT <条件> -j MARK --set-mark 23333
ip rule add fwmark 23333 lookup default

其中-t mangle表示过滤条件设置到OUTPUT链的mangle子链上——iptables每一个链都有三四个子链,负责不同的功能,例如nat子链负责地址的转换,filter子链负责包的过滤,而这里的mangle子链负责对包进行加标记等变换。




iptables -t mangle -A OUTPUT -m owner --uid-owner nobody -j MARK --set-mark 23333





# cn4是这个IP集合的名字,hash:net是内部的数据结构
# hash:net是为每一个不同的prefix length创建一个
# 哈希表,所以查询时间关于不同prefix length的个数
# 是线性增长的,比较适合我们的情况(因为IPv4总共也不
# 过32个不同的prefix)
ipset create cn4 hash:net
ipset add cn4
ipset add cn4



iptables -t mangle -A OUTPUT -m set --match-set cn4 dst -j MARK --set-mark 23333

--match-set cn4 dst表示匹配所有目标地址是cn4集合里面的包。


最近UROP在做生成模型这方面的任务,自己作为一个机器学习的纯萌新自然是一窍不通,只是凭着自己的模糊记忆记得生成模型似乎有三大流派:VAE、GAN和最近火起来的Diffusion Model。其中GAN是直觉上最好理解的,如果只是需要实现的话对于数学的要求不太高,但是VAE和DM哪怕是要初步理解都需要一定的数学背景知识。我一个下学期才选概率相关的基础课的半瓶水读起来挺吃力的。硬啃论文一个礼拜总算是把这两个模型基本搞明白了。这篇文章先记录一下我对VAE的理解,主要参考了Tutorial on Variational Autoencoders



VAE(以及GAN)基于一个假设:如果我们的样本空间分布的复杂性来源于不同维之间复杂的相关性,那么样本空间真正有意义的维度应该少于样本空间形式上的维数。因此,复杂的、高维的样本可以看作是由一个相对简单的、低维的隐变量(latent variable)所决定的。我们不妨让这些隐变量\boldsymbol z服从标准正态分布。于是生成这些隐变量就很简单了,而难题也被我们转化为了求一个从隐变量到目标样本的映射\boldsymbol x=f(\boldsymbol z)。更为广义地来说,f不一定要是一个确定性的过程,所以我们要求的实际上是P(\boldsymbol x|\boldsymbol z)。如果我们假设P(\boldsymbol x|\boldsymbol z)有固定的形式(比如说正态),那么也可以说我们寻找的是一个将\boldsymbol z映射到P(\boldsymbol x|\boldsymbol z)分布的参数的函数p(\boldsymbol z)。在机器学习的背景下,p的形式是固定的,我们需要寻找或者学习的其实是一组参数\boldsymbol \theta。那么什么样的p_{\boldsymbol \theta}是好的?我们希望我们的模型能够生成类似于训练集的数据,所以从数学上讲,对于训练集中的数据\boldsymbol x,我们希望最大化我们的模型的likelihood,即P(\boldsymbol x|\boldsymbol \theta)


那么怎么计算P(\boldsymbol x|\boldsymbol \theta)呢?从定义上来讲它可以看作是一个marginalization。不妨定义p_{\boldsymbol \theta, \boldsymbol z}(\boldsymbol x)p_{\boldsymbol \theta}(\boldsymbol z)表示的分布的PDF,则有

P(\boldsymbol x | \boldsymbol \theta)=\int P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)P(\boldsymbol z)\,d\boldsymbol z = \mathbb{E}_{\boldsymbol z}\bigl[P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)\bigr] = \mathbb{E}_{\boldsymbol z}\bigl[p_{\boldsymbol \theta, \boldsymbol z}(\boldsymbol x)\bigr].
(这里我们假设\boldsymbol z独立于我们的参数,因为前文提过我们可以把前者看作是从标准正态分布中抽样而来的)

遗憾的是上式可以看作是正确的废话。注意\boldsymbol x\boldsymbol z是强对应的,所以P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)是稀疏的。这导致如果我们在实践中用随机抽样的方法来估计\mathbb{E}_{\boldsymbol z}\bigl[P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)\bigr],我们随机到的\boldsymbol z很可能不会对我们的估计产生任何有意义的贡献,导致我们的估计值完全不可靠。

要解决这个问题,我们就需要确保我们抽样得到的都是有代表性的\boldsymbol z。这样一来我们就不能盲目地从标准正态分布中采样了——应该怎么办呢?




  • 百分百发生的事情自然是毫不令人意外的,因此P(X)=1\rightarrow s(X)=0
  • 一件事情发生的概率越低显然越让人意外,所以s(X)关于P(X)递减。
  • 对于两个独立事件的结果。其总的意外程度应该是两个结果意外程度之和,即s(X\cap Y)=s(X)+s(Y)

香农论证了满足以上三个约束的s(X)只能取s(X)=-\log P(X)(最多乘上一个正常数),这个结果是符合我们直觉的。同时注意到,“意外程度”在某种意义上就是信息量:毫不意外的事情不会传达任何信息,而相反一件事越罕见其信息量越大。所以,香农接着定义一个随机事件的信息量为其所有结果信息量/意外程度的期望,即

\int -P(X)\log P(X)\,dX.


假设一个人相信理论Q,那么事件X对那个人来说的信息量为-\log Q(X)。客观上来讲,事件是按照P刻画的概率分布发生的,所以这个事件对这个人的平均信息量就是

\mathbb E_{X\sim P}\bigl[-\log Q(x)\bigr]=\int -P(X)\log Q(X)\,dX.
上式和这个事件实际上的信息熵之差就可以看作是一种“理论和现实的距离”,称之为相对熵(relative entropy)。更为常见的称呼是Kullback-Leibler距离:
\int -P(X)\log Q(X)\,dX-\int -P(X)\log P(X)\,dX = \int -P(X)\log \left(\frac{Q(X)}{P(X)}\right)\,dX = D_{\text{KL}}(P\|Q).
上面这段话蕴含一个假设:所有基于Q计算的信息熵一定不小于一个事件实际的信息熵——这样作差当距离才有意义。真的是这样的吗?换而言之,KL距离在任何情况下都是非负的吗?注意到\log y\le y-1 \Rightarrow -\log y\ge 1-y
D_{\text{KL}}(P\|Q) &= \int -P(X)\log \left(\frac{Q(X)}{P(X)}\right)\,dX\\
&\ge\int -P(X)\left(\frac{Q(X)}{P(X)}-1\right)\,dX\\
&=\int P(X)\,dX-\int Q(X)\,dX\\

读VAE的文章之前我就不止一次地看到信息熵和KL divergence的式子,当时我也没有深究,觉得信就完事了,反正也不是特别难记(笑)。直到读VAE的文章,我终究觉得这种式子盲信起来总归有一种不踏实感,所以去找了一些相关的解释,发现如果不完全苛求严谨的话这两个式子还是比较好理解的,于是在这里记录一下。从这种解释中也能理解为什么KL距离是非对称的了,因为KL距离中的PQ的地位就不一样。


为了更加精确地近似采样\boldsymbol z,我们考虑从一个能给出“有代表性的\boldsymbol z”的分布Q中采样。所谓“有代表性”,可以理解为Q应该近似于P(\boldsymbol z|\boldsymbol x, \boldsymbol \theta)。换而言之,两者的KL距离应该尽可能小。于是考虑D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr)

D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) = \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr]
我们可以用贝叶斯公式展开P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)
D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr] \\
&= \mathbb E_{\boldsymbol z\sim Q}\!\left[\log Q(\boldsymbol z)-\log \frac{P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)P(\boldsymbol z)}{P(\boldsymbol x|\boldsymbol \theta)}\right] \\
&= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)-\log P(\boldsymbol z)+\log {P(\boldsymbol x|\boldsymbol \theta)}\bigr].
注意到\log {P(\boldsymbol x|\boldsymbol \theta)}一项和\boldsymbol z无关,所以可以提出来,然后再拆分一下期望里面的项,
D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)-\log P(\boldsymbol z)+\log {P(\boldsymbol x|\boldsymbol \theta)}\bigr] \\
&= \log P(\boldsymbol x|\boldsymbol \theta) + \underbrace{\mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z)\bigr]}_{D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)} - \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr].
移项,再利用KL距离非负的性质,我们就得到了log likelihood的一个下界:
\log P(\boldsymbol x|\boldsymbol \theta) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr) + D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) \\
&\ge \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr).
也就是说,只要D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr)足够小,即Q足够逼近P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta),优化\mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)就近似等价于优化\log P(\boldsymbol x|\boldsymbol z),也就近似等价于优化P(\boldsymbol x|\boldsymbol z)。这个下界被称为variational lower bound,是VAE的loss。所谓的“variational”,是指我们之前对于贝叶斯公式的运用——那种方法似乎被称为variational Bayesian method。

现在还剩下一个问题,就是我们不知道这样的Q具体是什么,但是我们知道Q应该是和\boldsymbol x有关的。我们于是发挥ML的传统艺能:如果不知道就糊个NN上去。不妨假设Q(\boldsymbol z)=\mathcal N\bigl(\boldsymbol\mu(\boldsymbol x),\operatorname{diag}\boldsymbol\sigma^2(\boldsymbol x)\bigr),我们让神经网络给出\boldsymbol \mu\boldsymbol \sigma就好了!注意到在Q(\boldsymbol z)P(\boldsymbol z)都是正态分布(且我们假设P(\boldsymbol z)=\mathcal N(0, I))的情况下,其KL距离存在解析解:

D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)=\frac{1}{2}\left(\sum \boldsymbol\sigma^2(\boldsymbol x) + \bigl\|\boldsymbol \mu(\boldsymbol x)\bigr\|^2-\log \prod \boldsymbol \sigma^2(\boldsymbol x)-\dim \boldsymbol z\right)
这是一个关于\boldsymbol\mu\boldsymbol \sigma的函数,于是我们在优化\boldsymbol \theta的同时可以顺便优化\boldsymbol \mu\boldsymbol \sigma

接下来还有一个技术上的问题:我们在计算目标函数的过程中要从Q当中采样。但是采样是一个离散的过程,会打断梯度的反向传播,导致Q的参数不能被正确更新。解决这个的方法被称为reparameterization trick。我们可以从标准正态分布中采样,再把采样线性缩放到Q的分布。这样\boldsymbol \mu\boldsymbol \sigma就有梯度了。在数学上说,这就是把loss中的\mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr]改为\mathbb E_{\boldsymbol \epsilon\sim \mathcal N(0,I)}\bigl[\log P(\boldsymbol x|\boldsymbol z=\boldsymbol \sigma (\boldsymbol x)\odot \boldsymbol\epsilon +\boldsymbol \mu(\boldsymbol x),\boldsymbol \theta)\bigr]



VAE的数学理论是优美的,但是有一个限制:只能够建模固定维度的样本。对于不定长序列的建模和生成,原版的VAE是无能为力的。为了解决这个问题,Chung在15年发表了名为Variational Recurrent Neural Network的架构。论文很有意思,但是归根结底,VRNN是给每一步的RNN Cell都嵌入一个VAE,同时将我们之前的推导都condition on前一个RNN cell的hidden state。

我们假设有两个比较简单的网络\varphi_{\boldsymbol x}(\boldsymbol x)\varphi_{\boldsymbol z}(\boldsymbol z)可以从输入和latent variable中做一个初步的特征提取。令RNN上一个时刻的hidden state为\boldsymbol h_{t-1},则神经网络本身的inference可以表现为:

\boldsymbol h_t = f\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_{t}),\varphi_{\boldsymbol z}(\boldsymbol z_{t}),\boldsymbol h_{t-1}\bigr).
f除了最naive的RNN cell以外也可以是如LSTM、GRU之类的变体。如果是LSTM,\boldsymbol h实际上是hidden state和cell state的总称。

所谓的嵌入VAE,是指每一步除了f以外,还多出了三个网络:[\boldsymbol \mu_{\boldsymbol z, t}, \boldsymbol \sigma_{\boldsymbol z, t}]=q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)是上文的Q网络,即“encoder”;[\boldsymbol \mu_{\boldsymbol x, t}, \boldsymbol \sigma_{\boldsymbol x, t}]=p\bigl(\varphi_{\boldsymbol z}(\boldsymbol z_t),\boldsymbol h_{t-1}\bigr)是上文的P网络,即“decoder”;[\boldsymbol \mu_{\boldsymbol z, t}, \boldsymbol \sigma_{\boldsymbol z, t}]=\varphi_{\text{prior}}(\boldsymbol h_{t-1})给出\boldsymbol z的先验分布——我们不能再假设\boldsymbol z总是可以从标准正态分布中采样了。可以看到,和上文唯一的差异是三个网络的参数都加上了上一时刻的hidden state。整个VRNN的loss就是每一步VAE loss之和:

\sum_{t=1}^T \biggl[\mathbb E_{\boldsymbol z \sim q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)}\!\Bigl[\log P\Bigl(\boldsymbol x_t\Big |\boldsymbol x_{t}\sim p\bigl(\varphi_{\boldsymbol z}(\boldsymbol z_t),\boldsymbol h_{t-1}\bigr)\Bigr)\Bigr] - D_{\text{KL}}\Bigl(q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)\|\varphi_{\text{prior}}(\boldsymbol h_{t-1})\Bigr)\biggr].

我前几个月转用Ubuntu发现了一个有意思的bug:在我这个屏幕分辨率下app grid的布局竟然是不稳定的!调出grid的时候会将app grid其初始化成3\times 8,而如果鼠标悬浮在上面或者左右滑动起来,另一套逻辑又觉得应该是4\times 6了。这导致在浏览app grid的时候整个网格总是会“卡一下”,切换生硬,观感不佳。在进行了一些研究之后我发现这应该是gnome-shell包的问题,但是网上似乎也没有相关的bug report——可能是我的情况比较特殊?我的修理方法比较简单粗暴:app grid的布局是在/js/ui/iconGrid.js里面定义的:

const defaultGridModes = [
        rows: 8,
        columns: 3,
        rows: 6,
        columns: 4,
        rows: 4,
        columns: 6,
        rows: 3,
        columns: 8,

直接把最后定义3\times 8的object删掉,就从根源上解决问题了!



# 删除除了本脚本以外的所有文件
find . \! -path './' -delete
# 下载gnome-shell的源代码
apt source gnome-shell
# 转到源码根目录,因为根目录包含不断变化的版本号所以需要动态判断
cd $(find . -maxdepth 1 -path "./gnome-shell-*" -type d -print)
# 把 3x8 的布局改成和 4x6 的一样(也可以删除,但是感觉用sed的话这样修改比较简单)
sed -i -e "s/rows: 3/rows: 4/" -e "s/columns: 8/columns: 6/" js/ui/iconGrid.js
# 添加一个 changelog entry,这里需要手动按一下 enter
dch -i "Remove 3x8 grid"
# 构建并安装
dpkg-buildpackage -rfakeroot -b
sudo apt install ../gnome-shell*.deb


写这个脚本的主要收获是熟悉了findsed命令的使用,同时把Debian package build的一些命令固化到脚本里面也省得我每次patch的时候都去搜相关的wiki。

最近我因为个人项目的原因开始深入了解TeX。虽然The TeXbookTeX By Topic两本书都很详尽地记述了TeX的运行机制机制,但读来仍然有一种雾里看花,知其然而不知其所以然的感觉。要清晰了解一个程序的行为,终究没有比直接阅读其源代码更好的方法了。好在Knuth一开始就公开了TeX的源代码并附上了非常完善的注释。然而无论如何,40年前的古董级代码理解起来也绝非易事,所以我想还是开一篇文章记录一下吧。这篇文章基本按照TeX The Program里面的章节顺序,对每一章略过细节进行简要的概述,其目的主要是帮助我回忆。当然,既然发在了博客上,我也希望我的笔记能够帮到也对TeX内部实现感兴趣的同好。

本文假定读者对于TeX的基本机制有比较充分的了解(指通读过The TeXbook)。换而言之,对于诸如如何定义宏,\edef\def的区别,什么是horizontal mode,什么是glue等话题,我不会在这篇文章中做出过多的解释。Knuth花了一整本书解释这些概念,你指望我一个半吊子在一片博客文章里就解释清楚?笑死,根本不可能。


TeX是用WEB语言编写的。WEB语言遵循Knuth提出的“文学编程/literate programming”的范式,夹杂文档和代码,可以看作是当今学术界常用的Jupyter Notebook的前身(当然,一个很大的区别是如今一个notebook不会超过20个cell,但TeX有1300多个……)。文档部分基于TeX,代码基于类Pascal(之所以说“类”,是因为当时Pascal的标准化还没完成)。如此一来,一个tex.web源文件,用tangle提取代码可以得到一份Pascal源文件用于构建程序本身,用weave提取文档则可以得到一份TeX源文件,编译成所谓的TeX The Program。在后文中我会使用形如TTP100的记号表示TeX The Program某节的内容。


如果粗略地看,TeX的源代码是非常模块化的,但是实际上却未必然。后一章定义的宏可能只是前几章某个宏的别名,导致对一个模块的修改可能会导致意想不到的结果, 又或者是虽然抽象出来了一些可以复用的函数,但是这些函数接口的形式极大地约束了实现的方案,导致事实上灵活性的缺失。模块化的设计势必包含一定的抽象,而抽象都伴随这一定的代价。在TeX开发的时代系统资源相当受限,语言标准缺失,各种infrastructure也不成熟,效率和兼容性是第一目标,因此代码设计成这样是可以理解的——只是苦了像我一样被现代开发范式惯坏的后来者。


TeX实现了自己的一套极具特色的动态储存字符串的方案。 简单来说,TeX有一个巨大的静态分配的char数组str_pool,以及一个元素递增的下标数组str_start。TeX给每个字符串赋予一个编号。编号s的字符串存储于str_pool下标中的str_start[s]以及str_start[s + 1]之间。以下C伪代码可以将TeX的字符串转换为C风格的字符串:

char *texstr_to_cstr(uint16_t s) {
    size_t len = str_start[s + 1] - str_start[s];
    char *ret = malloc(len + 1);
    memcpy(ret, str_pool + str_start[s], len);
    ret[len] = 0;
    return ret;


uint16_t cstr_to_texstr(char *s) {
    size_t len = strlen(s);
    assert(str_start[str_ptr] == pool_ptr);
    memcpy(str_pool + str_start[str_ptr], s, len);
    str_start[++str_ptr] = pool_ptr += len;
    return str_ptr - 1;


此外,TeX源代码中的所有字符串字面量,会在WEB TANGLE的时候被抽取到一个编译器的string pool里作为str_pool的基础,原地只会留下对应字符串的编号。所以在源代码里的字符串字面量,其实都是整数类型。


还是时代的原因,TeX问世时IEEE754还没有出,当时各型机器、各个编译器上对于浮点数的算术运算都遵循不同的标准。因此,TeX为了规避这种不确定性,很大一部分计算都基于其自身定义的定点数——Knuth称之为scaled number。定点数长4字节,以2^{-16}为最小单位。换而言之,如果直接把定点数当int32_t读,得到的是其所表示值的2^{16}倍。除此以外,当时各个平台对于对负数的整除和取模运算的实现存在分歧,因此TeX除了几处非核心算法,其余的部分在整除和取模前都检查操作数的正负性。

TeX的定点数计算函数里有很多令人啧啧称奇的操作。例如,以下函数用于近似计算100(t/s)^3(以下注释是我补的,原文没有解释7230584和1663497两个magic number的由来)。

typedef int16_t scaled;
const int16_t inf_bad = 10000;
int16_t badness(scaled t, scaled s) {
    if (t == 0) return 0;
    if (s <= 0) return inf_bad;
    int r; // 近似于2⁶ * ³√100 * t / s
    // 297³ ≈ 100 * 2¹⁸
    // 如果t > 7230584则t * 297 >= 2147483647会溢出
    if (t <= 7230584) r = (t * 297) / s;
    // 此时t >= 7230585,如果s / 297 < 1663497 / 297 = 5601的话,
    // t / (s / 297) >= 1291,此时(见下)一定会return inf_bad
    // 除法很慢,所以如果具体结果不重要就不要做
    else if (s >= 1663497) r = t / (s / 297);
    else r = t;
    // 如果r > 1290,则r * r * r  + (1 << 17)会溢出
    if (r > 1290) return inf_bad;
    return (r * r * r + (1 << 17)) >> 18;

可以看到,代码非常细致地考察了溢出的可能性,全程整数计算避免了浮点的不确定性,s >= 1663497这个条件更是在一定情况下避免了耗时的除法运算——非常有Knuth的风格。

TeX内部存储长度的单位是pt,格式自然是定点数。1 inch是72.27 pt。因此,所有TeX长度的最小单位是\frac1{72.27}\cdot2^{-16}=\frac{1}{4736286.72} inch。


typedef float glue_ratio;



union memory_word { // 假设没有memory alignment padding
    int32_t int_; 
    scaled sc;
    glue_ratio gr;
    // h: halfword, q: quarterword
    struct { int16_t lh; int16_t rh; } hh;
    struct { int8_t b0; int8_t b1; int16_t rh; } qqh;
    struct { int8_t b0; int8_t b1; int8_t b2; int8_t b3; } qqqq;


typedef uint16_t pointer;

mem分为上下两部分。高地址区向下拓展,存储单word的数据结构;低地址区向上拓展(且通常一次拓展1000 words),存储至少两word的数据结构。二者的边界分别由全局变量hi_mem_minlo_mem_max标记。如果两者重叠则宣告内存饱和。


#define info(p) mem[p].hh.lh
#define link(p) mem[p].hh.lh

全局变量avail始终指向链表头。因此,availlink(avail)link(link(avail))等就表示了所有待分配的地址。在分配时从头弹出,在释放时也从头插入,若内存不够则avail = --hi_mem_min

低地址区的维护基于双向链表,就是比较标准的做法了,TAOCP里有写,在此不赘述(原文:TTP124)。有意思的一点是维护低地址区时不会维护已分配数据结构的大小,后者是释放函数free_node(ptr, size)的参数。


众所周知,TeX中的一切最后都会变成box(成盒)。TeX在程序中使用一对type tag的组合来区分不同类型的box。

#define type(x) mem[x].qqh.b0
#define subtype(x) mem[x].qqh.b1

这些信息已经占去了半个word的空间。注意到一个box在任意时刻只会在一个list当中(Rust ownership mechanism并感),于是Knuth机制地利用另外半个word存储一个box的后继,即link(x)表示在处理box x之后下一个box的地址。不得不说,链表真的在TeX中被玩出了花。


也并不是,Knuth考虑了TeX中最最常见的一种box——由单个字符表示的char box。Knuth发现如果用type(x)表示一个char box的字体编号,再用subtype(x)表示它对应的字符,那么一个char box就能塞进一个word里面!

#define font type
#define character subtype

既然type已经被挪作他用了,如何知道一个box是char box还是其他类型呢?注意到char box一定被分配到高地址区,其他更复杂的box都在低地址区,于是只要判断指针是不是>= hi_mem_min就行了。



struct hbox_node { // sizeof(hbox_node) == 7;
    memory_word meta; // type tag = 0
    scaled width;
    scaled depth;
    scaled height;
    scaled shift_amount; // 对于hbox是下沉,对于vbox是右移
    uint8_t glue_sign; // box内所有glue的伸缩状态
    uint8_t glue_order; // box内glue的阶数——和fil, fill, filll有关,暂时看不明白
    pointer list; // box内的内容
    glue_ratio glue_set; // box内所有glue的伸缩比
enum { // glue_sign的取值
    NORMAL = 0,
    SHRINKING = 2,

还有一个box的类型和hbox和vbox很相似,称为unset box,即halignvalign环境中大小未定,但是一定会和其他box对齐的box(详见TTP159)。

除了char box以及上面的hbox/vbox,TeX还有很多种类型的box,但数据结构布局的基本思想大同小异,在此也不赘述了(详见TTP140左右)。最后记录一下glue box。为了节约内存,TeX的glue box本质上是一个引用计数的指针,指向一个单独分配的glue specification。

struct glue_node { // sizeof(glue_node) == 2;
    memory_word meta; // type tag = 10, subtype有点意思,详见TTP149
    pointer glue_ptr; // 指向glue specification
    pointer leader_ptr; // 如果这个glue是一个leader,指向leader的内容
struct glue_spec { // sizeof(glue_spec) == 4;
    uint8_t stretch_order; // stretch的阶数
    uint8_t shrink_order;
    uint16_t ref_count; // 引用计数
    scaled width;
    scaled stretch;
    scaled shrink;
enum { // xxx_order的取值
    // NORMAL = 0, 在上面已经定义过了
    FIL = 1,
    FILL = 2,
    FILLL = 3,


TeX的奇妙command code

TeX给每一类primitive固定了一个command code。具体不详细展开了,详见TTP207。

我个人认为给每一个primitive集中地,显式地指定一个整数的command code是不利于系统的可拓展性的——即使primitive内部使用整数command code进行比较与匹配,command code的分配也应该由程序完成。我说这话自然有站着说话不腰疼之嫌。这些command code孤立着看怎么样不好说,容我看完后续章节之后再来回顾。


The TeXbook有提到,TeX总共有六个模式:(restricted) horizontal,(internal) vertical,和(inline) math。在程序中,这六个模式对应六个整数常量。horizontal,vertical和display math对应的常量为正,对应的restricted mode的常量为前者的相反数——如此一来通过取绝对值就能判断一个模式的基本类型。 而且三个mode code不是简单的1,2,3,而是间隔max_command(即上节command code的最大值)。如此一来,mode code与command code之和就将模式和primitive的二元组一一映射到不同的整数。这在TeX的“big switch”中有大用。


struct list_state_record {
    int16_t mode; // 不确定是不是int16_t,原文中的Pascal有通过数据范围指定类型的方式,C没有
    pointer head; // 指向该模式正在构建的list的链表头
    pointer tail; // 指向该模式正在构建的list的链表尾
    int prev_graf; // 对应\prefgraf,即已经当前段落断行完毕的行数
    int mode_line; // 当前mode的起始行数
    memory_word aux; // 辅助信息


我们终于迎来了一个在TeX中具有核心意义的巨大数据结构——eqtb表。eqtb是equivalent table的简写,存储了每一个符号(包括control sequence,token register,counter等)的含义。eqtb是一个分为六部分的巨大数组,每个部分占据连续的一段下标区间,从小到大分别为:

  1. 字符宏(active character,即cat code 13)的定义;
  2. 多字符控制序列(control sequence)的定义(基础是一个哈希表);
  3. Glue的定义,例如\baselineskip
  4. 局部变量,包括catcode,box register,字体等;
  5. 数字参数,例如断行负权(hyphenation penalty);
  6. 大小(dimension),例如hsize


struct equivalent { // 布局等同于memory_word
    int8_t type; // 定义类型——是一个宏?一个primitive?一个counter?一个token list?
    int8_t level; // 被定义时TeX的局部域的嵌套级数,基于以下常数:
    // level_zero是未定义,level_one是全局,level_one + 1是全局一下一个group内,以此类推
    int16_t equiv; // 定义,可以是一个指针,也可以表示一个字体编号,取决于type

其中equivtype的意义都不难理解,但是level字段却出乎我的意料。按照当下的套路,如果需要存储或解析一个符号在多重嵌套作用域中的绑定,常见的方法是存一个帧栈并在解析符号时自顶向下查找。虽然这种思路很符合直觉,其弊端是需要在栈中为每一个作用域存储一个符号表。这在一切数据结构都要造轮子的时代显然增加了代码的复杂性。TeX的level字段则更像是把这些表拍扁成一块,在一定程度上节约了内存。但是现在想来,这样的设计无法独立存储或引用一个局部作用域——这对于lexical scoping是必须的。TeX的dynamic scoping究竟是喂屎还是特性一直有争论,在语言设计层面也有很多种解释。现在看来,除了语言设计的考量,内部的代码实现也不由得TeX不搞dynamic scoping。




eqtb是一个数组,其下标是一个一定范围内的整数;控制序列是一个字符串。将字符串映射到控制一个整数区间,自然要用到哈希表。TeX的哈希表实现基于TAOCP的“coalescing list”算法,定义如下辅助数据结构:

struct {
    uint16_t text; // 指向字符串的编号,如果该位置未存值则为0
    uint16_t next; // 指向链表的下一个entry,如果该位置未存值或位于链表尾则为0
} hash[/*下标从hash_base到undefined_control_sequence - 1,
uint16_t hash_used; // 一个递减的指针,hash中所有下标不小于hash_used的都存了值了

把哈希表的每一个entry定义为一个链表是常规操作,但coalescing list将链表本身也嵌在哈希表中,进一步节约了内存,就非常有意思了。在这个哈希表中查找以及插入的步骤如下:

uint16_t query_or_insert(char *s, bool allow_insert) {
    uint16_t p = compute_hash(s) + hash_base;
    bool found = false;
    size_t len = strlen(s);
    while true {
        if (hash[p].text > 0 
            && length(hash[p].text) == len
            && str_eq_buf(hash[p].text, s))  // 如果找到了则直接返回
            return p;
        if (!hash[p].next) { // 链表到头了还没找到
            if (!allow_insert)
                return undefined_control_sequence;
            // 开始把s接在p后面插入哈希表
            // 先找到一个还未存值的位置存入hash_used
            do { hash_used--; } while (!hash[hash_used].text);
            // 把p的后继设定为hash_used
            hash[p].next = hash_used;
            p = hash_used;
            hash[p].text = cstr_to_texstr(s);
            return p;
        p = hash[p].next;


uint16_t compute_hash(char *s) {
    uint16_t hash = *s++;
    while (*s) {
        hash = ((hash << 1) + (*s++)) % HASH_PRIME;
    return h;





struct {
    uint8_t type; 
    uint8_t level; 
    uint16_t index; 
} save_stack[...];
uint16_t save_ptr; // save_stack[save_ptr - 1]是栈顶
enum { // type的取值


  1. RESTORE_OLD_VALUE适用于内部作用域赋值覆盖外部已定义对象的情况。此时被覆盖的对象对应的eqtb下标为index,且在出栈回退时,该元素底下的一个元素应当视作被覆盖的原值——注意到,eqtbsave_stack的元素的内存布局都与memory_word相同,因此就有往栈里存eqtb的东西的奇怪操作。
  2. RESTORE_ZERO适用于内部作用域赋值外部未定义对象的情况。此时局部创建对象对应的eqtb下标为index,在出栈时,eqtb[index]应当被置为和eqtb[undefined_control_sequence]一样的内容(说人话就是回归未定义的状态)。
  3. INSERT_TOKEN用于在grouping结束之后追加token的情况——即\aftergroup。此时index表示要追加的token(虽然还没有写到,但TeX用一个16位整数编码一个token)。
  4. LEVEL_BOUNDARY顾名思义,用于标记每一个level的“边界”。注意到在一个作用域里面会往save_stack里面塞东西的数量是不确定的。如何在回退的时候只回退刚刚退出的那个作用域对应的修改呢?这就是这个type的用途。LEVEL_BOUNDARY在进入一个作用域的时候第一个入栈,并在离开作用域逐个出栈的过程中作为终止的标记。换而言之LEVEL_BOUNDARY之上的就是当前作用域的所有修改。LEVEL_BOUNDARY元素的level字段标记上一层作用域的“类型”,next字段指向上一层作用域的LEVEL_BOUNDARY在栈中的下标。

什么是作用域类型呢?除了最简单的靠{..}创造作用域以外,能够创建作用域的语法构造有很多,如\hbox\begingroup\endgroup等。不同类型的作用域终止条件不同,退出之后需要额外进行的操作也不同,因此需要记忆。TeX使用cur_group这一全局变量维护当前作用域的group code,上面几层的group code正如上面所说的那样以类链表的形式存在save_stackLEVEL_BOUNDARY节点里面。

骚操作又来了,如果创造作用域的命令有参数,那该参数也会入栈并处于新作用域对应LEVEL_BOUNDARY的底部。一个例子是\hbox to 100px { ... }那这个100px就会先于LEVEL_BOUNDARY入栈。这样在出栈回退过后这个保存了的参数马上就能用。Knuth不要什么都往栈里放啊喂。


除此以外,因为TeX内部有很多引用计数的结构,如token list,glue specification,par shape等。这些数据结构在赋值的时候和回退的时候需要额外处理,旧的值该回收的要回收,这让save_stack上面的操作逻辑稍微复杂了一点。

TeX的奇妙token list

Token list是TeX的另一大核心数据结构。而在解释token list之前,自然要先了解TeX是如何表示一个token的。TeX将token分为两类:

  1. “字符型”的token是catcode和单个字符的组合。TeX将其表示为2^8 \cdot\text{catcode}+\text{char code}
  2. 控制序列的token本质上对应eqtb表中的一个位置。TeX将对应eqtb下标p的控制序列表示为\left(2^{12}-1\right) + p——\left(2^{12} - 1\right)也被称为cs_token_flag

如此一来,每个token都可以映射为一个16位的正整数。接下来自然是TeX的常规操作:把token塞到一个memory word中,空出来的16位正好指向下一个token的地址。因此,token list的本质还是一个单向链表。

除了来自源文件的token,TeX还有两个特殊的token:matchend_match。这两个在存储宏定义的时候会用到。具体来说,TeX很神奇地将宏的参数格式和替换宏体存在一个token list里面。end_match是参数部分和宏体的分界,而match则代表宏参数格式中的参数本身(因为参数格式中参数的编号总是递增的,所以只需要存储match而不需要存储对应的参数编号)——是不是很绕?举个例子,如果定义如下的宏:

```tex \defa#1#2

书接上文,我们使用Python + Selenium白嫖了原来要收费的图档清理大师,效果极好。然而,我们的代码还有很多可以优化之处,这就是写本文的由来。














import requests

def clean_doc_requests(images: Generator[Tuple[bytes, str], None, None]) \
        -> Generator[Image.Image, None, None]:
    Cleans the scanned document pages using docleaner's online service.

    :param images: A generator yielding (an image as raw bytes, its extension as string).
    for (image, ext) in images:
        # noinspection HttpUrlsUsage
        req ="",
                            files={"file": (f"image.{ext}", image)})
        data = {
            # Weird typo in the API.
            "paramers": "降噪,去斑点,去黑边,去背景,自动纠斜",
            "type": "image",
            "storePath": req.json()["data"]["storePath"],
            "userId": ""
        # noinspection HttpUrlsUsage
        req ="", data=data)
        result = base64.b64decode(req.json()["data"]["outFileStr"])






但Python有一个好——battery included。我去官方文档转了一圈,发现Python的标准库里有一个multiprocessing模块,应付我这里的需求已经完全够用。具体来说,multiprocessing提供了一个进程池Pool:对于一般的

output_iterator = map(function, input_iterator)


from multiprocessing import Pool
with Pool(process_count) as p:
    output_iterator = p.imap(function, input_iterator)



def clean_single_page(args: Tuple[StrPath, int, int, bool, bool]) \
        -> Union[Image.Image, bytes]:
    Cleans a single page.

    :param args: A tuple consisting of (in order):
        1. Path to the page (pdf or image),
        2. Index (image index or page index in PDF),
        3. DPI (-1 if an image is direcly supplied),
        4. Whether to perform OCR,
        5. Whether to actually clean the page.
    :return:  If OCR is enabled, an OCR-ed PDF in raw bytes, otherwise a PIL
        Image object representing the cleaned page.
    page, idx, dpi, ocr, clean = args




def merge_to_pdf(pages: Iterable[Union[Image.Image, bytes]], output: StrPath):
    Converts and merges images to a one-page pdf file, performing optional
    OCR in the process.

    :param pages: A generator yielding PIL image objects.
    :param output: Path to the result pdf.
    doc = fitz.Document()
    for page in pages:
        if isinstance(page, Image.Image):
            # noinspection PyUnresolvedReferences
            doc_page = doc.new_page(width=page.width, height=page.height)
            buffer = io.BytesIO()
  , format="jpeg")
            doc_page.insert_image(fitz.Rect(0, 0, page.width, page.height),
            page = fitz.Document(stream=page, filetype="pdf")


@click.argument("input", type=click.Path())
@click.argument("output", type=click.Path())
@click.option("-d", "--dpi", default=300, help="DPI for rasterization.")
@click.option("--first-page", type=int, help="First page to convert/clean.")
@click.option("--last-page", type=int, help="Last page to convert/clean.")
@click.option("--ocr/--no-ocr", default=True,
              help="Whether to perform OCR during the conversion.")
@click.option("--clean/--dont-clean", default=True,
              help="Whether to clean pdf using docleaner's online service.")
def main(input: str, output: str, dpi: int,
         first_page: Optional[int], last_page: Optional[int], ocr: bool,
         clean: bool):
    if os.path.splitext(input)[1].lower() == ".pdf":
        # PDF mode
        assert os.path.exists(input)
        page_count = fitz.Document(input).page_count
        first_page = 0 if first_page is None else first_page - 1
        last_page = page_count if last_page is None else last_page
        args = zip(repeat(input), range(first_page, last_page),
                   repeat(dpi), repeat(ocr), repeat(clean))
        # Glob mode
        files = sorted(glob.glob(input, recursive=True))
        first_page = 0 if first_page is None else first_page - 1
        last_page = len(files) if last_page is None else last_page
        args = zip(files[first_page:last_page], repeat(0), repeat(-1),
                   repeat(ocr), repeat(clean))
    total = last_page - first_page
    with Pool() as p:
        results = tqdm(p.imap(clean_single_page, args), total=total)
        if os.path.splitext(output)[1].lower() == ".pdf":
            merge_to_pdf(results, output)
        elif not os.path.exists(output) or os.path.isdir(output):
            if ocr:
                raise RuntimeError("the OCR flag is useless because we are "
                                   "writing images (not PDF) to the output "
            if not os.path.exists(output):
            for (index, page) in enumerate(results):
                file_path = os.path.join(output, f"{index}.jpg")
                assert isinstance(page, Image.Image)
            raise RuntimeError("invalid output format.")







  1. 治本:使用更复杂一点的逻辑对PDF中内嵌的图片进行保真的提取。这样一来结果文件的大小就和原来的文件大致一致,可能还会更小一点(考虑到清理的时候很多区域都会整个变成白色)。
  2. 治标:使用更加优秀的图像压缩算法。






  1. 基于原内容的排版可变格式:一般为EPUB、MOBI或TXT格式,动态排版,在任何屏幕大小下都适合阅读。有索引,文字与图像本地渲染,清晰可辨。这是电子书的理想格式,但这种格式的书大多是以小说,我几乎没有见到过用这种格式排版的学术的教科书。
  2. 基于原内容的PDF:基于书的源文件直接渲染得来,排版不可变,在没有重排软件的情况下需要大屏阅读器才能舒适阅读,并且需要适当对页边进行裁剪。大部分时候有索引,文字和图像本地渲染,清晰可辨。这种PDF可遇而不可求。有教授把这种格式的书放在官网上给学生们免费下载(比如说这个)——我愿称他们为大善人。
  3. 精修的扫描PDF:是基于实体书扫描而来,文字部分经过OCR和轮廓平滑化变得更加清晰,公式可能经过重写也很清楚。良心一点的制作者会顺便把目录也做了。这种PDF(有的时候是DJVU)见到的概率比较高,libgen上十本里面能有六七本。我的Introduction to Linear Algebra就是这种类型。
  4. 三无的扫描PDF:PDF制作者只进行了基本的扫描。文字有倾斜,页面有阴影,页边缘出镜导致黑边,有些还会有铅笔的笔记。OCR是没有的,目录是不用指望的。这种是电子书里的下品,像我这样有一些轻微强迫症的会觉得很难受。







  1. 将输入的PDF按页转换成高清图片。
  2. 以某种方式程序化地调用在线体验的功能。
  3. 将结果的图片下载下来重新拼合成为PDF。
  4. 可以顺便使用pytesseract实现OCR。






import fitz
from typing import Optional, Generator, Union
from pathlib import Path

StrPath = Union[str, os.PathLike]

def convert_pdf_to_images(pdf: StrPath, fmt: str, dpi: int,
                          output: Optional[StrPath] = None,
                          first_page: Optional[int] = None,
                          last_page: Optional[int] = None) \
        -> Generator[StrPath, None, None]:
    Converts a pdf file to images. This a necessary pre-processing step
    because docleaner online only accepts images as inputs.
    :param pdf: The path to the pdf file.
    :param fmt: Image file format. jpg is the fastest but not lossless; png is
        lossless but slow; tiff is theoretically the best but occupies a lot of
        disk space.
    :param dpi: Pixel density of the output images.
    :param output: The output directory of intermediate images.
    :param first_page: First page to convert (starting from 1, inclusive).
    :param last_page: Last page to convert (starting from 1, inclusive).
    :return: A generator yielding the paths to the images.
    doc = fitz.Document(pdf)

    def normal_dir(dir_path):
        Path(dir_path).mkdir(parents=True, exist_ok=True)
        yield dir_path

    matrix = fitz.Matrix(dpi / 72, dpi / 72)
    first_page = 0 if first_page is None else first_page - 1
    last_page = doc.page_count if last_page is None else last_page
    with tempfile.TemporaryDirectory() if output is None else normal_dir(
            output) as path:
        for i in range(first_page, last_page):
            filename = os.path.join(path, f"{i}.{fmt}")
            # noinspection PyUnresolvedReferences
            yield filename
        if output is None:
            # Yield an empty string if we are using a temporary directory,
            # because without this, the temporary directory will be cleaned
            # up the moment the last filename is yielded, when the caller
            # hasn't done anything to the yielded temp file yet. Yielding an
            # emtpy string keeps the TemporaryDirectory object in memory
            # longer so the problem is solved.
            yield ""


  1. 代码写成了生成器的形式,这样便于和后续的其他功能进行更好的耦合。个人认为生成器是一个很不错的语法糖。
  2. 为了方便静态检查,我这里给方法签名标注了类型。
  3. 为了同时处理输出图片到临时目录和正常目录(可以用于其他工具)的情形,这里定义了名为normal_dircontext manager以和TemporaryDirectory保持形式上的一致。这样代码可以更加简短。要使用TemporaryDirectory()而不是mkdtemp()的原因是前者确保了临时文件夹最终会被删除。
  4. 但是TemporaryDirectory()在结束后删除临时文件夹会带来一个问题:在yield最后一页的图片的路径之后,临时文件夹马上就被删除了,刚返回的路径也就失效了。调用这个生成器的代码使用的时候就会产生找不到文件的错误。一个粗暴的解决方案就是让with块执行得再久一点——比如说在下次调用的时候再结束。这就是最后yield ""的意义。
  5. PyMyPDF默认的get_pixmap方法输出的图片分辨率等于当前页面的边界矩形大小(bound rectangle)。这个边界矩形尺寸是以点数(pts)为单位的,1英寸等于72点,因此如果需要达到指定的DPI,get_pixmap在执行的时候就需要放大dpi / 72倍。这就是matrix = fitz.Matrix(dpi / 72, dpi / 72)的含义。






from selenium import webdriver

browser = webdriver.Chrome()
# browser = webdriver.Firefox()
# browser = webdriver.Edge()
# browser = webdriver.Safari()



  1. 等待页面加载完毕(废话),以及
  2. 打开去背景和自动纠斜这两个默认关闭但是很有用的开关。


from import By
from import expected_conditions
from import WebDriverWait

WebDriverWait(browser, timeout).until(


<div style="position: relative;">
    <button type="button" class="layui-btn layui-btn-primary buttoncss buttoncss0" style="padding: 0 10px;">
        <img src="">自动纠斜
    <div class="buttoncheckbox">
        <input type="checkbox" name="exeConfig" lay-filter="exeConfig" value="自动纠斜" lay-skin="primary">
        <div class="layui-unselect layui-form-checkbox" lay-skin="primary"><i class="layui-icon layui-icon-ok"></i></div>






uploader = WebDriverWait(browser, timeout).until(
        (By.CLASS_NAME, "layui-upload-file")))

注意到这里是presence_of_element_located,前面是visibility_of_element_located,区别在于后者必须要求元素在浏览器框里面可视,而前者只需要DOM里有就行了。我们之前需要点击开关,因此必须先看得见才行(而且可视是确保网页加载完毕的一个更靠谱的条件),而上传控件在网页里是一个隐藏起来的<input>,因此只要present in DOM就行了。




result = WebDriverWait(browser, timeout).until(
        (By.ID, "dragImgRight")))


import io
import base64
from PIL import Image

result = result.get_attribute("src")
result = base64.b64decode(
    result.replace("data:image/jpg;base64,", ""))


    "arguments[0].parentNode.classList.add('layui-hide');", result)



def clean_doc_online(images: Generator[StrPath, None, None], browser: str) \
        -> Generator[Image.Image, None, None]:
    Cleans the scanned document pages using docleaner's online service.
    :param images: A generator yielding paths to document pages.
    :param browser: Browser type, can be "chrome", "firefox", "safari", or
        "edge". Requires the browser and its webdriver to be installed.
    if browser == "chrome":
        browser = webdriver.Chrome()
    elif browser == "firefox":
        browser = webdriver.Firefox()
    elif browser == "safari":
        browser = webdriver.Safari()
    elif browser == "edge":
        browser = webdriver.Edge()
        raise RuntimeError("Unknown browser type")
    # Timeout for web driver waits. 10s is a reasonable value unless you have
    # a very high-res image / terrible network.
    timeout = 10

    # Turn on background removal and automatic deskewing.
    WebDriverWait(browser, timeout).until(
    WebDriverWait(browser, timeout).until(
    # Wait for a while to ensure the changes take effect.

    uploader = WebDriverWait(browser, timeout).until(
            (By.CLASS_NAME, "layui-upload-file")))

        while True:
            # Write like this instead of a for loop enables us to fetch the
            # next image while the browser & remote server are processing the
            # image just uploaded. Converting a pdf page to an image is slow,
            # so we here save a lot of time :)
            next_image = next(images)
            # Wait for the result image to be visible.
            result = WebDriverWait(browser, timeout).until(
                    (By.ID, "dragImgRight")))
            # Hide the result image again so the wait condition above can be
            # re-used.
                "arguments[0].parentNode.classList.add('layui-hide');", result)
            result = result.get_attribute("src")
            result = base64.b64decode(
                result.replace("data:image/jpg;base64,", ""))
            if next_image == "":
                # See convert_pdf_to_images for the reason behind this weird
                # branch.
    except StopIteration:



  1. 之前在PDF转图片的设计中有返回空路径的奇技淫巧需要特判。
  2. 在点击开关之后需要等一会来确保开关真的点下去了。
  3. 循环的写法最好不要写成标准的for循环,因为这样的话文件图片上去之后代码就是干等。现在的写法确保了图片传上去之后立刻就获取下一张图片,之后再判断等待是否当前图片处理完成。再调用next(images)的过程中,浏览器也在独立于Python进程运行,因此我们毫无损失。将PDF的一页转化为一张高清图片需要1-2秒钟,而网站处理图片大概也是这么长时间,因此在这里循环写法的改变是一个很不错的优化。






from tqdm import tqdm
import pytesseract

def convert_images_to_pdf(images: Generator[Image.Image, None, None],
                          output: StrPath,
                          ocr: bool = True, total: Optional[int] = None):
    Converts and merges images to a one-page pdf file, performing optional
    OCR in the process.

    :param images: A generator yielding PIL image objects.
    :param output: Path to the result pdf.
    :param ocr: Whether to perform OCR(Optical Character Recognition).
    :param total: An optional integer hinting the total number of images given.
        If supplied, a progress bar will be displayed during the conversion.
    doc = fitz.Document()
    for image in images if total is None else tqdm(images, total=total):
        if ocr:
            pdf = pytesseract.image_to_pdf_or_hocr(image)
            page = fitz.Document(stream=pdf, filetype="pdf")
            # noinspection PyUnresolvedReferences
            page = doc.new_page(width=image.width, height=image.height)
            buffer = io.BytesIO()
  , format="jpeg")
            page.insert_image(fitz.Rect(0, 0, image.width, image.height),



import click

@click.argument("input", type=click.Path(exists=True))
@click.argument("output", type=click.Path())
@click.option("-f", "--format", default="png",
              help="Intermediate image format.")
@click.option("-d", "--dpi", default=300, help="DPI for rasterization.")
@click.option("-b", "--browser", default="chrome",
              help="The browser selenium uses.")
@click.option("--first-page", type=int, help="First page to convert/clean.")
@click.option("--last-page", type=int, help="Last page to convert/clean.")
@click.option("--ocr/--no-ocr", default=True,
              help="Whether to perform OCR during the conversion.")
@click.option("--clean/--dont-clean", default=True,
              help="Whether to clean pdf using docleaner's online service.")
def main(input: str, output: str, format: str, dpi: int, browser: str,
         first_page: Optional[int], last_page: Optional[int], ocr: bool,
         clean: bool):
    images = convert_pdf_to_images(input, fmt=format, dpi=dpi,
                                   first_page=first_page, last_page=last_page)
    if clean:
        images = clean_doc_online(images, browser)
    doc = fitz.Document(input)
    total = (doc.page_count if last_page is None else last_page) \
        - (0 if first_page is None else first_page - 1)
    convert_images_to_pdf(images, output, ocr=ocr, total=total)

if __name__ == "__main__":





  1. 过于激进的光栅化分辨率使得结果文档的大小显著增加。
  2. 基于Selenium的解决方案偶尔会出现开关没有开或者莫名其妙地关掉的情形。
  3. 脚本的运行效率有待提升。


在看了Introduction to Linear Algebra之后我对于行列式真的是豁然开朗。



  1. 单位阵的行列式为1。

  2. 两行交换,行列式变号。

  3. 行列式关于单行是线性的,即

    \begin{vmatrix} \vdots \\ (\boldsymbol  a + \boldsymbol b)^{\mathrm T} \\ \vdots \end{vmatrix} = \begin{vmatrix} \vdots \\ \boldsymbol a^{\mathrm T} \\ \vdots \end{vmatrix} + \begin{vmatrix} \vdots \\ \boldsymbol b^{\mathrm T} \\ \vdots \end{vmatrix},\quad\quad
\begin{vmatrix} \vdots \\ (t\boldsymbol a)^{\mathrm T} \\ \vdots \end{vmatrix} = t\begin{vmatrix} \vdots \\ \boldsymbol a^{\mathrm T} \\ \vdots \end{vmatrix}


  1. 两行相等,行列式为0证明:交换相等两行,行列式不变,但又由性质2,行列式要变号,因此行列式只能是0

  2. 初等行变换不改变行列式的值。证明:由性质3、4得:、

 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 (\boldsymbol a -l \boldsymbol b)^{\mathrm T} \\
 = \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
= \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\
l\cdot 0
= \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\

  3. 有一行为\boldsymbol 0,则行列式为0证明:将另一行加到零行上,出现相等的两行,由性质4,行列式的值为0。又因为初等行变换不改变行列式的值,所以原来的行列式值也是0

  4. 三角阵的行列式为其主对角线元素之积。证明:无论是上三角阵还是下三角阵,都可以简单地利用行变换消去非对角线上的元素,此时由性质1和性质3,行列式的值就是主对角线元素之积。

  5. 矩阵可逆,则行列式非零。证明:考虑高斯消元,如果矩阵奇异则必然三角阵中有零行,行列式为0。反之,则左右行都是主元行,三角阵的行列式非零。(高斯消元过程中的行变换不改变行列式,行交换只改变正负。)

  6. 积的行列式等于行列式的积。证明:对于矩阵A,B,考虑关于A的函数D(A) = |AB|/|B|。其有三个性质:

    1. D(I) = |B|/|B| = 1
    2. A的两行交换,则AB的两行交换,则|AB|变号,则D(A)变号。
    3. A的单行线性变换即AB的单行线性变换,因此|AB|关于A的单行是线性的,即D(A)关于A单行是线性的。

    因此,D(A)满足行列式的核心性质,由行列式之定义,D(A) = |A|

  7. 转置不改变行列式的值。若矩阵奇异,则其转置奇异,则二者行列式皆为0。若矩阵可逆,则考虑LU分解PA=LU。两边转置得A^{\mathrm T}P^{\mathrm T} = U^{\mathrm T}L^{\mathrm T}。由性质9,有

    |P||A|=|L||U|,\quad \left|A^{\mathrm T}\right|\left|P^{\mathrm T}\right| = \left|U^{\mathrm T}\right|\left|L^{\mathrm T}\right|
    比较二式,L,U为三角阵,因为转置不改变对角线,因而转置不改变二者行列式的值。P为行交换的排列,行列式为\pm 1,而P^{\mathrm T}P=I,因此|P|\left|P^{\mathrm T}\right|同正负,即相等。综上,由两等式,|A| = \left|A^{\mathrm T}\right|





        a &b\\
        c &d
    \end{vmatrix} &= \begin{vmatrix}
        a &0\\
        c &d
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        c &d
    \end{vmatrix} \\
    &= \begin{vmatrix}
        a &0\\
        c &0
    \end{vmatrix} + \begin{vmatrix}
        a &0\\
        0 &c
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        c &0
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        0 &d
照此态势,n行的行列式可以被分拆为n^n个每行至多只有一个非零元素的行列式之和。而注意到,这些行列式有一些存在零列,他们的值一定为0可以省略。因此,可能非零的只有那些每行至多只有一个非零元素,且非零元素所在列两两不同的行列式。对于这些行列式,存在排列矩阵可以将其对角化,进而我们得出这些行列式的值为每一行可能的非零元素之积乘以排列矩阵的行列式,后者只可能是\pm 1——不妨将这正负定义为排列的“奇偶性”。则行列式的排列公式呼之欲出:

A_{n\times n} = \sum_{\pi \in S_n} (-1)^{\sigma(\pi)}\prod_{i=1}^na_{i, \pi(i)}


\sigma(\pi)=\sum_{i<j}[\pi(i) > \pi(j)]







A \boldsymbol x &= \boldsymbol b \\
A I &= A
A \cdot \begin{bmatrix}
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{bmatrix} = \begin{bmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
|A| \cdot \begin{vmatrix}
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{vmatrix} = \begin{vmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{vmatrix} = x_i
x_i = \frac{\begin{vmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\




  1. 单位阵的行向量组成的盒体就是单位盒体,体积显然是一。
  2. 交换行列式的行不会改变盒体的形状,所以带号体积的绝对值不变,但符号是改变的(这姑且是带号体积的“定义”吧,带号体积的绝对值就是常规意义下的体积,这是自然的)。
  3. 对行列式一行的向量进行线性拉伸,对应盒体的体积显然也会线性拉伸。把行列式一行的行向量分拆成两个行向量之和……这个似乎有些复杂,在二维的情况下容易画图理解,在高维的情况下类似。

因此,以原点为一个顶点,另外两个顶点为(x_1, y_1), (x_2, y_2)的三角形面积就是

\frac{1}{2}\left|\begin{vmatrix}x_1 &y_1 \\ x_2 &y_2\end{vmatrix}\right|
三个顶点为(x_1,y_2),(x_2,y_2),(x_3, y_3)的三角形面积就是
\frac{1}{2}\left|\begin{vmatrix}x_1 &y_1 & 1\\ x_2 &y_2 &1 \\ x_3 &y_3 &1\end{vmatrix}\right|


两个三维向量的叉积\boldsymbol u \times \boldsymbol v定义为另外一个三维向量,其模为|\boldsymbol u||\boldsymbol v||\sin \theta|,其方向垂直\boldsymbol u, \boldsymbol v。叉积的公式可以通过如下的伪·行列式进行记忆:

\boldsymbol u \times \boldsymbol v = \begin{vmatrix}
    \boldsymbol i &\boldsymbol j &\boldsymbol k \\
    \boldsymbol u_1 &\boldsymbol u_2 &\boldsymbol u_3 \\
    \boldsymbol v_1 &\boldsymbol v_2 &\boldsymbol v_3 \\    
其中\boldsymbol i = (1, 0, 0), \boldsymbol j = (0, 1, 0), \boldsymbol k = (0, 0, 1)

显然,从上面的行列式形式我们看出叉积具有反对称性:\boldsymbol u \times \boldsymbol v = -\boldsymbol v \times \boldsymbol u



(\boldsymbol u \times \boldsymbol v)\cdot \boldsymbol w = \begin{vmatrix}
    \boldsymbol w_1 &\boldsymbol w_2 &\boldsymbol w_3 \\
    \boldsymbol u_1 &\boldsymbol u_2 &\boldsymbol u_3 \\
    \boldsymbol v_1 &\boldsymbol v_2 &\boldsymbol v_3 \\    

以下是读Gilbert Strang的Introduction to Linear Algebra的后回忆整理的笔记。内容大概包括第三节和第四节开头的一部分。


对于矩阵A\in \mathbb{R}^{m\times n},定义

  1. A的列空间为其所有列向量的张成空间,A的列向量的线性组合记作A\boldsymbol{x}=\boldsymbol b——因此所有这个形式的\boldsymbol b都属于A的列空间。记作C(A)C(A)\subset\mathbb{R}^m
  2. A的零空间为令其列向量线性组合为零的组合构成的空间,即所有满足A\boldsymbol{x}=\boldsymbol 0的向量\boldsymbol x。记作N(A)N(A)\subset \mathbb{R}^n
  3. A的行空间为其所有行向量的张成空间。因为A的行向量都是A^{\mathrm{T}}的列空间,所以行空间可以记作C\left(A^{\mathrm T}\right)C\left(A^{\mathrm T}\right) \subset \mathbb{R}^{n}
  4. A的左零空间为令其行向量线性组合为零的组合构成的空间,即所有满足\boldsymbol{x}^{\mathrm T}A = \boldsymbol 0^{\mathrm T}的向量\boldsymbol{x},现在\boldsymbol x在左边,所以和零空间相比就叫左零空间了。左零空间自然可以记作C\left(A^{\mathrm T}\right)C\left(A^{\mathrm T}\right) \subset \mathbb{R}^{m}





证明: 假设向量空间\mathbf V的两组基为\boldsymbol{v_1},\boldsymbol{v_2},\cdots,\boldsymbol{v_n}\boldsymbol{w_1}, \boldsymbol{w_2}, \cdots ,\boldsymbol{w_m}。运用反证法,不失一般性,设n < m。因为\boldsymbol{v_1},\boldsymbol{v_2},\cdots,\boldsymbol{v_n}是基,所以任意\boldsymbol{w_k}都可以用他们的线性组合表示。即

\boldsymbol{w_k} = \begin{bmatrix}
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix}\cdot \boldsymbol{a}


    \boldsymbol{w_1} &\boldsymbol{w_2} &\cdots &\boldsymbol{w_m}  
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix}\cdot A
不难发现,Anm列的矩阵,而n < m,所以A是扁的。在代数上,齐次线性方程组的变量多于方程数,也就是说A\boldsymbol{x} = \boldsymbol{0}有非平凡解。
    A\boldsymbol{x} &= \boldsymbol{0} \\
    \Rightarrow \begin{bmatrix}
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix} A\boldsymbol{x} &= \boldsymbol{0} \\
    \Rightarrow \begin{bmatrix}
    \boldsymbol{w_1} &\boldsymbol{w_2} &\cdots &\boldsymbol{w_m}  
\end{bmatrix} \boldsymbol{x} &= \boldsymbol{0}
也就是说,\boldsymbol{w_1}, \boldsymbol{w_2}, \cdots ,\boldsymbol{w_m}线性相关,推出矛盾,证毕。


一个矩阵四大空间的维度可以通过将其化简为行简化阶梯型(reduced row echelon form)来计算。行简化阶梯型的定义

  1. 阶梯型/上三角型矩阵
  2. 主元位置都是1
  3. 主元正上方都是0


    1 &1 &2 &4\\
    1 &2 &2 &5\\
    1 &3 &2 &6
    1 &0 &2 &3\\
    0 &1 &0 &1\\
    0 &0 &0 &0


  1. 它列空间的维度是它的秩:主元列显然是线性无关的,而非主元列显然可以用主元列的线性组合表示(系数就写在列里面呢,比如说在上面的例子中,第四列是第一列的3倍和第二列的1倍之和)。
  2. 它行空间的维度也是它的秩:行简化阶梯型的形态让我们不难发现每一个非零行都是线性无关的。原因在于每个非零行都包含一个主元,而只有那一行在主元的那一列是1
  3. 它零空间的维度是列数与秩之差,即非主元列的个数:从方程A\boldsymbol x = \boldsymbol 0的角度看,非主元列代表自由变量。通过令一个自由变量为1,令其余自由变量为0,并依此计算主元,我们可以得到和自由变量同样数目的若干“解向量”。这些解向量线性无关,都处在零空间中,且张成整个零空间,因此是零空间的一个基。
  4. 它左零空间的维度是行数与秩之差,即零行的个数:非零行线性无关,因此要使行的线性组合为零所有非零行的系数必须为零,零行的系数任意,因此左零空间的维度就是零行的个数。


  1. 初等行变换不会改变列向量之间的线性关系——对于所有列来说,行变换是“一荣俱荣,一损俱损”的。如果原来矩阵的第三列是第一列的两倍,那么行变换之后第三列还是第一列的两倍。如果原来矩阵的第二列是第一列和第三列的和,那么行变换之后这个和的关系不变。
  2. 因为列向量之间的线性关系没有变,所以零空间不会变,列空间的维数也不会变(但列空间本身会变)。
  3. 行变换是对行的线性组合,所以行简化阶梯型的行空间是原矩阵行空间的子集;而高斯消元的行变换是可逆的,所以原矩阵的行空间也是行简化阶梯型行空间的子集。因此,化简到行简化阶梯型的过程当中,行空间不变。
  4. 行简化阶梯型只是揭示而并非改变行向量之间的线性相关关系(行简化阶梯型中的零行表示原来的行向量和上面几行的行向量线性相关)。原来线性无关的行不会因为行变换就线性相关,反之亦然。因此左零空间的维数是不会变的。




  1. \dim\left(C(A)\right) = r
  2. \dim\left(N(A)\right) = n - r
  3. \dim\left(C\left(A^{\mathrm T}\right)\right) = r
  4. \dim\left(N\left(A^{\mathrm T}\right)\right) = m - r


    \dim\left(C(A)\right) + \dim\left(N(A)\right) &= n = \dim\left(\mathbb{R}^n\right) \\
    \dim\left(C\left(A^{\mathrm T}\right)\right) + \dim\left(N\left(A^{\mathrm T}\right)\right) &= m = \dim\left(\mathbb{R}^m\right)
有什么意义?注意到A可以看作一个线性变换,其定义域所在的空间是\mathbb{R}^nA^{\mathrm T}也是如此。结合零空间和列空间的意义,我们发现了这样一个关系:









定义两个子向量空间\mathbf{V}\mathbf W的和\mathbf{V} + \mathbf{W}为二者中任意向量组合形成的空间(即二者基的并的张成空间)。


\dim(\mathbf V) + \dim(\mathbf W) = \dim(\mathbf V + \mathbf W) + \dim(\mathbf V \cap \mathbf W)

朴素的做法设出两向量空间交的基,然后在此基础上分别拓展形成\mathbf V\mathbf W,然后再证明这些分别拓展的基和两空间交的基一起是线性无关的。步骤倒也不是特别复杂。


Mike Artin提出的证明是这样的:考虑二元运算+把二元组 (\boldsymbol v\in \mathbf V, \boldsymbol w\in \mathbf W)变换到\boldsymbol v + \boldsymbol w。易证这个变换是线性的。其输入维度是\dim(\mathbf V) + \dim(\mathbf W)。其值域的维度是\dim(\mathbf V + \mathbf W)。其零空间是(\boldsymbol v, \boldsymbol -v)其中v\in \mathbf V \cap \mathbf W,维度自然是\dim(\mathbf V \cap \mathbf W)。因为线性变换输入的维度等于输出的维度与零空间维度之和,得证。





  1. \mathbb R^n内,零空间N(A)和行空间C\left(A^{\mathrm T}\right)互相正交

    \forall A^{\mathrm T}\boldsymbol x \in C\left(A^{\mathrm T}\right),\boldsymbol y \in N(A),\quad \left(A^{\mathrm T}\boldsymbol x\right)^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}A\boldsymbol y = \boldsymbol{x}^{\mathrm T}\cdot \boldsymbol 0 = 0
    又因为二者维度之和为n,所以二者互补,即N(A) + C\left(A^{\mathrm T}\right) = \mathbb{R}^n

  2. \mathbb R^m内,列空间C(A)和左零空间N\left(A^{\mathrm T}\right)互相正交

    \forall A\boldsymbol x \in C(A),\boldsymbol y \in N\left(A^{\mathrm T}\right),\quad \left(A\boldsymbol x\right)^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}A^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}\cdot \boldsymbol 0 = 0
    又因为二者维度之和为m,所以二者互补,即C(A)+ N\left(A^{\mathrm T}\right) = \mathbb{R}^m

于是就可以祭出Introduction to Linear Algebra 封面上的神图手绘重制以表敬意