Chengyuan Ma's Blog

幽居默处而观万物之变,尽其自然之理,而断之于中

0%

The title was inspired by the priceless A Brief, Incomplete, and Mostly Wrong History of Programming Languages.

Foreword

"翻墙," or "going over the wall," has been an open secret for Chinese Internet users for quite some time. Accessing Google in China takes much more than typing google.com in one's browser bar -- in fact, so much that it becomes somewhat a shared experience that few outsiders could understand. The walls are becoming higher and higher while the ladders are also becoming better and better. It's a constant struggle that's been going on for at least two decades.

Personally, my struggle with censorship is almost as long as the total time I've been using the Internet. In retrospect, both the technological complexity and the evolution of both parties in the struggle have been truly fascinating, if not epic. But so far, there seems to be no complete recorded history on this subject -- what a pity! I suppose I can make an attempt here. I will tell the story mainly in chronological order. That said, the exact dates or even years could be inaccurate, and I can't give references to everything I say. I am not writing an academic paper here, and too many things have been lost in time.

There is a Wikipedia entry that covers many things beyond the technical stuff. In contrast, in this post I would like to try my best to just focus on the technical part which I consider fascinating enough on its own.

Without further ado, what did it take for people in China to access the true Internet?

The Story

Long Long Ago

In 1987, China sent its first email to Germany, marking its first step into the Internet.

Across the Great Wall, we can reach every corner in the world.

People were excited, but probably no one then expected this email to be a prophecy.


For 12 years since then, there had been no technical restrictions for Internet access in China, though the affordability of PCs and the knowledge required for their operations did make Internet users back then a minority.

The Early Days of GFW and VPNs

In 1998 (some say 2002), nation-wide DNS spoofing emerged. Domestic DNS servers were configured to return bogus IPs when asked about the actual address of specific domains, so they appeared to be inaccessible. There were two simple workarounds: explicitly switch to a foreign DNS server (still vulnerable to DNS hijacking) or use a Hosts file that contains correct resolutions. Various Hosts files started to be passed around in Chinese forums for the next decade, and this workaround would still partially work until the early 2010s if I recall correctly. DoT and DoH also would have been good workarounds if only they had been invented by then.

(The DNS spoofing is still active today as shown above, where facebook.com resolves to a Dropbox's internal IP. 114.114.114.114 is a widely used DNS server in China. IP information source)

The censors soon realized that DNS spoofing was insufficient, so they started installing extra special hardware devices at backbone network nodes, such as routers at provincial boundaries or submarine cable landing stations. These devices are connected to the network channel via beam splitters so they can tap the traffic without interfering with it in most cases. In its simplest form, these devices maintain a blacklist of IPs, and when it detects an established TCP connection where either end is on the list, it sends an RST packet to both ends on behalf of each other to reset the connection. Both ends wouldn't know the reset packet had been faked because it would look identical to a legitimate one on the wire. Browsers would then raise a "Connection Reset" error. These devices became the initial hardware foundations for what would be referred to as "The Great Firewall," or GFW in short (although this name or even the censorship itself has never been officially recognized).

(Source)

In addition to IP blacklists, GFW had a keyword-based filter around the same time that resets connections whenever it spots sensitive content (e.g., content containing names of top government officials or scandals). This was possible because HTTPS hasn't gained traction yet and most sites operated in unencrypted HTTP. Indeed the golden age for censors. In the age of HTTPS today, GFW would need to exert pressure on the site operators to keep an eye on sensitive content, which is technically uninteresting and tedious (though still quite effective for big sites).

By then, GFW had primarily focused on blocking direct access to "inappropriate content" and had not turned to people's countermeasures. Consequently, such censorship could often be circumvented with simple HTTPS/SOCKS proxies or VPNs based on standard protocols like PPTP or L2TP. They can usually be set up without installing additional software because popular OSes support them out of the box. People started passing credentials to these services around the Chinese Internet. These simple solutions would still have a slight chance to work in 2014 (personal experience). Some big techs in China also released early versions of their "game accelerators," which are VPNs in disguise, and some clever people could use that to "cross the wall" as well.

Early 2010s

That didn't last long. The first move GFW took was to blacklist the IPs of the VPNs they know and order the popular Chinese forums to promptly delete any advertising of such techniques. A VPN would "die" in 2-3 days, and the same went for proxies. A solution was to have a long list of VPNs so one would always have at least one of them not blacklisted. Better still, this long list should be actively maintained so that it doesn't die out altogether one day. This was when VPNGate, led by the University of Tsukuba, became popular. VPNGate has a directory of VPN nodes contributed by volunteers around the world and which one can try one by one. The success rate wasn't very high (by personal experience), but acceptable. Tor was also an option, but the bridges and entry nodes are infrequently updated, so blacklisting them was easy.

(VPNGate seems to be operational today but it is not usable at all in China now)

It might be worth mentioning that in 2008 the government once shortly promoted the installation of Green Dam Youth Escort, an access control software, on every PC. It didn't take long for the software to be a laughing stock for its disastrous implementation. Let alone the BSODs it caused; all blacklisted URLs were stored in a plaintext text file under the installation directory. Many curious teens back then later said they wouldn't have known so many porn sites but for Green Dam. A true "blessing in disguise?"

(Green Dam ended up as such a joke that people made cartoons out of it. Image taken from Wikipedia entry.)

Around this time (the late 2000s to early 2010s), software such as FreeGate or Wujie had also been popular. Behind them were organizations politically hostile to the Chinese government. Compared with volunteers in VPNGate, developers of this software tend to be more motivated and well-funded, so the reliability was a bit higher. The downside was that upon launching, it would always direct one's browser to their homepages, which contain political propaganda that could be misleading or disturbing to some.

(Image taken from Wikipedia entry)

GoAgent

Yet another solution for the geeks that appeared a little later was to use private foreign nodes. The idea was that if one set up a "ladder" (which Chinese Internet users use to refer to any means of GFW circumvention) that is used privately and never advertised, it would be much less noticeable to the censors. This was the basic idea that underpinned most ladders til today. An early example that emerged around that time was GoAgent, which uses the Google App Engine (GAE) as the proxy server. Everyone could create a unique GAE instance and get a personal proxy node.

I recall a lot of tutorials back then teaching newbies how to go through this entire process of setting up GoAgent. Indeed this seemed to be the moment when ladders became harder to configure. In the past, one pasted a link in the browser proxy settings or the OS VPN setup wizard, and that's it -- all easy Ctrl+C/V and GUI. Starting with GoAgent, one needs to do some remote configuration on the GAE dashboard, download some stuff, edit some config files, and open some scary terminal windows. Kudos to GoAgent devs for writing a script for one-click deployment to GAE, for many of its successors would require SSH-ing into a Linux VPS for deployment, which I imagine would have been an even greater nightmare for beginners. There have been numerous attempts in the future to make bypassing GFW easy again, but IMO the problem hasn't truly been solved.

GoAgent was shut down in late 2014 under government pressure. RIP GoAgent. Its direct successor, XXnet, claims to be usable til today, though I haven't checked.

The Death of VPNs

As various VPNs flourished, GFW is also receiving its upgrade, which soon led to the eventual downfall of all simple proxies and VPNs. The caveat of these solutions lies in that they all have highly distinctive traffic patterns. These protocols have some handshaking procedure where the byte pattern is fixed (regardless of whatever encryption there might be). GFW can therefore recognize the use of these protocols and shut down the connection. From a technical perspective, this is easier said than done compared with GFW's previous filters. Previously, GFW only needed to scan individual packets (TCP segments) for addresses and specific keywords. In contrast, now it has to reconstruct the TCP streams from packets and does the filtering on a higher level. This requirement of context awareness raises the complexity of GFW's algorithms, which is probably why such filters had not been implemented in the first place -- this is my educated guess, at least.

Protocol-based detection was indeed a powerful technique that killed many protocols that had not been designed to face such adversaries. Apart from SOCKS/HTTP proxies and PPTP/L2TP VPNs, the death list also included SSH tunnels, OpenVPN, and some other once-usable but unpopular ladders I don't have time to cover.

It was also around the same time that multiple blog posts appeared with attempts to probe the inner architectures of GFW. People figured out which backbone routers had GFW devices on the side using TTLs and IP reverse lookup. There was even a script called GFW looking glass that could trigger a buffer overflow in GFW with a deliberately crafted DNS query so that the response would contain parts of GFW's memory. Such knowledge proved not particularly helpful in developing ladders but is really cool. (Pity, by the time I knew the script, the buffer overflow bug had been patched :( )

Shadowsocks

Among the few survivors of the GFW upgrade was Shadowsocks. First appeared in 2012, Shadowsocks, in its simplest variant, is just SOCKS5 encrypted with algorithms like AES or Chacha using a predetermined key. The encryption makes its traffic characteristics much harder to model and identify, though it was still possible.

To counter Shadowsocks, GFW looks at the length distribution of the TCP packets, which the encryption does not change so much as the packets' content. In response to this move, later versions of Shadowsocks started adding random-length paddings to the messages, thus altering the length distribution. GFW's next move was to use replay attacks or disguise as a potential client to actively probe what it considered a possible Shadowsocks server (which is not that hard because Shadowsocks typically uses non-standard ports). If it receives a response typical from Shadowsocks servers, then GFW can be sure that the target is a proxy and blacklist the IP. The Shadowsocks developers managed to cope by updating the protocols and elaborately implementing the server. I would say Shadowsocks is the first family of protocols that considered GFW-like adversaries in its design in the very first place.

In addition to encryption, people also started experimenting with various obfuscation techniques on top of Shadowsocks. Multiple Shadowsocks implementations can disguise the proxy traffic to mimic "innocent" protocols such as HTTP, Skype, and WeChat video calls. The effect of such obfuscation has been controversial. Later, people found the paper The Parrot is Dead from UT Austin, confirming that such obfuscation did not improve undetectability (and actually quite the opposite).

In 2017, a Shadowsocks user found an intriguing patent application on the website of the Chinese Patent Office, which documents classifying Shadowsocks traffic with deep random forest. The discovery caused quite a stir (and panic) in the community with rumors that the patent came from "the wall builders" and Shadowsocks was going to die. Though there have never been official documents of GFW's inner workings, this might be the first evidence that GFW has started deploying machine-learning-based technologies to classify proxying traffic. (which, if you think of it from a purely technical perspective, is an engineering feat given the immense volume of traffic going through GFW every second) Since then, there have been increasing anecdotal reports of Shadowsocks servers being blocked, although anecdotal evidence also suggests that the protocol is still usable today under some configurations.

Shadowsocks has been more than influential in the ladder community for the past decade. When people say Shadowsocks, they are referring to not just one protocol or implementation but a family of protocols and implementations in different languages and on various platforms. The original Shadowsocks repo was once deleted under government pressure, and people moved to other forks (the most famous one being ShadowsocksR). Then somehow, the original Shadowsocks repo was revived, but some are still using some fork, or fork of fork of fork... The full Shadowsocks pedigree is daunting to draw. Almost all Shadowsocks apps contain a paper plane in their icons, and thus many call Shadowsocks servers "airstrips." The name was soon extended to other proxy servers.

A Short Detour, and Wireguard

I suppose that's enough Shadowsocks. Before we move on to the next milestone of the GFW or ladders, I would like to take a detour to talk about developments that took place on a more fundamental level -- VPS, routes, and TCP stacks.

None of these had been problems before 2010 when most of us just used VPNs set up by some random but generous geek. These aren't problems for many today who buy commercial ladders and have everything taken care of. But the last decade witnessed a leap forward in the computer skills of Chinese Internet users and a boom in cloud computing platforms, which made hosting one's own server node (VPS) a tempting option for more and more people (myself included). Those people started comparing service qualities and uptimes of different VPS providers and learned about different tiers of routes (mainly between China and the US). E.g., the China Telecom 163 backbone route is usually the default but will be intolerably laggy at night due to high traffic; The CN2 premium route was good but more expensive. People started tracerouting back and forth and created more benchmarking scripts. Some attacked the speed problem from another angle by replacing the default TCP CUBIC congestion control algorithm with a much more aggressive one. The LiteSpeed kernel module was created to this end, and then when Google's BBR came out, there were tutorials everywhere on enabling BBR on the VPS and a lot of convenience scripts. Such hacking inside the kernel requires support from the virtualizing technology, so people started comparing KVM and Xen and such. Some also created iptables scripts to control access to the servers so that they are more resilient to the GFW's active probing. These techniques are orthogonal to most proxy applications and could improve the experience significantly. I personally found messing around with these technologies great fun and learned a ton of Linux and networking knowledge, so I think they deserve a place in this article.

People also noticed Wireguard as a rising star around the time. To install it on the server side, one has to dig deep and upgrade the Linux kernel. This is closer to the low-level stuff I just covered, so I'll take the opportunity to talk about it here. Its simplicity and UDP-based nature were very appealing at the moment, and indeed it worked like a charm. The hype, however, quickly cooled down upon the realization that while the protocol was designed to be secure, it wasn't intended to be undetectable. The patterns of Wireguard packets are apparent. Thus, it was no surprise that in several months GFW could accurately detect Wireguard usage and block it. Another caveat is that as a VPN, Wireguard is not very flexible (i.e., either all app traffic goes through it or none) relative to proxies, especially compared with the one I am about to introduce...

V2Ray

In 2016, the first version of V2Ray was released. It was a game-changer.

The killer feature of V2Ray is that it offers highly flexible and fine-granularity control of the proxy architecture via its JSON-based config. V2Ray models all proxied traffic as coming through some "inbound," gets filtered or dispatched by some routing rules, and passed to some "outbound." All these three parts can be configured. Inbounds and outbounds can be assigned tags that can be used as a reference in routing rules. Notable inbound protocols include SOCKS5 and HTTP proxy. Outbound protocols include SOCKS5, Shadowsocks, "freedom" (which simply sends the traffic to the free Internet), "blackhole" (which silently consumes all traffic), and VMess, which was V2Ray's own proxy protocol, built upon the lessons learned from its predecessors over the years.

Under this model, a ladder's client and server sides can use the same V2Ray binary with different configurations. A simple client could be a SOCKS5 inbound chained to a VMess outbound; A simple server could be a VMess inbound chained to a "freedom" outbound. This also allows complicated setups involving multiple inbounds and outbounds, relay nodes, transparent proxies (using the "TProxy" inbound), port forwarding (using the "Doko demo door" outbound), and more. In addition, the routing rules can be very flexible. IPs and domains can be matched against CIDR, pre/suffices, and a geographic database that comes with the program. Want local traffic to not go through the proxy? Add a rule directing local IP ranges to "freedom" in the client. Want to block some ad domains? Add a rule to divert such requests to a "blackhole" outbound. Compared with configurations of previous ladders -- which usually contain a local address, a server address, and credentials -- what V2Ray offered was such an eye-opener.

Since V2Ray, other proxy applications have emerged with similar high configurability. Most call themselves "rule-based proxies" or "proxy frameworks." Notable examples include Clash and, more recently, Leaf. These alternatives offer some quality-of-life features, such as

  • A more user-friendly, YAML-based configuration format (V2Ray uses JSON);
  • A "true global proxy" with Tun2socks, usually referred to as a "TUN inbound." On activation, this special inbound augments the system network stack and modifies the routing table so it can capture traffic from applications that aren't proxy-aware.
  • Runnable on OpenWRT routers, so any devices connected do not need extra setups to be able to circumvent GFW.
  • GUI or WebUI front ends for graphical configuration.
  • More complicated routing with python scripts.
  • Subscription -- a mechanism that allows the proxy frameworks to auto-update configurations from a URL. It makes it easy for users of commercial proxy services to set up their clients and keep their node directory and protocol configurations up-to-date with the service provider. V2Ray offers a URL scheme that encodes simple, static configs as short URLs (e.g., vmess://...), which can be more easily shared as QR codes or messages, but this feature does not come close to the flexibility subscription offers.

Notwithstanding these fancy features, these frameworks adhere to V2Ray's inbound-routing rules-outbound model. All of these frameworks aim to be somewhat of an all-in-one solution and have popular protocols such as SOCKS5, Shadowsocks, and VMESS built-in. Because these protocols don't change, mixing these frameworks or chaining them up is possible. For instance, one can use V2Ray on the server side (because it's lightweight) and Clash for the client. Personally, I've used Clash to locally redirect traffic to a local V2Ray instance so I can have both a TUN inbound (Clash exclusive) and a VLESS outbound (a successor to VMess, V2Ray exclusive). Such mix-and-match adds another level of flexibility.

In general, V2Ray and its friends immensely diversified the proxy "ecosystem," improved user experience, and lowered the barrier to entry for new users. The massive number of possible combinations of proxy schemes posed a considerable challenge to GFW. The advanced routing features can make a technique from GFW's arsenal useless: If one accidentally visits a Chinese website through the proxy, the traffic goes abroad and returns, passing GFW twice. GFW can then easily correlate the two streams and identify the use of a proxy. However, if one configures V2Ray to route domestic traffic away from the proxy, this would not be a problem.

But still, the outbound/inbound proxy protocols themselves remain the weakness of the whole system. If GFW can identify every protocol you use and block it, then V2Ray will not take you very far, no matter how many hops of relay you use or how many different protocols are used in the process. Fortunately, the next breakthrough in proxy protocol was around the corner.

Back to HTTP?

work in progress, to be continued...

The Problem

Is it just me, or Chrome really isn't as fast and smooth as it claims?

Most people switch to Firefox for all its privacy claims. I did not (although that's indeed a plus). I have been using Firefox because the UI animations on nearly all Chromium-based browsers are intolerably laggy.

By “UI,” I mean all the components surrounding the web view: the tap strip, the toolbar, the download shelf, the Omnibox. The problem is most noticeable when you close a tab and see how neighbors adjust their bounds to fill the gap. You observe uncomfortable stutters, and that discomfort lures you into repeating the whole thing to confirm that it's not an illusion. It's not, and the pain only intensifies with the repetition until it completely ruins your day.

Do not mistake me. The animations in web pages are impeccable (Chrome couldn't have taken over more than half of the market share if it can't even do this right). But this contrast just makes the whole problem more annoying. Worse still, this lag passes down to nearly all Chromium derivatives like Edge or Brave and makes the entire Chromium world visually unacceptable. Vivaldi is an exception because it bases its UI on its own framework, but then its homemade UI does not blend well with other Chromium UI (which the Vivaldi devs don't bother to hide). This inconsistency itself is a problem.

So I started using Firefox. It was great until all the incompatibility issues reminded me again that I would need a Chromium-based browser in case.

Of course, you would expect someone to file a bug report somewhere for such a big user experience problem. Someone did indeed, but the devs don't seem to care.

Tfw you run across this thread while trying to find out if anyone else's scrolling stutters sometimes.

Blows my mind how stuff like this remains unsolved for months. I've sorta gotten used to laggy tabs by now but it's like they're fine with ignoring reasons for people to trash talk their browser's performance.

Enough complaints. If the devs aren't going to fix it, I will do it myself.

Checking Out the Code

The first step into fixing the bug is to check out the code. I followed this guide on Chromium's project homepage and encountered basically no problems except for two,

  • The guide asks you to run everything in a Windows command prompt instead of other shells. This also implies that you shouldn't start the command prompt from other shells (e.g., by typing cmd.exe in PowerShell) because then the launched process inherits the environment variables from the parent shell. I have Conda installed and this inheritance caused problems. The Conda shell integration somehow dynamically prepends its Python executable to Path on PowerShell startup. Consequently, no matter how up front I put depot tools in Path it will always be preceded by the Conda Python as long as I launch cmd.exe from PowerShell.

  • Use NTFS. The guide says

    At least 100GB of free disk space on an NTFS-formatted hard drive. FAT32 will not work, as some of the Git packfiles are larger than 4GB.

    I did not take this very seriously and thought I would be good as long as I don't use FAT32, so I initially used my external SSD in exFAT. The checkout did complete with no warnings and I was able to build. However, I faced massive >1h overbuilds even when I modified nothing (and more than 6000 targets have to be rebuilt when I changed a .cc file). For a time I thought this is the norm and had a very bad impression on the efficiency of Chromium's build system. ExFAT is also very inefficient in storing small files such that ~30G of source files ends up occupying ~180G of disk space, which is complete nonsense. Reformatting my drive as NTFS solves both problems.

First Attempt

I have never dealt with Chromium code before. Fortunately, Chromium's excellent documentation saves me the trouble of diving into the tens of millions of lines of code in the codebase completely clueless. It also has this online code search engine which makes navigating the code orders of magnitude easier. I was able to find this piece of code, a good start:

A further look into BoundsAnimator reveals that it has the member function SetAnimationDuration.

My first hypothesis is that if the animation is somehow laggy, then by making it shorter in duration the lag becomes less noticeable.

A search in tab****.cc reveals that SetAnimationDuration is not called by tab strip except with an argument of 0 to disable rich animation, so the default duration is nearly always used, which is defined here:

So I changed 200 down to 100 and then to 60. Now the animations are blazingly fast (literally), but is it smoother? Lags are much less likely to occur and be noticed but I could still found some from time to time. This fix is not satisfactory. Time to go deeper.

Second Attempt

A closer look at BoundsAnimator shows that it calls SlideAnimation under the hood:

SlideAnimation, in turn, inherits LinearAnimation, where an interesting constant gets defined,

The above screenshot basically captures everything I want to say. The LinearAnimation class uses the framerate passed in the constructor to calculate the timer interval at which it updates its internal state and calls observers (that's the role of BoundsAnimator). The observers then propagate the update to redraw interfaces. Most calls to LinearAnimation constructors ignore the frame_rate argument so kDefaultFrameRate is used. 60 appears to be a reasonable value at the first glance but it no longer is given that I am using a 165Hz monitor.

There has been a common argument that human eyes cannot differentiate motions beyond 60Hz and high refresh rate monitors are just a hype. I had believed this for years until I switched to a 90Hz phone. When everything else is animating at just 90Hz a 60Hz animation stands out as clunky, not to mention 120Hz or 165Hz. So here lies the problem: the Chromium UI animation appears to lag because somehow the devs decided to hardcode a framerate ceiling into its rendering process. They take VSync onto another level.

(This realization might also explain an interesting phenomenon I discovered when I attempted to file a bug report myself a while ago. I used OBS to record the tab strip animations. The lag was obvious at the time of the recording but when I looked back at the recording it seemed less so. Perhaps that was due to the screen capture being 60FPS too? Probably there was some magic going on at the video decoders that makes 60FPS video content appears smoother?)

If this is indeed the root of the problem, then the solution is simple: just tune up kDefaultFramerate. Do not forget to tune down the minimum timer interval in CalculateInterval accordingly though, because as it is now (10000) any framerate greater than 100 will be effectively capped to 100.

This is literally a change to 3 lines of code, and yes, the lags are gone!

(So in some sense the “3x as smooth” in the title is a clickbait. 165 / 60 = 2.75, so the framerate does approximately triple. The question is: is framerate a good measure of “smoothness”?)

Some Thoughts

I have had a very strong respect for the Chromium Project for years, so strong that I may well call it worship. After all, it is a codebase with over 10M LoCs (and growing) while I was still struggling to manage my personal projects, which all go below 10k LoCs. I've read many posts analyzing the elaborate architecture of Chrome, and many others call Chromium an engineering masterpiece. It was under this mindset that, when I first observed this bug a year ago, I thought there must be some seriously complicated regression deep inside the rendering engine of Chromium. Any fix would be highly untrivial and beyond the understanding of us mortals.

I consider a great takeaway from this bug-fixing experience to be the demystification of the project. I downloaded the code; I found something I could fix; I fixed it. Locating the root of the problem wasn't hard. So I suppose the only justification for the devs not to have fixed it is that the browser UI has a lower priority than the kernel.

I wouldn't say that my fix is elegant in any sense. All I can say about it is that it works well on my machine. Now it seems that the approach to hardcode the default framerate as a device-independent constant is fundamentally flawed. But how to improve on that? Should we query the max framerate of the monitor whenever we play an animation? Should we cache the result somewhere? Should we make it a configurable item in chrome://flags? A general and sustainable fix is much more complicated than changing the constants. I may be able to come up with one someday, but IAP is drawing to an end, and there are too many things for me to do. So, I choose to file a more technical issue on Chromium's issue tracker for the time being. Leave it to the devs. But before they officially fix the bugs, I will be using my own fork of Chromium.

Speaking of building Chromium on my own. A side product is that now I can enjoy the latest features of Chrome (I am already using M100) without worrying about the privacy controversies Chrome has. I am pretty sure there still exists some telemetry modules in Chromium (that's why ungoogled Chromium exists), but that's arguably much better than Chrome. I also leave the API key empty, so nothing sensitive will be sent to Google -- perhaps? My guess is that now it's on par with unhardened Firefox.

By the way, here I would like to briefly mention Firefox's approach to the UI problem. When investigating the bug I was surprised to know that Firefox's UI are web-based, i.e., it uses the same kernel to render both the web pages and the UI. Thus some very interesting things will happen if you type chrome://browser/content/browser.xhtml into the Firefox address bar! I feel like this is a better way to go since building web UI is a mature subject, and now by optimizing your kernel you are also optimizing your UI. I am not sure.

最近回国了就惦记起科学上网的事情了。很多科学上网工具的inbounds都是SOCKS 5协议,SOCKS 5很常用也很好用,但是因为在传输层所以需要应用的配合,这样一来全局代理就比较困难。目前比较成熟的解决方案是tun2socks。Tun2socks有很多实现,原理都是创建一个虚拟网卡(TUN device),然后从发往这个网卡的IP包中辨识出TCP和UDP的流量并转发给代理。为此,tun2socks也需要操作系统的配合。我们需要通过修改路由表把tun2socks创建的而TUN设置为默认网关。这是使用tun2socks的难点,而对于这一步,很多教程都没有说清楚,大部分人也都是使用Clash等可以自动配置tun2socks的应用。我作为一个Ubuntu用户,使用的代理协议又比较特殊,导致Clash不够用,配置起最原始的tun2socks真是踩坑不断。幸好折腾了一个礼拜总算是折腾出几个自动化的脚本,用这篇文章记录一下自己的心得。

看懂Linux的路由表和路由规则

Linux的路由表可以使用ip route show命令查看,例如:

default via 192.168.100.1 dev wlp4s0 proto dhcp metric 600
169.254.0.0/16 dev wlp4s0 scope link metric 1000 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.100.0/24 dev wlp4s0 proto kernel scope link src 192.168.100.13 metric 600 
198.18.0.0/15 dev tun0 proto kernel scope link src 198.18.0.1

上面的五行信息量不少,怎么解读?

首先看第一列,操作系统会把一个IP包的目标地址和路由表第一列的网段进行匹配,如果发现匹配的行就按照那一行的路由规则进行转发,如果找不到匹配的,就按照default所在的行进行转发。一般来说,路由表里面能够列出来的网段都是电脑所在的局域网,例如上面的192.168.100.0/24就是我家里WiFi的地址,172.17.0.0/16是Docker容器的内网地址。一般来说只有对于这些局域网的地址操作系统才能准确知道应该怎么转发,对于其他情况,一般就要把包发给上级的路由器,这就是default一行表示的内容。

第一列后,路由表的每一行都是由一连串的属性组成的:

  • dev xxx表示应该这个IP包应该从名为xxx网卡(interface)发出。
  • via xxx表示这个IP包应该发往地址为xxx的网关/路由器并由后者进一步地路由。这一条一般在default规则中最常见,例如via 192.168.100.1中的192.168.100.1就是我家里WiFi的网关地址。
  • scope link表示目标地址在链路层可以直达,即不需要通过其他设备进一步转发了。这一条对于tun2socks的配置倒不是很重要。
  • src xxx表示这个IP包的源地址最好是xxx。这个属性只有在一个IP包还没有确定源地址(也就是刚刚发出)的时候才有效,如果一个IP包已经有了一个源地址,那么这条规则是不会去改的。
  • linkdown表示这一行对应的网卡设备处于离线状态。
  • proto xxx记录了这一行路由规则的来源。例如,proto kernel说明这一行是内核加的,proto dhcp说明这一行来源于DHCP。
  • metric xxx表示这一行路由规则的优先级,如果路由表里有好几行都能匹配目标地址,那么系统会按照metric较小的一行来转发。

到这里,我们就基本可以从ip route的输出看懂一张路由表了。

然而Linux里面的路由表并不只有一张(悲)。上面展示的路由表只是路由表(table main)。默认还有两张路由表:local表由内核自动维护我们一般不管;还有一张默认为空的default表。既然有好几张路由表,Linux怎么知道应该查那一张呢?这是由路由规则决定的,可以用ip rule show查看,默认的路由规则是:

0:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

冒号前的数字表示一条规则的优先级,可以取0到32767的整数,值越小优先级越高。from all表示这条规则匹配所有源地址的IP包(all也可以换成具体的IP或者IP段,类似的条件还有to all),lookup local表示去查local这张路由表。因此,上面ip rule的输出翻译成人话就是:

  1. 首先查local这张路由表,如果查到有一行匹配就按照这一行走;
  2. 如果local表匹配不上,接着查main表;
  3. 如果main表也匹配不上,就查default表;
  4. 如果default表也匹配不上,这包就寄了。

这就是Linux默认的路由规则,非常清楚(确信)。

到这里,你已经初步了解了Linux的路由规则,接下来让我们做一个小练习吧!

试一试:给 tun2socks 配置路由表

假设我们有一个叫tun0的TUN设备,其IP为198.18.0.1,我们能要怎么修改路由表才能让所有的流量都转发给它呢?

我们首先想到的是default行——既然之前default负责把所有非局域网的流量转发给路由器,那么它自然也可以转发给tun0,于是我们执行如下命令

ip route add default via 198.18.0.1

为了保险起见,我们可以删除原来的default(方便起见,这里的网关IP和dev都是我自己的设备,可以按照自己的情况进行调整):

ip route del default via 192.168.100.1 dev wlp4s0

但是只这么做会导致一个问题:

  1. 如果SOCKS 5的inbounds不在本机上,那么从tun0发出的代理请求现在也会被绕回到tun0,这下死循环了。
  2. 就算SOCKS 5的inbounds在本机上,它的下一个节点也会在海外,而发向这个节点的代理请求也会被绕回到tun0,还是死循环。

如何解决这个问题?我们需要确保代理服务器的发出的包还是走原来的路径。我们把原来的路径存在default表里:

ip route add default via 192.168.100.1 dev wlp4s0 table default

这里注意到我们在末尾加上了table default来指定修改的路由表,如果不加默认是table main

接下来,通过配置路由规则让代理服务器的包走原来的路径。

# 192.168.100.2是我的局域网IP,这条规则对应SOCKS 5入口不在localhost的情况
# (这个时候需要用tun2socks的 -interface 参数绑定出包用的设备,绑定了之后
# tun2socks会用这个设备发送所有发给SOCKS 5的请求,IP包上的源IP此时自然就是局域网IP)
ip rule add from 192.168.100.2 lookup default
# 或者如果SOCKS 5入口在localhost但是你知道远程节点的确切IP的时候也可以这样,
# 这可以确保到远程节点的连接永远是直连。
ip rule add to <代理服务器IP> lookup default

默认情况下,我们现在加的路由规则会有更高的优先级(local表的规则除外),现在所有的路由规则:

0:  from all lookup local
32764:  to <代理服务器IP> lookup default
32765:  from 192.168.100.2 lookup default
32766:  from all lookup main
32767:  from all lookup default

于是现在碰到所有确定源IP是192.168.100.2的包或者是发往远程节点的包Linux都会先查找default表,而default表里面唯一的规则确保这些包会直接发往原来的网关。

至此,启动tun2socks就可以享受全局代理了!当然在关闭tun2socks之后要记得用类似的命令将路由表复原(基本上就是adddel互换)

进阶:iptables和ipset

上面的配置虽然够用,却有很大的局限性:

  1. 如果我的远程节点不止一个,甚至是依据订阅动态变化的,导致一条条加to <远程节点IP> lookup default会很麻烦,怎么办?
  2. 如果我在本地开了一个v2ray而且在那里设置了国内地址直连,又通过tun2socks把所有来自本机的请求导到v2ray的SOCKS 5 inbounds里面,那么如何确保v2ray出来的流量不会再次走到tun2socks那里形成死循环?注意到因为设置了国内地址直连,所以v2ray outbounds的流量的目标地址可以是远程节点的地址加上任意的国内IP,很难设置路由规则。
  3. 如果我不想通过v2ray等辅助软件实现国内地址分流,应该如何操作?

为了解决这些问题,我们需要iptables。Iptables算是Linux的防火墙系统,但是实际上灵活性很高,它和路由表一样是Linux网络层路由体系的重要组成部分。二者的具体关系可以参照下图中的绿色部分:

Netfilter

Iptables默认有五个链(chain),这些链可以完成对包的过滤、地址转换以及标记。这五个链的作用分别是:

  1. 到达本机的包先通过PREROUTING链变换过滤,然后这些包才会按照路由表分派到本机或者继续转发。
  2. 继续转发的包通过FORWARD链变换过滤。
  3. 发给本机进程的包接着通过INPUT链变换过滤。
  4. 本机进程准备发包时,先按照路由表确定一个初步的路径,然后构造好的IP包经过OUTPUT链变换过滤。
  5. FORWARD和OUTPUT输出的包接着汇合,此时Linux会再次查找路由表确定出包的最终路径——对于本机发出的包,如果最终的路径和经过OUTPUT前确定的路径不同,以新确定的路径为准(但是并不会修改按照原路径确定的源IP地址,这也是一个坑)
  6. 在路由完成后,所有表经过POSTROUTING链再次变换/过滤才会被真正发出。

可以看到,如果我们需要过滤由代理应用发出来的包,这个过滤的逻辑应该在OUTPUT链上。Iptables和ip rule配合的方式是使用iptables给包加上特定的fwmark,并在ip rule当中通过fwmark匹配来进行分流。所以我们就有(我这里选择给要直连的包打上23333的fwmark,具体数字可以任意选):

iptables -t mangle -A OUTPUT <条件> -j MARK --set-mark 23333
ip rule add fwmark 23333 lookup default

其中-t mangle表示过滤条件设置到OUTPUT链的mangle子链上——iptables每一个链都有三四个子链,负责不同的功能,例如nat子链负责地址的转换,filter子链负责包的过滤,而这里的mangle子链负责对包进行加标记等变换。

如何设置iptables的条件?这和代理应用的特征有关。

过滤来自特定用户或PID的流量

比如在我的机子上代理是systemd用nobody用户运行的,于是我们就可以用iptables的owner模块写出:

iptables -t mangle -A OUTPUT -m owner --uid-owner nobody -j MARK --set-mark 23333

除了--uid-owner还有--pid-owner--gid-owner,可以适当地调整以适应自己代理进程的特征。

过滤特定目标地址的流量(ipset)

如果只是想配置国内地址直连,也可以过滤目标地址是特定网段的流量。国内地址的IP段比较广泛(一般是6000-8000个IPv4的网段),所以不能直接一条条用iptables输入。这个时候就需要用到ipset这一个工具。

我们先用ipset创建一个集合:

# cn4是这个IP集合的名字,hash:net是内部的数据结构
# hash:net是为每一个不同的prefix length创建一个
# 哈希表,所以查询时间关于不同prefix length的个数
# 是线性增长的,比较适合我们的情况(因为IPv4总共也不
# 过32个不同的prefix)
ipset create cn4 hash:net
ipset add cn4 1.0.1.0/24
ipset add cn4 1.0.2.0/23
...

这一部分可以适当地用脚本自动化一波。如何获取国内所有IP段的列表以及如何自动化得添加超出了这个博文的范围。

然后我们在iptables里面加入如下规则:

iptables -t mangle -A OUTPUT -m set --match-set cn4 dst -j MARK --set-mark 23333

--match-set cn4 dst表示匹配所有目标地址是cn4集合里面的包。

修复:改变IP地址

最近UROP在做生成模型这方面的任务,自己作为一个机器学习的纯萌新自然是一窍不通,只是凭着自己的模糊记忆记得生成模型似乎有三大流派:VAE、GAN和最近火起来的Diffusion Model。其中GAN是直觉上最好理解的,如果只是需要实现的话对于数学的要求不太高,但是VAE和DM哪怕是要初步理解都需要一定的数学背景知识。我一个下学期才选概率相关的基础课的半瓶水读起来挺吃力的。硬啃论文一个礼拜总算是把这两个模型基本搞明白了。这篇文章先记录一下我对VAE的理解,主要参考了Tutorial on Variational Autoencoders

基本思路

VAE是生成模型。什么是生成模型?生成模型的本质是从一个概率分布中采样。这种采样在我们明确知道目标分布的PDF或CDF的时候并不困难。然而,对于很多实际问题,需要采样的变量的维数非常高,而分布本身也无法显式求得,这些局限是我们需要一个个复杂而精妙的生成模型的根本原因。

VAE(以及GAN)基于一个假设:如果我们的样本空间分布的复杂性来源于不同维之间复杂的相关性,那么样本空间真正有意义的维度应该少于样本空间形式上的维数。因此,复杂的、高维的样本可以看作是由一个相对简单的、低维的隐变量(latent variable)所决定的。我们不妨让这些隐变量\boldsymbol z服从标准正态分布。于是生成这些隐变量就很简单了,而难题也被我们转化为了求一个从隐变量到目标样本的映射\boldsymbol x=f(\boldsymbol z)。更为广义地来说,f不一定要是一个确定性的过程,所以我们要求的实际上是P(\boldsymbol x|\boldsymbol z)。如果我们假设P(\boldsymbol x|\boldsymbol z)有固定的形式(比如说正态),那么也可以说我们寻找的是一个将\boldsymbol z映射到P(\boldsymbol x|\boldsymbol z)分布的参数的函数p(\boldsymbol z)。在机器学习的背景下,p的形式是固定的,我们需要寻找或者学习的其实是一组参数\boldsymbol \theta。那么什么样的p_{\boldsymbol \theta}是好的?我们希望我们的模型能够生成类似于训练集的数据,所以从数学上讲,对于训练集中的数据\boldsymbol x,我们希望最大化我们的模型的likelihood,即P(\boldsymbol x|\boldsymbol \theta)

这里我觉得有必要提一下likelihood和probability的区别。我第一次读论文的时候看记号相似,就以为两个说的是一回事,犯了想当然的错误。Probability是相对于一个固定的假设/分布的,关于结果的函数;Likelihood是相对于一个固定结果的、关于假设/分布的函数。结果是互斥的,所以不同结果的probability一定和为1;但是likelihood则不然。在生成模型的背景下,训练集的样本都是固定的而变化的是我们的模型本身,所以应该用likelihood。

那么怎么计算P(\boldsymbol x|\boldsymbol \theta)呢?从定义上来讲它可以看作是一个marginalization。不妨定义p_{\boldsymbol \theta, \boldsymbol z}(\boldsymbol x)p_{\boldsymbol \theta}(\boldsymbol z)表示的分布的PDF,则有

P(\boldsymbol x | \boldsymbol \theta)=\int P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)P(\boldsymbol z)\,d\boldsymbol z = \mathbb{E}_{\boldsymbol z}\bigl[P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)\bigr] = \mathbb{E}_{\boldsymbol z}\bigl[p_{\boldsymbol \theta, \boldsymbol z}(\boldsymbol x)\bigr].
(这里我们假设\boldsymbol z独立于我们的参数,因为前文提过我们可以把前者看作是从标准正态分布中抽样而来的)

遗憾的是上式可以看作是正确的废话。注意\boldsymbol x\boldsymbol z是强对应的,所以P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)是稀疏的。这导致如果我们在实践中用随机抽样的方法来估计\mathbb{E}_{\boldsymbol z}\bigl[P(\boldsymbol x | \boldsymbol z,\boldsymbol \theta)\bigr],我们随机到的\boldsymbol z很可能不会对我们的估计产生任何有意义的贡献,导致我们的估计值完全不可靠。

要解决这个问题,我们就需要确保我们抽样得到的都是有代表性的\boldsymbol z。这样一来我们就不能盲目地从标准正态分布中采样了——应该怎么办呢?

信息熵与KL距离

在介绍VAE提供的解决方案之前,我在这一节先简要介绍一下相关的一些背景知识。

思考一个问题:如何量化一个随机事件结果X的“意外程度(surprise)”s(X)?这个意外程度应该满足如下三个要求

  • 百分百发生的事情自然是毫不令人意外的,因此P(X)=1\rightarrow s(X)=0
  • 一件事情发生的概率越低显然越让人意外,所以s(X)关于P(X)递减。
  • 对于两个独立事件的结果。其总的意外程度应该是两个结果意外程度之和,即s(X\cap Y)=s(X)+s(Y)

香农论证了满足以上三个约束的s(X)只能取s(X)=-\log P(X)(最多乘上一个正常数),这个结果是符合我们直觉的。同时注意到,“意外程度”在某种意义上就是信息量:毫不意外的事情不会传达任何信息,而相反一件事越罕见其信息量越大。所以,香农接着定义一个随机事件的信息量为其所有结果信息量/意外程度的期望,即

\int -P(X)\log P(X)\,dX.
这一个式子就是人们常说的信息熵——这个式子长成这样是有理由的!

接下来思考一个问题:假设一个随机事件现实中的结果分布是P,但是我们有一个理论说其结果分布是Q。我们如何量化理论和现实的差距?

假设一个人相信理论Q,那么事件X对那个人来说的信息量为-\log Q(X)。客观上来讲,事件是按照P刻画的概率分布发生的,所以这个事件对这个人的平均信息量就是

\mathbb E_{X\sim P}\bigl[-\log Q(x)\bigr]=\int -P(X)\log Q(X)\,dX.
上式和这个事件实际上的信息熵之差就可以看作是一种“理论和现实的距离”,称之为相对熵(relative entropy)。更为常见的称呼是Kullback-Leibler距离:
\int -P(X)\log Q(X)\,dX-\int -P(X)\log P(X)\,dX = \int -P(X)\log \left(\frac{Q(X)}{P(X)}\right)\,dX = D_{\text{KL}}(P\|Q).
上面这段话蕴含一个假设:所有基于Q计算的信息熵一定不小于一个事件实际的信息熵——这样作差当距离才有意义。真的是这样的吗?换而言之,KL距离在任何情况下都是非负的吗?注意到\log y\le y-1 \Rightarrow -\log y\ge 1-y
\begin{aligned}
D_{\text{KL}}(P\|Q) &= \int -P(X)\log \left(\frac{Q(X)}{P(X)}\right)\,dX\\
&\ge\int -P(X)\left(\frac{Q(X)}{P(X)}-1\right)\,dX\\
&=\int P(X)\,dX-\int Q(X)\,dX\\
&=1-1=0
\end{aligned}
所以KL距离作为一种“距离”,在直觉上确实是合理的。

读VAE的文章之前我就不止一次地看到信息熵和KL divergence的式子,当时我也没有深究,觉得信就完事了,反正也不是特别难记(笑)。直到读VAE的文章,我终究觉得这种式子盲信起来总归有一种不踏实感,所以去找了一些相关的解释,发现如果不完全苛求严谨的话这两个式子还是比较好理解的,于是在这里记录一下。从这种解释中也能理解为什么KL距离是非对称的了,因为KL距离中的PQ的地位就不一样。

回到VAE

为了更加精确地近似采样\boldsymbol z,我们考虑从一个能给出“有代表性的\boldsymbol z”的分布Q中采样。所谓“有代表性”,可以理解为Q应该近似于P(\boldsymbol z|\boldsymbol x, \boldsymbol \theta)。换而言之,两者的KL距离应该尽可能小。于是考虑D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr)

D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) = \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr]
我们可以用贝叶斯公式展开P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)
\begin{aligned}
D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr] \\
&= \mathbb E_{\boldsymbol z\sim Q}\!\left[\log Q(\boldsymbol z)-\log \frac{P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)P(\boldsymbol z)}{P(\boldsymbol x|\boldsymbol \theta)}\right] \\
&= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)-\log P(\boldsymbol z)+\log {P(\boldsymbol x|\boldsymbol \theta)}\bigr].
\end{aligned}
注意到\log {P(\boldsymbol x|\boldsymbol \theta)}一项和\boldsymbol z无关,所以可以提出来,然后再拆分一下期望里面的项,
\begin{aligned}
D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)-\log P(\boldsymbol z)+\log {P(\boldsymbol x|\boldsymbol \theta)}\bigr] \\
&= \log P(\boldsymbol x|\boldsymbol \theta) + \underbrace{\mathbb E_{\boldsymbol z\sim Q}\bigl[\log Q(\boldsymbol z)-\log P(\boldsymbol z)\bigr]}_{D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)} - \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr].
\end{aligned}
移项,再利用KL距离非负的性质,我们就得到了log likelihood的一个下界:
\begin{aligned}
\log P(\boldsymbol x|\boldsymbol \theta) &= \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr) + D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr) \\
&\ge \mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr).
\end{aligned}
也就是说,只要D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta)\bigr)足够小,即Q足够逼近P(\boldsymbol z|\boldsymbol x,\boldsymbol \theta),优化\mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr] - D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)就近似等价于优化\log P(\boldsymbol x|\boldsymbol z),也就近似等价于优化P(\boldsymbol x|\boldsymbol z)。这个下界被称为variational lower bound,是VAE的loss。所谓的“variational”,是指我们之前对于贝叶斯公式的运用——那种方法似乎被称为variational Bayesian method。

现在还剩下一个问题,就是我们不知道这样的Q具体是什么,但是我们知道Q应该是和\boldsymbol x有关的。我们于是发挥ML的传统艺能:如果不知道就糊个NN上去。不妨假设Q(\boldsymbol z)=\mathcal N\bigl(\boldsymbol\mu(\boldsymbol x),\operatorname{diag}\boldsymbol\sigma^2(\boldsymbol x)\bigr),我们让神经网络给出\boldsymbol \mu\boldsymbol \sigma就好了!注意到在Q(\boldsymbol z)P(\boldsymbol z)都是正态分布(且我们假设P(\boldsymbol z)=\mathcal N(0, I))的情况下,其KL距离存在解析解:

D_{\text{KL}}\bigl(Q(\boldsymbol z)\|P(\boldsymbol z)\bigr)=\frac{1}{2}\left(\sum \boldsymbol\sigma^2(\boldsymbol x) + \bigl\|\boldsymbol \mu(\boldsymbol x)\bigr\|^2-\log \prod \boldsymbol \sigma^2(\boldsymbol x)-\dim \boldsymbol z\right)
这是一个关于\boldsymbol\mu\boldsymbol \sigma的函数,于是我们在优化\boldsymbol \theta的同时可以顺便优化\boldsymbol \mu\boldsymbol \sigma

接下来还有一个技术上的问题:我们在计算目标函数的过程中要从Q当中采样。但是采样是一个离散的过程,会打断梯度的反向传播,导致Q的参数不能被正确更新。解决这个的方法被称为reparameterization trick。我们可以从标准正态分布中采样,再把采样线性缩放到Q的分布。这样\boldsymbol \mu\boldsymbol \sigma就有梯度了。在数学上说,这就是把loss中的\mathbb E_{\boldsymbol z\sim Q}\bigl[\log P(\boldsymbol x|\boldsymbol z,\boldsymbol \theta)\bigr]改为\mathbb E_{\boldsymbol \epsilon\sim \mathcal N(0,I)}\bigl[\log P(\boldsymbol x|\boldsymbol z=\boldsymbol \sigma (\boldsymbol x)\odot \boldsymbol\epsilon +\boldsymbol \mu(\boldsymbol x),\boldsymbol \theta)\bigr]

最后还有一个问题:为什么VAE是一种“auto-encoder”?回顾VAE的推导,Q网络将一个样本映射到一个隐变量的分布,可以近似看作是一种“编码”的过程;P网络将一个隐变量隐射到一个样本的分布,可以近似看作是一种“解码”的过程。两个网络是同步训练的,这点和传统意义上的encoder-decoder模型颇为相似。因为这几点,VAE便得了一个auto-encoder的名字。但是说“auto-encoder”只是形似:encoder-decoder的架构只能是事后的附会,而不是推导的灵感基础,何况encoder-decoder应该是确定性的,但是VAE两个网络输出的其实都是分布,这也是和encoder-decoder架构不一样的地方。

拓展:VRNN

VAE的数学理论是优美的,但是有一个限制:只能够建模固定维度的样本。对于不定长序列的建模和生成,原版的VAE是无能为力的。为了解决这个问题,Chung et.al.在15年发表了名为Variational Recurrent Neural Network的架构。论文很有意思,但是归根结底,VRNN是给每一步的RNN Cell都嵌入一个VAE,同时将我们之前的推导都condition on前一个RNN cell的hidden state。

我们假设有两个比较简单的网络\varphi_{\boldsymbol x}(\boldsymbol x)\varphi_{\boldsymbol z}(\boldsymbol z)可以从输入和latent variable中做一个初步的特征提取。令RNN上一个时刻的hidden state为\boldsymbol h_{t-1},则神经网络本身的inference可以表现为:

\boldsymbol h_t = f\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_{t}),\varphi_{\boldsymbol z}(\boldsymbol z_{t}),\boldsymbol h_{t-1}\bigr).
f除了最naive的RNN cell以外也可以是如LSTM、GRU之类的变体。如果是LSTM,\boldsymbol h实际上是hidden state和cell state的总称。

所谓的嵌入VAE,是指每一步除了f以外,还多出了三个网络:[\boldsymbol \mu_{\boldsymbol z, t}, \boldsymbol \sigma_{\boldsymbol z, t}]=q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)是上文的Q网络,即“encoder”;[\boldsymbol \mu_{\boldsymbol x, t}, \boldsymbol \sigma_{\boldsymbol x, t}]=p\bigl(\varphi_{\boldsymbol z}(\boldsymbol z_t),\boldsymbol h_{t-1}\bigr)是上文的P网络,即“decoder”;[\boldsymbol \mu_{\boldsymbol z, t}, \boldsymbol \sigma_{\boldsymbol z, t}]=\varphi_{\text{prior}}(\boldsymbol h_{t-1})给出\boldsymbol z的先验分布——我们不能再假设\boldsymbol z总是可以从标准正态分布中采样了。可以看到,和上文唯一的差异是三个网络的参数都加上了上一时刻的hidden state。整个VRNN的loss就是每一步VAE loss之和:

\sum_{t=1}^T \biggl[\mathbb E_{\boldsymbol z \sim q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)}\!\Bigl[\log P\Bigl(\boldsymbol x_t\Big |\boldsymbol x_{t}\sim p\bigl(\varphi_{\boldsymbol z}(\boldsymbol z_t),\boldsymbol h_{t-1}\bigr)\Bigr)\Bigr] - D_{\text{KL}}\Bigl(q\bigl(\varphi_{\boldsymbol x}(\boldsymbol x_t),\boldsymbol h_{t-1}\bigr)\|\varphi_{\text{prior}}(\boldsymbol h_{t-1})\Bigr)\biggr].

我前几个月转用Ubuntu发现了一个有意思的bug:在我这个屏幕分辨率下app grid的布局竟然是不稳定的!调出grid的时候会将app grid其初始化成3\times 8,而如果鼠标悬浮在上面或者左右滑动起来,另一套逻辑又觉得应该是4\times 6了。这导致在浏览app grid的时候整个网格总是会“卡一下”,切换生硬,观感不佳。在进行了一些研究之后我发现这应该是gnome-shell包的问题,但是网上似乎也没有相关的bug report——可能是我的情况比较特殊?我的修理方法比较简单粗暴:app grid的布局是在/js/ui/iconGrid.js里面定义的:

const defaultGridModes = [
    {
        rows: 8,
        columns: 3,
    },
    {
        rows: 6,
        columns: 4,
    },
    {
        rows: 4,
        columns: 6,
    },
    {
        rows: 3,
        columns: 8,
    },
];

直接把最后定义3\times 8的object删掉,就从根源上解决问题了!

修这个bug本身是很简单的,但是坏就坏在gnome-shell这个包更新的比较平凡,而Debian构建包的那一套又不简单。Gnome-shell每次更新之后都需要重新做修改重新build再重新安装,手工操作比较繁琐。自己今天一时兴起写了一个脚本来自动化这个过程。

#!/bin/bash

# 删除除了本脚本以外的所有文件
find . \! -path './patch_gnome.sh' -delete
# 下载gnome-shell的源代码
apt source gnome-shell
# 转到源码根目录,因为根目录包含不断变化的版本号所以需要动态判断
cd $(find . -maxdepth 1 -path "./gnome-shell-*" -type d -print)
# 把 3x8 的布局改成和 4x6 的一样(也可以删除,但是感觉用sed的话这样修改比较简单)
sed -i -e "s/rows: 3/rows: 4/" -e "s/columns: 8/columns: 6/" js/ui/iconGrid.js
# 添加一个 changelog entry,这里需要手动按一下 enter
dch -i "Remove 3x8 grid"
# 构建并安装
dpkg-buildpackage -rfakeroot -b
sudo apt install ../gnome-shell*.deb

把这个脚本命名为patch_gnome.sh然后丢到一个空目录里就行了,以后每次看到gnome-shell更新了就调用这个进行一个全自动patch,简单快捷。

写这个脚本的主要收获是熟悉了findsed命令的使用,同时把Debian package build的一些命令固化到脚本里面也省得我每次patch的时候都去搜相关的wiki。

最近我因为个人项目的原因开始深入了解TeX。虽然The TeXbookTeX By Topic两本书都很详尽地记述了TeX的运行机制机制,但读来仍然有一种雾里看花,知其然而不知其所以然的感觉。要清晰了解一个程序的行为,终究没有比直接阅读其源代码更好的方法了。好在Knuth一开始就公开了TeX的源代码并附上了非常完善的注释。然而无论如何,40年前的古董级代码理解起来也绝非易事,所以我想还是开一篇文章记录一下吧。这篇文章基本按照TeX The Program里面的章节顺序,对每一章略过细节进行简要的概述,其目的主要是帮助我回忆。当然,既然发在了博客上,我也希望我的笔记能够帮到也对TeX内部实现感兴趣的同好。

本文假定读者对于TeX的基本机制有比较充分的了解(指通读过The TeXbook)。换而言之,对于诸如如何定义宏,\edef\def的区别,什么是horizontal mode,什么是glue等话题,我不会在这篇文章中做出过多的解释。Knuth花了一整本书解释这些概念,你指望我一个半吊子在一片博客文章里就解释清楚?笑死,根本不可能。

TeX的奇妙基本架构

TeX是用WEB语言编写的。WEB语言遵循Knuth提出的“文学编程/literate programming”的范式,夹杂文档和代码,可以看作是当今学术界常用的Jupyter Notebook的前身(当然,一个很大的区别是如今一个notebook不会超过20个cell,但TeX有1300多个……)。文档部分基于TeX,代码基于类Pascal(之所以说“类”,是因为当时Pascal的标准化还没完成)。如此一来,一个tex.web源文件,用tangle提取代码可以得到一份Pascal源文件用于构建程序本身,用weave提取文档则可以得到一份TeX源文件,编译成所谓的TeX The Program。在后文中我会使用形如TTP100的记号表示TeX The Program某节的内容。

为了增加可读性,TeX源代码中每一个“cell”包含的代码都不超过一页纸(通常更是不超过20行),对于比较长的函数,Knuth会把一部分逻辑分离到其他“cell”中去并在原地插入引用。tangle提取代码的过程本质上就是把基于这些引用对代码块进行重排。除此以外,Knuth还通过WEB定义了一部分宏,这部分宏的替换也是由tangle在生成代码的时候完成的。

如果粗略地看,TeX的源代码是非常模块化的,但是实际上却未必然。后一章定义的宏可能只是前几章某个宏的别名,导致对一个模块的修改可能会导致意想不到的结果, 又或者是虽然抽象出来了一些可以复用的函数,但是这些函数接口的形式极大地约束了实现的方案,导致事实上灵活性的缺失。模块化的设计势必包含一定的抽象,而抽象都伴随这一定的代价。在TeX开发的时代系统资源相当受限,语言标准缺失,各种infrastructure也不成熟,效率和兼容性是第一目标,因此代码设计成这样是可以理解的——只是苦了像我一样被现代开发范式惯坏的后来者。

TeX的奇妙字符串处理

TeX实现了自己的一套极具特色的动态储存字符串的方案。 简单来说,TeX有一个巨大的静态分配的char数组str_pool,以及一个元素递增的下标数组str_start。TeX给每个字符串赋予一个编号。编号s的字符串存储于str_pool下标中的str_start[s]以及str_start[s + 1]之间。以下C伪代码可以将TeX的字符串转换为C风格的字符串:

char *texstr_to_cstr(uint16_t s) {
    size_t len = str_start[s + 1] - str_start[s];
    char *ret = malloc(len + 1);
    memcpy(ret, str_pool + str_start[s], len);
    ret[len] = 0;
    return ret;
}

在运行过程中,TeX会维护pool_ptrstr_ptr两个变量,分别表示str_pool中最小可用的下标以及最小的可用的字符串编号。新建一个字符串大概是这样的:

uint16_t cstr_to_texstr(char *s) {
    size_t len = strlen(s);
    assert(str_start[str_ptr] == pool_ptr);
    memcpy(str_pool + str_start[str_ptr], s, len);
    str_start[++str_ptr] = pool_ptr += len;
    return str_ptr - 1;
}

当然,实际上TeX在构建新字符串的时候是逐字符添加的,pool_ptr也是逐步递增。这又涉及到很多骚操作(例如append字符之后还能flush吐出来等等),在此略过。TeX有两个比较字符串的方法,str_eq_buf比较一个内部字符串和一个char*是否相等;str_eq_str比较两个内部字符串是否相等。

此外,TeX源代码中的所有字符串字面量,会在WEB TANGLE的时候被抽取到一个编译器的string pool里作为str_pool的基础,原地只会留下对应字符串的编号。所以在源代码里的字符串字面量,其实都是整数类型。

TeX的奇妙算术

还是时代的原因,TeX问世时IEEE754还没有出,当时各型机器、各个编译器上对于浮点数的算术运算都遵循不同的标准。因此,TeX为了规避这种不确定性,很大一部分计算都基于其自身定义的定点数——Knuth称之为scaled number。定点数长4字节,以2^{-16}为最小单位。换而言之,如果直接把定点数当int32_t读,得到的是其所表示值的2^{16}倍。除此以外,当时各个平台对于对负数的整除和取模运算的实现存在分歧,因此TeX除了几处非核心算法,其余的部分在整除和取模前都检查操作数的正负性。

TeX的定点数计算函数里有很多令人啧啧称奇的操作。例如,以下函数用于近似计算100(t/s)^3(以下注释是我补的,原文没有解释7230584和1663497两个magic number的由来)。

typedef int16_t scaled;
const int16_t inf_bad = 10000;
int16_t badness(scaled t, scaled s) {
    if (t == 0) return 0;
    if (s <= 0) return inf_bad;
    int r; // 近似于2⁶ * ³√100 * t / s
    // 297³ ≈ 100 * 2¹⁸
    // 如果t > 7230584则t * 297 >= 2147483647会溢出
    if (t <= 7230584) r = (t * 297) / s;
    // 此时t >= 7230585,如果s / 297 < 1663497 / 297 = 5601的话,
    // t / (s / 297) >= 1291,此时(见下)一定会return inf_bad
    // 除法很慢,所以如果具体结果不重要就不要做
    else if (s >= 1663497) r = t / (s / 297);
    else r = t;
    // 如果r > 1290,则r * r * r  + (1 << 17)会溢出
    if (r > 1290) return inf_bad;
    return (r * r * r + (1 << 17)) >> 18;
}

可以看到,代码非常细致地考察了溢出的可能性,全程整数计算避免了浮点的不确定性,s >= 1663497这个条件更是在一定情况下避免了耗时的除法运算——非常有Knuth的风格。

TeX内部存储长度的单位是pt,格式自然是定点数。1 inch是72.27 pt。因此,所有TeX长度的最小单位是\frac1{72.27}\cdot2^{-16}=\frac{1}{4736286.72} inch。

除此以外,TeX还有一个数据类型称作glue_ratio,等同于一个32位浮点数。它只有一个使用场景:在将一个list固化成一个box之后,存储list内部glue的伸缩比。因为box的大小是已经决定的了,此时精度问题不会对TeX的行为产生任何影响。

typedef float glue_ratio;

TeX的奇妙内存分配

不出所料,TeX也不会调用系统接口进行内存分配,而是自带一个很大的数组mem并在此基础上进行伪·动态内存分配(如此说来,当今OIer颇有Knuth之遗风)。TeX内存分配的基本单位是4字节的memory_wordmemory_word可以视作一个union,有6种打开方式,具体如下(命名较原文有所更改):

union memory_word { // 假设没有memory alignment padding
    int32_t int_; 
    scaled sc;
    glue_ratio gr;
    // h: halfword, q: quarterword
    struct { int16_t lh; int16_t rh; } hh;
    struct { int8_t b0; int8_t b1; int16_t rh; } qqh;
    struct { int8_t b0; int8_t b1; int8_t b2; int8_t b3; } qqqq;
};

mem的大小小于2^{16},因此TeX中的指针是一个16位无符号整数。

typedef uint16_t pointer;

mem分为上下两部分。高地址区向下拓展,存储单word的数据结构;低地址区向上拓展(且通常一次拓展1000 words),存储至少两word的数据结构。二者的边界分别由全局变量hi_mem_minlo_mem_max标记。如果两者重叠则宣告内存饱和。

高地址区的维护基于单向链表。每个未分配的word都有一半用于存储下一个未分配word的地址:

#define info(p) mem[p].hh.lh
#define link(p) mem[p].hh.lh

全局变量avail始终指向链表头。因此,availlink(avail)link(link(avail))等就表示了所有待分配的地址。在分配时从头弹出,在释放时也从头插入,若内存不够则avail = --hi_mem_min

低地址区的维护基于双向链表,就是比较标准的做法了,TAOCP里有写,在此不赘述(原文:TTP124)。有意思的一点是维护低地址区时不会维护已分配数据结构的大小,后者是释放函数free_node(ptr, size)的参数。

TeX的奇妙数据结构

众所周知,TeX中的一切最后都会变成box(成盒)。TeX在程序中使用一对type tag的组合来区分不同类型的box。

#define type(x) mem[x].qqh.b0
#define subtype(x) mem[x].qqh.b1

这些信息已经占去了半个word的空间。注意到一个box在任意时刻只会在一个list当中(Rust ownership mechanism并感),于是Knuth机制地利用另外半个word存储一个box的后继,即link(x)表示在处理box x之后下一个box的地址。不得不说,链表真的在TeX中被玩出了花。

因此,每个box的第一个字节都是它的“元信息”——那么这么说每个box至少都占用两word吗?

也并不是,Knuth考虑了TeX中最最常见的一种box——由单个字符表示的char box。Knuth发现如果用type(x)表示一个char box的字体编号,再用subtype(x)表示它对应的字符,那么一个char box就能塞进一个word里面!

#define font type
#define character subtype

既然type已经被挪作他用了,如何知道一个box是char box还是其他类型呢?注意到char box一定被分配到高地址区,其他更复杂的box都在低地址区,于是只要判断指针是不是>= hi_mem_min就行了。

不得不说,TeX在内存的使用上面堪称锱铢必较。按照一个字段的取值范围选择最小适用的数据类型固然能让数据结构更紧凑,但是也限制了拓展的可能性。例如,如果指针类型只有2字节,那如果哪一天TeX的需要管理超过65536个内存单元时应当如何?这就是TeX源代码的时代局限性。

对于一个hbox/vbox,TeX定义如下:

struct hbox_node { // sizeof(hbox_node) == 7;
    memory_word meta; // type tag = 0
    scaled width;
    scaled depth;
    scaled height;
    scaled shift_amount; // 对于hbox是下沉,对于vbox是右移
    uint8_t glue_sign; // box内所有glue的伸缩状态
    uint8_t glue_order; // box内glue的阶数——和fil, fill, filll有关,暂时看不明白
    pointer list; // box内的内容
    glue_ratio glue_set; // box内所有glue的伸缩比
};
enum { // glue_sign的取值
    NORMAL = 0,
    STRETCHING = 1,
    SHRINKING = 2,
}; 

还有一个box的类型和hbox和vbox很相似,称为unset box,即halignvalign环境中大小未定,但是一定会和其他box对齐的box(详见TTP159)。

除了char box以及上面的hbox/vbox,TeX还有很多种类型的box,但数据结构布局的基本思想大同小异,在此也不赘述了(详见TTP140左右)。最后记录一下glue box。为了节约内存,TeX的glue box本质上是一个引用计数的指针,指向一个单独分配的glue specification。

struct glue_node { // sizeof(glue_node) == 2;
    memory_word meta; // type tag = 10, subtype有点意思,详见TTP149
    pointer glue_ptr; // 指向glue specification
    pointer leader_ptr; // 如果这个glue是一个leader,指向leader的内容
};
struct glue_spec { // sizeof(glue_spec) == 4;
    uint8_t stretch_order; // stretch的阶数
    uint8_t shrink_order;
    uint16_t ref_count; // 引用计数
    scaled width;
    scaled stretch;
    scaled shrink;
};
enum { // xxx_order的取值
    // NORMAL = 0, 在上面已经定义过了
    FIL = 1,
    FILL = 2,
    FILLL = 3,
}

这是TeX中第一次出现引用计数。在后面,引用计数还会一次又一次地出现。

TeX的奇妙command code

TeX给每一类primitive固定了一个command code。具体不详细展开了,详见TTP207。

我个人认为给每一个primitive集中地,显式地指定一个整数的command code是不利于系统的可拓展性的——即使primitive内部使用整数command code进行比较与匹配,command code的分配也应该由程序完成。我说这话自然有站着说话不腰疼之嫌。这些command code孤立着看怎么样不好说,容我看完后续章节之后再来回顾。

TeX的奇妙模式嵌套

The TeXbook有提到,TeX总共有六个模式:(restricted) horizontal,(internal) vertical,和(inline) math。在程序中,这六个模式对应六个整数常量。horizontal,vertical和display math对应的常量为正,对应的restricted mode的常量为前者的相反数——如此一来通过取绝对值就能判断一个模式的基本类型。 而且三个mode code不是简单的1,2,3,而是间隔max_command(即上节command code的最大值)。如此一来,mode code与command code之和就将模式和primitive的二元组一一映射到不同的整数。这在TeX的“big switch”中有大用。

任何一个时刻TeX的运行状态都是若干个这样模式的嵌套。这种嵌套在程序中自然通过栈来维护,栈中存储如下数据结构:

struct list_state_record {
    int16_t mode; // 不确定是不是int16_t,原文中的Pascal有通过数据范围指定类型的方式,C没有
    pointer head; // 指向该模式正在构建的list的链表头
    pointer tail; // 指向该模式正在构建的list的链表尾
    int prev_graf; // 对应\prefgraf,即已经当前段落断行完毕的行数
    int mode_line; // 当前mode的起始行数
    memory_word aux; // 辅助信息
};

TeX的奇妙字典——eqtb

我们终于迎来了一个在TeX中具有核心意义的巨大数据结构——eqtb表。eqtb是equivalent table的简写,存储了每一个符号(包括control sequence,token register,counter等)的含义。eqtb是一个分为六部分的巨大数组,每个部分占据连续的一段下标区间,从小到大分别为:

  1. 字符宏(active character,即cat code 13)的定义;
  2. 多字符控制序列(control sequence)的定义(基础是一个哈希表);
  3. Glue的定义,例如\baselineskip
  4. 局部变量,包括catcode,box register,字体等;
  5. 数字参数,例如断行负权(hyphenation penalty);
  6. 大小(dimension),例如hsize

数组的每一个元素是一个memory_word。重定义如下

struct equivalent { // 布局等同于memory_word
    int8_t type; // 定义类型——是一个宏?一个primitive?一个counter?一个token list?
    int8_t level; // 被定义时TeX的局部域的嵌套级数,基于以下常数:
    // level_zero是未定义,level_one是全局,level_one + 1是全局一下一个group内,以此类推
    int16_t equiv; // 定义,可以是一个指针,也可以表示一个字体编号,取决于type
}

其中equivtype的意义都不难理解,但是level字段却出乎我的意料。按照当下的套路,如果需要存储或解析一个符号在多重嵌套作用域中的绑定,常见的方法是存一个帧栈并在解析符号时自顶向下查找。虽然这种思路很符合直觉,其弊端是需要在栈中为每一个作用域存储一个符号表。这在一切数据结构都要造轮子的时代显然增加了代码的复杂性。TeX的level字段则更像是把这些表拍扁成一块,在一定程度上节约了内存。但是现在想来,这样的设计无法独立存储或引用一个局部作用域——这对于lexical scoping是必须的。TeX的dynamic scoping究竟是喂屎还是特性一直有争论,在语言设计层面也有很多种解释。现在看来,除了语言设计的考量,内部的代码实现也不由得TeX不搞dynamic scoping。

哦对了,还有一点需要注意的是,eqtb最后两个区域数据本身就占用4字节,这就把typelevel的空间挤掉了。type对于整数或者glue确实是不需要的,但是level得找个办法保留。TeX的解决方案是对于后两个区额外开辟一个辅助数组xeq_level,下标范围对应eqtb后两区,专门用于存储level

TTP原文中eqtb一章有足足20多页长,覆盖了TeX的每一个内部参数(因为这些参数都要在这里集中register),细节很多也很杂。如果要概览TeX所有的primitive类型或者所有的内部glue以及内部参数,这一章会是一个不错的切入点。

TeX的奇妙哈希表

eqtb是一个数组,其下标是一个一定范围内的整数;控制序列是一个字符串。将字符串映射到控制一个整数区间,自然要用到哈希表。TeX的哈希表实现基于TAOCP的“coalescing list”算法,定义如下辅助数据结构:

struct {
    uint16_t text; // 指向字符串的编号,如果该位置未存值则为0
    uint16_t next; // 指向链表的下一个entry,如果该位置未存值或位于链表尾则为0
} hash[/*下标从hash_base到undefined_control_sequence - 1,
    对应eqtb第2区的下标区间*/];
uint16_t hash_used; // 一个递减的指针,hash中所有下标不小于hash_used的都存了值了

把哈希表的每一个entry定义为一个链表是常规操作,但coalescing list将链表本身也嵌在哈希表中,进一步节约了内存,就非常有意思了。在这个哈希表中查找以及插入的步骤如下:

uint16_t query_or_insert(char *s, bool allow_insert) {
    uint16_t p = compute_hash(s) + hash_base;
    bool found = false;
    size_t len = strlen(s);
    while true {
        if (hash[p].text > 0 
            && length(hash[p].text) == len
            && str_eq_buf(hash[p].text, s))  // 如果找到了则直接返回
            return p;
        if (!hash[p].next) { // 链表到头了还没找到
            if (!allow_insert)
                return undefined_control_sequence;
            // 开始把s接在p后面插入哈希表
            // 先找到一个还未存值的位置存入hash_used
            do { hash_used--; } while (!hash[hash_used].text);
            // 把p的后继设定为hash_used
            hash[p].next = hash_used;
            p = hash_used;
            hash[p].text = cstr_to_texstr(s);
            return p;
        }
        p = hash[p].next;
    }
}

哈希算法比较朴素:

uint16_t compute_hash(char *s) {
    uint16_t hash = *s++;
    while (*s) {
        hash = ((hash << 1) + (*s++)) % HASH_PRIME;
    }
    return h;
}

Knuth说根据理论分析,HASH_PRIME应当取表大小的85%左右,这样每次成功查找平均只要probe(这个比较好的翻译是什么?试查?)小于两次。Knuth还给出了文献的链接,但是我最近比较忙没有时间看,目前暂且先相信Knuth的handwaving。(我早就听说哈希表是基础数据结构中特性最不好分析的,自己最近又没什么时间,估计要等到上6.046的时候才能真正扎实地过一遍)。

TeX的奇妙栈存

在上上节写到,TeX全局就一张eqtb表。那它是如何实现赋值类操作的局部作用域的呢?答案是在局部域内赋值时,用一个特殊的栈save_stack存储原来的值。如此一来,在当前局部作用域结束的时候就可以通过出栈的方式回退赋值操作。

save_stack定义如下:

struct {
    uint8_t type; 
    uint8_t level; 
    uint16_t index; 
} save_stack[...];
uint16_t save_ptr; // save_stack[save_ptr - 1]是栈顶
enum { // type的取值
    RESTORE_OLD_VALUE,
    RESTORE_ZERO,
    INSERT_TOKEN,
    LEVEL_BOUNDARY,
}

可以看到save_stack当中有4种可能的元素类型:

  1. RESTORE_OLD_VALUE适用于内部作用域赋值覆盖外部已定义对象的情况。此时被覆盖的对象对应的eqtb下标为index,且在出栈回退时,该元素底下的一个元素应当视作被覆盖的原值——注意到,eqtbsave_stack的元素的内存布局都与memory_word相同,因此就有往栈里存eqtb的东西的奇怪操作。
  2. RESTORE_ZERO适用于内部作用域赋值外部未定义对象的情况。此时局部创建对象对应的eqtb下标为index,在出栈时,eqtb[index]应当被置为和eqtb[undefined_control_sequence]一样的内容(说人话就是回归未定义的状态)。
  3. INSERT_TOKEN用于在grouping结束之后追加token的情况——即\aftergroup。此时index表示要追加的token(虽然还没有写到,但TeX用一个16位整数编码一个token)。
  4. LEVEL_BOUNDARY顾名思义,用于标记每一个level的“边界”。注意到在一个作用域里面会往save_stack里面塞东西的数量是不确定的。如何在回退的时候只回退刚刚退出的那个作用域对应的修改呢?这就是这个type的用途。LEVEL_BOUNDARY在进入一个作用域的时候第一个入栈,并在离开作用域逐个出栈的过程中作为终止的标记。换而言之LEVEL_BOUNDARY之上的就是当前作用域的所有修改。LEVEL_BOUNDARY元素的level字段标记上一层作用域的“类型”,next字段指向上一层作用域的LEVEL_BOUNDARY在栈中的下标。

什么是作用域类型呢?除了最简单的靠{..}创造作用域以外,能够创建作用域的语法构造有很多,如\hbox\begingroup\endgroup等。不同类型的作用域终止条件不同,退出之后需要额外进行的操作也不同,因此需要记忆。TeX使用cur_group这一全局变量维护当前作用域的group code,上面几层的group code正如上面所说的那样以类链表的形式存在save_stackLEVEL_BOUNDARY节点里面。

骚操作又来了,如果创造作用域的命令有参数,那该参数也会入栈并处于新作用域对应LEVEL_BOUNDARY的底部。一个例子是\hbox to 100px { ... }那这个100px就会先于LEVEL_BOUNDARY入栈。这样在出栈回退过后这个保存了的参数马上就能用。Knuth不要什么都往栈里放啊喂。

还要注意的是,用\global修饰的全局赋值不需要回退。因此,在试图回退的时候需要检查此时eqtb里面对应项的level。如果是level_one,那么说明最近的一次赋值是全局的,要放弃回退。

除此以外,因为TeX内部有很多引用计数的结构,如token list,glue specification,par shape等。这些数据结构在赋值的时候和回退的时候需要额外处理,旧的值该回收的要回收,这让save_stack上面的操作逻辑稍微复杂了一点。

TeX的奇妙token list

Token list是TeX的另一大核心数据结构。而在解释token list之前,自然要先了解TeX是如何表示一个token的。TeX将token分为两类:

  1. “字符型”的token是catcode和单个字符的组合。TeX将其表示为2^8 \cdot\text{catcode}+\text{char code}
  2. 控制序列的token本质上对应eqtb表中的一个位置。TeX将对应eqtb下标p的控制序列表示为\left(2^{12}-1\right) + p——\left(2^{12} - 1\right)也被称为cs_token_flag

如此一来,每个token都可以映射为一个16位的正整数。接下来自然是TeX的常规操作:把token塞到一个memory word中,空出来的16位正好指向下一个token的地址。因此,token list的本质还是一个单向链表。

除了来自源文件的token,TeX还有两个特殊的token:matchend_match。这两个在存储宏定义的时候会用到。具体来说,TeX很神奇地将宏的参数格式和替换宏体存在一个token list里面。end_match是参数部分和宏体的分界,而match则代表宏参数格式中的参数本身(因为参数格式中参数的编号总是递增的,所以只需要存储match而不需要存储对应的参数编号)——是不是很绕?举个例子,如果定义如下的宏:

```tex \defa#1#2

书接上文,我们使用Python + Selenium白嫖了原来要收费的图档清理大师,效果极好。然而,我们的代码还有很多可以优化之处,这就是写本文的由来。

方法优化:避免开启浏览器

在写完代码发完朋友圈之后我才开始认真思考一个问题:

为啥不直接逆向在线体验的API啊?

我之前自己把自己给绕进去了。

Selenium模拟的好处是实现思路自然,只需要知道人工使用的方法而不需要知道内部的代码实现与网络通信。但缺点也很明显,需要开浏览器带来了一个明显的开销。浏览器的内存占用使得并行操作也存在一定的难度。

而直接模拟网络请求的好处就是确保没有资源浪费,存在并行化的潜质,而且相较于浏览器涉及到的不见耦合较少具有更高的稳定性(前提是API不变。Selenium则相较之下在内部API变化但界面不变的情况下具备跨版本的稳定性)。

试一试?

打开开发者界面,定位到上传图片的按钮,Firefox的调试器很贴心地告诉我这里有一个event,于是点进去。

网站的开发者很贴心地没有做minify也没有做混淆,甚至还留了注释。

我发现上传的主要逻辑其实在$('.exeDaqw').click()里,于是进一步在源码当中搜索.exeDaqw,发现如下片段:

通过阅读片段可以发现,这个.exeDaqw元素是通过layui.upload渲染的。我孤陋寡闻,但Layui看起来是一个控件库,事实也是如此。通过url字段我得以确定上传图片的API地址。在随后的代码当中又可以发现图片清理的API地址。

通过开发者窗口的网络监视器可以看到请求的实际情况:

经过一波操作比对,摸索得到的API可以写成如下代码:

import requests

def clean_doc_requests(images: Generator[Tuple[bytes, str], None, None]) \
        -> Generator[Image.Image, None, None]:
    """
    Cleans the scanned document pages using docleaner's online service.

    :param images: A generator yielding (an image as raw bytes, its extension as string).
    """
    for (image, ext) in images:
        # noinspection HttpUrlsUsage
        req = requests.post("http://service.docleaner.cn/attachCollect/upload",
                            files={"file": (f"image.{ext}", image)})
        data = {
            # Weird typo in the API.
            "paramers": "降噪,去斑点,去黑边,去背景,自动纠斜",
            "type": "image",
            "storePath": req.json()["data"]["storePath"],
            "userId": ""
        }
        # noinspection HttpUrlsUsage
        req = requests.post("http://service.docleaner.cn/exe/daqw", data=data)
        result = base64.b64decode(req.json()["data"]["outFileStr"])
        yield Image.open(io.BytesIO(result))

和之前上面的clean_doc_online代码对比,可以看到代码和逻辑确实都简洁了很多,这就是扒API直接用的优点了。我很幸运,这个站点的代码和注释都很清楚,所以逆向还原出API还是很容易的,对于更加复杂的一些站点,Selenium可能不失为一个更简单粗暴的好办法。

但有意思的地方来了:上面的这串代码,跑的比Selenium要慢。原因在之前解释Selenium的时候大致讲过了。Selenium新开了一个浏览器进程,那里上传图片还是等待结果本质上不会阻塞Python的运行,所以通过优化循环的写法,可以上传图片和从PDF中获取下一页图片同时进行。而requests是Python库,在请求时是阻塞的,因此就慢了。

失之东隅收之桑榆,虽然我们的“优化”让我们的代码反而变慢了,但是也为进一步的优化铺平了道路:在用Selenium的时候,因为同时开几十个浏览器不仅视觉上很离谱而且内存占用很高,所以并行处理很难;而同时开几十个并发的请求却是再容易不过的事情了。因此我们下一步的优化就是利用Python的多进程库进行加速。

速度优化:多进程加速

Python写并行的格局,是和别处不同的。究其原因,GIL的存在硬是让多线程的并行成了并发。而绕过GIL的唯一途径就是多进程,这又涉及到了进程之间通信,同步等一连串复杂的逻辑,令人望而生畏(至少我是这样的)。

但Python有一个好——battery included。我去官方文档转了一圈,发现Python的标准库里有一个multiprocessing模块,应付我这里的需求已经完全够用。具体来说,multiprocessing提供了一个进程池Pool:对于一般的

output_iterator = map(function, input_iterator)

只需要改写成

from multiprocessing import Pool
with Pool(process_count) as p:
    output_iterator = p.imap(function, input_iterator)

就可以把map的函数分派到线程池的多个线程上进行运算,并在结果出来之后进行保序归并,最终得到和map一样的结果。

为了让我们的代码可以套用这个模式,我们需要把之前生成器套生成器的逻辑重构成单个函数:

def clean_single_page(args: Tuple[StrPath, int, int, bool, bool]) \
        -> Union[Image.Image, bytes]:
    """
    Cleans a single page.

    :param args: A tuple consisting of (in order):
        1. Path to the page (pdf or image),
        2. Index (image index or page index in PDF),
        3. DPI (-1 if an image is direcly supplied),
        4. Whether to perform OCR,
        5. Whether to actually clean the page.
    :return:  If OCR is enabled, an OCR-ed PDF in raw bytes, otherwise a PIL
        Image object representing the cleaned page.
    """
    page, idx, dpi, ocr, clean = args
    ...

如上所示,从PDF中提取页面图像、上传图像到图档清理大师进行清理、对于结果的OCR都是可以单独进行的,故合并。

逻辑的合并自然导致参数的合并,而map接受的函数应当是单入单出的,于是我们就需要把所有参数打包成一个tuple进行用,并在函数体内部解包,即上面代码的第15行。

把所有页面重新归并成PDF的函数也要进行一定的简化与修改:

def merge_to_pdf(pages: Iterable[Union[Image.Image, bytes]], output: StrPath):
    """
    Converts and merges images to a one-page pdf file, performing optional
    OCR in the process.

    :param pages: A generator yielding PIL image objects.
    :param output: Path to the result pdf.
    """
    doc = fitz.Document()
    for page in pages:
        if isinstance(page, Image.Image):
            # noinspection PyUnresolvedReferences
            doc_page = doc.new_page(width=page.width, height=page.height)
            buffer = io.BytesIO()
            page.save(buffer, format="jpeg")
            doc_page.insert_image(fitz.Rect(0, 0, page.width, page.height),
                              stream=buffer)
        else:
            page = fitz.Document(stream=page, filetype="pdf")
            doc.insert_pdf(page)
    doc.save(output)

最后,从之前的代码应该可以看出,我在编写代码的时候一直把逻辑分得比较开,目的是在可能的情况下使脚本的使用能够更加灵活。例如,能不能以通配符的形式直接读取图片进行优化?能不能再输出的时候直接输出到文件夹中而跳过PDF归并从而便于其他软件?多进程的优化是一个重要的重构,因此趁这个重构的机会我也把之前提到的功能理了理和调用Pool的代码一起加在了CLI入口点:

@click.command()
@click.argument("input", type=click.Path())
@click.argument("output", type=click.Path())
@click.option("-d", "--dpi", default=300, help="DPI for rasterization.")
@click.option("--first-page", type=int, help="First page to convert/clean.")
@click.option("--last-page", type=int, help="Last page to convert/clean.")
@click.option("--ocr/--no-ocr", default=True,
              help="Whether to perform OCR during the conversion.")
@click.option("--clean/--dont-clean", default=True,
              help="Whether to clean pdf using docleaner's online service.")
def main(input: str, output: str, dpi: int,
         first_page: Optional[int], last_page: Optional[int], ocr: bool,
         clean: bool):
    if os.path.splitext(input)[1].lower() == ".pdf":
        # PDF mode
        assert os.path.exists(input)
        page_count = fitz.Document(input).page_count
        first_page = 0 if first_page is None else first_page - 1
        last_page = page_count if last_page is None else last_page
        args = zip(repeat(input), range(first_page, last_page),
                   repeat(dpi), repeat(ocr), repeat(clean))
    else:
        # Glob mode
        files = sorted(glob.glob(input, recursive=True))
        first_page = 0 if first_page is None else first_page - 1
        last_page = len(files) if last_page is None else last_page
        args = zip(files[first_page:last_page], repeat(0), repeat(-1),
                   repeat(ocr), repeat(clean))
    total = last_page - first_page
    with Pool() as p:
        results = tqdm(p.imap(clean_single_page, args), total=total)
        if os.path.splitext(output)[1].lower() == ".pdf":
            merge_to_pdf(results, output)
        elif not os.path.exists(output) or os.path.isdir(output):
            if ocr:
                raise RuntimeError("the OCR flag is useless because we are "
                                   "writing images (not PDF) to the output "
                                   "directory.")
            if not os.path.exists(output):
                Path(output).mkdir(parents=True)
            for (index, page) in enumerate(results):
                file_path = os.path.join(output, f"{index}.jpg")
                assert isinstance(page, Image.Image)
                page.save(file_path)
        else:
            raise RuntimeError("invalid output format.")

在多进程优化之后,脚本花了11分21秒就把测试用的《态度改变与社会影响》一书全文清理完毕。和原来30-40分钟的耗时对比,显然我们的优化是卓有成效的。

大小优化

未完待续

现在唯一一个比较明显的问题就是:清理之后的结果文件实在是太大了。

究其原因,现在的脚本是将PDF按一定的分辨率光栅化之后获得图像进行清理,而不是直接从PDF中提取本来存在的图像。现在方法的好处是实现简洁,无需考虑PDF内图像的存储方式(测试中发现直接提取PDF中图像提取出来可能是横的,这就说明PDF其实再显示的时候额外标注了旋转的信息,这在直接提取中会丢失),但坏处就是结果图像的大小和原本嵌入的图像大小无关,而且提取的过程中可能产生误差(artifact,不知道怎么翻最为恰当)。

因此,解决这个问题也有两种思路:

  1. 治本:使用更复杂一点的逻辑对PDF中内嵌的图片进行保真的提取。这样一来结果文件的大小就和原来的文件大致一致,可能还会更小一点(考虑到清理的时候很多区域都会整个变成白色)。
  2. 治标:使用更加优秀的图像压缩算法。

第一种思路还未实验。关于第二种思路,我不禁想到Google出品的无脑图像压缩应用Squoosh。Squoosh提供CLI,而且压缩率可以达到30%-40%,很有的吸引力。进一步探究,Squoosh默认使用的压缩工具是MozJPEG。这款Mozilla出品的压缩编码器似乎也是唯一靠谱的我们脚本里能用的(Squoosh里其他的压缩器的输出格式无法嵌入到PDF中,而且耗时太长)。但我在Windows上测试MozJPEG的CLI,发现无法输出合法的JPEG文件。还在尝试中。

代码仅作学习用途,请勿分享传播。

代码在这里

问题背景

因为实体书购买麻烦、厚重、占地空间大(对于国外的一些原版教材来说,价格高昂),我一直是电子书的忠实拥趸。一般来说,市面上下载的到的电子书大致可以分为以下几类:

  1. 基于原内容的排版可变格式:一般为EPUB、MOBI或TXT格式,动态排版,在任何屏幕大小下都适合阅读。有索引,文字与图像本地渲染,清晰可辨。这是电子书的理想格式,但这种格式的书大多是以小说,我几乎没有见到过用这种格式排版的学术的教科书。
  2. 基于原内容的PDF:基于书的源文件直接渲染得来,排版不可变,在没有重排软件的情况下需要大屏阅读器才能舒适阅读,并且需要适当对页边进行裁剪。大部分时候有索引,文字和图像本地渲染,清晰可辨。这种PDF可遇而不可求。有教授把这种格式的书放在官网上给学生们免费下载(比如说这个)——我愿称他们为大善人。
  3. 精修的扫描PDF:是基于实体书扫描而来,文字部分经过OCR和轮廓平滑化变得更加清晰,公式可能经过重写也很清楚。良心一点的制作者会顺便把目录也做了。这种PDF(有的时候是DJVU)见到的概率比较高,libgen上十本里面能有六七本。我的Introduction to Linear Algebra就是这种类型。
  4. 三无的扫描PDF:PDF制作者只进行了基本的扫描。文字有倾斜,页面有阴影,页边缘出镜导致黑边,有些还会有铅笔的笔记。OCR是没有的,目录是不用指望的。这种是电子书里的下品,像我这样有一些轻微强迫症的会觉得很难受。

最近自己要找的一本书搜遍了网上也只有三无版本,真的感觉头很大,体验极差。那么,有什么办法能把这种三无版本清理一下,使之朝精修PDF的方向上靠拢呢?

图档清洁专家是我前几天在网上发现的。实测效果拔群。缺点是要付60块钱买正版。60元不是什么大钱,但是人是要有追求的:

正好,图档清理专家提供了在线体验,这个在线体验提供的功能已经涵盖了我的需求了,但是为了防止我们白嫖地太爽,在线体验限制只能上传图片文件。对于我手里这本460页的教科书,一页页手工操作能把人折磨死。怎么办呢?

用Python自动化试试看?似乎能学到很多东西的样子。

代码实现

我们的脚本需要实现以下功能:

  1. 将输入的PDF按页转换成高清图片。
  2. 以某种方式程序化地调用在线体验的功能。
  3. 将结果的图片下载下来重新拼合成为PDF。
  4. 可以顺便使用pytesseract实现OCR。

PDF光栅化

如果只考虑“将PDF转化为图片”的功能,最切题的库是pdf2image,底层调用的工具是poppler。这也是我一开始的选择。

但是后来转念一想,既然后面还有要将图片转PDF,PDF拼合的需求,寻找一款更为通用的PDF库是更明智的做法。

经过一番搜索,跳过已经五年没有出过新版本的的PyPDF2,我选择了目前维护勤快的PyMuPDF。PyMuPDF调用的是MuPDF这个用C写的成熟的PDF库,功能与性能都是有保障的。MuPDF的license上说明对于开源项目免费授权,因此我就大胆地用了。

使用PyMuPDF进行PDF的光栅化,代码如下:

import fitz
from typing import Optional, Generator, Union
from pathlib import Path

StrPath = Union[str, os.PathLike]

def convert_pdf_to_images(pdf: StrPath, fmt: str, dpi: int,
                          output: Optional[StrPath] = None,
                          first_page: Optional[int] = None,
                          last_page: Optional[int] = None) \
        -> Generator[StrPath, None, None]:
    """
    Converts a pdf file to images. This a necessary pre-processing step
    because docleaner online only accepts images as inputs.
    
    :param pdf: The path to the pdf file.
    :param fmt: Image file format. jpg is the fastest but not lossless; png is
        lossless but slow; tiff is theoretically the best but occupies a lot of
        disk space.
    :param dpi: Pixel density of the output images.
    :param output: The output directory of intermediate images.
    :param first_page: First page to convert (starting from 1, inclusive).
    :param last_page: Last page to convert (starting from 1, inclusive).
    :return: A generator yielding the paths to the images.
    """
    doc = fitz.Document(pdf)

    @contextmanager
    def normal_dir(dir_path):
        Path(dir_path).mkdir(parents=True, exist_ok=True)
        yield dir_path

    matrix = fitz.Matrix(dpi / 72, dpi / 72)
    first_page = 0 if first_page is None else first_page - 1
    last_page = doc.page_count if last_page is None else last_page
    with tempfile.TemporaryDirectory() if output is None else normal_dir(
            output) as path:
        for i in range(first_page, last_page):
            filename = os.path.join(path, f"{i}.{fmt}")
            # noinspection PyUnresolvedReferences
            doc[i].get_pixmap(matrix=matrix).save(filename)
            yield filename
        if output is None:
            # Yield an empty string if we are using a temporary directory,
            # because without this, the temporary directory will be cleaned
            # up the moment the last filename is yielded, when the caller
            # hasn't done anything to the yielded temp file yet. Yielding an
            # emtpy string keeps the TemporaryDirectory object in memory
            # longer so the problem is solved.
            yield ""

这个代码乍一看是比较复杂,有几个重点:

  1. 代码写成了生成器的形式,这样便于和后续的其他功能进行更好的耦合。个人认为生成器是一个很不错的语法糖。
  2. 为了方便静态检查,我这里给方法签名标注了类型。
  3. 为了同时处理输出图片到临时目录和正常目录(可以用于其他工具)的情形,这里定义了名为normal_dircontext manager以和TemporaryDirectory保持形式上的一致。这样代码可以更加简短。要使用TemporaryDirectory()而不是mkdtemp()的原因是前者确保了临时文件夹最终会被删除。
  4. 但是TemporaryDirectory()在结束后删除临时文件夹会带来一个问题:在yield最后一页的图片的路径之后,临时文件夹马上就被删除了,刚返回的路径也就失效了。调用这个生成器的代码使用的时候就会产生找不到文件的错误。一个粗暴的解决方案就是让with块执行得再久一点——比如说在下次调用的时候再结束。这就是最后yield ""的意义。
  5. PyMyPDF默认的get_pixmap方法输出的图片分辨率等于当前页面的边界矩形大小(bound rectangle)。这个边界矩形尺寸是以点数(pts)为单位的,1英寸等于72点,因此如果需要达到指定的DPI,get_pixmap在执行的时候就需要放大dpi / 72倍。这就是matrix = fitz.Matrix(dpi / 72, dpi / 72)的含义。

剩下的无非就是看文档写代码。

使用Selenium自动化“在线体验”

人工一页一页上传手抽筋怎么办?一个很自然的思路就是让代码来直接操控浏览器。在这方面,因为网站开发自动化测试的需求,有人已经帮我们写好轮子了——Selenium。

根据官网上的描述,Selenium和我们的需求完美对接,而且Selenium提供了Python的接口。

使用Selenium的第一步是打开浏览器并且访问目标网页:

from selenium import webdriver

browser = webdriver.Chrome()
# browser = webdriver.Firefox()
# browser = webdriver.Edge()
# browser = webdriver.Safari()
browser.get("http://www.docleaner.cn/experience.html")

打开哪个浏览器就取决于个人喜好了。在这里需要注意:除了在Python环境里安装selenium包以外,还需要自行安装对应浏览器的WebDriver。具体如何装,怎么装,请参阅Selenium的文档。就我个人体验而言,Chrome的WebDriver启动得很快,但是默认会把日志写入控制台导致进度条显示混乱;Firefox得WebDriver要花一两秒启动,但默认会写日志到geckodriver.log里面。

在执行以上代码之后要做两件事:

  1. 等待页面加载完毕(废话),以及
  2. 打开去背景和自动纠斜这两个默认关闭但是很有用的开关。

Selenium的API允许我们把两件事并在一起做,前提是我要知道两个开关的“路径”——CSS选择器或XPath都行。

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait

WebDriverWait(browser, timeout).until(
    expected_conditions.visibility_of_element_located(
        (By.XPATH,
        "开关的XPath")
    )
).click()

以“自动纠斜”开关为例,其周围的HTML如下:

<div style="position: relative;">
    <button type="button" class="layui-btn layui-btn-primary buttoncss buttoncss0" style="padding: 0 10px;">
        <img src="http://www.docleaner.cn/templets/1/qwsoft//resource/images/jiuxie.png">自动纠斜
    </button>
    <div class="buttoncheckbox">
        <input type="checkbox" name="exeConfig" lay-filter="exeConfig" value="自动纠斜" lay-skin="primary">
        <div class="layui-unselect layui-form-checkbox" lay-skin="primary"><i class="layui-icon layui-icon-ok"></i></div>
    </div>
</div>

我们的目标是那个<button>元素,比较好找的是<input>元素,其XPath就是//input[@value='自动纠斜']

接下来我们要使用XPath的Axes语义间接地定位到<button>

//input[@value='自动纠斜']/parent::div/preceding-sibling::button

语义很清楚,就不多解释了。完整的Axes列表可以看这里

类似地,我们可以定位上传按钮的位置:

uploader = WebDriverWait(browser, timeout).until(
    expected_conditions.presence_of_element_located(
        (By.CLASS_NAME, "layui-upload-file")))

注意到这里是presence_of_element_located,前面是visibility_of_element_located,区别在于后者必须要求元素在浏览器框里面可视,而前者只需要DOM里有就行了。我们之前需要点击开关,因此必须先看得见才行(而且可视是确保网页加载完毕的一个更靠谱的条件),而上传控件在网页里是一个隐藏起来的<input>,因此只要present in DOM就行了。

往上传控件上传东西调用的是send_keys函数,很有意思:

uploader.send_keys(file_path)

然后等一会网页右边就会出现处理后的图像。其实显示结果的<img>一直都在,只是直到处理完成之后才可见,我们借此找到这个控件并且作为处理完成的依据:

result = WebDriverWait(browser, timeout).until(
    expected_conditions.visibility_of_element_located(
        (By.ID, "dragImgRight")))

开发者很好心地(?)在结果图像控件的src字段直接使用base64编码图片,连额外下载的代码都免了,我们直接得到结果图像:

import io
import base64
from PIL import Image

result = result.get_attribute("src")
result = base64.b64decode(
    result.replace("data:image/jpg;base64,", ""))
yield Image.open(io.BytesIO(result))

最后,注意把结果的<img>再藏起来,这样通过可见性来判断是否处理完成的逻辑在下一张图片传上去的时候才有效。经查,结果的可见性通过<img>父元素class列表里layui-hide的存在与否来判定。我们选择通过Selenium让浏览器执行JavaScript进行这种有点复杂度的操作:

browser.execute_script(
    "arguments[0].parentNode.classList.add('layui-hide');", result)

arguments自然就是后面传参的数组。

完整的代码如下:

def clean_doc_online(images: Generator[StrPath, None, None], browser: str) \
        -> Generator[Image.Image, None, None]:
    """
    Cleans the scanned document pages using docleaner's online service.
    
    :param images: A generator yielding paths to document pages.
    :param browser: Browser type, can be "chrome", "firefox", "safari", or
        "edge". Requires the browser and its webdriver to be installed.
    """
    if browser == "chrome":
        browser = webdriver.Chrome()
    elif browser == "firefox":
        browser = webdriver.Firefox()
    elif browser == "safari":
        browser = webdriver.Safari()
    elif browser == "edge":
        browser = webdriver.Edge()
    else:
        raise RuntimeError("Unknown browser type")
    # Timeout for web driver waits. 10s is a reasonable value unless you have
    # a very high-res image / terrible network.
    timeout = 10
    browser.get("http://www.docleaner.cn/experience.html")

    # Turn on background removal and automatic deskewing.
    WebDriverWait(browser, timeout).until(
        expected_conditions.visibility_of_element_located(
            (By.XPATH,
             "//input[@value='去背景']/parent::div/preceding-sibling::button")
        )
    ).click()
    WebDriverWait(browser, timeout).until(
        expected_conditions.visibility_of_element_located(
            (By.XPATH,
             "//input[@value='自动纠斜']/parent::div/preceding-sibling::button")
        )
    ).click()
    # Wait for a while to ensure the changes take effect.
    time.sleep(1)

    uploader = WebDriverWait(browser, timeout).until(
        expected_conditions.presence_of_element_located(
            (By.CLASS_NAME, "layui-upload-file")))

    try:
        uploader.send_keys(next(images))
        while True:
            # Write like this instead of a for loop enables us to fetch the
            # next image while the browser & remote server are processing the
            # image just uploaded. Converting a pdf page to an image is slow,
            # so we here save a lot of time :)
            next_image = next(images)
            # Wait for the result image to be visible.
            result = WebDriverWait(browser, timeout).until(
                expected_conditions.visibility_of_element_located(
                    (By.ID, "dragImgRight")))
            # Hide the result image again so the wait condition above can be
            # re-used.
            browser.execute_script(
                "arguments[0].parentNode.classList.add('layui-hide');", result)
            result = result.get_attribute("src")
            result = base64.b64decode(
                result.replace("data:image/jpg;base64,", ""))
            yield Image.open(io.BytesIO(result))
            if next_image == "":
                # See convert_pdf_to_images for the reason behind this weird
                # branch.
                break
            uploader.send_keys(next_image)
    except StopIteration:
        pass

    browser.quit()

注意三点:

  1. 之前在PDF转图片的设计中有返回空路径的奇技淫巧需要特判。
  2. 在点击开关之后需要等一会来确保开关真的点下去了。
  3. 循环的写法最好不要写成标准的for循环,因为这样的话文件图片上去之后代码就是干等。现在的写法确保了图片传上去之后立刻就获取下一张图片,之后再判断等待是否当前图片处理完成。再调用next(images)的过程中,浏览器也在独立于Python进程运行,因此我们毫无损失。将PDF的一页转化为一张高清图片需要1-2秒钟,而网站处理图片大概也是这么长时间,因此在这里循环写法的改变是一个很不错的优化。

将下载下来的图片拼合回PDF

这个功能的实现是相对简单,只要照着PyMuPDF的文档写就行了。

同时,我注意到pytesseract里面有输入图片直接输出PDF的函数,因此可以一并在这里完成。

顺便可以通过tqdm显示一个进度条,方便看清进度。

最终代码如下:

from tqdm import tqdm
import pytesseract

def convert_images_to_pdf(images: Generator[Image.Image, None, None],
                          output: StrPath,
                          ocr: bool = True, total: Optional[int] = None):
    """
    Converts and merges images to a one-page pdf file, performing optional
    OCR in the process.

    :param images: A generator yielding PIL image objects.
    :param output: Path to the result pdf.
    :param ocr: Whether to perform OCR(Optical Character Recognition).
    :param total: An optional integer hinting the total number of images given.
        If supplied, a progress bar will be displayed during the conversion.
    """
    doc = fitz.Document()
    for image in images if total is None else tqdm(images, total=total):
        if ocr:
            pdf = pytesseract.image_to_pdf_or_hocr(image)
            page = fitz.Document(stream=pdf, filetype="pdf")
            doc.insert_pdf(page)
        else:
            # noinspection PyUnresolvedReferences
            page = doc.new_page(width=image.width, height=image.height)
            buffer = io.BytesIO()
            image.save(buffer, format="jpeg")
            page.insert_image(fitz.Rect(0, 0, image.width, image.height),
                              stream=buffer)
    doc.save(output)

CLI封装

在最后的最后,可以通过Click库,为脚本进行一个CLI的封装:

import click

@click.command()
@click.argument("input", type=click.Path(exists=True))
@click.argument("output", type=click.Path())
@click.option("-f", "--format", default="png",
              help="Intermediate image format.")
@click.option("-d", "--dpi", default=300, help="DPI for rasterization.")
@click.option("-b", "--browser", default="chrome",
              help="The browser selenium uses.")
@click.option("--first-page", type=int, help="First page to convert/clean.")
@click.option("--last-page", type=int, help="Last page to convert/clean.")
@click.option("--ocr/--no-ocr", default=True,
              help="Whether to perform OCR during the conversion.")
@click.option("--clean/--dont-clean", default=True,
              help="Whether to clean pdf using docleaner's online service.")
def main(input: str, output: str, format: str, dpi: int, browser: str,
         first_page: Optional[int], last_page: Optional[int], ocr: bool,
         clean: bool):
    images = convert_pdf_to_images(input, fmt=format, dpi=dpi,
                                   first_page=first_page, last_page=last_page)
    if clean:
        images = clean_doc_online(images, browser)
    doc = fitz.Document(input)
    total = (doc.page_count if last_page is None else last_page) \
        - (0 if first_page is None else first_page - 1)
    convert_images_to_pdf(images, output, ocr=ocr, total=total)


if __name__ == "__main__":
    main()

大功告成!

测试

我选取的测试书籍是津巴多的《态度改变与心理影响》。原文扫描版460页98.4MB,自动化脚本耗时半个小时左右处理完毕,结果大小为325MB。对比如下(典型页面):

可以看到,脚本做出了显著的优化,但在脚本运行中也发现了一些可以改进的地方:

  1. 过于激进的光栅化分辨率使得结果文档的大小显著增加。
  2. 基于Selenium的解决方案偶尔会出现开关没有开或者莫名其妙地关掉的情形。
  3. 脚本的运行效率有待提升。

对于这三点,我们将在以后的文章中进行改进。

在看了Introduction to Linear Algebra之后我对于行列式真的是豁然开朗。

行列式的性质

行列式被定义为具有一下三个核心性质的,关于一个方阵的值:

  1. 单位阵的行列式为1。

  2. 两行交换,行列式变号。

  3. 行列式关于单行是线性的,即

    \begin{vmatrix} \vdots \\ (\boldsymbol  a + \boldsymbol b)^{\mathrm T} \\ \vdots \end{vmatrix} = \begin{vmatrix} \vdots \\ \boldsymbol a^{\mathrm T} \\ \vdots \end{vmatrix} + \begin{vmatrix} \vdots \\ \boldsymbol b^{\mathrm T} \\ \vdots \end{vmatrix},\quad\quad
\begin{vmatrix} \vdots \\ (t\boldsymbol a)^{\mathrm T} \\ \vdots \end{vmatrix} = t\begin{vmatrix} \vdots \\ \boldsymbol a^{\mathrm T} \\ \vdots \end{vmatrix}

从这三个性质,可以导出行列式所有重要的性质:

  1. 两行相等,行列式为0证明:交换相等两行,行列式不变,但又由性质2,行列式要变号,因此行列式只能是0

  2. 初等行变换不改变行列式的值。证明:由性质3、4得:、

    \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 (\boldsymbol a -l \boldsymbol b)^{\mathrm T} \\
 \vdots
\end{vmatrix}
 = \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\
 \vdots
\end{vmatrix}
-
l\begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots
\end{vmatrix}
= \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\
 \vdots
\end{vmatrix}
-
l\cdot 0
= \begin{vmatrix}
 \vdots \\
 \boldsymbol b^{\mathrm T} \\
 \vdots \\
 \boldsymbol a^{\mathrm T} \\
 \vdots
\end{vmatrix}

  3. 有一行为\boldsymbol 0,则行列式为0证明:将另一行加到零行上,出现相等的两行,由性质4,行列式的值为0。又因为初等行变换不改变行列式的值,所以原来的行列式值也是0

  4. 三角阵的行列式为其主对角线元素之积。证明:无论是上三角阵还是下三角阵,都可以简单地利用行变换消去非对角线上的元素,此时由性质1和性质3,行列式的值就是主对角线元素之积。

  5. 矩阵可逆,则行列式非零。证明:考虑高斯消元,如果矩阵奇异则必然三角阵中有零行,行列式为0。反之,则左右行都是主元行,三角阵的行列式非零。(高斯消元过程中的行变换不改变行列式,行交换只改变正负。)

  6. 积的行列式等于行列式的积。证明:对于矩阵A,B,考虑关于A的函数D(A) = |AB|/|B|。其有三个性质:

    1. D(I) = |B|/|B| = 1
    2. A的两行交换,则AB的两行交换,则|AB|变号,则D(A)变号。
    3. A的单行线性变换即AB的单行线性变换,因此|AB|关于A的单行是线性的,即D(A)关于A单行是线性的。

    因此,D(A)满足行列式的核心性质,由行列式之定义,D(A) = |A|

  7. 转置不改变行列式的值。若矩阵奇异,则其转置奇异,则二者行列式皆为0。若矩阵可逆,则考虑LU分解PA=LU。两边转置得A^{\mathrm T}P^{\mathrm T} = U^{\mathrm T}L^{\mathrm T}。由性质9,有

    |P||A|=|L||U|,\quad \left|A^{\mathrm T}\right|\left|P^{\mathrm T}\right| = \left|U^{\mathrm T}\right|\left|L^{\mathrm T}\right|
    比较二式,L,U为三角阵,因为转置不改变对角线,因而转置不改变二者行列式的值。P为行交换的排列,行列式为\pm 1,而P^{\mathrm T}P=I,因此|P|\left|P^{\mathrm T}\right|同正负,即相等。综上,由两等式,|A| = \left|A^{\mathrm T}\right|

    推论:性质2、3、4、5、6对于列也适用。

行列式的计算

排列公式

由行列式的单行线性,可以逐行将行列式进行拆分,例如:

\begin{aligned}
    \begin{vmatrix}
        a &b\\
        c &d
    \end{vmatrix} &= \begin{vmatrix}
        a &0\\
        c &d
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        c &d
    \end{vmatrix} \\
    &= \begin{vmatrix}
        a &0\\
        c &0
    \end{vmatrix} + \begin{vmatrix}
        a &0\\
        0 &c
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        c &0
    \end{vmatrix} + \begin{vmatrix}
        0 &b\\
        0 &d
    \end{vmatrix}
\end{aligned}
照此态势,n行的行列式可以被分拆为n^n个每行至多只有一个非零元素的行列式之和。而注意到,这些行列式有一些存在零列,他们的值一定为0可以省略。因此,可能非零的只有那些每行至多只有一个非零元素,且非零元素所在列两两不同的行列式。对于这些行列式,存在排列矩阵可以将其对角化,进而我们得出这些行列式的值为每一行可能的非零元素之积乘以排列矩阵的行列式,后者只可能是\pm 1——不妨将这正负定义为排列的“奇偶性”。则行列式的排列公式呼之欲出:

A_{n\times n} = \sum_{\pi \in S_n} (-1)^{\sigma(\pi)}\prod_{i=1}^na_{i, \pi(i)}

其中S_n表示所有n排列的集合,\pi表示一个排列,\sigma(\pi)表示\pi的逆序对数,即

\sigma(\pi)=\sum_{i<j}[\pi(i) > \pi(j)]

行列展开

同样由行列式的单行线性,可以将行列式的某一行进行拆分成n个该行只有至多一个非零元素的行列式之和。对于每一个这样的行列式考虑排列公式:显然有效的排列是正好在那一行选中了非零元素的那一列的(其他的排列都选到了0,就没有贡献了),去掉选中这个元素的部分,这些排列恰好全部转化为n-1长度的全排列,覆盖了矩阵中所有不在非零元素所在行所在列的元素——那这些是不是可以看作是一个小一点的n-1阶行列式呢?这就是行列式行列展开的原理。

因为这个笔记主要是我复习自用,所以具体形式我就不多写了。

我记得自己两年之前写过一篇行列式值的按行列展开定义与排列定义等价性证明,那个时候觉得自己写了一大堆很厉害,现在一看,虽然比较严谨,但是格局小了。

克拉默法则

克拉默法则提供了一种在形式上求解线性方程的工具,其证明如下:

\left.
\begin{aligned}
A \boldsymbol x &= \boldsymbol b \\
A I &= A
\end{aligned}
\right\}
\Rightarrow 
A \cdot \begin{bmatrix}
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{bmatrix} = \begin{bmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
\end{bmatrix}
因此,由积的行列式为行列式的积,就有
|A| \cdot \begin{vmatrix}
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{vmatrix} = \begin{vmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
\end{vmatrix}
而注意到
\begin{vmatrix}
1 &0 &\cdots &0 &x_1 &0 &\cdots &0 \\
0 &1 &\cdots &0 &x_2  &0&\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 &0 &\cdots &1 &x_{i-1} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i} &0 &\cdots &0 \\
0 &0 &\cdots &0 &x_{i+1} &1 &\cdots &0 \\
\vdots &\vdots &\ddots &\vdots &\vdots &\vdots &\ddots  &\vdots \\
0 & 0 &\cdots &0 &x_n &0 &\cdots &1
\end{vmatrix} = x_i
因此就有
x_i = \frac{\begin{vmatrix}
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
    \boldsymbol{a_1} &\boldsymbol{a_2} &\cdots  & \boldsymbol{a_{i - 1}} &\boldsymbol b & \boldsymbol{a_{i + 1}} &\cdots &\boldsymbol{a_n} \\
    \vert &\vert &\cdots &\vert &\vert &\vert &\cdots &\vert\\
\end{vmatrix}}{|A|}
即克拉默法则得证。

行列式体积

行列式的几何意义为:以行列式的行向量为边的高维盒体(box,不知道怎么翻最合适,就是矩形,平行六面体等)的“带号体积”。

为什么?因为这个几何意义是满足行列式的三条根本性质的:

  1. 单位阵的行向量组成的盒体就是单位盒体,体积显然是一。
  2. 交换行列式的行不会改变盒体的形状,所以带号体积的绝对值不变,但符号是改变的(这姑且是带号体积的“定义”吧,带号体积的绝对值就是常规意义下的体积,这是自然的)。
  3. 对行列式一行的向量进行线性拉伸,对应盒体的体积显然也会线性拉伸。把行列式一行的行向量分拆成两个行向量之和……这个似乎有些复杂,在二维的情况下容易画图理解,在高维的情况下类似。

因此,以原点为一个顶点,另外两个顶点为(x_1, y_1), (x_2, y_2)的三角形面积就是

\frac{1}{2}\left|\begin{vmatrix}x_1 &y_1 \\ x_2 &y_2\end{vmatrix}\right|
三个顶点为(x_1,y_2),(x_2,y_2),(x_3, y_3)的三角形面积就是
\frac{1}{2}\left|\begin{vmatrix}x_1 &y_1 & 1\\ x_2 &y_2 &1 \\ x_3 &y_3 &1\end{vmatrix}\right|
上面这个公式可以想象三个三维向量感性理解一波(脑补割补和祖暅原理),也可以通过对1列的展开归约到上式。

三维向量的叉积

两个三维向量的叉积\boldsymbol u \times \boldsymbol v定义为另外一个三维向量,其模为|\boldsymbol u||\boldsymbol v||\sin \theta|,其方向垂直\boldsymbol u, \boldsymbol v。叉积的公式可以通过如下的伪·行列式进行记忆:

\boldsymbol u \times \boldsymbol v = \begin{vmatrix}
    \boldsymbol i &\boldsymbol j &\boldsymbol k \\
    \boldsymbol u_1 &\boldsymbol u_2 &\boldsymbol u_3 \\
    \boldsymbol v_1 &\boldsymbol v_2 &\boldsymbol v_3 \\    
\end{vmatrix}
其中\boldsymbol i = (1, 0, 0), \boldsymbol j = (0, 1, 0), \boldsymbol k = (0, 0, 1)

显然,从上面的行列式形式我们看出叉积具有反对称性:\boldsymbol u \times \boldsymbol v = -\boldsymbol v \times \boldsymbol u

叉积在物理上具有比较重要的应用。

最后,有一种有意思的写法:叉积再点积

(\boldsymbol u \times \boldsymbol v)\cdot \boldsymbol w = \begin{vmatrix}
    \boldsymbol w_1 &\boldsymbol w_2 &\boldsymbol w_3 \\
    \boldsymbol u_1 &\boldsymbol u_2 &\boldsymbol u_3 \\
    \boldsymbol v_1 &\boldsymbol v_2 &\boldsymbol v_3 \\    
\end{vmatrix}

以下是读Gilbert Strang的Introduction to Linear Algebra的后回忆整理的笔记。内容大概包括第三节和第四节开头的一部分。

四大空间

对于矩阵A\in \mathbb{R}^{m\times n},定义

  1. A的列空间为其所有列向量的张成空间,A的列向量的线性组合记作A\boldsymbol{x}=\boldsymbol b——因此所有这个形式的\boldsymbol b都属于A的列空间。记作C(A)C(A)\subset\mathbb{R}^m
  2. A的零空间为令其列向量线性组合为零的组合构成的空间,即所有满足A\boldsymbol{x}=\boldsymbol 0的向量\boldsymbol x。记作N(A)N(A)\subset \mathbb{R}^n
  3. A的行空间为其所有行向量的张成空间。因为A的行向量都是A^{\mathrm{T}}的列空间,所以行空间可以记作C\left(A^{\mathrm T}\right)C\left(A^{\mathrm T}\right) \subset \mathbb{R}^{n}
  4. A的左零空间为令其行向量线性组合为零的组合构成的空间,即所有满足\boldsymbol{x}^{\mathrm T}A = \boldsymbol 0^{\mathrm T}的向量\boldsymbol{x},现在\boldsymbol x在左边,所以和零空间相比就叫左零空间了。左零空间自然可以记作C\left(A^{\mathrm T}\right)C\left(A^{\mathrm T}\right) \subset \mathbb{R}^{m}

以上就是线性代数当中比较重要的四个空间。虽然有四个,但本质只有“列空间”和“零空间”两种。基于矩阵的定义是颇有些枯燥的。矩阵只是线性变换比较常用且典型的形式,列空间和零空间的定义并不局限于此。归根结底,列空间是线性变换的“值域”,零空间是线性变换的“零点”

注:列空间在一些地方也叫做像空间(image),零空间在一些地方也叫做核(kernel)。个人认为,对于一般的线性变换,像空间的称谓比列空间更为恰当,但kernel这个词就有些莫名奇妙——我没有了解过相关的背景,但我觉得零空间的叫法更为直观。

向量空间的维度

向量空间的维度定义为这个空间任何一组基包含的向量个数。这个定义蕴含的一个事实是:对于同一个向量空间,无论其基如何选取,其大小都是相同的。这里摘录一个书上的证明:

证明: 假设向量空间\mathbf V的两组基为\boldsymbol{v_1},\boldsymbol{v_2},\cdots,\boldsymbol{v_n}\boldsymbol{w_1}, \boldsymbol{w_2}, \cdots ,\boldsymbol{w_m}。运用反证法,不失一般性,设n < m。因为\boldsymbol{v_1},\boldsymbol{v_2},\cdots,\boldsymbol{v_n}是基,所以任意\boldsymbol{w_k}都可以用他们的线性组合表示。即

\boldsymbol{w_k} = \begin{bmatrix}
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix}\cdot \boldsymbol{a}
(回忆一下,矩阵左乘向量等于按照向量的系数对矩阵的列向量进行线性组合)

进一步地,

\begin{bmatrix}
    \boldsymbol{w_1} &\boldsymbol{w_2} &\cdots &\boldsymbol{w_m}  
\end{bmatrix}= 
\begin{bmatrix}
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix}\cdot A
不难发现,Anm列的矩阵,而n < m,所以A是扁的。在代数上,齐次线性方程组的变量多于方程数,也就是说A\boldsymbol{x} = \boldsymbol{0}有非平凡解。
\begin{aligned}
    A\boldsymbol{x} &= \boldsymbol{0} \\
    \Rightarrow \begin{bmatrix}
    \boldsymbol{v_1} &\boldsymbol{v_2} &\cdots &\boldsymbol{v_n}  
\end{bmatrix} A\boldsymbol{x} &= \boldsymbol{0} \\
    \Rightarrow \begin{bmatrix}
    \boldsymbol{w_1} &\boldsymbol{w_2} &\cdots &\boldsymbol{w_m}  
\end{bmatrix} \boldsymbol{x} &= \boldsymbol{0}
\end{aligned}
也就是说,\boldsymbol{w_1}, \boldsymbol{w_2}, \cdots ,\boldsymbol{w_m}线性相关,推出矛盾,证毕。

四大空间的维度

一个矩阵四大空间的维度可以通过将其化简为行简化阶梯型(reduced row echelon form)来计算。行简化阶梯型的定义

  1. 阶梯型/上三角型矩阵
  2. 主元位置都是1
  3. 主元正上方都是0

例如

\operatorname{rref}\left(
\begin{bmatrix}
    1 &1 &2 &4\\
    1 &2 &2 &5\\
    1 &3 &2 &6
\end{bmatrix}
\right)
=
\begin{bmatrix}
    1 &0 &2 &3\\
    0 &1 &0 &1\\
    0 &0 &0 &0
\end{bmatrix}
行简化阶梯型矩阵可以通过原矩阵的初等行变换得到(高斯消元)。

行简化阶梯型矩阵当中主元的个数称为矩阵的(rank)。我们首先注意到,对于一个行简化阶梯型矩阵A

  1. 它列空间的维度是它的秩:主元列显然是线性无关的,而非主元列显然可以用主元列的线性组合表示(系数就写在列里面呢,比如说在上面的例子中,第四列是第一列的3倍和第二列的1倍之和)。
  2. 它行空间的维度也是它的秩:行简化阶梯型的形态让我们不难发现每一个非零行都是线性无关的。原因在于每个非零行都包含一个主元,而只有那一行在主元的那一列是1
  3. 它零空间的维度是列数与秩之差,即非主元列的个数:从方程A\boldsymbol x = \boldsymbol 0的角度看,非主元列代表自由变量。通过令一个自由变量为1,令其余自由变量为0,并依此计算主元,我们可以得到和自由变量同样数目的若干“解向量”。这些解向量线性无关,都处在零空间中,且张成整个零空间,因此是零空间的一个基。
  4. 它左零空间的维度是行数与秩之差,即零行的个数:非零行线性无关,因此要使行的线性组合为零所有非零行的系数必须为零,零行的系数任意,因此左零空间的维度就是零行的个数。

接下来,我们观察到从一般矩阵化简为行简化阶梯型的过程不会改变矩阵的四个空间的维度

  1. 初等行变换不会改变列向量之间的线性关系——对于所有列来说,行变换是“一荣俱荣,一损俱损”的。如果原来矩阵的第三列是第一列的两倍,那么行变换之后第三列还是第一列的两倍。如果原来矩阵的第二列是第一列和第三列的和,那么行变换之后这个和的关系不变。
  2. 因为列向量之间的线性关系没有变,所以零空间不会变,列空间的维数也不会变(但列空间本身会变)。
  3. 行变换是对行的线性组合,所以行简化阶梯型的行空间是原矩阵行空间的子集;而高斯消元的行变换是可逆的,所以原矩阵的行空间也是行简化阶梯型行空间的子集。因此,化简到行简化阶梯型的过程当中,行空间不变。
  4. 行简化阶梯型只是揭示而并非改变行向量之间的线性相关关系(行简化阶梯型中的零行表示原来的行向量和上面几行的行向量线性相关)。原来线性无关的行不会因为行变换就线性相关,反之亦然。因此左零空间的维数是不会变的。

所以,一般矩阵的秩也就可以定义为其对应行简化阶梯形矩阵的秩,一般矩阵的四大空间的维度也可以用其秩计算了。

四大空间维度的奇妙关联

简要地总结一下,对于一个mn列秩r的矩阵A,有

  1. \dim\left(C(A)\right) = r
  2. \dim\left(N(A)\right) = n - r
  3. \dim\left(C\left(A^{\mathrm T}\right)\right) = r
  4. \dim\left(N\left(A^{\mathrm T}\right)\right) = m - r

然后我们整理一下就发现

\begin{aligned}
    \dim\left(C(A)\right) + \dim\left(N(A)\right) &= n = \dim\left(\mathbb{R}^n\right) \\
    \dim\left(C\left(A^{\mathrm T}\right)\right) + \dim\left(N\left(A^{\mathrm T}\right)\right) &= m = \dim\left(\mathbb{R}^m\right)
\end{aligned}
有什么意义?注意到A可以看作一个线性变换,其定义域所在的空间是\mathbb{R}^nA^{\mathrm T}也是如此。结合零空间和列空间的意义,我们发现了这样一个关系:

线性变换定义域(输入)的维度,等于零空间与值域(输出)维度之和。

好理解不?一个输入的部分维度在线性变换中“卷入零空间”消失了,只有剩下的维度反映在输出中。

高屋建瓴,妙不可言。这个关系具有震撼人心的力量。自己前几天在学校阅览室读到这一行,差点拍案叫绝。

线性代数是简洁的。

线性代数是优美的。

莫不如是。

这有啥用处?

书后有一个习题

定义两个子向量空间\mathbf{V}\mathbf W的和\mathbf{V} + \mathbf{W}为二者中任意向量组合形成的空间(即二者基的并的张成空间)。

求证:

\dim(\mathbf V) + \dim(\mathbf W) = \dim(\mathbf V + \mathbf W) + \dim(\mathbf V \cap \mathbf W)

朴素的做法设出两向量空间交的基,然后在此基础上分别拓展形成\mathbf V\mathbf W,然后再证明这些分别拓展的基和两空间交的基一起是线性无关的。步骤倒也不是特别复杂。

但是格局低了。

Mike Artin提出的证明是这样的:考虑二元运算+把二元组 (\boldsymbol v\in \mathbf V, \boldsymbol w\in \mathbf W)变换到\boldsymbol v + \boldsymbol w。易证这个变换是线性的。其输入维度是\dim(\mathbf V) + \dim(\mathbf W)。其值域的维度是\dim(\mathbf V + \mathbf W)。其零空间是(\boldsymbol v, \boldsymbol -v)其中v\in \mathbf V \cap \mathbf W,维度自然是\dim(\mathbf V \cap \mathbf W)。因为线性变换输入的维度等于输出的维度与零空间维度之和,得证。

太高了,我恐高症都犯了。

四大空间的正交性

还有一点很妙的是:四大空间是两两正交的

什么是正交?两个子向量空间正交当且仅当两个空间中的任意两个向量都互相垂直。(说“子向量空间”是因为两个向量谈论垂直不垂直的前提条件自然是两个向量维数一样,即同属于一个更大的向量空间)

  1. \mathbb R^n内,零空间N(A)和行空间C\left(A^{\mathrm T}\right)互相正交

    \forall A^{\mathrm T}\boldsymbol x \in C\left(A^{\mathrm T}\right),\boldsymbol y \in N(A),\quad \left(A^{\mathrm T}\boldsymbol x\right)^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}A\boldsymbol y = \boldsymbol{x}^{\mathrm T}\cdot \boldsymbol 0 = 0
    又因为二者维度之和为n,所以二者互补,即N(A) + C\left(A^{\mathrm T}\right) = \mathbb{R}^n

  2. \mathbb R^m内,列空间C(A)和左零空间N\left(A^{\mathrm T}\right)互相正交

    \forall A\boldsymbol x \in C(A),\boldsymbol y \in N\left(A^{\mathrm T}\right),\quad \left(A\boldsymbol x\right)^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}A^{\mathrm T}\boldsymbol y = \boldsymbol{x}^{\mathrm T}\cdot \boldsymbol 0 = 0
    又因为二者维度之和为m,所以二者互补,即C(A)+ N\left(A^{\mathrm T}\right) = \mathbb{R}^m

于是就可以祭出Introduction to Linear Algebra 封面上的神图手绘重制以表敬意