Fast Walsh-Hadamard Transform in Competitive Programming
This should be the very first English post I write on my blog and I expect there to be some minor errors. This is a popular technique in the Chinese competitive programming community but there doesn't seem to be a lot of documentation about its application in the English CP community. The posts I found on Codeforces doesn't seem to be very clear to me...
Prerequisites
- A decent proficiency in competitive programming.
- A basic understanding of the Cooley-Tukey FFT and its application in competitive programming.
- A decent understanding of bitwise operations.
Why do we need FWHT?
Recall what we would do if we are to quickly calculate the following
convolution of two sequences
and
,
each of length
:
Now we try to generalize our findings to a more general case:
FWHT is an algorithm that borrows similar notions from FFT and is
able to compute the convolution in
time for
(bitwise OR, bitwise AND, and bitwise XOR). Why do the convolutions of
these bitwise operations matter? Observe that binary representation is a
way of encoding sets and these three operations correspond to set union,
set intersection and set symmetric difference respectively, therefore,
FWHT can be used to accelerate set-based DPs.
Bitwise OR convolution
Let's start with the convolution with respect to bitwise OR:
Recall what we did in FFT: we divide
into two subsequences based on parity of indices, a.k.a, the last bit of
indices. We did this because the root of unity has such amazing property
as
.
We could do that here as well, but a limitation of dividing based on the
last bit is that the order of elements changes in the process, so an
efficient in-place implementation has to do a pre-shuffle to cope with
that. Since OR is a bitwise operation, which bit based on which we
divide doesn't really matter much. Why not simply divide based
on the first, or the most significant bit, such that the order
of elements is preserved in the process? Dividing based on the highest
bit of indices, simply put, is to split
into the first half,
,
and the second half,
,
in their natural order.
Here I introduce a notation,
or
.
In the context where the length of the sequence is
(and
is a power of
),
where
,
and
is just
.
In other words,
has
as the highest bit and
has
as the highest bit.
(Note using this notation,
and
)
To make our writing clearer, denote
We first look at the first half of
.
Using the notation I defined above, these elements can be expressed as
.
(here I use the tuple notation to denote concatenation, and
to denote element-wise addition).
This is something we can write an in-place implementation for with ease:
void fwht_or(int n, int *a, int *A) {
copy(a, a + n, A);
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1)
for (int l = 0; l < n; l += s)
for (int i = 0; i < h; i++)
A[l + h + i] += A[l + i]
}
Its time complexity is obviously
with a really small constant factor.
Its reverse transform turns out to be simple as well, suppose we know
and let
Then we can recover
and
:
void ifwht_or(int n, int *a, int *A) {
std::copy(A, A + n, a);
// If n is guaranteed to be a power of 2 then we don't need n_ and the min(...) in the inner loop.
int n_ = 1; while (n_ < n) n_ <<= 1;
// n_ = 1 << (32 - __builtin_clz(n - 1));
for (int s = n_, h = n_ / 2; h; s >>= 1, h >>= 1)
for (int l = 0; l < n; l += s)
for (int i = 0; i < std::min(i, n - l - h); i++)
a[l + h + i] -= a[l + i]
}
And an amazing thing about this, which I haven't quite figured out why, is that the order of the outermost loop above can be reversed and both functions can be merged into one:
void fwht_or(int n, int *a, int *A, int dir = 1) {
std::copy(a, a + n, A);
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1)
for (int l = 0; l < n; l += s)
for (int i = 0; i < h; i++)
A[l + h + i] += dir * A[l + i]
}
(Fast bitwise OR / set union convolution is sometimes aliased "Fast Mobius Transform" in Chinese CP community. Both are essentially the same.)
Bitwise AND convolution
The bitwise AND convolution
(Actually, by de Morgan's Law we can always reduce an AND convolution to an OR convolution)
Note that AND also has this interesting property:
We still adopt the same divide-and-conquer approach and continue to
use the notations
.
Consider the first half of
,
which can be expressed as
:
This gives an efficient implementation very similar to
fwht_or
above:
void fwht_and(int n, int *a, int *A) {
std::copy(a, a + n, A);
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1)
for (int l = 0; l < n; l += s)
for (int i = 0; i < h; i++)
A[l + i] += A[l + h + i]
}
The inverse transform is simple as well. Let
,
then
void ifwht_and(int n, int *a, int *A) {
std::copy(A, A + n, a);
// If n is guaranteed to be a power of 2 then we don't need n_ and the min(...) in the inner loop.
int n_ = 1; while (n_ < n) n_ <<= 1;
for (int s = n_, h = n_ / 2; h; s >>= 1, h >>= 1)
for (int l = 0; l < n; l += s)
for (int i = 0; i < std::min(i, n - l - h); i++)
a[l + i] -= a[l + h + i]
}
The order of the outermost loop can be reversed and we can also merge the functions above together.
Bitwise XOR convolution
The XOR operation does not have such nice property as
We first introduce an auxiliary operation, define
,
where
denotes the number of
s
in the binary representation of
.
Claim: The transformation below turns convolutions to element-wise multiplications:
Observe that by the definition of XOR we have
We finally get
With this conclusion we can simplify the four terms above:
We then explore how to compute
efficiently. Divide and conquer is still our friend, and dividing
based on the highest bit works here so we continue to use those
notations.
Consider the first half of
...
void fwht_xor(int n, int *a, int *A) {
std::copy(a, a + n, A);
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1) {
for (int l = 0; l < n; l += s) {
for (int i = 0; i < h; i++) {
int t = A[l + h + i];
A[l + h + i] = A[l + i] - t;
A[l + i] += t;
}
}
}
}
void ifwht_xor(int n, int *a, int *A) {
std::copy(A, A + n, a);
// If n is guaranteed to be a power of 2 then we don't need n_ and the min(...) in the inner loop.
int n_ = 1; while (n_ < n) n_ <<= 1;
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1) {
for (int l = 0; l < n; l += s) {
for (int i = 0; i < std::min(i, n - l - h); i++) {
int t = a[l + h + i];
a[l + h + i] = (a[l + i] - t) / 2;
a[l + i] = (a[l + i] + t) / 2;
}
}
}
}
They can be merged as well:
void fwht_xor(int n, int *a, int *A, bool inv = false) {
std::copy(a, a + n, A);
for (int s = 2, h = 1; s <= n; s <<= 1, h <<= 1) {
for (int l = 0; l < n; l += s) {
for (int i = 0; i < h; i++) {
int t = A[l + h + i];
A[l + h + i] = A[l + i] - t;
A[l + i] += t;
if (inv) A[l + h + i] /= 2, A[l + i] /= 2;
}
}
}
}
This code above is what Wikipedia refers to as the authentic Fast Walsh-Hadamard Transform.
Some sidenotes
Note that though FFT and FWHT shares the same idea of divide and
conquer, FWHT does not require
to be a power of
whereas FFT does. (Well actually neither of them "require"
to be a power of
,
but to apply FFT when
is not a power of
you either need to pad with
s
or you have to make your implementation really complicated).
Also, I just came to know that if we express WHT in the language of matrices and vectors, the matrix is called a Hadamard Matrix.
Another fact that I didn't quite understand is why the order of the inverse FWHT can be reversed.
For instance, when
,
after fully dividing the sequence into individual elements, we first
merge
,
then we merge
and finally
.
Naturally when we do the inverse transform we have to start with
,
recover
,
then recover
and then recover the individual elements. But the popular implementation
seems to suggest that the inverse transformation algorithm works in
another direction as well. I am now puzzled why this is true and
currently I'm just taking this for granted. Perhaps I derived the
inversion in a different way than others did? If you have a simple
explanation please leave a comment :)