14 January 2026

Tall and skinny SVD

Recently, in my work I needed to construct a rational approximation of a matrix-valued function. The best algorithm I know is AAA¹ and that requires singular value decomposition of \(m \times n\) matrices with \(m \gg n\). For scalar functions, there is no problem with that since in most cases the matrix is reasonably sized to use svd call from standard Julia’s LinearAlgebra. To use AAA algorithm for \(r \times c\) matrix-valued function we have to deal with matrices \(r c\) times taller which is already painful. Continuous version of AAA algorithm² allows to increase number of rows adaptively, but even with that I had hard time even to store the full matrix in the memory: typical matrix size was millions times tens.

Luckily, the matrix in AAA algorithm is quite specific and I don’t need to keep all of it in the memory. Also, I don’t need the full SVD, only one right singular vector which corresponds to the smaller singular value. Of course, right singular vectors of a matrix \(A\) coincide with eigenvectors of \(A^\dagger A\). But the latter it extremely ill-conditioned so diagonalizing it is pretty much a dead end.

Since we don’t need left singular vectors, we can mess with the matrix a bit. Let’s make a QR-decomposition: \[ A = Q R, \] where \(Q\) is \(m \times n\) (let’s incorporate \(r\) and \(c\) into \(m\)) matrix with orthonormal columns \(Q^\dagger Q = I\) and \(R\) is an upper triangular \(n \times n\) matrix. The point is that matrices \(A\) and \(R\) have the same singular values and right singular vectors. Matrix \(R\) is small, so one can find it’s SVD straightforwardly. The problem is that while constructing QR-decomposition, we store all Householder/Givens rotations which constitute matrix \(Q\) and that takes memory.

But we don’t really need \(Q\). And the idea to replace matrix \(A\) with some simpler matrix while preserving singular values and right singular vectors is really nice. So let’s split matrix A into \(k\) blocks like \[ A = \begin{bmatrix} A_1 \\
A_2 \\
\vdots \\
A_k \end{bmatrix}, \] where \(A_j\) is \(m_j \times n\) matrix for \(j = 1, \ldots, k\). Each of the blocks is still “tall” \(m_j > n\) but fits in memory nicely. Let’s factorize the first block as \[ A_1 = Q_1 R_1. \] We notice that block matrix \[ \begin{bmatrix} R_1 \\
A_2 \\
\vdots \\
A_k \end{bmatrix} \] has the same singular values and right vectors. And it’s a bit shorter, since \(R\) is square matrix and \(m_1 > n\). Let’s go further: \[ \begin{bmatrix} R_1 \\
A_2 \end{bmatrix} = Q_2 R_2. \] Then the original matrix transforms to \[ \begin{bmatrix} R_2 \\
A_3 \\
\vdots \\
A_k \end{bmatrix}. \] Continuing this way, we construct \(R_k\) which coincides the factor \(R\) of QR-factorization of the full matrix. And this one is easy to deal with.

Parallel version

In principle, we could factorize each of the blocks separately \(A_j = \tilde Q_j \tilde R_j\) and then make a stack \[ \begin{bmatrix} \tilde R_1 \\
\tilde R_2 \\
\vdots \\
\tilde R_k \\
\end{bmatrix} \] This would shorten the matrix significantly while preserving its singular values and right singular vectors. Since the factorizations are independent, this operation parallelizes nicely. If the resulting stack is short enough, one can do brute-force SVD. Otherwise nothing stops us from decimating this stack in the same fashion again and again. This algorithm is known as tall and skinny QR (TSQR) and can be used in distributed calculations. But for my purpose it’s an overkill.

References

tags: linear algebra, numerical methods, singular value decomposition