documentation/documentation.tex at master · SubcubicInversion/documentation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
\documentclass[12pt, letterpaper]{article}

\usepackage[utf8]{inputenc}
\usepackage{array}
\usepackage{amsmath, amssymb, amsthm}
\usepackage{enumerate}
\usepackage{newunicodechar}
\usepackage{mathtools}
\usepackage{bm}
\usepackage{multicol}

\DeclareUnicodeCharacter{2212}{-}

\theoremstyle{remark}
\newtheorem{problem}{Problem}

\theoremstyle{remark}
\newtheorem*{solution}{\textbf{Solution}}

\newenvironment{amatrix}[1]{%
	\left[\begin{array}{@{}*{#1}{c}|c@{}}
			}{                 %
		\end{array}\right]
}

\title{Subcubic Parallelized Matrix Inversion \\
	\large Through Strassen's Algorithm}
\author{Andres Valdes, Alejandro Perez, Dewey Mowris, Andy Herrera}
\date{April, 2018}

\begin{document}
    \begin{titlepage}
        \maketitle
        \thispagestyle{empty}

    \end{titlepage}

    \begin{abstract}

        Matrix inversion is one of the most computationally complex
        matrix operations. This is unfortunate, as matrix inversion
        has many practical applications, including encryption, where the
        contents of a matrix are obfuscated by multiplying by a square,
        invertible cypher matrix and can only be retrieved my multiplying
        it by the inverse of that cypher.

    \end{abstract}

    \pagebreak

    \section{Introduction}

    \noindent
    A matrix is a rectangular array of numbers or expressions, for example:
    \[\begin{bmatrix*}
        a & b & c & d \\
        e & f & g & h
    \end{bmatrix*}\]

    This is what is referred to as a \(2\times{4}\) matrix, as it
    has two rows and four columns. In general, we speak about matrices as
    \(m\times{n}\), where \(m\) is the number of rows and \(n\) is the number
    of columns. The types of matrices we will deal with are called "square"
    matrices, this is the case that we have an \(m\times{n}\) matrix such that
    \(m = n\), we refer to these also as \(n\times{n}\) matrices. Only square,
    non-singular\footnote{A matrix whose determinant is nonzero.}
    matrices are invertible.

    The inverse of a matrix \(A\), is a matrix \(A^{-1}\) that when multiplied by \(A\),
    yields the identity matrix, \(I\). \(I_n\) is the square, \(n\times{n}\)
    matrix with the top-leftmost to the bottom-rightmost diagonal filled in
    with ones, and where every other element is a zero.
    \[I_n = \begin{bmatrix*}
        1 & 0 & 0 & \dots & 0 \\
        0 & 1 & 0 & \dots & 0 \\
        0 & 0 & 1 & \dots & 0 \\
        \vdots & \vdots & \vdots & \ddots & 0 \\
        0 & 0 & 0 & \dots & 1
    \end{bmatrix*}_{n\times n}\]

    So \(AA^{-1} = I\), this is the linear algebra equivalent of \(a\frac{1}{a} = 1\).

    \bigskip

    Traditionally, the inversion of a matrix involves the use of the formula
    \[A^{-1} = \frac{1}{det(A)}adj(A),\ adj(A) = (C_{A})^T\]

    Where \(C_A\) is the coefficient matrix of \(A\), that is, a matrix of
    the dimension of \(A\) with determinants filled in for each of the elements.

    This is a particularly heavy computation, as the common algorithm for determinant
    computation is of the computation complexity \(O(n!)\), with the best algorithm
    being \(O(n^3)\), forcing us to do \(O(n^5)\) computations for the coefficient
    matrix alone, as there are \(n^2\) elements in \(A\).

    \bigskip

    Thankfully, we do have a way of reducing the problem for the calculation of
    a matrix inverse via an analytic inversion formula known as blockwise inversion\cite{zhang}.
    In blockwise inversion, we split our matrix \(M\) into matrix blocks, such that
    \[M^{-1} = \begin{bmatrix*}
        A & B \\
        C & D
    \end{bmatrix*}^{-1} =
    \begin{bmatrix*}
        A^{-1} + A^{-1}B(D - CA^{-1}B)^{-1}CA^{-1} & -A^{-1}B(D - CA^{-1}B)^{-1} \\
        -(D - CA^{-1}B)^{-1}CA^{-1} & (D - CA^{-1}B)^{-1}
    \end{bmatrix*}\]

    The advantage to this divide-and-conquer algorithm is that it can be shown
    that the computational complexity of the entire inversion is the same as the
    computational complexity of the multiplication algorithm used\cite{algo}. This is where
    Strassen's algorithm, with a complexity of \(O(n^{log_27})\), comes into play.

    \bigskip

    Strassen describes an algorithm to multiply two matrices by splitting
    each into four matrix blocks of equal dimensions and performing seven multiplications
    between those blocks as opposed to eight\cite{strassen}.
    \[A = \begin{bmatrix*}
        A_{11} & A_{12} \\
        A_{21} & A_{22}
    \end{bmatrix*},\ B =
    \begin{bmatrix*}
        B_{11} & B_{12} \\
        B_{21} & B_{22}
    \end{bmatrix*},\ AB = C =
    \begin{bmatrix*}
        C_{11} & C_{12} \\
        C_{21} & C_{22}
    \end{bmatrix*}\]

    We define seven new matrices
    \begin{align*}
        M_1 &\coloneqq (A_{11} + A_{22})(B_{11} + B_{22}) \\
        M_2 &\coloneqq (A_{21} + A_{22})B_{11} \\
        M_3 &\coloneqq A_{11}(B_{12} - B_{22}) \\
        M_4 &\coloneqq A_{22}(B_{21} - B_{11}) \\
        M_5 &\coloneqq (A_{11} + A_{12})B_{22} \\
        M_6 &\coloneqq (A_{21} - A_{11})(B_{11} + B_{12}) \\
        M_7 &\coloneqq (A_{12} - A_{22})(B_{21} + B_{22})
    \end{align*}

    Note that for each matrix, we multiply only once. We can
    now compose these matrices into \(C\), or \(AB\).
    \begin{align*}
        C_{11} &= M_1 + M_4 - M_5 + M_7 \\
        C_{12} &= M_3 + M_5 \\
        C_{21} &= M_2 + M_4 \\
        C_{22} &= M_1 - M_2 + M_3 + M_6
    \end{align*}
    And thus, we've achieved matrix multiplication in subcubic time,
    and by extension, we can achieve matrix inversion in subcubic time.

    \section{Previous Works}

    There exist many algorithms for inverting matrices in
    cubic time via Gaussian Elimination in parallel, similar to the one implemented in many of the popular linear
    algebra packages like BLAS and LAPACK, and also the parallelized implementations of it.

    There are also parallelized implementations of Strassen's algorithms in BLAS and elsewhere, but their implementation
    does not strictly tie in with the matrix inversion problem addressed in this article. The topic is particularly sparse
    when looking towards parallelized inversion algorithms in Go, where parallelized Strassen algorithms are
    scarse to begin with.

    This article aims to implement an algorithm that, while not new, hasn't been parallelized for
    the application of inverting matrices, especially not in Go.

    \pagebreak

    \section{Process}

    Now, in order to implement the algorithm described above, we'll be using Go, Google's
    language written particularly for concurrency using a wonderful structure called a "goroutine."

    Essentially, our algorithm can be broken up into two algorithms, both of which have aspects that can be
    parallelized. First and foremost, we need to split our matrix sub-blocks. We then perform
    multiplications across those sub-blocks using Strassen's algorithm in order to get the seven intermediate matrices,
    and we do so in parallel. Of course, the intermediate scalar multiplications, additions between matrices, and copying of
    matrices are parallelized trivially in and of themselves.

    The code on the following page implements the mathematical algorithm described below:
    \begin{align*}
        M_1 &\coloneqq (A_{11} + A_{22})(B_{11} + B_{22}) \\
        M_2 &\coloneqq (A_{21} + A_{22})B_{11} \\
        M_3 &\coloneqq A_{11}(B_{12} - B_{22}) \\
        M_4 &\coloneqq A_{22}(B_{21} - B_{11}) \\
        M_5 &\coloneqq (A_{11} + A_{12})B_{22} \\
        M_6 &\coloneqq (A_{21} - A_{11})(B_{11} + B_{12}) \\
        M_7 &\coloneqq (A_{12} - A_{22})(B_{21} + B_{22})
    \end{align*}
    \pagebreak
    \begin{verbatim}
        wg.Add(7)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m1, Mult(Add(a11, a22), Add(b11, b22)))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m2, Mult(Add(a21, a22), b11))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m3, Mult(a11, Add(b12, Scale(-1, b22))))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m4, Mult(a22, Add(b21, Scale(-1, b11))))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m5, Mult(Add(a11, a12), b22))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m6, Mult(Add(a21, Scale(-1, a11)), Add(b11, b12)))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.copy(m7, Mult(Add(a12, Scale(-1, a22)), Add(b21, b22)))
        }(&wg)
        wg.Wait()
    \end{verbatim}

    \textit{You may notice the use of "WaitGroup", these are essentially pools of
    goroutines that have yet to finish, the main thread will be halted until
    these goroutines finish.}

    \pagebreak

    After we've acquired these matrices, we can find compose our result matrix, \(C\),
    in the following manner:
    \begin{align*}
        C_{11} &= M_1 + M_4 - M_5 + M_7 \\
        C_{12} &= M_3 + M_5 \\
        C_{21} &= M_2 + M_4 \\
        C_{22} &= M_1 - M_2 + M_3 + M_6
    \end{align*}
    Via the following Go implementation:
    \begin{verbatim}
        wg.Add(4)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.Copy(c11, Add(Add(Add(m1, m4), Scale(-1, m5)), m7))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.Copy(c12, Add(m3, m5))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.Copy(c21, Add(m2, m4))
        }(&wg)
        go func(wg *sync.WaitGroup) {
            defer wg.Done()
            utils.Copy(c22, Add(Add(Add(m1, Scale(-1, m2)), m3), m6))
        }(&wg)
        wg.Wait()
    \end{verbatim}

    Strassen's multiplication, in our implementation, recurses down to a base case of a trivial \(1\times{1}\)
    multiplication.

    This is done for every step in our blockwise inversion involving multiplication,
    the implementation of which is far too lengthy and complex to include in this article
    and would be better understood within the context of the entire package, found on the GitHub repository.\footnote{Repository: https://github.com/SubcubicInversion/implementation}

    Once the algorithm has recursed down to the base case, the \(3\times{3}\) case, the function calls close,
    fully inverting the matrix.

    \section{Results}

    The following results compare out runtime against that of BLAS, the low-level package
    that handles NumPy's matrix inversion.

    \begin{center}


        \begin{tabular}{|c|l|l|}

            \hline
            Size & Strassen & BLAS  \\
            \hline\hline
            \(2\times2\) & 44.55ns & 367.89ns \\
            \hline
            \(4\times4\) & 648.49ns & 503.06ns \\
            \hline
            \(8\times8\) & 2.41ms & 1.05ms   \\
            \hline
            \(512\times512\) & 2min, 2s & 12.49ms \\
            \hline
        \end{tabular}

    \end{center}

    Clearly, BLAS' matrix inversion implementation, despite its \(O(n^3)\) time complexity, is far faster for matrices of reasonable sizes.
    The reason for this is that the assembly-level implementation of BLAS uses cache tricks to significantly reduce the time required to
    compute the inverse.

    This advantage of caching over lesser complexity does dwindle as \(n\) approaches infinity, however; but the matrices of that magnitude are far
    too large for our tests to terminate within a reasonable amount of time.

    Another reason that our algorithm may take longer to execute may be the fact that Go is a garbage-collected language, and through recursive calls, many
    intermediary structures are created and then simply unreferenced, forcing Go to garbage-collect a large host of memory.

    \pagebreak

    \section{Conclusions}

    While for our tests, the algorithm developed to invert a matrix fell short of the algorithms used commonly
    as implemented by BLAS and LAPACK, it was the first parallelized algorithm to invert a matrix via Strassen's multiplication
    implemented in Go. It is interesting to see how, despite Go's low-level nature and the parallelization of an algorithm of lesser
    complexity, it is still beaten out by clever caching tricks to avoid fully computing the inverse element by element.

    This does not mean that we cannot improve on the algorithm, as it be helpful to continue to improve on the space-complexity of the problem and perhaps write the algorithm on a lower-level
    language, one that is not garbage-collected, like C or Assembly, as BLAS is written.

    Perhaps in the future, we can run this algorithm on a computer fast enough to terminate within a passable amount of time for
    very large \(n\), so that we may see the aymptotic relationships between both algorithms.

    \begin{thebibliography}{9}
        \bibitem{algo}
        T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein,
        \textit{Introduction to Algorithms}, 3rd ed., MIT Press, Cambridge, MA, 2009, §28.2.

        \bibitem{strassen}
        Strassen, Volker,
        \textit{Gaussian Elimination is not Optimal}, Numer. Math. 13, p. 354-356, 1969

        \bibitem{zhang}
        Zhang, F., & Zhang, F. (2005).
        \textit{The Schur Complement and Its Applications: Numerical Methods and Algorithms}. Boston, MA: Springer US.
    \end{thebibliography}


\end{document}