Matlab has a reputation for running slowly. Here are some pointers on how to speed computations, to an often unexpected degree. Subjects currently covered:
Matrix Coding
Implicit Multithreading on a Multicore Machine
Sparse Matrices
Sub-Block Computation to Avoid Memory Overflow
Matrix Coding - 1
Matlab documentation notes that efficient computation depends on using the matrix facilities, and that mathematically identical algorithms can have very different runtimes, but they are a bit coy about just what these differences are. A simple but telling example:
The following is the core of the GD-CLS algorithm of Berry et.al., copied from fig. 1 of Shahnaz et.al, 2006, "Document clustering using nonnegative matrix factorization':
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
for ii = 1:n
b = W'*V(:,ii);
H(:,ii) = A \ b;
end
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e-9);
end
Replacing the columwise update of H with a matrix update gives:
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
B = W'*V;
H = A \ B;
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e-9);
end
These were tested on an 8049 x 8660 sparse matrix bag of words V (.0083 non-zeros), with W of size 8049 x 50, H 50 x 8660, maxiter = 50, lambda = 0.1, and identical initial W. They were run consecutivly, multithreaded on an 8-processor Sun server, starting at ~7:30PM. Tic-toc timing was recorded.
Runtimes were respectivly 6586.2 and 70.5 seconds, a 93:1 difference. The maximum absolute pairwise difference between W matrix values was 6.6e-14.
Similar speedups have been consistantly observed in other cases. In one algorithm, combining matrix operations with efficient use of the sparse matrix facilities gave a 3600:1 speedup.
For speed alone, C-style iterative programming should be avoided wherever possible. In addition, when a couple lines of matrix code can substitute for an entire C-style function, program clarity is much improved.
Matrix Coding - 2
Applied to integration, the speed gains are not so great, largely due to the time taken to set up the and deal with the boundaries. The anyomous function setup time is neglegable. I demonstrate on a simple uniform step linearly interpolated 1-D integration of cos() from 0 to pi, which should yield zero:
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit - start)/step)step + start;
delta = (endit - enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)delta/2;
intF = intF + fun(enda)(delta+1)/2;
for ii = start+step:step:enda-step
intF = intF + fun(ii);
end
intF = intFstep
toc;
intF = -2.910164109692914e-14
Elapsed time is 4.091038 seconds.
Replacing the inner summation loop with the matrix equivalent speeds things up a bit:
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit - start)/step)*step + start;
delta = (endit - enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)*delta/2;
intF = intF + fun(enda)*(delta+1)/2;
intF = intF + sum(fun(start+step:step:enda-step));
intF = intF*step
toc;
intF = -2.868419946011613e-14
Elapsed time is 0.141564 seconds.
The core computation take