Non-negative Matrix Factorization

In this post, I will be discussing Non-negative Matrix Factorization (NMF). NMF is a low-rank approximation algorithm that discovers latent features in your data. It is similar to PCA in the sense that they both reduce high-dimensional data into lower dimensions for better understanding of the data. The major difference is that PCA projects data onto a subspace that maximizes variability for the discovered features, while NMF discovers non-negative features that are additive in nature. NMF is formally defined as:

\[V \approx W H\]

where \(V\) is a non-negative matrix and both \(W\) and \(H\) are unique and non-negative matrices. In other words, the matrix \(V\) is factorized into two matrices \(W\) and \(H\), where \(W\) is the features matrix or the basis and \(H\) is the coefficient matrix. Typically, this means that \(H\) represents a coordinate system that uses \(W\) to reconstruct \(V\). We can consider that \(V\) is a linear combination of column vectors of \(W\) using the coordinate system in \(H\), \(v_{i} = Wh_{i}\).

Here, I will describe two algorithms to solve for NMF using iterative updates of \(W\) and \(H\). First, we will consider the cost function. A cost function is a function that quantifies or measures the error between the predicted values and the expected values. The Mean Squared Error (MSE), or L2 loss is one of the most popular cost functions in linear regressions. Given an linear equation \(y = mx + b\), MSE is:

\[MSE = \frac{1}{N}\Sigma_{i=1}^{n}(y_{i}-(mx_{i}+b))^{2} \]

For the cost function in NMF, we use a similar function called the Frobenius Norm. It is defined as:

\[||A||_{F} = \sqrt{\Sigma_{i=1}^{m}\Sigma_{j=1}^{n} |a_{ij}|^{2}}\]

In the case of NMF, we are using the square of the Forbenius norm to measure how good of an approximation \(WH\) is for \(V\).

\[||V - WH||_{F}^{2} = \Sigma_{i,j}(V - WH)^{2}_{ij}\]

We can see that as \(WH\) approaches \(V\), then the equation will slowly converge to zero. Therefore, the optimization can be defined as the following:

Minimize \(||V - WH||_{F}^{2}\) with respect to \(W\) and \(H\), subject to the constraints \(W,H \ge 0\)

In the paper by Lee & Seung, they introduced the multiplicative update rule to solve for NMF. Please see their original paper for details on the proof. Essentially the update causes the function value to be non-increasing to converge to zero.

\[H_{ik} \leftarrow H_{ik}\frac{(W^{T}V)_{ik}}{(W^{T}WH)_{ik}}\]

\[W_{kj} \leftarrow W_{kj}\frac{(VH^{T})_{kj}}{(WHH^{T})_{kj}}\]

Here we will implement NMF using the multiplicative update rule. To get started, make sure you have installed both R and RStudio. In addition, we will also be using the package NMF to benchmark our work.

Here I write a function to solve for NMF using the multiplicative update rule. I added a delta variable to the denominator update rule to prevent division by zero. K specifies the column length of \(W\) and the row length of \(H\). K can also be considered as the number of hidden features we are discovering in \(V\). K is less than \(n\) in a \(n \times m\) matrix. After iterating for x number of steps, the function returns a list containing W and H for \(W\) and \(H\) respectively.

nmf_mu <- function(R, K, delta = 0.001, steps = 5000){
  N = dim(R)[1]
  M = dim(R)[2]
  K = N
  W <- rmatrix(N,K) #Random initialization of W
  H <- rmatrix(K,M) #Random initialization of H
  for (step in 1:steps){
    W_TA = t(W) %*% R
    W_TWH = t(W) %*% W %*% H + delta
    for (i in 1:dim(H)[1]){
      for (j in 1:dim(H)[2]){
        H[i, j] = H[i, j] * W_TA[i, j] / W_TWH[i, j] #Updating H
      }
    }
    RH_T = R %*% t(H)
    WHH_T = W %*% H %*% t(H) + delta
    for (i in 1:dim(W)[1]){
      for (j in 1:dim(W)[2]){
        W[i, j] = W[i, j] * RH_T[i, j] / WHH_T[i, j] #Updating W
      }
    }
  }
  list <- list('W'=W, 'H'=H)
  return(list)
}

Let’s initialize a random \(n \times m\) matrix and test our function.

require(NMF)
## Loading required package: NMF
## Loading required package: pkgmaker
## Loading required package: registry
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')
R <- rmatrix(5,6)
R
##           [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] 0.4007137 0.48988199 0.11730123 0.90561032 0.8786088 0.86731883
## [2,] 0.2958185 0.06199266 0.20295819 0.12190276 0.2675324 0.60205900
## [3,] 0.8963035 0.10200525 0.06888566 0.76200470 0.7317623 0.34809631
## [4,] 0.8569394 0.69137974 0.40093261 0.16341560 0.4682005 0.03118922
## [5,] 0.9269652 0.63180347 0.35229504 0.01099477 0.1182396 0.19406648
nmf_mu_results <- nmf_mu(R)
cat('\fMatrix W is:\n')
## Matrix W is:
print(nmf_mu_results$W)
##              [,1]         [,2]         [,3]         [,4]         [,5]
## [1,] 0.2945120085 1.177823e-01 1.407215e+00 0.0076274588 9.419760e-03
## [2,] 0.0348530558 1.102322e-01 1.578286e-01 0.0001414884 8.664569e-01
## [3,] 0.0009698899 2.809684e-02 5.126590e-01 1.3105861587 5.523500e-02
## [4,] 0.9099011251 7.185633e-01 2.546607e-03 0.0021807525 3.153967e-06
## [5,] 0.8909743283 5.861403e-11 1.484582e-06 0.0130366441 2.806236e-01
cat('Matrix H is:\n')
## Matrix H is:
print(nmf_mu_results$H)
##             [,1]         [,2]         [,3]         [,4]      [,5]         [,6]
## [1,] 0.938981765 7.094122e-01 3.315199e-01 7.624112e-03 0.0861642 0.0328688319
## [2,] 0.001690588 6.152914e-02 1.367573e-01 2.128332e-01 0.5380754 0.0001024716
## [3,] 0.082623509 1.940418e-01 9.823391e-23 6.219701e-01 0.5583590 0.6051258017
## [4,] 0.638131482 1.004283e-30 4.027833e-02 3.330218e-01 0.3221636 0.0041240851
## [5,] 0.287527049 1.124161e-10 2.021005e-01 1.717841e-08 0.1336130 0.5825066726

Let’s see if we can reconstruct our original matrix and compare it to the nmf function.

R
##           [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] 0.4007137 0.48988199 0.11730123 0.90561032 0.8786088 0.86731883
## [2,] 0.2958185 0.06199266 0.20295819 0.12190276 0.2675324 0.60205900
## [3,] 0.8963035 0.10200525 0.06888566 0.76200470 0.7317623 0.34809631
## [4,] 0.8569394 0.69137974 0.40093261 0.16341560 0.4682005 0.03118922
## [5,] 0.9269652 0.63180347 0.35229504 0.01099477 0.1182396 0.19406648
nmf_mu_results$W %*% nmf_mu_results$H
##           [,1]       [,2]       [,3]      [,4]      [,5]       [,6]
## [1,] 0.4005853 0.48923593 0.11595514 0.9050991 0.8781992 0.86675295
## [2,] 0.2951732 0.06213302 0.20174658 0.1219386 0.2662568 0.60138055
## [3,] 0.8955237 0.10189408 0.06811523 0.7612996 0.7310528 0.34783766
## [4,] 0.8571983 0.69020172 0.40000758 0.1621815 0.4671671 0.03153287
## [5,] 0.9256147 0.63206838 0.35261495 0.0111353 0.1184658 0.19280505

We get the same results using the nmf function with the lee method.

nmf <- nmf(R, dim(R)[1], method = 'lee')
basis(nmf) %*% coefficients(nmf)
##           [,1]      [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] 0.3992716 0.4851242 0.13451318 0.91289882 0.8719353 0.86698917
## [2,] 0.2961278 0.0636697 0.19479112 0.11394434 0.2736916 0.60313470
## [3,] 0.8956540 0.1081984 0.07613699 0.74973301 0.7425820 0.34928051
## [4,] 0.8553733 0.6867311 0.42085554 0.17957373 0.4533049 0.03152422
## [5,] 0.9295931 0.6391867 0.32236372 0.02893233 0.1384885 0.18994336
Now we will take a look at another method of implementing NMF. This one is called Stochastic Gradient Descent (SGD). A gradient descent is a first-order iterative optimization algorithm to finding a local minimum for a function that is differentiable. In fact, we used the Block Coordinate Descent in the multiplicative update rule. In SGD, we take the derivative of the cost function like before. However, we will now be focusing on taking the derivative of each variable, setting them to zero or lower, solving for the feature variables, and finally updating each feature. We also add a regularization term in the cost function to control for overfitting.

\[||V - WH||_{F}^{2} = \Sigma_{i,j}(V - WH)^{2}_{ij}\]

\[e_{ij}^{2} = \Sigma_{i,j}(v_{ij}- \hat v_{ij})^{2} = (v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj})^{2}\]

\[e_{ij}^{2} = (v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj})^{2} + \lambda\Sigma_{k=1}^{K}(||W||^{2} + ||H||^{2})\]

\(\lambda\) is used to control the magnitudes of \(w\) and \(h\) such that they would provide a good approximation of \(v\). We will update each feature with each sample. We choose a small \(\lambda\), such as 0.01. The update is given by the equations below:

\[w_{ik} \leftarrow w_{ik} - \eta \frac{\partial}{\partial w_{ik}}e_{ij}^{2}\]

\[h_{kj} \leftarrow h_{kj} - \eta \frac{\partial}{\partial h_{kj}}e_{ij}^{2}\]

\(\eta\) is the learning rate and modifies the magnitude that we update the features. We first solve for \(\frac{\partial}{\partial h_{kj}}e_{ij}^{2}\).

Using the chain rule, \(\frac{\partial}{\partial h_{kj}}(v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj}) = \frac{\partial u^{2}}{\partial v} \frac{\partial u}{\partial v}\), where \(u = (v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj}) and \frac{\partial u^{2}}{\partial v} = 2u\)

\[ \frac{\partial}{\partial h_{kj}}e_{ij}^{2} = 2(v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj}) \frac{\partial}{\partial h_{kj}}(v_{ij} - \Sigma_{k=1}^{K}w_{ik}h_{kj}) + 2\lambda h_{kj} \]

\[ \frac{\partial}{\partial h_{kj}}e_{ij}^{2} = -2e_{ij}w_{ik} + 2\lambda h_{kj} \]

The final update rules for both \(W\) and \(H\):

\[ \frac{\partial}{\partial h_{kj}}e_{ij}^{2} = -2e_{ij}w_{ik} + 2\lambda h_{kj} \]

\[ \frac{\partial}{\partial w_{ik}}e_{ij}^{2} = -2e_{ij}h + 2\lambda w_{ik} \]

frob_norm <- function(M){
  norm = 0
  for (i in 1:dim(M)[1]){
    for (j in 1:dim(M)[2]){
      norm = norm + M[i,j] ** 2
    }
  }
  return(sqrt(norm))
}
nmf_sgd <- function(A,steps = 50000, lam = 1e-2, lr = 1e-3){
  N = dim(A)[1]
  M = dim(A)[2]
  K = N
  W <- rmatrix(N,K)
  H <- rmatrix(K,M)
  for (step in 1:steps){
    R =  A - W %*% H 
    dW = R %*% t(H) - W*lam
    dH = t(W) %*% R - H*lam
    W = W + lr * dW
    H = H + lr * dH
    if (frob_norm(A - W %*% H) < 0.01){
      print(frob_norm(A - W %*% H))
      break
    }
  }
  list <- list(W, t(H))
  return(list)
}
nmf_sgd_results <- nmf_sgd(R)
R
##           [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] 0.4007137 0.48988199 0.11730123 0.90561032 0.8786088 0.86731883
## [2,] 0.2958185 0.06199266 0.20295819 0.12190276 0.2675324 0.60205900
## [3,] 0.8963035 0.10200525 0.06888566 0.76200470 0.7317623 0.34809631
## [4,] 0.8569394 0.69137974 0.40093261 0.16341560 0.4682005 0.03118922
## [5,] 0.9269652 0.63180347 0.35229504 0.01099477 0.1182396 0.19406648

The reconstructed method using our NMF function looks like this:

nmf_sgd_results[[1]] %*% t(nmf_sgd_results[[2]])
##           [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
## [1,] 0.4031416 0.48552160 0.11813207 0.90027221 0.8751664 0.86220859
## [2,] 0.2947061 0.06484878 0.19910116 0.12492977 0.2659008 0.59531443
## [3,] 0.8898673 0.10631871 0.07049123 0.75713091 0.7285831 0.34892049
## [4,] 0.8546253 0.68720686 0.39597605 0.16538414 0.4627980 0.03481928
## [5,] 0.9205705 0.62716189 0.35232043 0.01176342 0.1231614 0.19201868

As you can see, it is a near approximation of the original matrix. In another post, I will go into detail the differences between each dimensional reduction technique. See below for more information on NMF.

Sources and additional information:

https://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

https://github.com/hpark3910/nonnegative-matrix-factorization

https://www.almoststochastic.com/2013/06/nonnegative-matrix-factorization.html

https://github.com/carriexu24/NMF-from-scratch-using-SGD

https://albertauyeung.github.io/2017/04/23/python-matrix-factorization.html

Written on April 3, 2021

[ tutorials  R  machine-learning  ]