Machine Learning by Andrew Ng

Table of Contents

https://class.coursera.org/ml-007

1. definition

T(task),E(experience),P(performance)

2. prerequisite

  • linear algebra
    • matrices / vectors # addition / subtraction/ multiplication / inversion / transposition
    • some matrices are not invertible called singular / degenerate # redundant(linear dependent) or too many features
    • pseudo inverse(pinv, works on matrix non-invertible) and inverse(inv)
  • Octave #note: python-scikit-learn

3. supervised learning

  • http://math.stackexchange.com/questions/141381/regression-vs-classification
  • regression # continuous valued output.
    • squared error function (as cost function) # ordinary-least-square 最小二乘法
    • linear regression # 线性回归
    • polynomial regression # 多项式回归
  • classification # discrete valued output.
    • decision boundary # 决策边界
    • binary-class(negative/positive) vs. multi-class
    • logistic regression # 逻辑回归
      • sigmoid / logistic function
      • g(x) = 1 / (1 + e^-x) [0, 1]
      • interpretation of hypothesis output.
      • p(y=0|x;theta) + p(y=1|x;theta) = 1 (for binary-class)
      • h(x) = 1 / (1 + e^-(theta * x)) [0, 1]
      • cost funciton = [-log(h(x)) if y = 1, -log(1-h(x)) if y = 0]
      • => -(log(h(x)) * y + log(1-h(x)) * (1-y))
    • SVM(support vector machine)
  • training set / historical data set.
    • input variables / features
      • univariate # single input variable
      • multivariate # multiple input variables.
      • feature scaling / mean normalization
    • output variables / targets
    • feature scaling (otherwise more steps to find global minimum), approximately [-1,1]
  • hypothesis parameters(theta) and hypothesis(theta * x)
  • cost function # 代价函数
    • convex and non-convex function
    • "batch" = uses all training set
    • gradient descent algorithm # 梯度下降
      • learning rate, derivative term.
      • if learning rate is too small, converge rate could be low.
      • if learning rate is too large, fail to converge or even diverge.
    • gradient checking # 梯度检查
    • optimization algorithm: conjugate gradient / BFGS / L-BFGS
      • no need to manually peek learning rate
      • faster than gradient descent
      • provided cost function and partial derivatives
      • 'fmincg' or 'fminunc' in Octave
  • overfitting # 过拟合
    • problem
      • if underfit -> high bias
      • if overfit -> high variance
      • not generalize new examples
      • generalization ability # 泛化能力
    • addressing
      • reduce number of features.
        • manually select which features to keep
        • model selection algorithm
      • regularization # 正规化
        • keep all features but reduce magnitude/values of parameters
        • works well when we have a lot of features
        • if regularization parameter is very large -> underfitting.
        • L1 norm, L2 norm. L1范数和L2范数

4. neural networks

  • motivation # 神经网络
    • complex non-linear classification / hypothesis # 针对复杂非线性分类
    • many features -> too many polynomial terms.
      • quantic x, then O(n^x).
      • quadratic : O(n ^ 2)
      • cubic: O(n ^ 3)
  • background
    • origins: algorithms that try to mimic the brain
    • widely used in 80s and early 90s, popularity diminished in late 90s
    • recent resurgence
    • "one learning algorithm" hypothesis = cortex.
  • model representation
    • neuron in the brain # 神经元
      • dendrite = input write # 树突
      • axon = output write # 轴突
      • cell body / nucleus # 核
      • communicated by spike(pulse of electricity) # 电信号传输
    • neuron model: logistic unit
      • sigmoid (logistic) activation function # 激活函数
      • hypothesis parameter = weight
      • 数学之美C30: 神经元函数只能对输入变量线性组合后的结果进行一次非线性变换.
    • layer: input/output/hidden
      • a(i,j) = "activation" of unit i in layer j
      • theta(j) = matrix of weights controlling function mapping from layer j to layer j+1
      • if network has s(j) units in layer j, and s(j+1) units in layer j+1, then theta(j) is M(s(j+1), s(j)+1)
    • forward propagation
    • backward propagation
  • backpropagation algorithm # 反向传播算法来计算参数导数
    • general cost function
    • delta(j,l) = "error" of node j in layer l
    • intuition # use backpropagation algorithm to compute derivatives.
    • implementation
      • unroll parameters
      • gradient checking(inefficient) to verify backprop derivatives
      • initialize parameters randomly[symmetry breaking] (otherwise features are duplicated)
  • putting together
    • network architecture
      • no. of input units: dimension of features
      • no. of output units: number of classes
      • hidden layer
        • reasonable default: 1 hidden layer, or >1 hidden layer have same no. of hidden units in every layer(usually the more the better)
        • no. of hidden units = [2,3,4] * no. input units.
      • network size
        • small # fewer parameters, more prone to underfitting, computationally cheaper.
        • large # more parameters, more prone to overfitting, computationally more expensive.
    • training a neural network
      • randomly initialize weights
      • for-loop to iterate each training samples.
      • forward propagation to compute activation
      • compute cost function
      • backward propagation to compute partial derivatives
      • gradient checking
      • gradient descent algorithm

5. support vector machine

  • alternative view of logistic regression
  • SVM cost function # replace sigmoid function with two simple functions (cost0 and cost1)
    • cost function = -y * cost1(tx) + (1-y) * cost0(tx)
    • hypothesis: y = 1 if tx >=0. y = 0 otherwise.
    • cost0和cost1实际上是合页损失函数(hinge loss function)
  • SVM decision boundary / large margin intuition (if C very large) # SVM决策边界是找到完美划分的超平面
  • kernel / kernel function # 核函数
    • for more features
    • to compute similarity (with landmarks) as more complex, non-linear features.
    • gaussian kernel function.
      • K(x,y,e) = exp ^ (-0.5 / e^2 * |x-y|^2)
      • if e^2 is large, high bias and low variance
      • if e^2 is small, low bias and high variance
    • output range [0,1]
  • how it works
    • choose typical landmarks.
    • compute similarity with landmarks as input [0,1]
    • translate into a typical classifier problem.
    • number of features == number of landmarks.
  • practice
    • liblinear, libsvm
    • specify 1) choice of parameter C 2) kernel function
    • no kernel / linear kernel function # n >> m
    • gaussian kernel function # m >> n
    • polynomial kernel function
    • string kernel / chi-square kernel / histogram intersection kernel

6. advice for applying ML

  • unacceptablely large errors in its predictions
    • don't just use gut feelings and do the following things randomly
    • get more training examples. (but not the more the better) => fix high variance
    • try smaller sets of features. => fix high variance
    • try getting additional features. => fix high bias
    • try polynomial features. => fix high bias
    • try decreasing/increasing lambda. => fix high bias/variance
  • system diagnostics
    • evaluating hypothesis
      • split examples randomly into training set(70%) and test set(30%).
      • see J_test(theta) is overfitting or not.
    • model selection (for choosing polynomial terms and regularization)
      • split examples randomly into training set(60%), cross validation set(20%), and test set(20%)
      • use cross validation set to select model, and get estimate of generalization error.
      • validation curves.
    • high bias vs. variance
      • bias > underfit: J_train(theta) is high, J_cv/test(theta) = J_train(theta)
      • variance > overfit: J_train(theta) is low, but J_cv/test(theta) > J_train(theta)
      • learning curves # J_cv/test(theta) and J_train(theta) over training set size
      • if suffers from high bias, more training data will not help
      • if suffers from high variance, more training data might help

7. advice for designing ML

  • numerical evaluation # a real number tells how well is your system. 使用一个数值来衡量系统
  • error analysis # spot any systematic trend in what type of examples it is making errors on. 误差分析
  • skewed classes.
    • y = 1 in presence of rare class # 如果y_pred=0的话没有任何预测性但是accuracy准确率超高
    • precision = true positive / [no. of predicted positive = (true pos + false pos)] # 精确度
    • recall = true positive / [no. of actual positive = (true pos + false neg)] # 召回率
    • good classifier: precision and recall are both high enough.
      • but there are tradeoffs between both
      • F1 score = 2 * P * R / (P + R)
      • #note: see "anomaly detection select threshold" how to compute P,R, and F1.
  • large data rationale
    • assume features have sufficient information to predicate accurately
    • useful test: give the input x, can a human expert confidently predict y?

8. unsupervised learning

  • cluster algorithm
  • cocktail party problem
  • K-means algorithm
    • cluster centroid
    • K = cluster number, k = cluster index
      • should have K < m
      • choose K manually(most time) or with elbow method
    • objective function = distances between training set and centroids.
      • convex, but risk of local optima
      • randomly choose centroids from training set.
      • multiple random initialization

9. dimensionality reduction

  • motivation # 维度降解
    • data compression
    • data visualization
    • speed up learning algorithm
  • PCA(principal component analysis) # 主成分分析
    • find k vectors onto which to project the data
    • minimize the projection error(different to linear regression)
    • algorithm # reduce n dimensions to k dimensions
      • sigma = 1/m * sum{X(i) * X(i)'}. X(i)~n*1, so sigma~n*n
      • [U,S,V] = svd(sigma) # singular value decomposition
      • U~n*n. use first k columns called U_reduce~(n*k)
      • z = U_reduce' * X(i) ~ (k * n * n * 1) = (k*1)
      • reconstruct: X_approx(i) = U_reduce * z ~ (n * k * k * 1) = (n*1)
    • choose k # n% of variance is retained.
      • n = sum{i=1,k}S_{ii} / sum{i=1,n}S_{ii} (S from svd, diagonal matrix)
      • n = 99 typical value
  • comments
    • don't use PCA to prevent overfitting
    • use raw data first, then consider PCA

10. anomaly detection

  • gaussian distribution # 高斯分布
    • X ~ N(u, e^2) # X distributed as N. where mean = u, variance = e^2
    • p(x, u, e^2) = 1 / ((sqrt(2 * pi) * e)) * exp ^ { - (x-u)^2 / (2 * e^2) } # probability
    • multivariate version # 多变量高斯分布
      • to capture anomalous combination of values. computationally expensive.
      • u~{n*1}, e~{n*n} (covariance matrix) # intuition. contour not axis aligned.
      • p(x, u, e) = 1 / ((2 * pi) ^ (n/2) * sqrt(det(e))) * exp ^ {-0.5 * (x-u)' * e^-1 * (x-u)}
      • u = 1/m * sum{x}, e = 1/m * sum{(x-u) * (x-u)'}
      • #note: m > n, otherwise e is non-invertible.
  • how it works # 我们假设特征数据符合高斯分布,所以异常数据点对应概率会非常低
    • model p(x) from data
    • p(x) < epsilon to decide if anomalous
      • epsilon # p(x) is comparable for normal and anomalous examples.
      • features to distinguish normal and anomalous examples.
      • p(x) = p1(x1, u1, e1^2) * … pj(xj, uj, ej^2).. # j = # of features.
      • if xj is not gaussian feature, transform it to fit into gaussian distribution. # 如果数据不满足高斯分布,那么要对数据做变换符合高斯分布
  • vs. supervised learning
    • anomaly detection
      • # of positive cases is very small, while # of negative cases is very large
      • many different types of "anomaly", hard to learn from positive cases what anomalies looks like
      • future anomalies maybe very different to current ones.
      • fraud detection, manufacturing, monitoring machines.
    • supervised learning
      • # of positive cases and negative cases are both very large
      • enough positive cases to learn what positive cases look like
      • future positive cases are similar to current ones.
      • email spam, weather prediction, cancer classification.

11. recommender system

  • content based recommendation
  • collaborative filtering algorithm
    • low rank matrix factorization
    • random initialization to break symmetry
    • content features to compute similarity between items
    • mean normalization # 对于空值使用属性平均值代替

12. ML in large scale

  • stochastic gradient descent algorithm # 随机梯度下降算法
    • vs. batch gradient descent # 可以增量使用训练数据
    • randomly shuffle dataset
    • repeat for i = 1..m { for j = 0..n { update theta_j only use ith data } }
    • move to global minimum generally, but not always in one iteration.
    • convergence checking
      • use averaged last k(say 1000) examples.
      • the larger k, the smoother cost function curve.
      • can slowly decrease learning rate over time for convergence.
  • mini-batch gradient descent algorithm
    • between batch and stochastic gradient descent
    • use b(say 10) examples in one iteration
    • take advantage of vectorization
  • online learning
  • map-reduce and data parallelism
  • more data
    • collect from multiple sources
    • artificial data synthesis
  • ceiling analysis

13. appendix code

13.1. feature normalization

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

mu = mean(X);
X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);


% ============================================================

end

13.2. linear regression cost function

#note: works for polynomial regression too.

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the
%   cost of using theta as the parameter for linear regression to fit the
%   data points in X and y. Returns the cost in J and the gradient in grad

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost and gradient of regularized linear
%               regression for a particular choice of theta.
%
%               You should set J to the cost and grad to the gradient.
%

diff = X * theta - y;
J = sum(diff .^ 2) * 0.5 / m;
t = theta;
t(1) = 0;
J += sum(t .^ 2) * lambda * 0.5 / m;
grad = ((X' * diff) + lambda * t) / m;

% =========================================================================

grad = grad(:);

end

13.3. neural network cost function

Pasted-Image-20231225104022.png

Pasted-Image-20231225104244.png

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices.
%
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
%               following parts.
%
% Part 1: Feedforward the neural network and return the cost in the
%         variable J. After implementing Part 1, you can verify that your
%         cost function computation is correct by verifying the cost
%         computed in ex4.m
%
% Part 2: Implement the backpropagation algorithm to compute the gradients
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
%         Theta2_grad, respectively. After implementing Part 2, you can check
%         that your implementation is correct by running checkNNGradients
%
%         Note: The vector y passed into the function is a vector of labels
%               containing values from 1..K. You need to map this vector into a
%               binary vector of 1's and 0's to be used with the neural network
%               cost function.
%
%         Hint: We recommend implementing backpropagation using a for-loop
%               over the training examples if you are implementing it for the
%               first time.
%
% Part 3: Implement regularization with the cost function and gradients.
%
%         Hint: You can implement this around the code for
%               backpropagation. That is, you can compute the gradients for
%               the regularization separately and then add them to Theta1_grad
%               and Theta2_grad from Part 2.
%

X2 = [ones(m, 1)  X];
tx2 = X2 * Theta1';
hx2 = sigmoid(tx2);
X3 = [ones(m, 1) hx2];
tx3 = X3 * Theta2';
hx3 = sigmoid(tx3);
hy = zeros(m, num_labels);
for i = [1:m],
    hy(i, y(i)) = 1;
end;
J = sum(sum(log(hx3) .* (-hy) - log(1 - hx3) .* (1 - hy))) / m;

R = 0;
R += sum(sum(Theta1(:, 2:end) .^ 2));
R += sum(sum(Theta2(:, 2:end) .^ 2));
R *= lambda / m * 0.5;

J += R;

% -------------------------------------------------------------

d3 = hx3 - hy; # M * K
d2 = (d3 * Theta2)(:,2:end) .* sigmoidGradient(tx2); # M * H
Theta2_grad = d3' * X3 / m; # K * M * M * (H+1) = K * (H+1)
Theta1_grad = d2' * X2 / m; # H * M * M * (N+1) = H * (N+1)

t2 = Theta2;
t2(:,1) = 0;
t1 = Theta1;
t1(:,1) = 0;
Theta2_grad += t2 * lambda / m;
Theta1_grad += t1 * lambda / m;

% =========================================================================

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];


end

13.4. pca(principal compoenent analysis)

function [U, S] = pca(X)
%PCA Run principal component analysis on the dataset X
%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X
%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S
%

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);

% ====================== YOUR CODE HERE ======================
% Instructions: You should first compute the covariance matrix. Then, you
%               should use the "svd" function to compute the eigenvectors
%               and eigenvalues of the covariance matrix.
%
% Note: When computing the covariance matrix, remember to divide by m (the
%       number of examples).
%

sigma = 1.0 / m * X' * X;
[U,S,_ ] = svd(sigma);



% =========================================================================

end

projectData

function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only
%on to the top k eigenvectors
%   Z = projectData(X, U, K) computes the projection of
%   the normalized inputs X into the reduced dimensional space spanned by
%   the first K columns of U. It returns the projected examples in Z.
%

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the projection of the data using only the top K
%               eigenvectors in U (first K columns).
%               For the i-th example X(i,:), the projection on to the k-th
%               eigenvector is given as follows:
%                    x = X(i, :)';
%                    projection_k = x' * U(:, k);
%

U_reduce = U(:, 1:K);

Z = X * U_reduce;



% =============================================================

end

recoverData

function X_rec = recoverData(Z, U, K)
%RECOVERDATA Recovers an approximation of the original data when using the
%projected data
%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the
%   original data that has been reduced to K dimensions. It returns the
%   approximate reconstruction in X_rec.
%

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the approximation of the data by projecting back
%               onto the original space using the top K eigenvectors in U.
%
%               For the i-th example Z(i,:), the (approximate)
%               recovered data for dimension j is given as follows:
%                    v = Z(i, :)';
%                    recovered_j = v' * U(j, 1:K)';
%
%               Notice that U(j, 1:K) is a row vector.
%

U_reduce = U(:, 1:K);

X_rec = Z * U_reduce';

% =============================================================

end

13.5. gaussian distribution

compute mean and variance of X

function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a
%Gaussian distribution using the data in X
%   [mu sigma2] = estimateGaussian(X),
%   The input X is the dataset with each n-dimensional data point in one row
%   The output is an n-dimensional vector mu, the mean of the data set
%   and the variances sigma^2, an n x 1 vector
%

% Useful variables
[m, n] = size(X);

% You should return these values correctly
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the mean of the data and the variances
%               In particular, mu(i) should contain the mean of
%               the data for the i-th feature and sigma2(i)
%               should contain variance of the i-th feature.
%

mu = mean(X)';
# xu = X - mu';
# sigma2 = 1.0 / m * sum(xu .^ 2)';
sigma2 = (m-1) / m * var(X)';

% =============================================================


end

compute probability

function p = multivariateGaussian(X, mu, Sigma2)
%MULTIVARIATEGAUSSIAN Computes the probability density function of the
%multivariate gaussian distribution.
%    p = MULTIVARIATEGAUSSIAN(X, mu, Sigma2) Computes the probability
%    density function of the examples X under the multivariate gaussian
%    distribution with parameters mu and Sigma2. If Sigma2 is a matrix, it is
%    treated as the covariance matrix. If Sigma2 is a vector, it is treated
%    as the \sigma^2 values of the variances in each dimension (a diagonal
%    covariance matrix)
%

k = length(mu);

if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end

X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));

end

13.6. anomaly detection select threshold

function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
%   threshold to use for selecting outliers based on the results from a
%   validation set (pval) and the ground truth (yval).
%

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)

    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the F1 score of choosing epsilon as the
    %               threshold and place the value in F1. The code at the
    %               end of the loop will compare the F1 score for this
    %               choice of epsilon and set it to be the best epsilon if
    %               it is better than the current choice of epsilon.
    %
    % Note: You can use predictions = (pval < epsilon) to get a binary vector
    %       of 0's and 1's of the outlier predictions

    cv_pred = pval < epsilon;
    tp = sum((cv_pred == 1) & (yval == 1));
    fp = sum((cv_pred == 1) & (yval == 0));
    fn = sum((cv_pred == 0) & (yval == 1));
    prec = tp / (tp + fp);
    recall = tp / (tp + fn);
    F1 = 2 * prec * recall / (prec + recall);

    % =============================================================

    if F1 > bestF1
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

end

13.7. collaborative filtering cost function

Pasted-Image-20231225103335.png

Pasted-Image-20231225103856.png

function [J, grad] = cofiCostFunc(params, Y, R, num_users, num_movies, ...
                                  num_features, lambda)
%COFICOSTFUNC Collaborative filtering cost function
%   [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
%   num_features, lambda) returns the cost and gradient for the
%   collaborative filtering problem.
%

% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
                num_users, num_features);


% You need to return the following values correctly
J = 0;
X_grad = zeros(size(X));
Theta_grad = zeros(size(Theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost function and gradient for collaborative
%               filtering. Concretely, you should first implement the cost
%               function (without regularization) and make sure it is
%               matches our costs. After that, you should implement the
%               gradient and use the checkCostFunction routine to check
%               that the gradient is correct. Finally, you should implement
%               regularization.
%
% Notes: X - num_movies  x num_features matrix of movie features
%        Theta - num_users  x num_features matrix of user features
%        Y - num_movies x num_users matrix of user ratings of movies
%        R - num_movies x num_users matrix, where R(i, j) = 1 if the
%            i-th movie was rated by the j-th user
%
% You should set the following variables correctly:
%
%        X_grad - num_movies x num_features matrix, containing the
%                 partial derivatives w.r.t. to each element of X
%        Theta_grad - num_users x num_features matrix, containing the
%                     partial derivatives w.r.t. to each element of Theta
%

xt = X * Theta'; % m * n * n * u = m * u
df = (xt- Y) .* R; % m * u
J = 0.5 * sum(sum(df .^ 2));
J += 0.5 * lambda * (sum(sum(Theta .^ 2)) + sum(sum(X .^ 2)));


X_grad = df * Theta; % m * u * u * n = m * n;
X_grad += lambda * X;

Theta_grad = df' * X; % u * m * m * n = u * n;
Theta_grad += lambda * Theta;

% =============================================================

grad = [X_grad(:); Theta_grad(:)];

end

14. octave notes

  • basic syntax
    • ==(equal), ~=(not equal), xor(a,b), PS1('new promot: ') starts comment with %
    • ; to supersede output
    • m(:) expand m to a single column vector.
    • m(i,j) could be used as left operand. And for i,j could be
      • [a b] which means ath and bth elements.
      • or s:i:e/s:e which is shorthand of [a,b] # end as a keyword.
      • or : which means every elements.
  • output control
    • disp(value) to print value
    • sprintf to format string in C syntax.
    • format long/short # change output format of numerical values.
    • hist(matrix, [bucket]) to plot histogram of matrix.
  • generate matrix
    • [a b ; c d; e f ]. 3 * 2 matrix.
    • [start : interval : end] / [ start : end ] as interval = 1.
    • ones/zeros(m,n) m * n matrix with elements are all 0/1.
    • rand(m,n) m * n matrix with uniform distributed random elements in range [0,1]
    • randn Gaussian distribution with mean = 0 and standard deviation = 1
    • eye(n) n * n identity matrix
  • import/export data
    • size, who/whos, clear [<value>],
    • load <file>, save <file> <value> [-ascii % readable form]
  • compute command
    • . means elements wise operation.
    • A .* B , A .^ 2 , 1 ./ A , A .+ 1 , A .- 1
    • log, exp, abs, -(neg), <
    • A * B
    • A' # tranpose
    • max/find/magic/sum/prod/floor/ceil/flipud/pivn
  • drawing command
    • plot(x,y)
    • hold on # plot new figure on old one.
    • xlabel / ylabel / legend / title / print # output to file / close / axis
    • figure # create figure / clf # clear figure
    • subplot(m,n,ind) # divides plot a m*n grid, and access ind th
    • "imagesc(<matrix>) , colorbar , colormap gray;"
  • flow control
    • for i=<range>[1:10], <stmt> end;
    • if <cond>, <stmt> end; / elseif / else
    • while <cond>, <stmt> end;
    • break/continue
    • function definition resides in function_name.m
      • function y = <name>(x)
      • function [y1,y2] = <name>(x)
      • y = x * 2
      • name(x) to call function.
    • addpath to add path in order to find function file.