Machine Learning by Andrew Ng

1. definition
2. prerequisite
3. supervised learning
4. neural networks
5. support vector machine
6. advice for applying ML
7. advice for designing ML
8. unsupervised learning
9. dimensionality reduction
10. anomaly detection
11. recommender system
12. ML in large scale
13. appendix code
14. octave notes

https://class.coursera.org/ml-007

1. definition

T(task),E(experience),P(performance)

2. prerequisite

linear algebra
- matrices / vectors # addition / subtraction/ multiplication / inversion / transposition
- some matrices are not invertible called singular / degenerate # redundant(linear dependent) or too many features
- pseudo inverse(pinv, works on matrix non-invertible) and inverse(inv)
Octave #note: python-scikit-learn

3. supervised learning

http://math.stackexchange.com/questions/141381/regression-vs-classification
regression # continuous valued output.
- squared error function (as cost function) # ordinary-least-square 最小二乘法
- linear regression # 线性回归
- polynomial regression # 多项式回归
classification # discrete valued output.
- decision boundary # 决策边界
- binary-class(negative/positive) vs. multi-class
- logistic regression # 逻辑回归
  - sigmoid / logistic function
  - g(x) = 1 / (1 + e^-x) [0, 1]
  - interpretation of hypothesis output.
  - p(y=0|x;theta) + p(y=1|x;theta) = 1 (for binary-class)
  - h(x) = 1 / (1 + e^-(theta * x)) [0, 1]
  - cost funciton = [-log(h(x)) if y = 1, -log(1-h(x)) if y = 0]
  - => -(log(h(x)) * y + log(1-h(x)) * (1-y))
- SVM(support vector machine)
training set / historical data set.
- input variables / features
  - univariate # single input variable
  - multivariate # multiple input variables.
  - feature scaling / mean normalization
- output variables / targets
- feature scaling (otherwise more steps to find global minimum), approximately [-1,1]
hypothesis parameters(theta) and hypothesis(theta * x)
cost function # 代价函数
- convex and non-convex function
- "batch" = uses all training set
- gradient descent algorithm # 梯度下降
  - learning rate, derivative term.
  - if learning rate is too small, converge rate could be low.
  - if learning rate is too large, fail to converge or even diverge.
- gradient checking # 梯度检查
- optimization algorithm: conjugate gradient / BFGS / L-BFGS
  - no need to manually peek learning rate
  - faster than gradient descent
  - provided cost function and partial derivatives
  - 'fmincg' or 'fminunc' in Octave
overfitting # 过拟合
- problem
  - if underfit -> high bias
  - if overfit -> high variance
  - not generalize new examples
  - generalization ability # 泛化能力
- addressing
  - reduce number of features.
    - manually select which features to keep
    - model selection algorithm
  - regularization # 正规化
    - keep all features but reduce magnitude/values of parameters
    - works well when we have a lot of features
    - if regularization parameter is very large -> underfitting.
    - L1 norm, L2 norm. L1范数和L2范数

4. neural networks

motivation # 神经网络
- complex non-linear classification / hypothesis # 针对复杂非线性分类
- many features -> too many polynomial terms.
  - quantic x, then O(n^x).
  - quadratic : O(n ^ 2)
  - cubic: O(n ^ 3)
background
- origins: algorithms that try to mimic the brain
- widely used in 80s and early 90s, popularity diminished in late 90s
- recent resurgence
- "one learning algorithm" hypothesis = cortex.
model representation
- neuron in the brain # 神经元
  - dendrite = input write # 树突
  - axon = output write # 轴突
  - cell body / nucleus # 核
  - communicated by spike(pulse of electricity) # 电信号传输
- neuron model: logistic unit
  - sigmoid (logistic) activation function # 激活函数
  - hypothesis parameter = weight
  - 数学之美C30: 神经元函数只能对输入变量线性组合后的结果进行一次非线性变换.
- layer: input/output/hidden
  - a(i,j) = "activation" of unit i in layer j
  - theta(j) = matrix of weights controlling function mapping from layer j to layer j+1
  - if network has s(j) units in layer j, and s(j+1) units in layer j+1, then theta(j) is M(s(j+1), s(j)+1)
- forward propagation
- backward propagation
backpropagation algorithm # 反向传播算法来计算参数导数
- general cost function
- delta(j,l) = "error" of node j in layer l
- intuition # use backpropagation algorithm to compute derivatives.
- implementation
  - unroll parameters
  - gradient checking(inefficient) to verify backprop derivatives
  - initialize parameters randomly[symmetry breaking] (otherwise features are duplicated)
putting together
- network architecture
  - no. of input units: dimension of features
  - no. of output units: number of classes
  - hidden layer
    - reasonable default: 1 hidden layer, or >1 hidden layer have same no. of hidden units in every layer(usually the more the better)
    - no. of hidden units = [2,3,4] * no. input units.
  - network size
    - small # fewer parameters, more prone to underfitting, computationally cheaper.
    - large # more parameters, more prone to overfitting, computationally more expensive.
- training a neural network
  - randomly initialize weights
  - for-loop to iterate each training samples.
  - forward propagation to compute activation
  - compute cost function
  - backward propagation to compute partial derivatives
  - gradient checking
  - gradient descent algorithm

5. support vector machine

alternative view of logistic regression
SVM cost function # replace sigmoid function with two simple functions (cost0 and cost1)
- cost function = -y * cost1(tx) + (1-y) * cost0(tx)
- hypothesis: y = 1 if tx >=0. y = 0 otherwise.
- cost0和cost1实际上是合页损失函数(hinge loss function)
SVM decision boundary / large margin intuition (if C very large) # SVM决策边界是找到完美划分的超平面
kernel / kernel function # 核函数
- for more features
- to compute similarity (with landmarks) as more complex, non-linear features.
- gaussian kernel function.
  - K(x,y,e) = exp ^ (-0.5 / e^2 * |x-y|^2)
  - if e^2 is large, high bias and low variance
  - if e^2 is small, low bias and high variance
- output range [0,1]
how it works
- choose typical landmarks.
- compute similarity with landmarks as input [0,1]
- translate into a typical classifier problem.
- number of features == number of landmarks.
practice
- liblinear, libsvm
- specify 1) choice of parameter C 2) kernel function
- no kernel / linear kernel function # n >> m
- gaussian kernel function # m >> n
- polynomial kernel function
- string kernel / chi-square kernel / histogram intersection kernel

6. advice for applying ML

unacceptablely large errors in its predictions
- don't just use gut feelings and do the following things randomly
- get more training examples. (but not the more the better) => fix high variance
- try smaller sets of features. => fix high variance
- try getting additional features. => fix high bias
- try polynomial features. => fix high bias
- try decreasing/increasing lambda. => fix high bias/variance
system diagnostics
- evaluating hypothesis
  - split examples randomly into training set(70%) and test set(30%).
  - see J_test(theta) is overfitting or not.
- model selection (for choosing polynomial terms and regularization)
  - split examples randomly into training set(60%), cross validation set(20%), and test set(20%)
  - use cross validation set to select model, and get estimate of generalization error.
  - validation curves.
- high bias vs. variance
  - bias > underfit: J_train(theta) is high, J_cv/test(theta) = J_train(theta)
  - variance > overfit: J_train(theta) is low, but J_cv/test(theta) > J_train(theta)
  - learning curves # J_cv/test(theta) and J_train(theta) over training set size
  - if suffers from high bias, more training data will not help
  - if suffers from high variance, more training data might help

7. advice for designing ML

numerical evaluation # a real number tells how well is your system. 使用一个数值来衡量系统
error analysis # spot any systematic trend in what type of examples it is making errors on. 误差分析
skewed classes.
- y = 1 in presence of rare class # 如果y_pred=0的话没有任何预测性但是accuracy准确率超高
- precision = true positive / [no. of predicted positive = (true pos + false pos)] # 精确度
- recall = true positive / [no. of actual positive = (true pos + false neg)] # 召回率
- good classifier: precision and recall are both high enough.
  - but there are tradeoffs between both
  - F1 score = 2 * P * R / (P + R)
  - #note: see "anomaly detection select threshold" how to compute P,R, and F1.
large data rationale
- assume features have sufficient information to predicate accurately
- useful test: give the input x, can a human expert confidently predict y?

8. unsupervised learning

cluster algorithm
cocktail party problem
K-means algorithm
- cluster centroid
- K = cluster number, k = cluster index
  - should have K < m
  - choose K manually(most time) or with elbow method
- objective function = distances between training set and centroids.
  - convex, but risk of local optima
  - randomly choose centroids from training set.
  - multiple random initialization

9. dimensionality reduction

motivation # 维度降解
- data compression
- data visualization
- speed up learning algorithm
PCA(principal component analysis) # 主成分分析
- find k vectors onto which to project the data
- minimize the projection error(different to linear regression)
- algorithm # reduce n dimensions to k dimensions
  - sigma = 1/m * sum{X(i) * X(i)'}. X(i)~n*1, so sigma~n*n
  - [U,S,V] = svd(sigma) # singular value decomposition
  - U~n*n. use first k columns called U_reduce~(n*k)
  - z = U_reduce' * X(i) ~ (k * n * n * 1) = (k*1)
  - reconstruct: X_approx(i) = U_reduce * z ~ (n * k * k * 1) = (n*1)
- choose k # n% of variance is retained.
  - n = sum{i=1,k}S_{ii} / sum{i=1,n}S_{ii} (S from svd, diagonal matrix)
  - n = 99 typical value
comments
- don't use PCA to prevent overfitting
- use raw data first, then consider PCA

10. anomaly detection

gaussian distribution # 高斯分布
- X ~ N(u, e^2) # X distributed as N. where mean = u, variance = e^2
- p(x, u, e^2) = 1 / ((sqrt(2 * pi) * e)) * exp ^ { - (x-u)^2 / (2 * e^2) } # probability
- multivariate version # 多变量高斯分布
  - to capture anomalous combination of values. computationally expensive.
  - u~{n*1}, e~{n*n} (covariance matrix) # intuition. contour not axis aligned.
  - p(x, u, e) = 1 / ((2 * pi) ^ (n/2) * sqrt(det(e))) * exp ^ {-0.5 * (x-u)' * e^-1 * (x-u)}
  - u = 1/m * sum{x}, e = 1/m * sum{(x-u) * (x-u)'}
  - #note: m > n, otherwise e is non-invertible.
how it works # 我们假设特征数据符合高斯分布，所以异常数据点对应概率会非常低
- model p(x) from data
- p(x) < epsilon to decide if anomalous
  - epsilon # p(x) is comparable for normal and anomalous examples.
  - features to distinguish normal and anomalous examples.
  - p(x) = p1(x1, u1, e1^2) * … pj(xj, uj, ej^2).. # j = # of features.
  - if xj is not gaussian feature, transform it to fit into gaussian distribution. # 如果数据不满足高斯分布，那么要对数据做变换符合高斯分布
vs. supervised learning
- anomaly detection
  - # of positive cases is very small, while # of negative cases is very large
  - many different types of "anomaly", hard to learn from positive cases what anomalies looks like
  - future anomalies maybe very different to current ones.
  - fraud detection, manufacturing, monitoring machines.
- supervised learning
  - # of positive cases and negative cases are both very large
  - enough positive cases to learn what positive cases look like
  - future positive cases are similar to current ones.
  - email spam, weather prediction, cancer classification.

11. recommender system

content based recommendation
collaborative filtering algorithm
- low rank matrix factorization
- random initialization to break symmetry
- content features to compute similarity between items
- mean normalization # 对于空值使用属性平均值代替

12. ML in large scale

stochastic gradient descent algorithm # 随机梯度下降算法
- vs. batch gradient descent # 可以增量使用训练数据
- randomly shuffle dataset
- repeat for i = 1..m { for j = 0..n { update theta_j only use ith data } }
- move to global minimum generally, but not always in one iteration.
- convergence checking
  - use averaged last k(say 1000) examples.
  - the larger k, the smoother cost function curve.
  - can slowly decrease learning rate over time for convergence.
mini-batch gradient descent algorithm
- between batch and stochastic gradient descent
- use b(say 10) examples in one iteration
- take advantage of vectorization
online learning
map-reduce and data parallelism
more data
- collect from multiple sources
- artificial data synthesis
ceiling analysis

13. appendix code

13.1. feature normalization

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

mu = mean(X);
X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);


% ============================================================

end

13.2. linear regression cost function

#note: works for polynomial regression too.

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)
%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear
%regression with multiple variables
%   [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the
%   cost of using theta as the parameter for linear regression to fit the
%   data points in X and y. Returns the cost in J and the gradient in grad

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost and gradient of regularized linear
%               regression for a particular choice of theta.
%
%               You should set J to the cost and grad to the gradient.
%

diff = X * theta - y;
J = sum(diff .^ 2) * 0.5 / m;
t = theta;
t(1) = 0;
J += sum(t .^ 2) * lambda * 0.5 / m;
grad = ((X' * diff) + lambda * t) / m;

% =========================================================================

grad = grad(:);

end

13.3. neural network cost function

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices.
%
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
%               following parts.
%
% Part 1: Feedforward the neural network and return the cost in the
%         variable J. After implementing Part 1, you can verify that your
%         cost function computation is correct by verifying the cost
%         computed in ex4.m
%
% Part 2: Implement the backpropagation algorithm to compute the gradients
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
%         Theta2_grad, respectively. After implementing Part 2, you can check
%         that your implementation is correct by running checkNNGradients
%
%         Note: The vector y passed into the function is a vector of labels
%               containing values from 1..K. You need to map this vector into a
%               binary vector of 1's and 0's to be used with the neural network
%               cost function.
%
%         Hint: We recommend implementing backpropagation using a for-loop
%               over the training examples if you are implementing it for the
%               first time.
%
% Part 3: Implement regularization with the cost function and gradients.
%
%         Hint: You can implement this around the code for
%               backpropagation. That is, you can compute the gradients for
%               the regularization separately and then add them to Theta1_grad
%               and Theta2_grad from Part 2.
%

X2 = [ones(m, 1)  X];
tx2 = X2 * Theta1';
hx2 = sigmoid(tx2);
X3 = [ones(m, 1) hx2];
tx3 = X3 * Theta2';
hx3 = sigmoid(tx3);
hy = zeros(m, num_labels);
for i = [1:m],
    hy(i, y(i)) = 1;
end;
J = sum(sum(log(hx3) .* (-hy) - log(1 - hx3) .* (1 - hy))) / m;

R = 0;
R += sum(sum(Theta1(:, 2:end) .^ 2));
R += sum(sum(Theta2(:, 2:end) .^ 2));
R *= lambda / m * 0.5;

J += R;

% -------------------------------------------------------------

d3 = hx3 - hy; # M * K
d2 = (d3 * Theta2)(:,2:end) .* sigmoidGradient(tx2); # M * H
Theta2_grad = d3' * X3 / m; # K * M * M * (H+1) = K * (H+1)
Theta1_grad = d2' * X2 / m; # H * M * M * (N+1) = H * (N+1)

t2 = Theta2;
t2(:,1) = 0;
t1 = Theta1;
t1(:,1) = 0;
Theta2_grad += t2 * lambda / m;
Theta1_grad += t1 * lambda / m;

% =========================================================================

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];


end

13.4. pca(principal compoenent analysis)

function [U, S] = pca(X)
%PCA Run principal component analysis on the dataset X
%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X
%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S
%

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);

% ====================== YOUR CODE HERE ======================
% Instructions: You should first compute the covariance matrix. Then, you
%               should use the "svd" function to compute the eigenvectors
%               and eigenvalues of the covariance matrix.
%
% Note: When computing the covariance matrix, remember to divide by m (the
%       number of examples).
%

sigma = 1.0 / m * X' * X;
[U,S,_ ] = svd(sigma);



% =========================================================================

end

projectData

function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only
%on to the top k eigenvectors
%   Z = projectData(X, U, K) computes the projection of
%   the normalized inputs X into the reduced dimensional space spanned by
%   the first K columns of U. It returns the projected examples in Z.
%

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the projection of the data using only the top K
%               eigenvectors in U (first K columns).
%               For the i-th example X(i,:), the projection on to the k-th
%               eigenvector is given as follows:
%                    x = X(i, :)';
%                    projection_k = x' * U(:, k);
%

U_reduce = U(:, 1:K);

Z = X * U_reduce;



% =============================================================

end

recoverData

function X_rec = recoverData(Z, U, K)
%RECOVERDATA Recovers an approximation of the original data when using the
%projected data
%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the
%   original data that has been reduced to K dimensions. It returns the
%   approximate reconstruction in X_rec.
%

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the approximation of the data by projecting back
%               onto the original space using the top K eigenvectors in U.
%
%               For the i-th example Z(i,:), the (approximate)
%               recovered data for dimension j is given as follows:
%                    v = Z(i, :)';
%                    recovered_j = v' * U(j, 1:K)';
%
%               Notice that U(j, 1:K) is a row vector.
%

U_reduce = U(:, 1:K);

X_rec = Z * U_reduce';

% =============================================================

end

13.5. gaussian distribution

compute mean and variance of X

function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a
%Gaussian distribution using the data in X
%   [mu sigma2] = estimateGaussian(X),
%   The input X is the dataset with each n-dimensional data point in one row
%   The output is an n-dimensional vector mu, the mean of the data set
%   and the variances sigma^2, an n x 1 vector
%

% Useful variables
[m, n] = size(X);

% You should return these values correctly
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the mean of the data and the variances
%               In particular, mu(i) should contain the mean of
%               the data for the i-th feature and sigma2(i)
%               should contain variance of the i-th feature.
%

mu = mean(X)';
# xu = X - mu';
# sigma2 = 1.0 / m * sum(xu .^ 2)';
sigma2 = (m-1) / m * var(X)';

% =============================================================


end

compute probability

function p = multivariateGaussian(X, mu, Sigma2)
%MULTIVARIATEGAUSSIAN Computes the probability density function of the
%multivariate gaussian distribution.
%    p = MULTIVARIATEGAUSSIAN(X, mu, Sigma2) Computes the probability
%    density function of the examples X under the multivariate gaussian
%    distribution with parameters mu and Sigma2. If Sigma2 is a matrix, it is
%    treated as the covariance matrix. If Sigma2 is a vector, it is treated
%    as the \sigma^2 values of the variances in each dimension (a diagonal
%    covariance matrix)
%

k = length(mu);

if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end

X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));

end

13.6. anomaly detection select threshold

function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
%   threshold to use for selecting outliers based on the results from a
%   validation set (pval) and the ground truth (yval).
%

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)

    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the F1 score of choosing epsilon as the
    %               threshold and place the value in F1. The code at the
    %               end of the loop will compare the F1 score for this
    %               choice of epsilon and set it to be the best epsilon if
    %               it is better than the current choice of epsilon.
    %
    % Note: You can use predictions = (pval < epsilon) to get a binary vector
    %       of 0's and 1's of the outlier predictions

    cv_pred = pval < epsilon;
    tp = sum((cv_pred == 1) & (yval == 1));
    fp = sum((cv_pred == 1) & (yval == 0));
    fn = sum((cv_pred == 0) & (yval == 1));
    prec = tp / (tp + fp);
    recall = tp / (tp + fn);
    F1 = 2 * prec * recall / (prec + recall);

    % =============================================================

    if F1 > bestF1
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

end

13.7. collaborative filtering cost function

function [J, grad] = cofiCostFunc(params, Y, R, num_users, num_movies, ...
                                  num_features, lambda)
%COFICOSTFUNC Collaborative filtering cost function
%   [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
%   num_features, lambda) returns the cost and gradient for the
%   collaborative filtering problem.
%

% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
                num_users, num_features);


% You need to return the following values correctly
J = 0;
X_grad = zeros(size(X));
Theta_grad = zeros(size(Theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost function and gradient for collaborative
%               filtering. Concretely, you should first implement the cost
%               function (without regularization) and make sure it is
%               matches our costs. After that, you should implement the
%               gradient and use the checkCostFunction routine to check
%               that the gradient is correct. Finally, you should implement
%               regularization.
%
% Notes: X - num_movies  x num_features matrix of movie features
%        Theta - num_users  x num_features matrix of user features
%        Y - num_movies x num_users matrix of user ratings of movies
%        R - num_movies x num_users matrix, where R(i, j) = 1 if the
%            i-th movie was rated by the j-th user
%
% You should set the following variables correctly:
%
%        X_grad - num_movies x num_features matrix, containing the
%                 partial derivatives w.r.t. to each element of X
%        Theta_grad - num_users x num_features matrix, containing the
%                     partial derivatives w.r.t. to each element of Theta
%

xt = X * Theta'; % m * n * n * u = m * u
df = (xt- Y) .* R; % m * u
J = 0.5 * sum(sum(df .^ 2));
J += 0.5 * lambda * (sum(sum(Theta .^ 2)) + sum(sum(X .^ 2)));


X_grad = df * Theta; % m * u * u * n = m * n;
X_grad += lambda * X;

Theta_grad = df' * X; % u * m * m * n = u * n;
Theta_grad += lambda * Theta;

% =============================================================

grad = [X_grad(:); Theta_grad(:)];

end

14. octave notes

basic syntax
- ==(equal), ~=(not equal), xor(a,b), PS1('new promot: ') starts comment with %
- ; to supersede output
- m(:) expand m to a single column vector.
- m(i,j) could be used as left operand. And for i,j could be
  - [a b] which means ath and bth elements.
  - or s:i:e/s:e which is shorthand of [a,b] # end as a keyword.
  - or : which means every elements.
output control
- disp(value) to print value
- sprintf to format string in C syntax.
- format long/short # change output format of numerical values.
- hist(matrix, [bucket]) to plot histogram of matrix.
generate matrix
- [a b ; c d; e f ]. 3 * 2 matrix.
- [start : interval : end] / [ start : end ] as interval = 1.
- ones/zeros(m,n) m * n matrix with elements are all 0/1.
- rand(m,n) m * n matrix with uniform distributed random elements in range [0,1]
- randn Gaussian distribution with mean = 0 and standard deviation = 1
- eye(n) n * n identity matrix
import/export data
- size, who/whos, clear [<value>],
- load <file>, save <file> <value> [-ascii % readable form]
compute command
- . means elements wise operation.
- A .* B , A .^ 2 , 1 ./ A , A .+ 1 , A .- 1
- log, exp, abs, -(neg), <
- A * B
- A' # tranpose
- max/find/magic/sum/prod/floor/ceil/flipud/pivn
drawing command
- plot(x,y)
- hold on # plot new figure on old one.
- xlabel / ylabel / legend / title / print # output to file / close / axis
- figure # create figure / clf # clear figure
- subplot(m,n,ind) # divides plot a m*n grid, and access ind th
- "imagesc(<matrix>) , colorbar , colormap gray;"
flow control
- for i=<range>[1:10], <stmt> end;
- if <cond>, <stmt> end; / elseif / else
- while <cond>, <stmt> end;
- break/continue
- function definition resides in function_name.m
  - function y = <name>(x)
  - function [y1,y2] = <name>(x)
  - y = x * 2
  - name(x) to call function.
- addpath to add path in order to find function file.

Machine Learning by Andrew Ng

Table of Contents

1. definition

2. prerequisite

3. supervised learning

4. neural networks

5. support vector machine

6. advice for applying ML

7. advice for designing ML

8. unsupervised learning

9. dimensionality reduction

10. anomaly detection

11. recommender system

12. ML in large scale

13. appendix code

13.1. feature normalization

13.2. linear regression cost function

13.3. neural network cost function

13.4. pca(principal compoenent analysis)

13.5. gaussian distribution

13.6. anomaly detection select threshold

13.7. collaborative filtering cost function

14. octave notes