Classification problems(分类问题) 我们给出正确的数据集,期望算法能够预测出discrete valued(离散值) 。 (the features can be more than 2 , Even infinite.)
Unsupervised learning(无监督学习)
Feature:
We just give the data and hope it can automatically find structure in the data for us.
Including:
Clustering algorithms(聚类算法)
Problems:
In fact I find the unsupervised learning looks like the classification problems . The only difference is whether or not the data set is GIVENGive the correct data ahead of time .
Summary:
We just give the Disorderly data sets , and hope the algorithm can automatically find out the structure of the data , and divide it into different clusters wisely.
Octave
Usage:
用octave建立算法的原型,如果算法成立,再迁移到其他的编程环境中来,如此才能使效率更高。
Part 2 :单变量线性回归(Linear Regression with One Variable)
算法实现过程:
Method:
Training set Learning algorithm hypothesis function(假设函数) Find the appropriate form to represent the hypothesis function -
简单来说就是我们给出Training Set , 用它来训练我们的Learning Algorithm ,最终期望能够得到一个函数 h(theta), 这也就是预测的模型,通过这个模型,我们能够实现输入变量集,然后就能够返回对应的预测值。
Cost function(代价函数)
关于h(x)函数
对于房价问题,我们决定使用Linear function 来拟合曲线 ,设 h(x)= theta(0)+theta(1)* x
因为我们最后想要得到的就是 h(x),对于 x 来说,它是始终变化的,每一条数据对应着不同的变量集,无法固定;所以algorithm想要确定的就是不变的量 theta集 了。在以后的操作中,因为我们想要求得的是 theta集,所以更多的时候,我们会把 x 看作是常量,却把 theta 看作是变量,所以一定不要糊涂了。
<a> is learning rate which control how big a step we use to upgrade the paras. So if the <a> is too small , the algorithm will run slowly ; but if <a> is too big , it can even make it away from the minimum point . 另外,每次theta下降的幅度都是不一样的,这是因为实际上theta每次下降多少是由 【<a>*Derivative Term】这个整体决定的,<a>是下降幅度的重要决定因子,但并不是全部。 在每次下降以后,Derivative Term都会变小,下一次下降的幅度也会变小,如此,控制梯度慢慢的下降,直到Derivative Term=0。
the rest of the formula is the Derivative Term(导数项) of theta
In this image,we can find that no matter the Derivative Term of theta is positive or negative , the Gradient descent algorithm works well
the Details of the formula:
Attention: With the changing of the paras,the derivative term is changing too. So: Make sure updating the paras at the same time.(simultaneous update同步更新) For what? Because the derivative should use the former para rather than the new. How: 在全部theta没有更新完成之前,我们用temp集来暂时存储已经更新完成的theta集,直到最后一个theta也更新完成之后,我们才再用存储在temp集里的新theta集更新原theta集。
Gradient descent for linear regression
“Batch” Gradient descent :
Each step of gradient descent uses all the training examples .
在梯度下降公式中存在某一项需要对所有m个训练样本求和。
Idea:
将梯度下降与代价函数相结合
The result:
This is the result after derivation. And we can find that comparing with the first formula , the second has a x term at the end . It does tell us something that the variable is para0 and para1 , not the x or y . Although they looks more like variables.
After derivation , for the first formula we have nothing left , for the second , we just has a constant term(常数项) x left .
Purpose: Make the gradient descent runs faster . 尤其是如果我们采用多项式回归模型,在运行梯度下降算法前,特征缩放非常有必要。
Idea: Make sure number are in similar scale .
Usage:
The range limit : Get every feature range from -3 to 3 or -1/3 to 1/3 . That’s fine . In fact there is no problems , because the feature is a real number ! That’s means it will never change . x 是变量,并且它是已知的,我们可以从data set 中直接获得,我们可以对它进行特征缩放;theta是未知的,是我们要求的目标。
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
%很醉人,这谁佛得了啊
Advanced Optimization(高级优化)
雾里看花。。。
梯度下降并不是我们求theta最好的方法,还有其他一些算法,更高级、更复杂。
共轭梯度法 BFGS (变尺度法) 和L-BFGS (限制变尺度法) 就是其中一些更高级的优化算法
优点: 通常不需要手动选择学习率 ; 算法的运行速度通常远远超过梯度下降。
Octave代码实现:
%定义Cost function 函数
function [jVal, gradient]=costFunction(theta)
jVal=(theta(1)-5)^2+(theta(2)-5)^2;
gradient=zeros(2,1);
gradient(1)=2*(theta(1)-5);
gradient(2)=2*(theta(2)-5);
end