Advanced Machine LearningSupport Vector MachinesEric XingLecture 6, August 11, 2009Reading:©Eric Xing @ CMU, 2006-20091What is a good Decision Boundary?zConsider a binary classification task with y = ±1 labels (not 0/1 as before). zWhen the training examples are Class 2linearly separable, we can set the parameters of a linear classifier so that all the training examples are classified correctlyzMany decision boundaries!zGenerative classifiersClass 1zLogistic regressions …zAre all decision boundaries equally good?©Eric Xing @ CMU, 2006-200921Not All Decision Boundaries Are Equal!zWhy we may have such boundaries?zIrregular distributionzImbalanced training sizeszoutliners©Eric Xing @ CMU, 2006-20093Classification and MarginzParameterzingdecision boundaryzLetwdenote a vector orthogonal to the decision boundary, and bdenote a scalar \"offset\" term, then we can write the decision boundary as:wTx+b=0Class 2Class 1d -d+©Eric Xing @ CMU, 2006-200942Classification and MarginzParameterzingdecision boundaryzLetwdenote a vector orthogonal to the decision boundary, and bdenote a scalar \"offset\" term, then we can write the decision boundary as:wTx+b=0zMarginwTxi+b> +cfor all xiin class 2wTxi+b< −cfor all xiin class 1Class 2Or more compactly:(wTxi+b)yi>cClass 1d -d+The margin between two pointsm = d−+ d+ =©Eric Xing @ CMU, 2006-20095Maximum Margin ClassificationzThe margin is:m=wTw(x)2ci*−xj*=wzHere is our Maximum Margin Classification problem:max2cwws.ty(wTixi+b)≥c, ∀i©Eric Xing @ CMU, 2006-200963Maximum Margin Classification, con'd.zThe optimization problem:maxcw,bs.twyi(wTxi+b)≥c, ∀izBut note that the magnitude of cmerely scales wand b, and does not change the classification boundary at all! (why?)zSo we instead work on this cleaner problem:max1w,bs.twyTi(wxi+b)≥1, ∀izThe solution to this leads to the famous Support Vector Machines---believed by many to be the best \"off-the-shelf\" supervised learning algorithm©Eric Xing @ CMU, 2006-20097Support vector machinezA convex quadratic programming problemwith linear constrains:max1d+w,bd-s.twyi(wTxi+b)≥1, ∀izThe attained margin is now given by1wzOnly a few of the classification constraints are relevant Îsupport vectorszConstrained optimizationzWe can directly solve this using commercial quadratic programming (QP) codezBut we want to take a more careful investigation of Lagrange duality, and the solution of the above in its dual form. Îdeeper insight: support vectors, kernels …Îmore efficient algorithm©Eric Xing @ CMU, 2006-200984Digression to LagrangianDualityzThe Primal Problemminf(w)Primal:ws.t.gi(w)≤0, i=1,K,khi(w)=0, i=1,The generalized Lagrangian:K,lLkl(w,α,β)=f(w)+∑αigi(w)+i=1∑βihi(w)i=1the α's(αι≥0) and β'sare called the Lagarangianmultipliers Lemma:max L(w,α,β)=⎧⎨f(w)if w satisfies primal constraintsα,β,αi≥0⎩∞o/wA re-written Primal:minwmaxα,β,αi≥0 L(w,α,β)©Eric Xing @ CMU, 2006-20099LagrangianDuality, cont.zRecall the Primal Problem:minwmaxα,β,αi≥0 L(w,α,β)zThe Dual Problem:maxα,β,αi≥0minwL(w,α,β)zTheorem (weak duality): d*=maxα,β,αi≥0minwL(w,α,β) ≤ minwmaxα,β,αi≥0 L(w,α,β)=p*zTheorem (strong duality):Iffthere exist a saddle point of , we haveL(w,α,β)d*=p*©Eric Xing @ CMU, 2006-2009105The KKT conditionszIf there exists some saddle point of L, then the saddle point satisfies the following \"Karush-Kuhn-Tucker\" (KKT) conditions:∂∂wL(w,α,β) =0, i=1,K,ki∂∂βL(w,α,β) =0, i=1,K,liαigi(w)=0, i=1,K,mgi(w)≤0, i=1,K,mαi≥0, i=1,K,mzTheorem: If w*, α* and β* satisfy the KKT condition, then it is also a solution to the primal and the dual problems.©Eric Xing @ CMU, 2006-200911Solving optimal margin classifierzRecall our opt problem:max1w,bs.twyi(wTxi+b)≥1, ∀izThis is equivalent tomin1w,bwTs.t2w1( )−yi(wTxi+b)≤0, ∀i*zWrite the Lagrangian:L(w,b,α)=1m2wTw−∑αi[yTi(wxi+b)−1]i=1zRecall that (*) can be reformulated asminw,bmaxαi≥0 L(w,b,α)Now we solve its dual problem: maxαi≥0minw,bL(w,b,α)©Eric Xing @ CMU, 2006-2009126The Dual Problemmaxα≥0minw,bL(w,b,α)izWe minimize Lwith respect to wand bfirst:m∇wL(w,b,α) =w−∑αiyixi=0,i=1( )*m∇bL(w,b,α) =∑αiyi=0,( )i=1**mNote that (*)implies: w=∑αiyixi( )i=1***zPlus (***) back to L, and using (**), we have:mL(w,b,α)=∑α1mi−i=12∑αiαjyiyj(xTixj)i,j=1©Eric Xing @ CMU, 2006-200913The Dual problem, cont.zNow we have the following dual opt problem:Jm1mmaxα(α)=∑αi−i=12∑αiαjyiyj(xTixj)i,j=1s.t. αi≥0, i=1,K,km ∑αiyi=0.i=1zThis is, (again,) a quadratic programmingproblem.zA global maximum of αican always be found. zBut what's the big deal??zNote two things:m1.wcan be recovered by w=∑αiyixiSee next …i=12.The \"kernel\"xTixjMore later …©Eric Xing @ CMU, 2006-2009147I. Support vectorszNote the KKT condition ---only a few αi'scan be nonzero!!αigi(w)=0, i=1,K,mClass 2Call the training data points α8=0.6α10=0whose αi'sare nonzero the support vectors(SV) α7=0α5=0α2=0αα1=0.84=0α6=1.4α9=0Class 1α3=0©Eric Xing @ CMU, 2006-200915Support vector machineszOnce we have the Lagrange multipliers {αreconstruct the parameter vector was a weighted combination i},we can of the training examples:w=iixii∈∑αySVzFor testing with a new data zzCompute wTz+b=y(xTiiiz)+bi∈∑αSVand classify zas class 1 if the sum is positive, and class 2 otherwisezNote: wneed not be formed explicitly©Eric Xing @ CMU, 2006-2009168Interpretation of support vector machineszThe optimalwis a linear combination of a small number of data points. This “sparse”representation can be viewed as data compression as in the construction of kNNclassifierzTo compute the weights {αi}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples xTixjzWe make decisions by comparing each new example zwith only the support vectors:y*=sign⎛⎜∑αiy(T⎝ixiz)+b⎞⎟i∈SV⎠©Eric Xing @ CMU, 2006-200917II. The Kernel TrickzRecall the SVM optimization problemm1mmaxα J(α)=∑αi−i=12∑αiαjyiyj(xTixj)i,j=1s.t. 0≤αi≤C, i=1,K,mm ∑αiyi=0.i=1zThe data points only appear asinner productzAs long as we can calculate the inner product in the feature space, we do not need the mapping explicitlyzMany common geometric operations (angles, distances) can be expressed by inner productszDefine the kernel function KbyK(xi,xTj)=φ(xi)φ(xj)©Eric Xing @ CMU, 2006-20091Transforming the Dataφ( )φ( )φ( )φ(.)φ( )φ( )φ( )φ( )φ( )φ( )φφ( )φ( )φ( )φ( )( )φ( )φ( )φ( )φ( )Input spaceFeature spaceNote: feature space is of higher dimension than the input space in practicezComputation in the feature space can be costly because it is high dimensionalzThe feature space is typically infinite-dimensional!zThe kernel trick comes to rescue©Eric Xing @ CMU, 2006-200919An Example for feature mapping and kernelszConsider an input x=[x1,x2]zSuppose φ(.) is given as followsφ⎛⎜⎡x⎜1⎤⎞⎟=1,2x1,2x2,x12,2⎝⎢⎣xx2⎥⎦⎟⎠2,2x1x2zAn inner product in the feature space isφ⎜⎛⎡x⎤⎞⎛⎡x'⎤⎞⎜⎝⎢1⎣x2⎥⎟⎦⎟,φ⎜⎠⎜⎝⎢1⎣x2'⎥⎟⎦⎟=⎠zSo, if we define the kernel functionas follows, there is no need to carry out φ(.) explicitlyK(x,x')=(1+xTx')2©Eric Xing @ CMU, 2006-20092010More examples of kernel functionszLinear kernel (we've seen it)K(x,x')=xTx'zPolynomial kernel (we just saw an example)K(x,x')=(1+xTx')pwhere p= 2, 3, …To get the feature vectors we concatenate all pthorder polynomial terms of the components of x (weighted appropriately)zRadial basis kernelK(x,x')=exp⎛⎜12⎞⎝−2x−x'⎟⎠In this case the feature space consists of functions and resultsin a non-parametric classifier.©Eric Xing @ CMU, 2006-200921The essence of kernelzFeature mapping, but “without paying a cost”zE.g., polynomial kernelzHow many dimensions we’ve got in the new space?zHow many operations it takes to compute K()?zKernel design, any principle?zK(x,z) can be thought of as a similarity function between x and zzThis intuition can be well reflected in the following “Gaussian”function(Similarly one can easily come up with other K() in the same spirit)zIs this necessarily lead to a “legal”kernel?(in the above particular case, K() is a legal one, do you know how many dimension φ(x) is?©Eric Xing @ CMU, 2006-20092211Kernel matrixzSuppose for now that Kis indeed a valid kernel corresponding to some feature mapping φ, then for xcompute an m×mmatrix , where1, …, xm, we can zThis is called a kernel matrix!zNow, if a kernel function is indeed a valid kernel, and its elements are dot-product in the transformed feature space, it must satisfy:zSymmetry K=KTproofzPositive –semidefiniteproof? zMercer’s theorem©Eric Xing @ CMU, 2006-200923SVM examples©Eric Xing @ CMU, 2006-20092412Examples for Non Linear SVMs–Gaussian Kernel©Eric Xing @ CMU, 2006-200925Non-linearly Separable ProblemsClass 2Class 1zWe allow “error”ξiin classification; it is based on the output of the discriminant function wTx+bzξiapproximates the number of misclassified samples©Eric Xing @ CMU, 2006-20092613Soft Margin HyperplanezNow we have a slightly different opt problem:min1Tmw,b2ww+C∑ξ1ii=s.tyi(wTxi+b)≥1−ξi, ∀iξi≥0, ∀i zξiare “slack variables”in optimizationzNote that ξi=0 if there is no error for xizξiis an upper bound of the number of errorszC: tradeoff parameter between error and margin©Eric Xing @ CMU, 2006-200927The Optimization ProblemzThe dual of this new constrained optimization problem ismaxJm(α)=∑α1mTα i−i=12∑α1iαjyiyj(xixj)i,j=s.t. 0≤αi≤C, i=1,K,mm ∑αiyi=0.i=1zThis is very similar to the optimization problem in the linear separable case, except that there is an upper bound Con αnowizOnce again, a QP solver can be used to find αi©Eric Xing @ CMU, 2006-20092814The SMO algorithmzConsider solving the unconstrainedopt problem:zWe’ve already see three opt algorithms! z?z?z?zCoordinate ascend:©Eric Xing @ CMU, 2006-200929Coordinate ascend©Eric Xing @ CMU, 2006-20093015Sequential minimal optimizationzConstrained optimization:mmax J(α)=∑α1mαi−∑αiαjyiyj(xTixj)i=12i,j=1s.t. 0≤αi≤C, i=1,K,m ∑mαiyi=0.i=1zQuestion: can we do coordinate along one direction at a time (i.e., hold all α[-i]fixed, and update αi?)©Eric Xing @ CMU, 2006-200931The SMO algorithmRepeat till convergence1. Select some pair αiand αjto update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum).2. Re-optimize J(α) with respect to αiand αi; j) fixed.j, while holding all the other αk's (k≠Will this procedure converge?©Eric Xing @ CMU, 2006-20093216Convergence of SMOmmaxJ(α)=∑α1mα i−∑αiαjyiyj(xTixj)i=12i,j=1KKT:s.t. 0≤αi≤C, i=1,K,k ∑mαiyi=0.i=1zLet’s hold α3,…, αmfixed and reoptJ w.r.t. α1and α2©Eric Xing @ CMU, 2006-200933Convergence of SMOzThe constraints:zThe objective:zConstrained opt:©Eric Xing @ CMU, 2006-20093417Cross-validation error of SVMzThe leave-one-out cross-validation error does not depend on the dimensionality of the feature space but only on the # of support vectors!Leave-one-out CV error =# support vectors# of training examples©Eric Xing @ CMU, 2006-200935SummaryzMax-margin decision boundaryzConstrained convex optimizationzDualityzThe KTT conditions and the support vectorszNon-separable case and slack variableszThe SMO algorithm©Eric Xing @ CMU, 2006-20093618