10 Understanding the difficulty of training deep feedforward neural networks
更新时间:2023-03-20 00:02:01 阅读量: 教育文库 文档下载
- 10张表格看懂“新十条”推荐度:
- 相关推荐
Understandingthedif cultyoftrainingdeepfeedforwardneuralnetworks
XavierGlorotYoshuaBengioDIRO,Universit´edeMontr´eal,Montr´eal,Qu´ebec,Canada
Abstract
Whereasbefore2006itappearsthatdeepmulti-layerneuralnetworkswerenotsuccessfully
trained,sincethenseveralalgorithmshavebeenshowntosuccessfullytrainthem,withexperi-mentalresultsshowingthesuperiorityofdeepervslessdeeparchitectures.Alltheseexperimen-talresultswereobtainedwithnewinitializationortrainingmechanisms.Ourobjectivehereistounderstandbetterwhystandardgradientdescentfromrandominitializationisdoingsopoorlywithdeepneuralnetworks,tobetterunderstandtheserecentrelativesuccessesandhelpdesignbetteralgorithmsinthefuture.We rstobservethein uenceofthenon-linearactivationsfunc-tions.We ndthatthelogisticsigmoidactivationisunsuitedfordeepnetworkswithrandomini-tializationbecauseofitsmeanvalue,whichcandriveespeciallythetophiddenlayerintosatu-ration.Surprisingly,we ndthatsaturatedunitscanmoveoutofsaturationbythemselves,albeitslowly,andexplainingtheplateaussometimesseenwhentrainingneuralnetworks.We ndthatanewnon-linearitythatsaturateslesscanoftenbebene cial.Finally,westudyhowactivationsandgradientsvaryacrosslayersandduringtrain-ing,withtheideathattrainingmaybemoredif- cultwhenthesingularvaluesoftheJacobianassociatedwitheachlayerarefarfrom1.Basedontheseconsiderations,weproposeanewini-tializationschemethatbringssubstantiallyfasterconvergence.
learningmethodsforawidearrayofdeeparchitectures,includingneuralnetworkswithmanyhiddenlayers(Vin-centetal.,2008)andgraphicalmodelswithmanylevelsofhiddenvariables(Hintonetal.,2006),amongothers(Zhuetal.,2009;Westonetal.,2008).Muchattentionhasre-centlybeendevotedtothem(see(Bengio,2009)forare-view),becauseoftheirtheoreticalappeal,inspirationfrombiologyandhumancognition,andbecauseofempiricalsuccessinvision(Ranzatoetal.,2007;Larochelleetal.,2007;Vincentetal.,2008)andnaturallanguageprocess-ing(NLP)(Collobert&Weston,2008;Mnih&Hinton,2009).TheoreticalresultsreviewedanddiscussedbyBen-gio(2009),suggestthatinordertolearnthekindofcom-plicatedfunctionsthatcanrepresenthigh-levelabstractions(e.g.invision,language,andotherAI-leveltasks),onemayneeddeeparchitectures.
Mostoftherecentexperimentalresultswithdeeparchi-tectureareobtainedwithmodelsthatcanbeturnedintodeepsupervisedneuralnetworks,butwithinitializationortrainingschemesdifferentfromtheclassicalfeedforwardneuralnetworks(Rumelhartetal.,1986).Whyarethesenewalgorithmsworkingsomuchbetterthanthestandardrandominitializationandgradient-basedoptimizationofasupervisedtrainingcriterion?Partoftheanswermaybefoundinrecentanalysesoftheeffectofunsupervisedpre-training(Erhanetal.,2009),showingthatitactsasaregu-larizerthatinitializestheparametersina“better”basinofattractionoftheoptimizationprocedure,correspondingtoanapparentlocalminimumassociatedwithbettergeneral-ization.Butearlierwork(Bengioetal.,2007)hadshownthatevenapurelysupervisedbutgreedylayer-wiseproce-durewouldgivebetterresults.Sohereinsteadoffocus-ingonwhatunsupervisedpre-trainingorsemi-supervisedcriteriabringtodeeparchitectures,wefocusonanalyzingwhatmaybegoingwrongwithgoodold(butdeep)multi-layerneuralnetworks.
Ouranalysisisdrivenbyinvestigativeexperimentstomon-itoractivations(watchingforsaturationofhiddenunits)andgradients,acrosslayersandacrosstrainingiterations.Wealsoevaluatetheeffectsontheseofchoicesofacti-vationfunction(withtheideathatitmightaffectsatura-tion)andinitializationprocedure(sinceunsupervisedpre-trainingisaparticularformofinitializationandithasadrasticimpact).
1DeepNeuralNetworks
Deeplearningmethodsaimatlearningfeaturehierarchieswithfeaturesfromhigherlevelsofthehierarchyformedbythecompositionoflowerlevelfeatures.Theyinclude
AppearinginProceedingsofthe13thInternationalConferenceonArti cialIntelligenceandStatistics(AISTATS)2010,ChiaLa-gunaResort,Sardinia,Italy.Volume9ofJMLR:W&CP9.Copy-right2010bytheauthors.
2ExperimentalSettingandDatasets
Codetoproducethenewdatasetsintroducedinthissectionisavailablefrom:http://www.iro.umontreal.ca/ lisa/twiki/bin/view.cgi/Public/DeepGradientsAISTATS2010.2.1
OnlineLearningonanIn niteDataset:Shapeset-3×2
Recentworkwithdeeparchitectures(seeFigure7inBen-gio(2009))showsthatevenwithverylargetrainingsetsoronlinelearning,initializationfromunsupervisedpre-trainingyieldssubstantialimprovement,whichdoesnotvanishasthenumberoftrainingexamplesincreases.Theonlinesettingisalsointerestingbecauseitfocusesontheoptimizationissuesratherthanonthesmall-sampleregu-larizationeffects,sowedecidedtoincludeinourexperi-mentsasyntheticimagesdatasetinspiredfromLarochelleetal.(2007)andLarochelleetal.(2009),fromwhichasmanyexamplesasneededcouldbesampled,fortestingtheonlinelearningscenario.
WecallthisdatasettheShapeset-3×2dataset,withex-ampleimagesinFigure1(top).Shapeset-3×2con-tainsimagesof1or2two-dimensionalobjects,eachtakenfrom3shapecategories(triangle,parallelogram,ellipse),andplacedwithrandomshapeparameters(relativelengthsand/orangles),scaling,rotation,translationandgrey-scale.Wenoticedthatforonlyoneshapepresentintheimagethetaskofrecognizingitwastooeasy.Wethereforedecidedtosamplealsoimageswithtwoobjects,withtheconstraintthatthesecondobjectdoesnotoverlapwiththe rstbymorethan ftypercentofitsarea,toavoidhidingiten-tirely.Thetaskistopredicttheobjectspresent(e.g.trian-gle+ellipse,parallelogram+parallelogram,trianglealone,etc.)withouthavingtodistinguishbetweentheforegroundshapeandthebackgroundshapewhentheyoverlap.Thisthereforede nesninecon gurationclasses.
Thetaskisfairlydif cultbecauseweneedtodiscoverin-variancesoverrotation,translation,scaling,objectcolor,occlusionandrelativepositionoftheshapes.Inparallelweneedtoextractthefactorsofvariabilitythatpredictwhichobjectshapesarepresent.
Thesizeoftheimagesarearbitrarybutwe xeditto32×32inordertoworkwithdeepdensenetworksef ciently.2.2FiniteDatasets
TheMNISTdigits(LeCunetal.,1998a),datasethas50,000trainingimages,10,000validationimages(forhyper-parameterselection),and10,000testimages,eachshowinga28×28grey-scalepixelimageofoneofthe10digits.
CIFAR-10(Krizhevsky&Hinton,2009)isalabelled
sub-
Figure1:Top:Shapeset-3×2imagesat64×64resolution.Theexamplesweusedareat32×32resolution.Thelearnertriestopredictwhichobjects(parallelogram,triangle,orel-lipse)arepresent,and1or2objectscanbepresent,yield-ing9possibleclassi cations.Bottom:Small-ImageNetimagesatfullresolution.
setofthetiny-imagesdatasetthatcontains50,000trainingexamples(fromwhichweextracted10,000asvalidationdata)and10,000testexamples.Thereare10classescor-respondingtothemainobjectineachimage:airplane,au-tomobile,bird,cat,deer,dog,frog,horse,ship,ortruck.Theclassesarebalanced.Eachimageisincolor,butisjust32×32pixelsinsize,sotheinputisavectorof32×32×3=3072realvalues.
Small-ImageNetwhichisasetoftiny37×37graylevelimagesdatasetcomputedfromthehigher-resolutionandlargersetat,withla-belsfromtheWordNetnounhierarchy.Wehaveused90,000examplesfortraining,10,000forthevalidationset,and10,000fortesting.Thereare10balancedclasses:rep-tiles,vehicles,birds,mammals, sh,furniture,instruments,tools, owersandfruitsFigure1(bottom)showsrandomlychosenexamples.
2.3ExperimentalSetting
Weoptimizedfeedforwardneuralnetworkswithoneto vehiddenlayers,withonethousandhiddenunitsperlayer,andwithasoftmaxlogisticregressionfortheout-putlayer.Thecostfunctionisthenegativelog-likelihood logP(y|x),where(x,y)isthe(inputimage,targetclass)pair.Theneuralnetworkswereoptimizedwithstochasticback-propagationonmini-batchesofsizeten,i.e.,theav-P(y|x)
eragegof logwascomputedover10consecutive
trainingpairs(x,y)andusedtoupdateparametersθinthatdirection,withθ←θ g.Thelearningrate isahyper-parameterthatisoptimizedbasedonvalidationseterrorafteralargenumberofupdates(5million).
Wevariedthetypeofnon-linearactivationfunctioninthehiddenlayers:thesigmoid1/(1+e x),thehyperbolictangenttanh(x),andanewlyproposedactivationfunc-tion(Bergstraetal.,2009)calledthesoftsign,x/(1+|x|).Thesoftsignissimilartothehyperbolictangent(itsrangeis-1to1)butitstailsarequadraticpolynomialsratherthanexponentials,i.e.,itapproachesitsasymptotesmuchslower.
Inthecomparisons,wesearchforthebesthyper-parameters(learningrateanddepth)separatelyforeachmodel.Notethatthebestdepthwasalways veforShapeset-3×2,exceptforthesigmoid,forwhichitwasfour.
Weinitializedthebiasestobe0andtheweightsWijateachlayerwiththefollowingcommonlyusedheuristic:
11
Wij~U ,,(1)
whereU[ a,a]istheuniformdistributionintheinterval
( a,a)andnisthesizeofthepreviouslayer(thenumberofcolumnsofW).
Figure2:Meanandstandarddeviation(verticalbars)oftheactivationvalues(outputofthesigmoid)duringsupervisedlearning,forthedifferenthiddenlayersofadeeparchi-tecture.Thetophiddenlayerquicklysaturatesat0(slow-ingdownalllearning),butthenslowlydesaturatesaroundepoch100.
Weseethatveryquicklyatthebeginning,allthesigmoidactivationvaluesofthelasthiddenlayerarepushedtotheirlowersaturationvalueof0.Inversely,theotherslayershaveameanactivationvaluethatisabove0.5,anddecreas-ingaswegofromtheoutputlayertotheinputlayer.Wehavefoundthatthiskindofsaturationcanlastverylongindeepernetworkswithsigmoidactivations,e.g.,thedepth- vemodelneverescapedthisregimeduringtraining.Thebigsurpriseisthatforintermediatenumberofhiddenlay-ers(herefour),thesaturationregimemaybeescaped.Atthesametimethatthetophiddenlayermovesoutofsatura-tion,the rsthiddenlayerbeginstosaturateandthereforetostabilize.
Wehypothesizethatthisbehaviorisduetothecombina-tionofrandominitializationandthefactthatanhiddenunitoutputof0correspondstoasaturatedsigmoid.Notethatdeepnetworkswithsigmoidsbutinitializedfromunsuper-visedpre-training(e.g.fromRBMs)donotsufferfromthissaturationbehavior.Ourproposedexplanationrestsonthehypothesisthatthetransformationthatthelowerlayersoftherandomlyinitializednetworkcomputesinitiallyisnotusefultotheclassi cationtask,unlikethetransforma-tionobtainedfromunsupervisedpre-training.Thelogisticlayeroutputsoftmax(b+Wh)mightinitiallyrelymoreonitsbiasesb(whicharelearnedveryquickly)thanonthetophiddenactivationshderivedfromtheinputimage(becausehwouldvaryinwaysthatarenotpredictiveofy,maybecorrelatedmostlywithotherandpossiblymoredominantvariationsofx).ThustheerrorgradientwouldtendtopushWhtowards0,whichcanbeachievedbypushinghtowards0.Inthecaseofsymmetricactivationfunctionslikethehyperbolictangentandthesoftsign,sittingaround0isgoodbecauseitallowsgradientsto owbackwards.However,pushingthesigmoidoutputsto0wouldbringthemintoasaturationregimewhichwouldpreventgradi-entsto owbackwardandpreventthelowerlayersfromlearningusefulfeatures.Eventuallybutslowly,thelowerlayersmovetowardmoreusefulfeaturesandthetophiddenlayerthenmovesoutofthesaturationregime.Notehow-everthat,evenafterthis,thenetworkmovesintoasolutionthatisofpoorerquality(alsointermsofgeneralization)
3
EffectofActivationFunctionsandSaturationDuringTraining
Twothingswewanttoavoidandthatcanberevealedfromtheevolutionofactivationsisexcessivesaturationofacti-vationfunctionsononehand(thengradientswillnotprop-agatewell),andoverlylinearunits(theywillnotcomputesomethinginteresting).
3.1ExperimentswiththeSigmoid
Thesigmoidnon-linearityhasbeenalreadyshowntoslowdownlearningbecauseofitsnone-zeromeanthatinducesimportantsingularvaluesintheHessian(LeCunetal.,1998b).Inthissectionwewillseeanothersymptomaticbehaviorduetothisactivationfunctionindeepfeedforwardnetworks.
Wewanttostudypossiblesaturation,bylookingattheevo-lutionofactivationsduringtraining,andthe guresinthissectionshowresultsontheShapeset-3×2data,butsim-ilarbehaviorisobservedwiththeotherdatasets.Figure2showstheevolutionoftheactivationvalues(afterthenon-linearity)yer1referstotheoutputof rsthiddenlayer,andtherearefourhiddenlayers.Thegraphshowsthemeansandstandarddeviationsoftheseactivations.Thesestatisticsalongwithhistogramsarecomputedatdifferenttimesduringlearning,bylookingatactivationvaluesfora xedsetof300
testexamples.
thenthosefoundwithsymmetricactivationfunctions,ascanbeseenin gure11.3.2
ExperimentswiththeHyperbolictangent
wherethegradientswould ow
well.
Asdiscussedabove,thehyperbolictangentnetworksdonotsufferfromthekindofsaturationbehaviorofthetophid-denlayerobservedwithsigmoidnetworks,becauseofitssymmetryaroundwithourstandardweight 0.However,
11
initializationU ,,weobserveasequentiallyoc-curringsaturationphenomenonstartingwithlayer1andpropagatingupinthenetwork,asillustratedinFigure3.Whythisishappeningremainstobe
understood.
Figure4:Activationvaluesnormalizedhistogramattheendoflearning,averagedacrossunitsofthesamelayerandacross300testexamples.Top:activationfunctionishyper-bolictangent,weseeimportantsaturationofthelowerlay-ers.Bottom:activationfunctionissoftsign,weseemanyactivationvaluesaround(-0.6,-0.8)and(0.6,0.8)wheretheunitsdonotsaturatebutarenon-linear.
4
4.1
Figure3:Top:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionoftheactivationvaluesforthehyperbolictangentnetworksinthecourseoflearning.Weseethe rsthiddenlayersatu-rating rst,thenthesecond,etc.Bottom:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionofactivationvaluesforthesoft-signduringlearning.Herethedifferentlayerssaturatelessanddosotogether.3.3
ExperimentswiththeSoftsign
StudyingGradientsandtheirPropagation
EffectoftheCostFunction
Thesoftsignx/(1+|x|)issimilartothehyperbolictangentbutmightbehavedifferentlyintermsofsaturationbecauseofitssmootherasymptotes(polynomialinsteadofexpo-nential).WeseeonFigure3thatthesaturationdoesnotoccuronelayeraftertheotherlikeforthehyperbolictan-gent.Itisfasteratthebeginningandthenslow,andalllayersmovetogethertowardslargerweights.
Wecanalsoseeattheendoftrainingthatthehistogramofactivationvaluesisverydifferentfromthatseenwiththehyperbolictangent(Figure4).Whereasthelatteryieldsmodesoftheactivationsdistributionmostlyattheextremes(asymptotes-1and1)oraround0,thesoftsignnetworkhasmodesofactivationsarounditsknees(betweenthelinearregimearound0andthe atregimearound-1and1).Thesearetheareaswherethereissubstantialnon-linearitybut
Wehavefoundthatthelogisticregressionorconditionallog-likelihoodcostfunction( logP(y|x)coupledwithsoftmaxoutputs)workedmuchbetter(forclassi cationproblems)thanthequadraticcostwhichwastradition-allyusedtotrainfeedforwardneuralnetworks(Rumelhartetal.,1986).Thisisnotanewobservation(Sollaetal.,1988)butwe nditimportanttostresshere.Wefoundthattheplateausinthetrainingcriterion(asafunctionofthepa-rameters)arelesspresentwiththelog-likelihoodcostfunc-tion.WecanseethisonFigure5,whichplotsthetrainingcriterionasafunctionoftwoweightsforatwo-layernet-work(onehiddenlayer)withhyperbolictangentunits,andarandominputandtargetsignal.Thereareclearlymoresevereplateauswiththequadraticcost.4.2Gradientsatinitialization4.2.1
TheoreticalConsiderationsandaNewNormalizedInitialization
Westudytheback-propagatedgradients,orequivalentlythegradientofthecostfunctionontheinputsbiasesateachlayer.Bradley(2009)foundthatback-propagatedgradientsweresmallerasonemovesfromtheoutputlayertowardstheinputlayer,justafterinitialization.Hestudiednetworkswithlinearactivationateachlayer, ndingthatthevarianceoftheback-propagatedgradientsdecreasesaswegoback-wardsinthenetwork.Wewillalsostartbystudyingthelinearregime.
Fromaforward-propagationpointofview,tokeepinfor-mation owingwewouldlikethat
(i,i ),Var[zi]=Var[zi].
(8)
Fromaback-propagationpointofviewwewouldsimilarlyliketohave
Cost Cost
=Var.(9) (i,i),Var
si si Thesetwoconditionstransformto:
i,niVar[Wi]=1 i,ni+1Var[Wi]=1
Figure5:Crossentropy(black,surfaceontop)and
quadratic(red,bottomsurface)costasafunctionoftwoweights(oneateachlayer)ofanetworkwithtwolayers,W1respectivelyonthe rstlayerandW2onthesecond,outputlayer.
Foradensearti cialneuralnetworkusingsymmetricacti-vationfunctionfwithunitderivativeat0(i.e.f (0)=1),ifwewritezifortheactivationvectoroflayeri,andsitheargumentvectoroftheactivationfunctionatlayeri,wehavesi=ziWi+biandzi+1=f(si).Fromthesede nitionsweobtainthefollowing:
Costi+1 Cost i
=f(s)Wkk,
s sk
Costi Cost=zl
wl,k sk
(2)(3)
(10)(11)
Asacompromisebetweenthesetwoconstraints,wemight
wanttohave
2
i,Var[Wi]=(12)
ni+ni+1Notehowbothconstraintsaresatis edwhenalllayershavethesamewidth.Ifwealsohavethesameinitializationfortheweightswecouldgetthefollowinginterestingproper-ties:
Cost d i
i,Var=nVar[W]Var[x](13)
si i,Var
Cost w d Cost
=nVar[W]Var[x]Var
s(14)
Thevarianceswillbeexpressedwithrespecttotheinput,outpoutandweightinitializationrandomness.Considerthehypothesisthatweareinalinearregimeattheinitial-ization,thattheweightsareinitializedindependentlyandthattheinputsfeaturesvariancesarethesame(=Var[x]).Thenwecansaythat,withnithesizeoflayeriandxthenetworkinput,
f (sik)≈1,
Var[z]=Var[x]
Wecanseethatthevarianceofthegradientontheweights
isthesameforalllayers,butthevarianceoftheback-propagatedgradientmightstillvanishorexplodeasweconsiderdeepernetworks.Notehowthisisreminis-centofissuesraisedwhenstudyingrecurrentneuralnet-works(Bengioetal.,1994),whichcanbeseenasverydeepnetworkswhenunfoldedthroughtime.
Thestandardinitializationthatwehaveused(eq.1)givesrisetovariancewiththefollowingproperty:
nVar[W]=
13
(15)
(4)
i
i 1 i =0
ni Var[Wi],
(5)
WewriteVar[Wi]forthesharedscalarvarianceofall
weightsatlayeri .Thenforanetworkwithdlayers,
d Cost Cost
i
+1Var[W],(6)Var=Varni
si sd
i=ii
wherenisthelayersize(assumingalllayersofthesame
size).Thiswillcausethevarianceoftheback-propagatedgradienttobedependentonthelayer(anddecreasing).Thenormalizationfactormaythereforebeimportantwheninitializingdeepnetworksbecauseofthemultiplicativeef-fectthroughlayers,andwesuggestthefollowinginitializa-tionproceduretoapproximatelysatisfyourobjectivesofmaintainingactivationvariancesandback-propagatedgra-dientsvarianceasonemovesupordownthenetwork.Wecallitthenormalizedinitialization:
√√ ,(16)W~U jj+1jj+1
Var
Cost w=
i 1 i =0
ni Var[W]
d 1 i =i
ni +1Var[Wi].
(7)
×Var[x]Var
Cost s
4.2.2GradientPropagationStudy
Toempiricallyvalidatetheabovetheoreticalideas,wehaveplottedsomenormalizedhistogramsofactivationvalues,weightgradientsandoftheback-propagatedgradientsatinitializationwiththetwodifferentinitializationmethods.Theresultsdisplayed(Figures6,7and8)arefromexper-imentsonShapeset-3×2,butqualitativelysimilarresultswereobtainedwiththeotherdatasets.
WemonitorthesingularvaluesoftheJacobianmatrixas-sociatedwithlayeri:
zi+1i
J=(17)
zWhenconsecutivelayershavethesamedimension,theav-eragesingularvaluecorrespondstotheaverageratioofin- nitesimalvolumesmappedfromzitozi+1,aswellastotheratioofaverageactivationvariancegoingfromzitozi+1.Withournormalizedinitialization,thisratioisaround0.8whereaswiththestandardinitialization,itdropsdownto0.5
.
Figure7:Back-propagatedgradientsnormalizedhis-togramswithhyperbolictangentactivation,withstandard(top)vsnormalized(bottom)initialization.Top:0-peakdecreasesforhigherlayers.
Whatwasinitiallyreallysurprisingisthatevenwhentheback-propagatedgradientsbecomesmaller(standardini-tialization),thevarianceoftheweightsgradientsisroughlyconstantacrosslayers,asshownonFigure8.However,thisisexplainedbyourtheoreticalanalysisabove(eq.14).In-terestingly,asshowninFigure9,theseobservationsontheweightgradientofstandardandnormalizedinitializationchangeduringtraining(hereforatanhnetwork).Indeed,whereasthegradientshaveinitiallyroughlythesamemag-nitude,theydivergefromeachother(withlargergradientsinthelowerlayers)astrainingprogresses,especiallywiththestandardinitialization.Notethatthismightbeoneoftheadvantagesofthenormalizedinitialization,sincehav-inggradientsofverydifferentmagnitudesatdifferentlay-ersmayyieldtoill-conditioningandslowertraining.Finally,weobservethatthesoftsignnetworkssharesimi-laritieswiththetanhnetworkswithnormalizedinitializa-tion,ascanbeseenbycomparingtheevolutionofactiva-tionsinbothcases(resp.Figure3-bottomandFigure10).
Figure6:Activationvaluesnormalizedhistogramswithhyperbolictangentactivation,withstandard(top)vsnor-malizedinitialization(bottom).Top:0-peakincreasesforhigherlayers.4.3
Back-propagatedGradientsDuringLearning
5ErrorCurvesandConclusions
Thedynamicoflearninginsuchnetworksiscomplexandwewouldliketodevelopbettertoolstoanalyzeandtrackit.Inparticular,wecannotusesimplevariancecalculationsinourtheoreticalanalysisbecausetheweightsvaluesarenotanymoreindependentoftheactivationvaluesandthelinearityhypothesisisalsoviolated.
As rstnotedbyBradley(2009),weobserve(Figure7)thatatthebeginningoftraining,afterthestandardinitializa-tion(eq.1),thevarianceoftheback-propagatedgradientsgetssmallerasitispropagateddownwards.Howeverwe ingournormalizedinitializationwedonotseesuchde-creasingback-propagatedgradients(bottomofFigure
7).
The nalconsiderationthatwecareforisthesuccessoftrainingwithdifferentstrategies,andthisisbestil-lustratedwitherrorcurvesshowingtheevolutionoftesterrorastrainingprogressesandasymptotes.Figure11showssuchcurveswithonlinetrainingonShapeset-3×2,whileTable1gives naltesterrorforallthedatasetsstudied(Shapeset-3×2,MNIST,CIFAR-10,andSmall-ImageNet).Asabaseline,weoptimizedRBFSVMmod-elsononehundredthousandShapesetexamplesandob-tained59.47%testerror,whileonthesamesetweobtained50.47%withadepth vehyperbolictangentnetworkwithnormalizedinitialization.
Theseresultsillustratetheeffectofthechoiceofactiva-tionandinitialization.AsareferenceweincludeinFig-
Figure8:Weightgradientnormalizedhistogramswithhy-perbolictangentactivationjustafterinitialization,withstandardinitialization(top)andnormalizedinitialization(bottom),fordifferentlayers.Eventhoughwithstandardinitializationtheback-propagatedgradientsgetsmaller,theweightgradientsdonot!
Table1:Testerrorwithdifferentactivationfunctionsandinitializationschemesfordeepnetworkswith5hiddenlay-ers.Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.Resultsinboldarestatisticallydifferentfromnon-boldonesunderthenullhypothesistestwithp=0.005.
TYPESoftsignSoftsignNTanhTanhNSigmoid
Shapeset16.2716.0627.1515.6082.61
MNIST1.641.721.761.642.21
CIFAR-1055.7853.855.952.9257.28
ImageNet69.1468.1370.5868.5770.66
Figure9:Standarddeviationintervalsoftheweightsgradi-entswithhyperbolictangentswithstandardinitialization(top)andnormalized(bottom)duringtraining.Weseethatthenormalizationallowstokeepthesamevarianceoftheweightsgradientacrosslayers,duringtraining(top:smallervarianceforhigher
layers).
Figure10:98percentile(markersalone)andstandardde-viation(solidlineswithmarkers)ofthedistributionofac-tivationvaluesforhyperbolictangentwithnormalizedini-tializationduringlearning.
activations( owingupward)andgradients( owingbackward).
Othersmethodscanalleviatediscrepanciesbetweenlay-ersduringlearning,e.g.,exploitingsecondorderinforma-tiontosetthelearningrateseparatelyforeachparame-ter.Forexample,wecanexploitthediagonaloftheHes-sian(LeCunetal.,1998b)oragradientvarianceestimate.BoththosemethodshavebeenappliedforShapeset-3×2withhyperbolictangentandstandardinitialization.Weob-servedagaininperformancebutnotreachingtheresultob-tainedfromnormalizedinitialization.Inaddition,weob-servedfurthergainsbycombiningnormalizedinitializationwithsecondordermethods:theestimatedHessianmightthenfocusondiscrepanciesbetweenunits,nothavingtocorrectimportantinitialdiscrepanciesbetweenlayers.Inallreportedexperimentswehaveusedthesamenum-berofunitsperlayer.However,weveri edthatweobtainthesamegainswhenthelayersizeincreases(ordecreases)withlayernumber.
Theotherconclusionsfromthisstudyarethefollowing: Monitoringactivationsandgradientsacrosslayersand
ure11theerrorcurveforthesupervised ne-tuningfromtheinitializationobtainedafterunsupervisedpre-trainingwithdenoisingauto-encoders(Vincentetal.,2008).Foreachnetworkthelearningrateisseparatelychosentomin-imizeerroronthevalidationset.WecanremarkthatonShapeset-3×2,becauseofthetaskdif culty,weobserveimportantsaturationsduringlearning,thismightexplainthatthenormalizedinitializationorthesoftsigneffectsaremorevisible.
Severalconclusionscanbedrawnfromtheseerrorcurves: Themoreclassicalneuralnetworkswithsigmoidorhyperbolictangentunitsandstandardinitializationfareratherpoorly,convergingmoreslowlyandappar-entlytowardsultimatelypoorerlocalminima. Thesoftsignnetworksseemtobemorerobusttotheinitializationprocedurethanthetanhnetworks,pre-sumablybecauseoftheirgentlernon-linearity. Fortanhnetworks,theproposednormalizedinitial-izationcanbequitehelpful,presumablybecausethelayer-to-layertransformationsmaintainmagnitudes
of
Bengio,Y.,Lamblin,P.,Popovici,D.,&Larochelle,H.(2007).Greedylayer-wisetrainingofdeepnetworks.NIPS19(pp.153–160).MITPress.Bengio,Y.,Simard,P.,&Frasconi,P.(1994).Learninglong-termdependencieswithgradientdescentisdif cult.IEEETransac-tionsonNeuralNetworks,5,157–166.Bergstra,J.,Desjardins,G.,Lamblin,P.,&Bengio,Y.(2009).Quadraticpolynomialslearnbetterimagefeatures(TechnicalReport1337).D´epartementd’InformatiqueetdeRechercheOp´erationnelle,Universit´edeMontr´eal.Bradley,D.(2009).Learninginmodularsystems.Doctoraldis-sertation,TheRoboticsInstitute,CarnegieMellonUniversity.Collobert,R.,&Weston,J.(2008).Auni edarchitecturefornat-urallanguageprocessing:Deepneuralnetworkswithmultitasklearning.ICML2008.
Figure11:TesterrorduringonlinetrainingontheShapeset-3×2dataset,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottominde-creasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalized
initialization.
Erhan,D.,Manzagol,P.-A.,Bengio,Y.,Bengio,S.,&Vincent,P.(2009).Thedif cultyoftrainingdeeparchitecturesandtheeffectofunsupervisedpre-training.AISTATS’2009(pp.153–160).Hinton,G.E.,Osindero,S.,&Teh,Y.(2006).Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18,1527–1554.Krizhevsky,A.,&Hinton,G.(2009).Learningmultiplelayersoffeaturesfromtinyimages(TechnicalReport)rochelle,H.,Bengio,Y.,Louradour,J.,&Lamblin,P.(2009).Exploringstrategiesfortrainingdeepneuralnetworks.TheJournalofMachineLearningResearch,10,1–rochelle,H.,Erhan,D.,Courville,A.,Bergstra,J.,&Bengio,Y.(2007).Anempiricalevaluationofdeeparchitecturesonproblemswithmanyfactorsofvariation.ICML2007.LeCun,Y.,Bottou,L.,Bengio,Y.,&Haffner,P.(1998a).Gradient-basedlearningappliedtodocumentrecognition.Pro-ceedingsoftheIEEE,86,2278–2324.LeCun,Y.,Bottou,L.,Orr,G.B.,&M¨uller,K.-R.(1998b).Ef -cientbackprop.InNeuralnetworks,tricksofthetrade,LectureNotesinComputerScienceLNCS1524.SpringerVerlag.Mnih,A.,&Hinton,G.E.(2009).Ascalablehierarchicaldis-tributedlanguagemodel.NIPS21(pp.1081–1088).Ranzato,M.,Poultney,C.,Chopra,S.,&LeCun,Y.(2007).Ef- cientlearningofsparserepresentationswithanenergy-basedmodel.NIPS19.Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1986).Learn-ingrepresentationsbyback-propagatingerrors.Nature,323,533–536.Solla,S.A.,Levin,E.,&Fleisher,M.(1988)plexSystems,2,625–639.Vincent,P.,Larochelle,H.,Bengio,Y.,&Manzagol,P.-A.(2008).Extractingandcomposingrobustfeatureswithdenoisingau-toencoders.ICML2008.Weston,J.,Ratle,F.,&Collobert,R.(2008).Deeplearningviasemi-supervisedembedding.ICML2008(pp.1168–1175).NewYork,NY,USA:ACM.Zhu,L.,Chen,Y.,&Yuille,A.(2009).Unsupervisedlearningofprobabilisticgrammar-markovmodelsforobjectcategories.IEEETransactionsonPatternAnalysisandMachineIntelli-gence,31,114–128.
Figure12:TesterrorcurvesduringtrainingonMNISTandCIFAR10,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottomindecreasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.
trainingiterationsisapowerfulinvestigativetoolforunderstandingtrainingdif cultiesindeepnets. Sigmoidactivations(notsymmetricaround0)shouldbeavoidedwheninitializingfromsmallrandomweights,becausetheyyieldpoorlearningdynamics,withinitialsaturationofthetophiddenlayer. Keepingthelayer-to-layertransformationssuchthatbothactivationsandgradients owwell(i.e.withaJa-cobianaround1)appearshelpful,andallowstoelim-inateagoodpartofthediscrepancybetweenpurelysuperviseddeepnetworksandonespre-trainedwithunsupervisedlearning. Manyofourobservationsremainunexplained,sug-gestingfurtherinvestigationstobetterunderstandgra-dientsandtrainingdynamicsindeeparchitectures.
References
Bengio,Y.(2009).LearningdeeparchitecturesforAI.Founda-tionsandTrendsinMachineLearning,2,1–127.Alsopub-lishedasabook.NowPublishers,2009.
- 1Agilent 3070 Operator Training - 图文
- 2TITLE OF THESIS A DSP Controlled Adaptive Feedforward Amplifier
- 3Understanding Mobile Q&A Usag e
- 4SF sale skill training-2004
- 5Work permit training_V1
- 6洗衣房培训Laundry Training
- 7The Genesis of Intermediate and Silicic Magmas in Deep Crust
- 8Siplace pro training - 图文
- 9SAP Cost Center Training
- 10advance training_ chinese
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- Understanding
- feedforward
- difficulty
- training
- networks
- neural
- deep
- 10
- 数字信号处理2015年测试3
- 国家安全与大学生使命
- 9 四季五带
- 2008年广东高考英语试题亲自扫描版
- 列方程或方程组解应用题
- 空间缆索悬索桥的主缆线形分析_罗喜恒
- 进口过氧化氢H2O2传感器
- 硝苯地平缓释片、马来酸依那普利片和富马酸比索洛尔片对Ⅰ级高血压患者24 h血压的影响-论文
- 薪酬管理制度编制说明
- 列王的纷争技能怎么加点?技能加点分析攻略
- (7)刚体运动学、转动惯量、定轴转动
- 如何利用小组合作学习促进英语学困生转化
- 2014年中国儿童用品行业发展战略
- 2013年新人教版七年级上册unit7How much are these socks
- MyEclipse 9.0正式版官网下载
- 一生的计划-模板-_趁早效率手册
- 公选课创办你的企业教案——何翔
- PPP项目建设管理方案
- LCDHome论坛_修改《风寒专用国芯通用BIN修改器》的教程
- 美国高等教育制度的发展