10 Understanding the difficulty of training deep feedforward neural networks
更新时间:2023-05-18 01:18:01 阅读量: 实用文档 文档下载
- 10张表格看懂“新十条”推荐度:
- 相关推荐
Understandingthedif cultyoftrainingdeepfeedforwardneuralnetworks
XavierGlorotYoshuaBengioDIRO,Universit´edeMontr´eal,Montr´eal,Qu´ebec,Canada
Abstract
Whereasbefore2006itappearsthatdeepmulti-layerneuralnetworkswerenotsuccessfully
trained,sincethenseveralalgorithmshavebeenshowntosuccessfullytrainthem,withexperi-mentalresultsshowingthesuperiorityofdeepervslessdeeparchitectures.Alltheseexperimen-talresultswereobtainedwithnewinitializationortrainingmechanisms.Ourobjectivehereistounderstandbetterwhystandardgradientdescentfromrandominitializationisdoingsopoorlywithdeepneuralnetworks,tobetterunderstandtheserecentrelativesuccessesandhelpdesignbetteralgorithmsinthefuture.We rstobservethein uenceofthenon-linearactivationsfunc-tions.We ndthatthelogisticsigmoidactivationisunsuitedfordeepnetworkswithrandomini-tializationbecauseofitsmeanvalue,whichcandriveespeciallythetophiddenlayerintosatu-ration.Surprisingly,we ndthatsaturatedunitscanmoveoutofsaturationbythemselves,albeitslowly,andexplainingtheplateaussometimesseenwhentrainingneuralnetworks.We ndthatanewnon-linearitythatsaturateslesscanoftenbebene cial.Finally,westudyhowactivationsandgradientsvaryacrosslayersandduringtrain-ing,withtheideathattrainingmaybemoredif- cultwhenthesingularvaluesoftheJacobianassociatedwitheachlayerarefarfrom1.Basedontheseconsiderations,weproposeanewini-tializationschemethatbringssubstantiallyfasterconvergence.
learningmethodsforawidearrayofdeeparchitectures,includingneuralnetworkswithmanyhiddenlayers(Vin-centetal.,2008)andgraphicalmodelswithmanylevelsofhiddenvariables(Hintonetal.,2006),amongothers(Zhuetal.,2009;Westonetal.,2008).Muchattentionhasre-centlybeendevotedtothem(see(Bengio,2009)forare-view),becauseoftheirtheoreticalappeal,inspirationfrombiologyandhumancognition,andbecauseofempiricalsuccessinvision(Ranzatoetal.,2007;Larochelleetal.,2007;Vincentetal.,2008)andnaturallanguageprocess-ing(NLP)(Collobert&Weston,2008;Mnih&Hinton,2009).TheoreticalresultsreviewedanddiscussedbyBen-gio(2009),suggestthatinordertolearnthekindofcom-plicatedfunctionsthatcanrepresenthigh-levelabstractions(e.g.invision,language,andotherAI-leveltasks),onemayneeddeeparchitectures.
Mostoftherecentexperimentalresultswithdeeparchi-tectureareobtainedwithmodelsthatcanbeturnedintodeepsupervisedneuralnetworks,butwithinitializationortrainingschemesdifferentfromtheclassicalfeedforwardneuralnetworks(Rumelhartetal.,1986).Whyarethesenewalgorithmsworkingsomuchbetterthanthestandardrandominitializationandgradient-basedoptimizationofasupervisedtrainingcriterion?Partoftheanswermaybefoundinrecentanalysesoftheeffectofunsupervisedpre-training(Erhanetal.,2009),showingthatitactsasaregu-larizerthatinitializestheparametersina“better”basinofattractionoftheoptimizationprocedure,correspondingtoanapparentlocalminimumassociatedwithbettergeneral-ization.Butearlierwork(Bengioetal.,2007)hadshownthatevenapurelysupervisedbutgreedylayer-wiseproce-durewouldgivebetterresults.Sohereinsteadoffocus-ingonwhatunsupervisedpre-trainingorsemi-supervisedcriteriabringtodeeparchitectures,wefocusonanalyzingwhatmaybegoingwrongwithgoodold(butdeep)multi-layerneuralnetworks.
Ouranalysisisdrivenbyinvestigativeexperimentstomon-itoractivations(watchingforsaturationofhiddenunits)andgradients,acrosslayersandacrosstrainingiterations.Wealsoevaluatetheeffectsontheseofchoicesofacti-vationfunction(withtheideathatitmightaffectsatura-tion)andinitializationprocedure(sinceunsupervisedpre-trainingisaparticularformofinitializationandithasadrasticimpact).
1DeepNeuralNetworks
Deeplearningmethodsaimatlearningfeaturehierarchieswithfeaturesfromhigherlevelsofthehierarchyformedbythecompositionoflowerlevelfeatures.Theyinclude
AppearinginProceedingsofthe13thInternationalConferenceonArti cialIntelligenceandStatistics(AISTATS)2010,ChiaLa-gunaResort,Sardinia,Italy.Volume9ofJMLR:W&CP9.Copy-right2010bytheauthors.
2ExperimentalSettingandDatasets
Codetoproducethenewdatasetsintroducedinthissectionisavailablefrom:http://www.iro.umontreal.ca/ lisa/twiki/bin/view.cgi/Public/DeepGradientsAISTATS2010.2.1
OnlineLearningonanIn niteDataset:Shapeset-3×2
Recentworkwithdeeparchitectures(seeFigure7inBen-gio(2009))showsthatevenwithverylargetrainingsetsoronlinelearning,initializationfromunsupervisedpre-trainingyieldssubstantialimprovement,whichdoesnotvanishasthenumberoftrainingexamplesincreases.Theonlinesettingisalsointerestingbecauseitfocusesontheoptimizationissuesratherthanonthesmall-sampleregu-larizationeffects,sowedecidedtoincludeinourexperi-mentsasyntheticimagesdatasetinspiredfromLarochelleetal.(2007)andLarochelleetal.(2009),fromwhichasmanyexamplesasneededcouldbesampled,fortestingtheonlinelearningscenario.
WecallthisdatasettheShapeset-3×2dataset,withex-ampleimagesinFigure1(top).Shapeset-3×2con-tainsimagesof1or2two-dimensionalobjects,eachtakenfrom3shapecategories(triangle,parallelogram,ellipse),andplacedwithrandomshapeparameters(relativelengthsand/orangles),scaling,rotation,translationandgrey-scale.Wenoticedthatforonlyoneshapepresentintheimagethetaskofrecognizingitwastooeasy.Wethereforedecidedtosamplealsoimageswithtwoobjects,withtheconstraintthatthesecondobjectdoesnotoverlapwiththe rstbymorethan ftypercentofitsarea,toavoidhidingiten-tirely.Thetaskistopredicttheobjectspresent(e.g.trian-gle+ellipse,parallelogram+parallelogram,trianglealone,etc.)withouthavingtodistinguishbetweentheforegroundshapeandthebackgroundshapewhentheyoverlap.Thisthereforede nesninecon gurationclasses.
Thetaskisfairlydif cultbecauseweneedtodiscoverin-variancesoverrotation,translation,scaling,objectcolor,occlusionandrelativepositionoftheshapes.Inparallelweneedtoextractthefactorsofvariabilitythatpredictwhichobjectshapesarepresent.
Thesizeoftheimagesarearbitrarybutwe xeditto32×32inordertoworkwithdeepdensenetworksef ciently.2.2FiniteDatasets
TheMNISTdigits(LeCunetal.,1998a),datasethas50,000trainingimages,10,000validationimages(forhyper-parameterselection),and10,000testimages,eachshowinga28×28grey-scalepixelimageofoneofthe10digits.
CIFAR-10(Krizhevsky&Hinton,2009)isalabelled
sub-
Figure1:Top:Shapeset-3×2imagesat64×64resolution.Theexamplesweusedareat32×32resolution.Thelearnertriestopredictwhichobjects(parallelogram,triangle,orel-lipse)arepresent,and1or2objectscanbepresent,yield-ing9possibleclassi cations.Bottom:Small-ImageNetimagesatfullresolution.
setofthetiny-imagesdatasetthatcontains50,000trainingexamples(fromwhichweextracted10,000asvalidationdata)and10,000testexamples.Thereare10classescor-respondingtothemainobjectineachimage:airplane,au-tomobile,bird,cat,deer,dog,frog,horse,ship,ortruck.Theclassesarebalanced.Eachimageisincolor,butisjust32×32pixelsinsize,sotheinputisavectorof32×32×3=3072realvalues.
Small-ImageNetwhichisasetoftiny37×37graylevelimagesdatasetcomputedfromthehigher-resolutionandlargersetat,withla-belsfromtheWordNetnounhierarchy.Wehaveused90,000examplesfortraining,10,000forthevalidationset,and10,000fortesting.Thereare10balancedclasses:rep-tiles,vehicles,birds,mammals, sh,furniture,instruments,tools, owersandfruitsFigure1(bottom)showsrandomlychosenexamples.
2.3ExperimentalSetting
Weoptimizedfeedforwardneuralnetworkswithoneto vehiddenlayers,withonethousandhiddenunitsperlayer,andwithasoftmaxlogisticregressionfortheout-putlayer.Thecostfunctionisthenegativelog-likelihood logP(y|x),where(x,y)isthe(inputimage,targetclass)pair.Theneuralnetworkswereoptimizedwithstochasticback-propagationonmini-batchesofsizeten,i.e.,theav-P(y|x)
eragegof logwascomputedover10consecutive
trainingpairs(x,y)andusedtoupdateparametersθinthatdirection,withθ←θ g.Thelearningrate isahyper-parameterthatisoptimizedbasedonvalidationseterrorafteralargenumberofupdates(5million).
Wevariedthetypeofnon-linearactivationfunctioninthehiddenlayers:thesigmoid1/(1+e x),thehyperbolictangenttanh(x),andanewlyproposedactivationfunc-tion(Bergstraetal.,2009)calledthesoftsign,x/(1+|x|).Thesoftsignissimilartothehyperbolictangent(itsrangeis-1to1)butitstailsarequadraticpolynomialsratherthanexponentials,i.e.,itapproachesitsasymptotesmuchslower.
Inthecomparisons,wesearchforthebesthyper-parameters(learningrateanddepth)separatelyforeachmodel.Notethatthebestdepthwasalways veforShapeset-3×2,exceptforthesigmoid,forwhichitwasfour.
Weinitializedthebiasestobe0andtheweightsWijateachlayerwiththefollowingcommonlyusedheuristic:
11
Wij~U ,,(1)
whereU[ a,a]istheuniformdistributionintheinterval
( a,a)andnisthesizeofthepreviouslayer(thenumberofcolumnsofW).
Figure2:Meanandstandarddeviation(verticalbars)oftheactivationvalues(outputofthesigmoid)duringsupervisedlearning,forthedifferenthiddenlayersofadeeparchi-tecture.Thetophiddenlayerquicklysaturatesat0(slow-ingdownalllearning),butthenslowlydesaturatesaroundepoch100.
Weseethatveryquicklyatthebeginning,allthesigmoidactivationvaluesofthelasthiddenlayerarepushedtotheirlowersaturationvalueof0.Inversely,theotherslayershaveameanactivationvaluethatisabove0.5,anddecreas-ingaswegofromtheoutputlayertotheinputlayer.Wehavefoundthatthiskindofsaturationcanlastverylongindeepernetworkswithsigmoidactivations,e.g.,thedepth- vemodelneverescapedthisregimeduringtraining.Thebigsurpriseisthatforintermediatenumberofhiddenlay-ers(herefour),thesaturationregimemaybeescaped.Atthesametimethatthetophiddenlayermovesoutofsatura-tion,the rsthiddenlayerbeginstosaturateandthereforetostabilize.
Wehypothesizethatthisbehaviorisduetothecombina-tionofrandominitializationandthefactthatanhiddenunitoutputof0correspondstoasaturatedsigmoid.Notethatdeepnetworkswithsigmoidsbutinitializedfromunsuper-visedpre-training(e.g.fromRBMs)donotsufferfromthissaturationbehavior.Ourproposedexplanationrestsonthehypothesisthatthetransformationthatthelowerlayersoftherandomlyinitializednetworkcomputesinitiallyisnotusefultotheclassi cationtask,unlikethetransforma-tionobtainedfromunsupervisedpre-training.Thelogisticlayeroutputsoftmax(b+Wh)mightinitiallyrelymoreonitsbiasesb(whicharelearnedveryquickly)thanonthetophiddenactivationshderivedfromtheinputimage(becausehwouldvaryinwaysthatarenotpredictiveofy,maybecorrelatedmostlywithotherandpossiblymoredominantvariationsofx).ThustheerrorgradientwouldtendtopushWhtowards0,whichcanbeachievedbypushinghtowards0.Inthecaseofsymmetricactivationfunctionslikethehyperbolictangentandthesoftsign,sittingaround0isgoodbecauseitallowsgradientsto owbackwards.However,pushingthesigmoidoutputsto0wouldbringthemintoasaturationregimewhichwouldpreventgradi-entsto owbackwardandpreventthelowerlayersfromlearningusefulfeatures.Eventuallybutslowly,thelowerlayersmovetowardmoreusefulfeaturesandthetophiddenlayerthenmovesoutofthesaturationregime.Notehow-everthat,evenafterthis,thenetworkmovesintoasolutionthatisofpoorerquality(alsointermsofgeneralization)
3
EffectofActivationFunctionsandSaturationDuringTraining
Twothingswewanttoavoidandthatcanberevealedfromtheevolutionofactivationsisexcessivesaturationofacti-vationfunctionsononehand(thengradientswillnotprop-agatewell),andoverlylinearunits(theywillnotcomputesomethinginteresting).
3.1ExperimentswiththeSigmoid
Thesigmoidnon-linearityhasbeenalreadyshowntoslowdownlearningbecauseofitsnone-zeromeanthatinducesimportantsingularvaluesintheHessian(LeCunetal.,1998b).Inthissectionwewillseeanothersymptomaticbehaviorduetothisactivationfunctionindeepfeedforwardnetworks.
Wewanttostudypossiblesaturation,bylookingattheevo-lutionofactivationsduringtraining,andthe guresinthissectionshowresultsontheShapeset-3×2data,butsim-ilarbehaviorisobservedwiththeotherdatasets.Figure2showstheevolutionoftheactivationvalues(afterthenon-linearity)yer1referstotheoutputof rsthiddenlayer,andtherearefourhiddenlayers.Thegraphshowsthemeansandstandarddeviationsoftheseactivations.Thesestatisticsalongwithhistogramsarecomputedatdifferenttimesduringlearning,bylookingatactivationvaluesfora xedsetof300
testexamples.
thenthosefoundwithsymmetricactivationfunctions,ascanbeseenin gure11.3.2
ExperimentswiththeHyperbolictangent
wherethegradientswould ow
well.
Asdiscussedabove,thehyperbolictangentnetworksdonotsufferfromthekindofsaturationbehaviorofthetophid-denlayerobservedwithsigmoidnetworks,becauseofitssymmetryaroundwithourstandardweight 0.However,
11
initializationU ,,weobserveasequentiallyoc-curringsaturationphenomenonstartingwithlayer1andpropagatingupinthenetwork,asillustratedinFigure3.Whythisishappeningremainstobe
understood.
Figure4:Activationvaluesnormalizedhistogramattheendoflearning,averagedacrossunitsofthesamelayerandacross300testexamples.Top:activationfunctionishyper-bolictangent,weseeimportantsaturationofthelowerlay-ers.Bottom:activationfunctionissoftsign,weseemanyactivationvaluesaround(-0.6,-0.8)and(0.6,0.8)wheretheunitsdonotsaturatebutarenon-linear.
4
4.1
Figure3:Top:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionoftheactivationvaluesforthehyperbolictangentnetworksinthecourseoflearning.Weseethe rsthiddenlayersatu-rating rst,thenthesecond,etc.Bottom:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionofactivationvaluesforthesoft-signduringlearning.Herethedifferentlayerssaturatelessanddosotogether.3.3
ExperimentswiththeSoftsign
StudyingGradientsandtheirPropagation
EffectoftheCostFunction
Thesoftsignx/(1+|x|)issimilartothehyperbolictangentbutmightbehavedifferentlyintermsofsaturationbecauseofitssmootherasymptotes(polynomialinsteadofexpo-nential).WeseeonFigure3thatthesaturationdoesnotoccuronelayeraftertheotherlikeforthehyperbolictan-gent.Itisfasteratthebeginningandthenslow,andalllayersmovetogethertowardslargerweights.
Wecanalsoseeattheendoftrainingthatthehistogramofactivationvaluesisverydifferentfromthatseenwiththehyperbolictangent(Figure4).Whereasthelatteryieldsmodesoftheactivationsdistributionmostlyattheextremes(asymptotes-1and1)oraround0,thesoftsignnetworkhasmodesofactivationsarounditsknees(betweenthelinearregimearound0andthe atregimearound-1and1).Thesearetheareaswherethereissubstantialnon-linearitybut
Wehavefoundthatthelogisticregressionorconditionallog-likelihoodcostfunction( logP(y|x)coupledwithsoftmaxoutputs)workedmuchbetter(forclassi cationproblems)thanthequadraticcostwhichwastradition-allyusedtotrainfeedforwardneuralnetworks(Rumelhartetal.,1986).Thisisnotanewobservation(Sollaetal.,1988)butwe nditimportanttostresshere.Wefoundthattheplateausinthetrainingcriterion(asafunctionofthepa-rameters)arelesspresentwiththelog-likelihoodcostfunc-tion.WecanseethisonFigure5,whichplotsthetrainingcriterionasafunctionoftwoweightsforatwo-layernet-work(onehiddenlayer)withhyperbolictangentunits,andarandominputandtargetsignal.Thereareclearlymoresevereplateauswiththequadraticcost.4.2Gradientsatinitialization4.2.1
TheoreticalConsiderationsandaNewNormalizedInitialization
Westudytheback-propagatedgradients,orequivalentlythegradientofthecostfunctionontheinputsbiasesateachlayer.Bradley(2009)foundthatback-propagatedgradientsweresmallerasonemovesfromtheoutputlayertowardstheinputlayer,justafterinitialization.Hestudiednetworkswithlinearactivationateachlayer, ndingthatthevarianceoftheback-propagatedgradientsdecreasesaswegoback-wardsinthenetwork.Wewillalsostartbystudyingthelinearregime.
Fromaforward-propagationpointofview,tokeepinfor-mation owingwewouldlikethat
(i,i ),Var[zi]=Var[zi].
(8)
Fromaback-propagationpointofviewwewouldsimilarlyliketohave
Cost Cost
=Var.(9) (i,i),Var
si si Thesetwoconditionstransformto:
i,niVar[Wi]=1 i,ni+1Var[Wi]=1
Figure5:Crossentropy(black,surfaceontop)and
quadratic(red,bottomsurface)costasafunctionoftwoweights(oneateachlayer)ofanetworkwithtwolayers,W1respectivelyonthe rstlayerandW2onthesecond,outputlayer.
Foradensearti cialneuralnetworkusingsymmetricacti-vationfunctionfwithunitderivativeat0(i.e.f (0)=1),ifwewritezifortheactivationvectoroflayeri,andsitheargumentvectoroftheactivationfunctionatlayeri,wehavesi=ziWi+biandzi+1=f(si).Fromthesede nitionsweobtainthefollowing:
Costi+1 Cost i
=f(s)Wkk,
s sk
Costi Cost=zl
wl,k sk
(2)(3)
(10)(11)
Asacompromisebetweenthesetwoconstraints,wemight
wanttohave
2
i,Var[Wi]=(12)
ni+ni+1Notehowbothconstraintsaresatis edwhenalllayershavethesamewidth.Ifwealsohavethesameinitializationfortheweightswecouldgetthefollowinginterestingproper-ties:
Cost d i
i,Var=nVar[W]Var[x](13)
si i,Var
Cost w d Cost
=nVar[W]Var[x]Var
s(14)
Thevarianceswillbeexpressedwithrespecttotheinput,outpoutandweightinitializationrandomness.Considerthehypothesisthatweareinalinearregimeattheinitial-ization,thattheweightsareinitializedindependentlyandthattheinputsfeaturesvariancesarethesame(=Var[x]).Thenwecansaythat,withnithesizeoflayeriandxthenetworkinput,
f (sik)≈1,
Var[z]=Var[x]
Wecanseethatthevarianceofthegradientontheweights
isthesameforalllayers,butthevarianceoftheback-propagatedgradientmightstillvanishorexplodeasweconsiderdeepernetworks.Notehowthisisreminis-centofissuesraisedwhenstudyingrecurrentneuralnet-works(Bengioetal.,1994),whichcanbeseenasverydeepnetworkswhenunfoldedthroughtime.
Thestandardinitializationthatwehaveused(eq.1)givesrisetovariancewiththefollowingproperty:
nVar[W]=
13
(15)
(4)
i
i 1 i =0
ni Var[Wi],
(5)
WewriteVar[Wi]forthesharedscalarvarianceofall
weightsatlayeri .Thenforanetworkwithdlayers,
d Cost Cost
i
+1Var[W],(6)Var=Varni
si sd
i=ii
wherenisthelayersize(assumingalllayersofthesame
size).Thiswillcausethevarianceoftheback-propagatedgradienttobedependentonthelayer(anddecreasing).Thenormalizationfactormaythereforebeimportantwheninitializingdeepnetworksbecauseofthemultiplicativeef-fectthroughlayers,andwesuggestthefollowinginitializa-tionproceduretoapproximatelysatisfyourobjectivesofmaintainingactivationvariancesandback-propagatedgra-dientsvarianceasonemovesupordownthenetwork.Wecallitthenormalizedinitialization:
√√ ,(16)W~U jj+1jj+1
Var
Cost w=
i 1 i =0
ni Var[W]
d 1 i =i
ni +1Var[Wi].
(7)
×Var[x]Var
Cost s
4.2.2GradientPropagationStudy
Toempiricallyvalidatetheabovetheoreticalideas,wehaveplottedsomenormalizedhistogramsofactivationvalues,weightgradientsandoftheback-propagatedgradientsatinitializationwiththetwodifferentinitializationmethods.Theresultsdisplayed(Figures6,7and8)arefromexper-imentsonShapeset-3×2,butqualitativelysimilarresultswereobtainedwiththeotherdatasets.
WemonitorthesingularvaluesoftheJacobianmatrixas-sociatedwithlayeri:
zi+1i
J=(17)
zWhenconsecutivelayershavethesamedimension,theav-eragesingularvaluecorrespondstotheaverageratioofin- nitesimalvolumesmappedfromzitozi+1,aswellastotheratioofaverageactivationvariancegoingfromzitozi+1.Withournormalizedinitialization,thisratioisaround0.8whereaswiththestandardinitialization,itdropsdownto0.5
.
Figure7:Back-propagatedgradientsnormalizedhis-togramswithhyperbolictangentactivation,withstandard(top)vsnormalized(bottom)initialization.Top:0-peakdecreasesforhigherlayers.
Whatwasinitiallyreallysurprisingisthatevenwhentheback-propagatedgradientsbecomesmaller(standardini-tialization),thevarianceoftheweightsgradientsisroughlyconstantacrosslayers,asshownonFigure8.However,thisisexplainedbyourtheoreticalanalysisabove(eq.14).In-terestingly,asshowninFigure9,theseobservationsontheweightgradientofstandardandnormalizedinitializationchangeduringtraining(hereforatanhnetwork).Indeed,whereasthegradientshaveinitiallyroughlythesamemag-nitude,theydivergefromeachother(withlargergradientsinthelowerlayers)astrainingprogresses,especiallywiththestandardinitialization.Notethatthismightbeoneoftheadvantagesofthenormalizedinitialization,sincehav-inggradientsofverydifferentmagnitudesatdifferentlay-ersmayyieldtoill-conditioningandslowertraining.Finally,weobservethatthesoftsignnetworkssharesimi-laritieswiththetanhnetworkswithnormalizedinitializa-tion,ascanbeseenbycomparingtheevolutionofactiva-tionsinbothcases(resp.Figure3-bottomandFigure10).
Figure6:Activationvaluesnormalizedhistogramswithhyperbolictangentactivation,withstandard(top)vsnor-malizedinitialization(bottom).Top:0-peakincreasesforhigherlayers.4.3
Back-propagatedGradientsDuringLearning
5ErrorCurvesandConclusions
Thedynamicoflearninginsuchnetworksiscomplexandwewouldliketodevelopbettertoolstoanalyzeandtrackit.Inparticular,wecannotusesimplevariancecalculationsinourtheoreticalanalysisbecausetheweightsvaluesarenotanymoreindependentoftheactivationvaluesandthelinearityhypothesisisalsoviolated.
As rstnotedbyBradley(2009),weobserve(Figure7)thatatthebeginningoftraining,afterthestandardinitializa-tion(eq.1),thevarianceoftheback-propagatedgradientsgetssmallerasitispropagateddownwards.Howeverwe ingournormalizedinitializationwedonotseesuchde-creasingback-propagatedgradients(bottomofFigure
7).
The nalconsiderationthatwecareforisthesuccessoftrainingwithdifferentstrategies,andthisisbestil-lustratedwitherrorcurvesshowingtheevolutionoftesterrorastrainingprogressesandasymptotes.Figure11showssuchcurveswithonlinetrainingonShapeset-3×2,whileTable1gives naltesterrorforallthedatasetsstudied(Shapeset-3×2,MNIST,CIFAR-10,andSmall-ImageNet).Asabaseline,weoptimizedRBFSVMmod-elsononehundredthousandShapesetexamplesandob-tained59.47%testerror,whileonthesamesetweobtained50.47%withadepth vehyperbolictangentnetworkwithnormalizedinitialization.
Theseresultsillustratetheeffectofthechoiceofactiva-tionandinitialization.AsareferenceweincludeinFig-
Figure8:Weightgradientnormalizedhistogramswithhy-perbolictangentactivationjustafterinitialization,withstandardinitialization(top)andnormalizedinitialization(bottom),fordifferentlayers.Eventhoughwithstandardinitializationtheback-propagatedgradientsgetsmaller,theweightgradientsdonot!
Table1:Testerrorwithdifferentactivationfunctionsandinitializationschemesfordeepnetworkswith5hiddenlay-ers.Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.Resultsinboldarestatisticallydifferentfromnon-boldonesunderthenullhypothesistestwithp=0.005.
TYPESoftsignSoftsignNTanhTanhNSigmoid
Shapeset16.2716.0627.1515.6082.61
MNIST1.641.721.761.642.21
CIFAR-1055.7853.855.952.9257.28
ImageNet69.1468.1370.5868.5770.66
Figure9:Standarddeviationintervalsoftheweightsgradi-entswithhyperbolictangentswithstandardinitialization(top)andnormalized(bottom)duringtraining.Weseethatthenormalizationallowstokeepthesamevarianceoftheweightsgradientacrosslayers,duringtraining(top:smallervarianceforhigher
layers).
Figure10:98percentile(markersalone)andstandardde-viation(solidlineswithmarkers)ofthedistributionofac-tivationvaluesforhyperbolictangentwithnormalizedini-tializationduringlearning.
activations( owingupward)andgradients( owingbackward).
Othersmethodscanalleviatediscrepanciesbetweenlay-ersduringlearning,e.g.,exploitingsecondorderinforma-tiontosetthelearningrateseparatelyforeachparame-ter.Forexample,wecanexploitthediagonaloftheHes-sian(LeCunetal.,1998b)oragradientvarianceestimate.BoththosemethodshavebeenappliedforShapeset-3×2withhyperbolictangentandstandardinitialization.Weob-servedagaininperformancebutnotreachingtheresultob-tainedfromnormalizedinitialization.Inaddition,weob-servedfurthergainsbycombiningnormalizedinitializationwithsecondordermethods:theestimatedHessianmightthenfocusondiscrepanciesbetweenunits,nothavingtocorrectimportantinitialdiscrepanciesbetweenlayers.Inallreportedexperimentswehaveusedthesamenum-berofunitsperlayer.However,weveri edthatweobtainthesamegainswhenthelayersizeincreases(ordecreases)withlayernumber.
Theotherconclusionsfromthisstudyarethefollowing: Monitoringactivationsandgradientsacrosslayersand
ure11theerrorcurveforthesupervised ne-tuningfromtheinitializationobtainedafterunsupervisedpre-trainingwithdenoisingauto-encoders(Vincentetal.,2008).Foreachnetworkthelearningrateisseparatelychosentomin-imizeerroronthevalidationset.WecanremarkthatonShapeset-3×2,becauseofthetaskdif culty,weobserveimportantsaturationsduringlearning,thismightexplainthatthenormalizedinitializationorthesoftsigneffectsaremorevisible.
Severalconclusionscanbedrawnfromtheseerrorcurves: Themoreclassicalneuralnetworkswithsigmoidorhyperbolictangentunitsandstandardinitializationfareratherpoorly,convergingmoreslowlyandappar-entlytowardsultimatelypoorerlocalminima. Thesoftsignnetworksseemtobemorerobusttotheinitializationprocedurethanthetanhnetworks,pre-sumablybecauseoftheirgentlernon-linearity. Fortanhnetworks,theproposednormalizedinitial-izationcanbequitehelpful,presumablybecausethelayer-to-layertransformationsmaintainmagnitudes
of
Bengio,Y.,Lamblin,P.,Popovici,D.,&Larochelle,H.(2007).Greedylayer-wisetrainingofdeepnetworks.NIPS19(pp.153–160).MITPress.Bengio,Y.,Simard,P.,&Frasconi,P.(1994).Learninglong-termdependencieswithgradientdescentisdif cult.IEEETransac-tionsonNeuralNetworks,5,157–166.Bergstra,J.,Desjardins,G.,Lamblin,P.,&Bengio,Y.(2009).Quadraticpolynomialslearnbetterimagefeatures(TechnicalReport1337).D´epartementd’InformatiqueetdeRechercheOp´erationnelle,Universit´edeMontr´eal.Bradley,D.(2009).Learninginmodularsystems.Doctoraldis-sertation,TheRoboticsInstitute,CarnegieMellonUniversity.Collobert,R.,&Weston,J.(2008).Auni edarchitecturefornat-urallanguageprocessing:Deepneuralnetworkswithmultitasklearning.ICML2008.
Figure11:TesterrorduringonlinetrainingontheShapeset-3×2dataset,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottominde-creasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalized
initialization.
Erhan,D.,Manzagol,P.-A.,Bengio,Y.,Bengio,S.,&Vincent,P.(2009).Thedif cultyoftrainingdeeparchitecturesandtheeffectofunsupervisedpre-training.AISTATS’2009(pp.153–160).Hinton,G.E.,Osindero,S.,&Teh,Y.(2006).Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18,1527–1554.Krizhevsky,A.,&Hinton,G.(2009).Learningmultiplelayersoffeaturesfromtinyimages(TechnicalReport)rochelle,H.,Bengio,Y.,Louradour,J.,&Lamblin,P.(2009).Exploringstrategiesfortrainingdeepneuralnetworks.TheJournalofMachineLearningResearch,10,1–rochelle,H.,Erhan,D.,Courville,A.,Bergstra,J.,&Bengio,Y.(2007).Anempiricalevaluationofdeeparchitecturesonproblemswithmanyfactorsofvariation.ICML2007.LeCun,Y.,Bottou,L.,Bengio,Y.,&Haffner,P.(1998a).Gradient-basedlearningappliedtodocumentrecognition.Pro-ceedingsoftheIEEE,86,2278–2324.LeCun,Y.,Bottou,L.,Orr,G.B.,&M¨uller,K.-R.(1998b).Ef -cientbackprop.InNeuralnetworks,tricksofthetrade,LectureNotesinComputerScienceLNCS1524.SpringerVerlag.Mnih,A.,&Hinton,G.E.(2009).Ascalablehierarchicaldis-tributedlanguagemodel.NIPS21(pp.1081–1088).Ranzato,M.,Poultney,C.,Chopra,S.,&LeCun,Y.(2007).Ef- cientlearningofsparserepresentationswithanenergy-basedmodel.NIPS19.Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1986).Learn-ingrepresentationsbyback-propagatingerrors.Nature,323,533–536.Solla,S.A.,Levin,E.,&Fleisher,M.(1988)plexSystems,2,625–639.Vincent,P.,Larochelle,H.,Bengio,Y.,&Manzagol,P.-A.(2008).Extractingandcomposingrobustfeatureswithdenoisingau-toencoders.ICML2008.Weston,J.,Ratle,F.,&Collobert,R.(2008).Deeplearningviasemi-supervisedembedding.ICML2008(pp.1168–1175).NewYork,NY,USA:ACM.Zhu,L.,Chen,Y.,&Yuille,A.(2009).Unsupervisedlearningofprobabilisticgrammar-markovmodelsforobjectcategories.IEEETransactionsonPatternAnalysisandMachineIntelli-gence,31,114–128.
Figure12:TesterrorcurvesduringtrainingonMNISTandCIFAR10,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottomindecreasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.
trainingiterationsisapowerfulinvestigativetoolforunderstandingtrainingdif cultiesindeepnets. Sigmoidactivations(notsymmetricaround0)shouldbeavoidedwheninitializingfromsmallrandomweights,becausetheyyieldpoorlearningdynamics,withinitialsaturationofthetophiddenlayer. Keepingthelayer-to-layertransformationssuchthatbothactivationsandgradients owwell(i.e.withaJa-cobianaround1)appearshelpful,andallowstoelim-inateagoodpartofthediscrepancybetweenpurelysuperviseddeepnetworksandonespre-trainedwithunsupervisedlearning. Manyofourobservationsremainunexplained,sug-gestingfurtherinvestigationstobetterunderstandgra-dientsandtrainingdynamicsindeeparchitectures.
References
Bengio,Y.(2009).LearningdeeparchitecturesforAI.Founda-tionsandTrendsinMachineLearning,2,1–127.Alsopub-lishedasabook.NowPublishers,2009.
正在阅读:
10 Understanding the difficulty of training deep feedforward neural networks05-18
网球试题06-21
数据中心机房工程施工组织设计04-16
暑假600字作文三篇08-07
最美乡村教师观后感04-02
小战象观后感04-01
2012世界末日观后感12-11
钓鱼潮水口诀03-08
古今中外所有电影的故事套路04-30
- 1Agilent 3070 Operator Training - 图文
- 2TITLE OF THESIS A DSP Controlled Adaptive Feedforward Amplifier
- 3Understanding Mobile Q&A Usag e
- 4SF sale skill training-2004
- 5Work permit training_V1
- 6洗衣房培训Laundry Training
- 7The Genesis of Intermediate and Silicic Magmas in Deep Crust
- 8Siplace pro training - 图文
- 9SAP Cost Center Training
- 10advance training_ chinese
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Understanding
- feedforward
- difficulty
- training
- networks
- neural
- deep
- 10
- 完美国际修改头像代码
- 雷曼兄弟风险管理评析
- 依法执教原则的具体要求包括
- 一生的计划-模板-_趁早效率手册
- 浅谈预算单位公务卡结算
- 邢台德龙_新华里188项目设计服务建议书-2011.9.23
- 2015年山东省卫生计生系统“针对性普法“题库及答案(卫生计生监督执法人员)
- 对股票期权制度的优劣分析
- 一清胶囊联合阿达帕林凝胶治疗痤疮疗效观察
- 基于物联网的农产品质量安全溯源系统解决方案
- MyEclipse 9.0正式版官网下载
- 高中历史必修一知识结构
- 药品不良反应报告制度
- 日本社会保障制度对中国的启示和借鉴意义
- 特岗教师招聘考试教育理论综合知识模拟题
- PPP项目建设管理方案
- 2014英语中考试卷
- 重装电脑XP系统详细过程
- 2014年中国儿童用品行业发展战略
- 2010年三季度中国宏观经济形势分析