10 Understanding the difficulty of training deep feedforward neural networks

更新时间:2023-05-18 01:18:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Understandingthedif cultyoftrainingdeepfeedforwardneuralnetworks

XavierGlorotYoshuaBengioDIRO,Universit´edeMontr´eal,Montr´eal,Qu´ebec,Canada

Abstract

Whereasbefore2006itappearsthatdeepmulti-layerneuralnetworkswerenotsuccessfully

trained,sincethenseveralalgorithmshavebeenshowntosuccessfullytrainthem,withexperi-mentalresultsshowingthesuperiorityofdeepervslessdeeparchitectures.Alltheseexperimen-talresultswereobtainedwithnewinitializationortrainingmechanisms.Ourobjectivehereistounderstandbetterwhystandardgradientdescentfromrandominitializationisdoingsopoorlywithdeepneuralnetworks,tobetterunderstandtheserecentrelativesuccessesandhelpdesignbetteralgorithmsinthefuture.We rstobservethein uenceofthenon-linearactivationsfunc-tions.We ndthatthelogisticsigmoidactivationisunsuitedfordeepnetworkswithrandomini-tializationbecauseofitsmeanvalue,whichcandriveespeciallythetophiddenlayerintosatu-ration.Surprisingly,we ndthatsaturatedunitscanmoveoutofsaturationbythemselves,albeitslowly,andexplainingtheplateaussometimesseenwhentrainingneuralnetworks.We ndthatanewnon-linearitythatsaturateslesscanoftenbebene cial.Finally,westudyhowactivationsandgradientsvaryacrosslayersandduringtrain-ing,withtheideathattrainingmaybemoredif- cultwhenthesingularvaluesoftheJacobianassociatedwitheachlayerarefarfrom1.Basedontheseconsiderations,weproposeanewini-tializationschemethatbringssubstantiallyfasterconvergence.

learningmethodsforawidearrayofdeeparchitectures,includingneuralnetworkswithmanyhiddenlayers(Vin-centetal.,2008)andgraphicalmodelswithmanylevelsofhiddenvariables(Hintonetal.,2006),amongothers(Zhuetal.,2009;Westonetal.,2008).Muchattentionhasre-centlybeendevotedtothem(see(Bengio,2009)forare-view),becauseoftheirtheoreticalappeal,inspirationfrombiologyandhumancognition,andbecauseofempiricalsuccessinvision(Ranzatoetal.,2007;Larochelleetal.,2007;Vincentetal.,2008)andnaturallanguageprocess-ing(NLP)(Collobert&Weston,2008;Mnih&Hinton,2009).TheoreticalresultsreviewedanddiscussedbyBen-gio(2009),suggestthatinordertolearnthekindofcom-plicatedfunctionsthatcanrepresenthigh-levelabstractions(e.g.invision,language,andotherAI-leveltasks),onemayneeddeeparchitectures.

Mostoftherecentexperimentalresultswithdeeparchi-tectureareobtainedwithmodelsthatcanbeturnedintodeepsupervisedneuralnetworks,butwithinitializationortrainingschemesdifferentfromtheclassicalfeedforwardneuralnetworks(Rumelhartetal.,1986).Whyarethesenewalgorithmsworkingsomuchbetterthanthestandardrandominitializationandgradient-basedoptimizationofasupervisedtrainingcriterion?Partoftheanswermaybefoundinrecentanalysesoftheeffectofunsupervisedpre-training(Erhanetal.,2009),showingthatitactsasaregu-larizerthatinitializestheparametersina“better”basinofattractionoftheoptimizationprocedure,correspondingtoanapparentlocalminimumassociatedwithbettergeneral-ization.Butearlierwork(Bengioetal.,2007)hadshownthatevenapurelysupervisedbutgreedylayer-wiseproce-durewouldgivebetterresults.Sohereinsteadoffocus-ingonwhatunsupervisedpre-trainingorsemi-supervisedcriteriabringtodeeparchitectures,wefocusonanalyzingwhatmaybegoingwrongwithgoodold(butdeep)multi-layerneuralnetworks.

Ouranalysisisdrivenbyinvestigativeexperimentstomon-itoractivations(watchingforsaturationofhiddenunits)andgradients,acrosslayersandacrosstrainingiterations.Wealsoevaluatetheeffectsontheseofchoicesofacti-vationfunction(withtheideathatitmightaffectsatura-tion)andinitializationprocedure(sinceunsupervisedpre-trainingisaparticularformofinitializationandithasadrasticimpact).

1DeepNeuralNetworks

Deeplearningmethodsaimatlearningfeaturehierarchieswithfeaturesfromhigherlevelsofthehierarchyformedbythecompositionoflowerlevelfeatures.Theyinclude

AppearinginProceedingsofthe13thInternationalConferenceonArti cialIntelligenceandStatistics(AISTATS)2010,ChiaLa-gunaResort,Sardinia,Italy.Volume9ofJMLR:W&CP9.Copy-right2010bytheauthors.

2ExperimentalSettingandDatasets

Codetoproducethenewdatasetsintroducedinthissectionisavailablefrom:http://www.iro.umontreal.ca/ lisa/twiki/bin/view.cgi/Public/DeepGradientsAISTATS2010.2.1

OnlineLearningonanIn niteDataset:Shapeset-3×2

Recentworkwithdeeparchitectures(seeFigure7inBen-gio(2009))showsthatevenwithverylargetrainingsetsoronlinelearning,initializationfromunsupervisedpre-trainingyieldssubstantialimprovement,whichdoesnotvanishasthenumberoftrainingexamplesincreases.Theonlinesettingisalsointerestingbecauseitfocusesontheoptimizationissuesratherthanonthesmall-sampleregu-larizationeffects,sowedecidedtoincludeinourexperi-mentsasyntheticimagesdatasetinspiredfromLarochelleetal.(2007)andLarochelleetal.(2009),fromwhichasmanyexamplesasneededcouldbesampled,fortestingtheonlinelearningscenario.

WecallthisdatasettheShapeset-3×2dataset,withex-ampleimagesinFigure1(top).Shapeset-3×2con-tainsimagesof1or2two-dimensionalobjects,eachtakenfrom3shapecategories(triangle,parallelogram,ellipse),andplacedwithrandomshapeparameters(relativelengthsand/orangles),scaling,rotation,translationandgrey-scale.Wenoticedthatforonlyoneshapepresentintheimagethetaskofrecognizingitwastooeasy.Wethereforedecidedtosamplealsoimageswithtwoobjects,withtheconstraintthatthesecondobjectdoesnotoverlapwiththe rstbymorethan ftypercentofitsarea,toavoidhidingiten-tirely.Thetaskistopredicttheobjectspresent(e.g.trian-gle+ellipse,parallelogram+parallelogram,trianglealone,etc.)withouthavingtodistinguishbetweentheforegroundshapeandthebackgroundshapewhentheyoverlap.Thisthereforede nesninecon gurationclasses.

Thetaskisfairlydif cultbecauseweneedtodiscoverin-variancesoverrotation,translation,scaling,objectcolor,occlusionandrelativepositionoftheshapes.Inparallelweneedtoextractthefactorsofvariabilitythatpredictwhichobjectshapesarepresent.

Thesizeoftheimagesarearbitrarybutwe xeditto32×32inordertoworkwithdeepdensenetworksef ciently.2.2FiniteDatasets

TheMNISTdigits(LeCunetal.,1998a),datasethas50,000trainingimages,10,000validationimages(forhyper-parameterselection),and10,000testimages,eachshowinga28×28grey-scalepixelimageofoneofthe10digits.

CIFAR-10(Krizhevsky&Hinton,2009)isalabelled

sub-

Figure1:Top:Shapeset-3×2imagesat64×64resolution.Theexamplesweusedareat32×32resolution.Thelearnertriestopredictwhichobjects(parallelogram,triangle,orel-lipse)arepresent,and1or2objectscanbepresent,yield-ing9possibleclassi cations.Bottom:Small-ImageNetimagesatfullresolution.

setofthetiny-imagesdatasetthatcontains50,000trainingexamples(fromwhichweextracted10,000asvalidationdata)and10,000testexamples.Thereare10classescor-respondingtothemainobjectineachimage:airplane,au-tomobile,bird,cat,deer,dog,frog,horse,ship,ortruck.Theclassesarebalanced.Eachimageisincolor,butisjust32×32pixelsinsize,sotheinputisavectorof32×32×3=3072realvalues.

Small-ImageNetwhichisasetoftiny37×37graylevelimagesdatasetcomputedfromthehigher-resolutionandlargersetat,withla-belsfromtheWordNetnounhierarchy.Wehaveused90,000examplesfortraining,10,000forthevalidationset,and10,000fortesting.Thereare10balancedclasses:rep-tiles,vehicles,birds,mammals, sh,furniture,instruments,tools, owersandfruitsFigure1(bottom)showsrandomlychosenexamples.

2.3ExperimentalSetting

Weoptimizedfeedforwardneuralnetworkswithoneto vehiddenlayers,withonethousandhiddenunitsperlayer,andwithasoftmaxlogisticregressionfortheout-putlayer.Thecostfunctionisthenegativelog-likelihood logP(y|x),where(x,y)isthe(inputimage,targetclass)pair.Theneuralnetworkswereoptimizedwithstochasticback-propagationonmini-batchesofsizeten,i.e.,theav-P(y|x)

eragegof logwascomputedover10consecutive

trainingpairs(x,y)andusedtoupdateparametersθinthatdirection,withθ←θ g.Thelearningrate isahyper-parameterthatisoptimizedbasedonvalidationseterrorafteralargenumberofupdates(5million).

Wevariedthetypeofnon-linearactivationfunctioninthehiddenlayers:thesigmoid1/(1+e x),thehyperbolictangenttanh(x),andanewlyproposedactivationfunc-tion(Bergstraetal.,2009)calledthesoftsign,x/(1+|x|).Thesoftsignissimilartothehyperbolictangent(itsrangeis-1to1)butitstailsarequadraticpolynomialsratherthanexponentials,i.e.,itapproachesitsasymptotesmuchslower.

Inthecomparisons,wesearchforthebesthyper-parameters(learningrateanddepth)separatelyforeachmodel.Notethatthebestdepthwasalways veforShapeset-3×2,exceptforthesigmoid,forwhichitwasfour.

Weinitializedthebiasestobe0andtheweightsWijateachlayerwiththefollowingcommonlyusedheuristic:

11

Wij~U ,,(1)

whereU[ a,a]istheuniformdistributionintheinterval

( a,a)andnisthesizeofthepreviouslayer(thenumberofcolumnsofW).

Figure2:Meanandstandarddeviation(verticalbars)oftheactivationvalues(outputofthesigmoid)duringsupervisedlearning,forthedifferenthiddenlayersofadeeparchi-tecture.Thetophiddenlayerquicklysaturatesat0(slow-ingdownalllearning),butthenslowlydesaturatesaroundepoch100.

Weseethatveryquicklyatthebeginning,allthesigmoidactivationvaluesofthelasthiddenlayerarepushedtotheirlowersaturationvalueof0.Inversely,theotherslayershaveameanactivationvaluethatisabove0.5,anddecreas-ingaswegofromtheoutputlayertotheinputlayer.Wehavefoundthatthiskindofsaturationcanlastverylongindeepernetworkswithsigmoidactivations,e.g.,thedepth- vemodelneverescapedthisregimeduringtraining.Thebigsurpriseisthatforintermediatenumberofhiddenlay-ers(herefour),thesaturationregimemaybeescaped.Atthesametimethatthetophiddenlayermovesoutofsatura-tion,the rsthiddenlayerbeginstosaturateandthereforetostabilize.

Wehypothesizethatthisbehaviorisduetothecombina-tionofrandominitializationandthefactthatanhiddenunitoutputof0correspondstoasaturatedsigmoid.Notethatdeepnetworkswithsigmoidsbutinitializedfromunsuper-visedpre-training(e.g.fromRBMs)donotsufferfromthissaturationbehavior.Ourproposedexplanationrestsonthehypothesisthatthetransformationthatthelowerlayersoftherandomlyinitializednetworkcomputesinitiallyisnotusefultotheclassi cationtask,unlikethetransforma-tionobtainedfromunsupervisedpre-training.Thelogisticlayeroutputsoftmax(b+Wh)mightinitiallyrelymoreonitsbiasesb(whicharelearnedveryquickly)thanonthetophiddenactivationshderivedfromtheinputimage(becausehwouldvaryinwaysthatarenotpredictiveofy,maybecorrelatedmostlywithotherandpossiblymoredominantvariationsofx).ThustheerrorgradientwouldtendtopushWhtowards0,whichcanbeachievedbypushinghtowards0.Inthecaseofsymmetricactivationfunctionslikethehyperbolictangentandthesoftsign,sittingaround0isgoodbecauseitallowsgradientsto owbackwards.However,pushingthesigmoidoutputsto0wouldbringthemintoasaturationregimewhichwouldpreventgradi-entsto owbackwardandpreventthelowerlayersfromlearningusefulfeatures.Eventuallybutslowly,thelowerlayersmovetowardmoreusefulfeaturesandthetophiddenlayerthenmovesoutofthesaturationregime.Notehow-everthat,evenafterthis,thenetworkmovesintoasolutionthatisofpoorerquality(alsointermsofgeneralization)

3

EffectofActivationFunctionsandSaturationDuringTraining

Twothingswewanttoavoidandthatcanberevealedfromtheevolutionofactivationsisexcessivesaturationofacti-vationfunctionsononehand(thengradientswillnotprop-agatewell),andoverlylinearunits(theywillnotcomputesomethinginteresting).

3.1ExperimentswiththeSigmoid

Thesigmoidnon-linearityhasbeenalreadyshowntoslowdownlearningbecauseofitsnone-zeromeanthatinducesimportantsingularvaluesintheHessian(LeCunetal.,1998b).Inthissectionwewillseeanothersymptomaticbehaviorduetothisactivationfunctionindeepfeedforwardnetworks.

Wewanttostudypossiblesaturation,bylookingattheevo-lutionofactivationsduringtraining,andthe guresinthissectionshowresultsontheShapeset-3×2data,butsim-ilarbehaviorisobservedwiththeotherdatasets.Figure2showstheevolutionoftheactivationvalues(afterthenon-linearity)yer1referstotheoutputof rsthiddenlayer,andtherearefourhiddenlayers.Thegraphshowsthemeansandstandarddeviationsoftheseactivations.Thesestatisticsalongwithhistogramsarecomputedatdifferenttimesduringlearning,bylookingatactivationvaluesfora xedsetof300

testexamples.

thenthosefoundwithsymmetricactivationfunctions,ascanbeseenin gure11.3.2

ExperimentswiththeHyperbolictangent

wherethegradientswould ow

well.

Asdiscussedabove,thehyperbolictangentnetworksdonotsufferfromthekindofsaturationbehaviorofthetophid-denlayerobservedwithsigmoidnetworks,becauseofitssymmetryaroundwithourstandardweight 0.However,

11

initializationU ,,weobserveasequentiallyoc-curringsaturationphenomenonstartingwithlayer1andpropagatingupinthenetwork,asillustratedinFigure3.Whythisishappeningremainstobe

understood.

Figure4:Activationvaluesnormalizedhistogramattheendoflearning,averagedacrossunitsofthesamelayerandacross300testexamples.Top:activationfunctionishyper-bolictangent,weseeimportantsaturationofthelowerlay-ers.Bottom:activationfunctionissoftsign,weseemanyactivationvaluesaround(-0.6,-0.8)and(0.6,0.8)wheretheunitsdonotsaturatebutarenon-linear.

4

4.1

Figure3:Top:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionoftheactivationvaluesforthehyperbolictangentnetworksinthecourseoflearning.Weseethe rsthiddenlayersatu-rating rst,thenthesecond,etc.Bottom:98percentiles(markersalone)andstandarddeviation(solidlineswithmarkers)ofthedistributionofactivationvaluesforthesoft-signduringlearning.Herethedifferentlayerssaturatelessanddosotogether.3.3

ExperimentswiththeSoftsign

StudyingGradientsandtheirPropagation

EffectoftheCostFunction

Thesoftsignx/(1+|x|)issimilartothehyperbolictangentbutmightbehavedifferentlyintermsofsaturationbecauseofitssmootherasymptotes(polynomialinsteadofexpo-nential).WeseeonFigure3thatthesaturationdoesnotoccuronelayeraftertheotherlikeforthehyperbolictan-gent.Itisfasteratthebeginningandthenslow,andalllayersmovetogethertowardslargerweights.

Wecanalsoseeattheendoftrainingthatthehistogramofactivationvaluesisverydifferentfromthatseenwiththehyperbolictangent(Figure4).Whereasthelatteryieldsmodesoftheactivationsdistributionmostlyattheextremes(asymptotes-1and1)oraround0,thesoftsignnetworkhasmodesofactivationsarounditsknees(betweenthelinearregimearound0andthe atregimearound-1and1).Thesearetheareaswherethereissubstantialnon-linearitybut

Wehavefoundthatthelogisticregressionorconditionallog-likelihoodcostfunction( logP(y|x)coupledwithsoftmaxoutputs)workedmuchbetter(forclassi cationproblems)thanthequadraticcostwhichwastradition-allyusedtotrainfeedforwardneuralnetworks(Rumelhartetal.,1986).Thisisnotanewobservation(Sollaetal.,1988)butwe nditimportanttostresshere.Wefoundthattheplateausinthetrainingcriterion(asafunctionofthepa-rameters)arelesspresentwiththelog-likelihoodcostfunc-tion.WecanseethisonFigure5,whichplotsthetrainingcriterionasafunctionoftwoweightsforatwo-layernet-work(onehiddenlayer)withhyperbolictangentunits,andarandominputandtargetsignal.Thereareclearlymoresevereplateauswiththequadraticcost.4.2Gradientsatinitialization4.2.1

TheoreticalConsiderationsandaNewNormalizedInitialization

Westudytheback-propagatedgradients,orequivalentlythegradientofthecostfunctionontheinputsbiasesateachlayer.Bradley(2009)foundthatback-propagatedgradientsweresmallerasonemovesfromtheoutputlayertowardstheinputlayer,justafterinitialization.Hestudiednetworkswithlinearactivationateachlayer, ndingthatthevarianceoftheback-propagatedgradientsdecreasesaswegoback-wardsinthenetwork.Wewillalsostartbystudyingthelinearregime.

Fromaforward-propagationpointofview,tokeepinfor-mation owingwewouldlikethat

(i,i ),Var[zi]=Var[zi].

(8)

Fromaback-propagationpointofviewwewouldsimilarlyliketohave

Cost Cost

=Var.(9) (i,i),Var

si si Thesetwoconditionstransformto:

i,niVar[Wi]=1 i,ni+1Var[Wi]=1

Figure5:Crossentropy(black,surfaceontop)and

quadratic(red,bottomsurface)costasafunctionoftwoweights(oneateachlayer)ofanetworkwithtwolayers,W1respectivelyonthe rstlayerandW2onthesecond,outputlayer.

Foradensearti cialneuralnetworkusingsymmetricacti-vationfunctionfwithunitderivativeat0(i.e.f (0)=1),ifwewritezifortheactivationvectoroflayeri,andsitheargumentvectoroftheactivationfunctionatlayeri,wehavesi=ziWi+biandzi+1=f(si).Fromthesede nitionsweobtainthefollowing:

Costi+1 Cost i

=f(s)Wkk,

s sk

Costi Cost=zl

wl,k sk

(2)(3)

(10)(11)

Asacompromisebetweenthesetwoconstraints,wemight

wanttohave

2

i,Var[Wi]=(12)

ni+ni+1Notehowbothconstraintsaresatis edwhenalllayershavethesamewidth.Ifwealsohavethesameinitializationfortheweightswecouldgetthefollowinginterestingproper-ties:

Cost d i

i,Var=nVar[W]Var[x](13)

si i,Var

Cost w d Cost

=nVar[W]Var[x]Var

s(14)

Thevarianceswillbeexpressedwithrespecttotheinput,outpoutandweightinitializationrandomness.Considerthehypothesisthatweareinalinearregimeattheinitial-ization,thattheweightsareinitializedindependentlyandthattheinputsfeaturesvariancesarethesame(=Var[x]).Thenwecansaythat,withnithesizeoflayeriandxthenetworkinput,

f (sik)≈1,

Var[z]=Var[x]

Wecanseethatthevarianceofthegradientontheweights

isthesameforalllayers,butthevarianceoftheback-propagatedgradientmightstillvanishorexplodeasweconsiderdeepernetworks.Notehowthisisreminis-centofissuesraisedwhenstudyingrecurrentneuralnet-works(Bengioetal.,1994),whichcanbeseenasverydeepnetworkswhenunfoldedthroughtime.

Thestandardinitializationthatwehaveused(eq.1)givesrisetovariancewiththefollowingproperty:

nVar[W]=

13

(15)

(4)

i

i 1 i =0

ni Var[Wi],

(5)

WewriteVar[Wi]forthesharedscalarvarianceofall

weightsatlayeri .Thenforanetworkwithdlayers,

d Cost Cost

i

+1Var[W],(6)Var=Varni

si sd

i=ii

wherenisthelayersize(assumingalllayersofthesame

size).Thiswillcausethevarianceoftheback-propagatedgradienttobedependentonthelayer(anddecreasing).Thenormalizationfactormaythereforebeimportantwheninitializingdeepnetworksbecauseofthemultiplicativeef-fectthroughlayers,andwesuggestthefollowinginitializa-tionproceduretoapproximatelysatisfyourobjectivesofmaintainingactivationvariancesandback-propagatedgra-dientsvarianceasonemovesupordownthenetwork.Wecallitthenormalizedinitialization:

√√ ,(16)W~U jj+1jj+1

Var

Cost w=

i 1 i =0

ni Var[W]

d 1 i =i

ni +1Var[Wi].

(7)

×Var[x]Var

Cost s

4.2.2GradientPropagationStudy

Toempiricallyvalidatetheabovetheoreticalideas,wehaveplottedsomenormalizedhistogramsofactivationvalues,weightgradientsandoftheback-propagatedgradientsatinitializationwiththetwodifferentinitializationmethods.Theresultsdisplayed(Figures6,7and8)arefromexper-imentsonShapeset-3×2,butqualitativelysimilarresultswereobtainedwiththeotherdatasets.

WemonitorthesingularvaluesoftheJacobianmatrixas-sociatedwithlayeri:

zi+1i

J=(17)

zWhenconsecutivelayershavethesamedimension,theav-eragesingularvaluecorrespondstotheaverageratioofin- nitesimalvolumesmappedfromzitozi+1,aswellastotheratioofaverageactivationvariancegoingfromzitozi+1.Withournormalizedinitialization,thisratioisaround0.8whereaswiththestandardinitialization,itdropsdownto0.5

.

Figure7:Back-propagatedgradientsnormalizedhis-togramswithhyperbolictangentactivation,withstandard(top)vsnormalized(bottom)initialization.Top:0-peakdecreasesforhigherlayers.

Whatwasinitiallyreallysurprisingisthatevenwhentheback-propagatedgradientsbecomesmaller(standardini-tialization),thevarianceoftheweightsgradientsisroughlyconstantacrosslayers,asshownonFigure8.However,thisisexplainedbyourtheoreticalanalysisabove(eq.14).In-terestingly,asshowninFigure9,theseobservationsontheweightgradientofstandardandnormalizedinitializationchangeduringtraining(hereforatanhnetwork).Indeed,whereasthegradientshaveinitiallyroughlythesamemag-nitude,theydivergefromeachother(withlargergradientsinthelowerlayers)astrainingprogresses,especiallywiththestandardinitialization.Notethatthismightbeoneoftheadvantagesofthenormalizedinitialization,sincehav-inggradientsofverydifferentmagnitudesatdifferentlay-ersmayyieldtoill-conditioningandslowertraining.Finally,weobservethatthesoftsignnetworkssharesimi-laritieswiththetanhnetworkswithnormalizedinitializa-tion,ascanbeseenbycomparingtheevolutionofactiva-tionsinbothcases(resp.Figure3-bottomandFigure10).

Figure6:Activationvaluesnormalizedhistogramswithhyperbolictangentactivation,withstandard(top)vsnor-malizedinitialization(bottom).Top:0-peakincreasesforhigherlayers.4.3

Back-propagatedGradientsDuringLearning

5ErrorCurvesandConclusions

Thedynamicoflearninginsuchnetworksiscomplexandwewouldliketodevelopbettertoolstoanalyzeandtrackit.Inparticular,wecannotusesimplevariancecalculationsinourtheoreticalanalysisbecausetheweightsvaluesarenotanymoreindependentoftheactivationvaluesandthelinearityhypothesisisalsoviolated.

As rstnotedbyBradley(2009),weobserve(Figure7)thatatthebeginningoftraining,afterthestandardinitializa-tion(eq.1),thevarianceoftheback-propagatedgradientsgetssmallerasitispropagateddownwards.Howeverwe ingournormalizedinitializationwedonotseesuchde-creasingback-propagatedgradients(bottomofFigure

7).

The nalconsiderationthatwecareforisthesuccessoftrainingwithdifferentstrategies,andthisisbestil-lustratedwitherrorcurvesshowingtheevolutionoftesterrorastrainingprogressesandasymptotes.Figure11showssuchcurveswithonlinetrainingonShapeset-3×2,whileTable1gives naltesterrorforallthedatasetsstudied(Shapeset-3×2,MNIST,CIFAR-10,andSmall-ImageNet).Asabaseline,weoptimizedRBFSVMmod-elsononehundredthousandShapesetexamplesandob-tained59.47%testerror,whileonthesamesetweobtained50.47%withadepth vehyperbolictangentnetworkwithnormalizedinitialization.

Theseresultsillustratetheeffectofthechoiceofactiva-tionandinitialization.AsareferenceweincludeinFig-

Figure8:Weightgradientnormalizedhistogramswithhy-perbolictangentactivationjustafterinitialization,withstandardinitialization(top)andnormalizedinitialization(bottom),fordifferentlayers.Eventhoughwithstandardinitializationtheback-propagatedgradientsgetsmaller,theweightgradientsdonot!

Table1:Testerrorwithdifferentactivationfunctionsandinitializationschemesfordeepnetworkswith5hiddenlay-ers.Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.Resultsinboldarestatisticallydifferentfromnon-boldonesunderthenullhypothesistestwithp=0.005.

TYPESoftsignSoftsignNTanhTanhNSigmoid

Shapeset16.2716.0627.1515.6082.61

MNIST1.641.721.761.642.21

CIFAR-1055.7853.855.952.9257.28

ImageNet69.1468.1370.5868.5770.66

Figure9:Standarddeviationintervalsoftheweightsgradi-entswithhyperbolictangentswithstandardinitialization(top)andnormalized(bottom)duringtraining.Weseethatthenormalizationallowstokeepthesamevarianceoftheweightsgradientacrosslayers,duringtraining(top:smallervarianceforhigher

layers).

Figure10:98percentile(markersalone)andstandardde-viation(solidlineswithmarkers)ofthedistributionofac-tivationvaluesforhyperbolictangentwithnormalizedini-tializationduringlearning.

activations( owingupward)andgradients( owingbackward).

Othersmethodscanalleviatediscrepanciesbetweenlay-ersduringlearning,e.g.,exploitingsecondorderinforma-tiontosetthelearningrateseparatelyforeachparame-ter.Forexample,wecanexploitthediagonaloftheHes-sian(LeCunetal.,1998b)oragradientvarianceestimate.BoththosemethodshavebeenappliedforShapeset-3×2withhyperbolictangentandstandardinitialization.Weob-servedagaininperformancebutnotreachingtheresultob-tainedfromnormalizedinitialization.Inaddition,weob-servedfurthergainsbycombiningnormalizedinitializationwithsecondordermethods:theestimatedHessianmightthenfocusondiscrepanciesbetweenunits,nothavingtocorrectimportantinitialdiscrepanciesbetweenlayers.Inallreportedexperimentswehaveusedthesamenum-berofunitsperlayer.However,weveri edthatweobtainthesamegainswhenthelayersizeincreases(ordecreases)withlayernumber.

Theotherconclusionsfromthisstudyarethefollowing: Monitoringactivationsandgradientsacrosslayersand

ure11theerrorcurveforthesupervised ne-tuningfromtheinitializationobtainedafterunsupervisedpre-trainingwithdenoisingauto-encoders(Vincentetal.,2008).Foreachnetworkthelearningrateisseparatelychosentomin-imizeerroronthevalidationset.WecanremarkthatonShapeset-3×2,becauseofthetaskdif culty,weobserveimportantsaturationsduringlearning,thismightexplainthatthenormalizedinitializationorthesoftsigneffectsaremorevisible.

Severalconclusionscanbedrawnfromtheseerrorcurves: Themoreclassicalneuralnetworkswithsigmoidorhyperbolictangentunitsandstandardinitializationfareratherpoorly,convergingmoreslowlyandappar-entlytowardsultimatelypoorerlocalminima. Thesoftsignnetworksseemtobemorerobusttotheinitializationprocedurethanthetanhnetworks,pre-sumablybecauseoftheirgentlernon-linearity. Fortanhnetworks,theproposednormalizedinitial-izationcanbequitehelpful,presumablybecausethelayer-to-layertransformationsmaintainmagnitudes

of

Bengio,Y.,Lamblin,P.,Popovici,D.,&Larochelle,H.(2007).Greedylayer-wisetrainingofdeepnetworks.NIPS19(pp.153–160).MITPress.Bengio,Y.,Simard,P.,&Frasconi,P.(1994).Learninglong-termdependencieswithgradientdescentisdif cult.IEEETransac-tionsonNeuralNetworks,5,157–166.Bergstra,J.,Desjardins,G.,Lamblin,P.,&Bengio,Y.(2009).Quadraticpolynomialslearnbetterimagefeatures(TechnicalReport1337).D´epartementd’InformatiqueetdeRechercheOp´erationnelle,Universit´edeMontr´eal.Bradley,D.(2009).Learninginmodularsystems.Doctoraldis-sertation,TheRoboticsInstitute,CarnegieMellonUniversity.Collobert,R.,&Weston,J.(2008).Auni edarchitecturefornat-urallanguageprocessing:Deepneuralnetworkswithmultitasklearning.ICML2008.

Figure11:TesterrorduringonlinetrainingontheShapeset-3×2dataset,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottominde-creasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalized

initialization.

Erhan,D.,Manzagol,P.-A.,Bengio,Y.,Bengio,S.,&Vincent,P.(2009).Thedif cultyoftrainingdeeparchitecturesandtheeffectofunsupervisedpre-training.AISTATS’2009(pp.153–160).Hinton,G.E.,Osindero,S.,&Teh,Y.(2006).Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18,1527–1554.Krizhevsky,A.,&Hinton,G.(2009).Learningmultiplelayersoffeaturesfromtinyimages(TechnicalReport)rochelle,H.,Bengio,Y.,Louradour,J.,&Lamblin,P.(2009).Exploringstrategiesfortrainingdeepneuralnetworks.TheJournalofMachineLearningResearch,10,1–rochelle,H.,Erhan,D.,Courville,A.,Bergstra,J.,&Bengio,Y.(2007).Anempiricalevaluationofdeeparchitecturesonproblemswithmanyfactorsofvariation.ICML2007.LeCun,Y.,Bottou,L.,Bengio,Y.,&Haffner,P.(1998a).Gradient-basedlearningappliedtodocumentrecognition.Pro-ceedingsoftheIEEE,86,2278–2324.LeCun,Y.,Bottou,L.,Orr,G.B.,&M¨uller,K.-R.(1998b).Ef -cientbackprop.InNeuralnetworks,tricksofthetrade,LectureNotesinComputerScienceLNCS1524.SpringerVerlag.Mnih,A.,&Hinton,G.E.(2009).Ascalablehierarchicaldis-tributedlanguagemodel.NIPS21(pp.1081–1088).Ranzato,M.,Poultney,C.,Chopra,S.,&LeCun,Y.(2007).Ef- cientlearningofsparserepresentationswithanenergy-basedmodel.NIPS19.Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1986).Learn-ingrepresentationsbyback-propagatingerrors.Nature,323,533–536.Solla,S.A.,Levin,E.,&Fleisher,M.(1988)plexSystems,2,625–639.Vincent,P.,Larochelle,H.,Bengio,Y.,&Manzagol,P.-A.(2008).Extractingandcomposingrobustfeatureswithdenoisingau-toencoders.ICML2008.Weston,J.,Ratle,F.,&Collobert,R.(2008).Deeplearningviasemi-supervisedembedding.ICML2008(pp.1168–1175).NewYork,NY,USA:ACM.Zhu,L.,Chen,Y.,&Yuille,A.(2009).Unsupervisedlearningofprobabilisticgrammar-markovmodelsforobjectcategories.IEEETransactionsonPatternAnalysisandMachineIntelli-gence,31,114–128.

Figure12:TesterrorcurvesduringtrainingonMNISTandCIFAR10,forvariousactivationfunctionsandinitializationschemes(orderedfromtoptobottomindecreasing nalerror).Naftertheactivationfunctionnameindicatestheuseofnormalizedinitialization.

trainingiterationsisapowerfulinvestigativetoolforunderstandingtrainingdif cultiesindeepnets. Sigmoidactivations(notsymmetricaround0)shouldbeavoidedwheninitializingfromsmallrandomweights,becausetheyyieldpoorlearningdynamics,withinitialsaturationofthetophiddenlayer. Keepingthelayer-to-layertransformationssuchthatbothactivationsandgradients owwell(i.e.withaJa-cobianaround1)appearshelpful,andallowstoelim-inateagoodpartofthediscrepancybetweenpurelysuperviseddeepnetworksandonespre-trainedwithunsupervisedlearning. Manyofourobservationsremainunexplained,sug-gestingfurtherinvestigationstobetterunderstandgra-dientsandtrainingdynamicsindeeparchitectures.

References

Bengio,Y.(2009).LearningdeeparchitecturesforAI.Founda-tionsandTrendsinMachineLearning,2,1–127.Alsopub-lishedasabook.NowPublishers,2009.

本文来源:https://www.bwwdw.com/article/34s4.html

Top