A TWO-STAGE ALGORITHM FOR ENHANCEMENT OF REVERBERANT SPEECH

更新时间:2023-07-22 12:41:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Room reverberation causes two perceptual distortions on clean speech: Coloration and long-term reverberation. These two effects correspond to two physical variables: Signal-toreverberant energy ratio (SRR) and reverberation time, respectively. Based on thi

ATWO-STAGEALGORITHMFORENHANCEMENTOF

REVERBERANTSPEECH

MingyangWuandDeLiangWang

DepartmentofComputerScienceandEngineering

andCenterforCognitiveScienceTheOhioStateUniversityColumbus,OH43210-1277,USA

Email:mwu@,dwang@cse.ohio-state.edu

ABSTRACT

Roomreverberationcausestwoperceptualdistortionsoncleanspeech:Colorationandlong-termreverberation.Thesetwoeffectscorrespondtotwophysicalvariables:Signal-to-reverberantenergyratio(SRR)andreverberationtime,respectively.Basedonthisobservation,weproposeatwo-stagealgorithmthatenhancesreverberantspeechfromone-microphonerecordings.Inthefirststage,aninversefilterisestimatedtoreducecolorationeffectsorincreaseSRR.Thesecondstageemploysspectralsubtractiontominimizetheinfluenceoflong-termreverberation.Theproposedalgorithmsignificantlyimprovesthequalityofreverberantspeech.Acomparisonwitharecentone-microphoneenhancementalgorithmshowsthatoursystemproducessignificantlybetterresults.

1.INTRODUCTION

Amaincauseofspeechdegradationinpracticallyalllisteningsituationsisroomreverberation.Althoughapersonwithnormalhearingislittleaffectedbyroomreverberationtoaconsiderabledegree,hearing-impairedlistenerssufferfromreverberationeffectsdisproportionally[12].Also,reverberationcausessignificantperformancedecrementforcurrentautomaticspeechrecognition(ASR)andspeakerrecognitionsystems.Consequently,aneffectivereverberantspeechenhancementsystemcanbeusedforimprovingintelligenthearingaidsdesignandisessentialformanyspeechtechnologyapplications.

Inthisarticlewestudyone-microphonereverberantspeechenhancement.Thisismotivatedbythefollowingtwoconsiderations.First,aone-microphonesolutionishighlydesirableformanyreal-worldapplicationssuchashand-freeaudiocommunicationandaudioinformationretrieval.Second,moderatelyreverberantspeechishighlyintelligibleinmonaurallisteningconditions.Hencehowtoachievethismonauralcapabilityremainsafundamentalscientificquestion.

Anumberofreverberantspeechenhancementalgorithmshavebeendesignedutilizingmorethanonemicrophone.Forexample,microphone-arraybasedmethods[6],suchasbeamformingtechniques,attempttosuppressthesoundenergycomingfromdirectionsotherthanthatofthedirectsourceandthereforeenhancetargetspeech.AspointedoutbyKoenigetal.[10],thereverberationtailsoftheimpulseresponses,characterizingthereverberationprocessinaroomwithmultiplemicrophonesandonespeaker,areuncorrelated.Several

«

algorithmsareproposedtoreducethereverberationeffectsbyremovingtheincoherentpartsofreceivedsignals.Blinddeconvolutionalgorithmsaimtoreconstructtheinversefilterswithoutthepriorknowledgeofroomimpulseresponses(forexample,see[8]).BrandsteinandGriebel[5]utilizetheextremaofwaveletcoefficientstoreconstructthelinearprediction(LP)residualoforiginalspeech.

Reverberantspeechenhancementusingonemicrophoneissignificantlymorechallengingthanthatusingmultiplemicrophones.Nonetheless,anumberofone-microphonealgorithmshavebeenproposed.Beesetal.[3]employsacepstrum-basedmethodtoestimatethecepstrumofreverberationimpulseresponse,anditsinverseisthenusedtodereverberatethesignal.Severaldereverberationalgorithms(forexample,see[2])aremotivatedbytheeffectsofreverberationonModulationTransferFunction(MTF).YegnanarayanaandMurthy[16]observedthatLPresidualofvoicedcleanspeechhasdampedsinusoidalpatternswithineachglottalcycle,whilethatofreverberantspeechissmearedandresemblesGaussiannoise.Withthisobservation,LPresidualofcleanspeechisestimatedandthentheenhancedspeechisresynthesized.NakataniandMiyoshi[13]proposedasystemcapableofblinddereverberationbyemployingtheharmonicstructureofspeech.Goodresultsareobtainedbutthisalgorithmrequiresalargeamountofreverberantspeechproducedusingthesameroomimpulseresponsefunction.Despitethesestudies,existingreverberantspeechenhancementalgorithms,however,donotreachaperformanceleveldemandedbymanypracticalapplications.

2.BACKGROUND

Reverberationcausesanoticeablechangeinspeechquality.BerkleyandAllen[4]identifiedthattwophysicalvariables,reverberationtimeT60andspectraldeviation,areimportantforreverberantspeechquality.Considertheimpulseresponseasacombinationofthreeparts,thedirect,early,andlatereflections.Whilelatereflectionssmearthespeechspectraandreducetheintelligibilityandqualityofspeechsignals,earlyreflectionscauseanotherdistortionofspeechsignalcalledcoloration;thenon-flatfrequencyresponseoftheearlyreflectionsdistortsthespeechspectrum.Thecolorationcanbecharacterizedbyaspectraldeviationdefinedasthestandarddeviationofroomfrequencyresponse.Increasingeitherspectraldeviationorreverberationtimeresultsindecreasedreverberantspeechquality.Moreover,Jetzt[9]showsthatspectraldeviationis

Room reverberation causes two perceptual distortions on clean speech: Coloration and long-term reverberation. These two effects correspond to two physical variables: Signal-toreverberant energy ratio (SRR) and reverberation time, respectively. Based on thi

determinedbysignal-to-reverberantenergyratio(SRR),whichistheratiobetweentheenergytravelingdirectlyfromasourcetoalistenerandtheenergyofallacousticreflectionsreachingthelistener,andinturn,itisdeterminedbytalker-to-microphonedistance.Shortertalker-to-microphonedistanceresultsinhigherSRRandlessspectraldeviation,hence,lesscoloration.

Consequently,weproposeatwo-stagemodeltodealwithtwotypesofdegradations–colorationandlong-termreverberation–inareverberantenvironment.Inthefirststage,ourmodelestimatesaninversefiltertoreducecolorationeffectsinordertoincreaseSRR.Thesecondstageemploysspectralsubtractiontominimizetheinfluenceoflong-termreverberation.

3.INVERSEFILTERING

Inthefirststageofouralgorithm,wederiveaninversefiltertoreducethereverberationeffectsandthisstageisadaptedfromamulti-microphoneinversefilteringalgorithmproposedbyGillespieatel.[8].AnFIRinversefilteroftheroomimpulseresponseisestimatedbymaximizingthekurtosisofthelinearprediction(LP)residualofspeechutilizingablockfrequency-domainadaptivefilter.Then,inverse-filteredspeechisobtainedbyconvolvingtheinversefilterwithreverberantspeech.

AtypicalresultfromthefirststageofouralgorithmisshowninFig.1.Fig.1(a)illustratesaroomimpulseresponsefunction(T60=0.3s)generatedbytheimagemodelofAllenandBerkley[1].Theequalizedimpulseresponse–theresultoftheroomimpulseresponseinFig.1(a)convolvedwiththeobtainedinversefilter–isshowninFig.1(b).Ascanbeseen,theequalizedimpulseresponseisfarmoreimpulse-likethantheroomimpulseresponse.Infact,theSRRvalueoftheroomimpulseresponseis–9.8dBincomparisonwith2.4dBforthatoftheequalizedimpulseresponse.

However,theaboveinversefilteringmethoddoesnotimproveonthetailpartofreverberation.Fig.1(c)and(d)showtheenergydecaycurvesoftheroomimpulseresponseandtheequalizedimpulseresponse,respectively.Ascanbeseen,exceptforthefirst50ms,theenergydecaypatternsarealmostidentical,andthustheestimatedreverberationtimesarealmostthesame,around0.3s.WhilethecolorationdistortionisreducedduetotheincreaseofSRR,thedegradationduetoreverberationtailsisnotalleviated.Inotherwords,theeffectofinversefilteringissimilartothatofmovingthesoundsourceclosertothereceiver.Inthenextsection,weintroducethesecondstageofouralgorithmtoreducetheeffectsoflong-termreverberation.

3.SPECTRALSUBTRACTION

Latereflectionsinaroomimpulseresponsefunctionsmearspeechspectrumanddegradespeechintelligibilityandquality.Likewise,anequalizedimpulseresponsecanbedecomposedintotwoparts:earlyandlateimpulses.Resemblingtheeffectsofthelatereflectionsinaroomimpulseresponse,thelateimpulseshavedeleteriouseffectsonthequalityofinverse-filteredspeech;byestimatingtheeffectsofthelateimpulsesandsubtractingthem,wecanexpecttoenhancethespeechquality.

Inapreviousversionofthisalgorithm,WuandWang[15]proposeaone-stagemethodtoenhancethereverberantspeechbyestimatingandsubtractingeffectsoflatereflections.

Thesmearingeffectsoflateimpulsesleadtothesmoothingofthesignalspectruminthetimedomain.Therefore,weassumethatthepowerspectrumoflate-impulsecomponentsisa

(a)

(b)

(c)

Time(ms)

(d)

Fig.1.(a)Aroomimpulseresponsefunctiongeneratedbytheimagemodelinanoffice-sizeroom.(b)Theequalizedimpulseresponsederivedfromthereverberantspeechgeneratedbytheroomimpulseresponsein(a)astheresultofthefirststageofouralgorithm.Energydecaycurves(c)thatcomputedfromtheroomimpulseresponsefunctionin(a).(d)Thatfromtheequalizedimpulseresponsein(b).EachcurveiscalculatedusingtheSchroederintegrationmethod.Thehorizontaldotlinerepresents–60dBenergydecaylevel.Theleftdashlinesindicatethestartingtimesoftheimpulseresponsesandtherightdashlinesthetimesatwhichdecaycurvescross–60dB.

¬

smoothedandshiftedversionofthepowerspectrumoftheinverse-filteredspeechzt:

()

Sl(k;i=γw(i ρ) Sz(k;i),

2

2

(1)

whereSz(k;i)

2

andSl(k;i)

2

are,respectively,theshort-term

powerspectraoftheinverse-filteredspeechandthelate-impulsecomponents.Indexeskandirefertofrequencybinandtimeframe,respectively.Thesymbol denotesconvolutioninthetimedomainandw(i)isasmoothingfunction.Theshort-termspeechspectrumisobtainedbyusinghammingwindowsoflength16mswith8msoverlapforshort-termFourieranalysis.

Room reverberation causes two perceptual distortions on clean speech: Coloration and long-term reverberation. These two effects correspond to two physical variables: Signal-toreverberant energy ratio (SRR) and reverberation time, respectively. Based on thi

¬

TableI.ThesystematicresultsofreverberantspeechenhancementforspeechutterancesoffourfemaleandfourmalespeakersrandomlyselectedfromtheTIMITdatabase.Allsignalsaresampledat8kHz.Speaker/GenderFemale#1Female#2Female#3Female#4Male#1Male#2Male#3Male#4Average

SNRrevfw

SNRYMfw

proc

SNRfw

rev

SNRYMfw

proc rev

SNRfw

short-termphasespectrumofenhancedspeechissettothatofinverse-filteredspeechandtheprocessedspeechisreconstructedfromtheshort-termmagnitudeandphasespectrum.

3.RESULTSANDDISCUSSIONS

Acorpusofspeechutterancesfromeightspeakers,fourfemalesandfourmales,rmallisteningtestsshowthattheproposedalgorithmachievessubstantialreductionofreverberationandhaslittleaudibleartifacts.Toillustratetypicalperformance,weshowtheenhancementresultsinFig.2.Fig.2(a)and(c)showthecleanandthereverberantsignalandFig.2(b)and(d),thecorrespondingspectrograms,respectively.ThereverberantsignalisproducedbyconvolvingthecleansignalandtheroomimpulseresponsefunctioninFig.1(a)withT60=0.3s.Ascanbeseen,whilethecleansignalhasfineharmonicstructureandsilencegapsbetweenthewords,thereverberantspeechissmearedanditsharmonicstructureiselongated.

Toputourperformanceinperspective,wecomparewitharecentone-microphonereverberantspeechenhancementalgorithmproposedbyYegnanarayanaandMurthy[16].WerefertothisalgorithmastheYMalgorithm.TheYMalgorithmappliesweightstoLPresidualsothattheyresemblemorecloselythedampedsinusoidalpatternsofLPresidualfromcleanspeech.Fig.2(e)and(f)showtheprocessedspeechusingtheYMalgorithmanditsspectrogram,respectively.Ascanbeseen,spectralstructureisclearerandsomesilencegapsareattenuated.TheprocessedspeechusingouralgorithmanditsspectrogramareshowninFig.2(g)and(h).Ascanbeseen,theeffectsofreverberationhavebeensignificantlyreducedintheprocessedspeech.Thesmearingislessenedandmanysilencegapsareclearer.ThefigureclearlyshowsthatouralgorithmenhancesthereverberantspeechmorethandoestheYMalgorithm.Anaudiodemonstrationalsocanbefoundathttp://www.cse.ohio-state.edu/~dwang/demo/WuReverb.html.

Quantitativecomparisonsareobtainedfromthespeechutterancesoftheeightspeakersseparatelyutilizingfrequency-weightedsegmentalSNR[14]andpresentedinTableI.SNRrevfw,

(dB)

-3.64-3.51-3.86-4.12-3.86-3.33-3.30-3.50-3.64(dB)-3.06-3.05-3.19-3.29-2.65-2.68-2.53-2.76-2.90(dB)0.920.74-0.200.73-0.921.771.20-0.130.51(dB)0.580.460.680.831.210.650.760.750.74(dB)4.564.253.664.842.945.104.493.384.15

Theshiftdelayρindicatestherelativedelayofthelate-impulsecomponents.Thedistinctionofearlyandlatereflectionsforspeechiscommonlysetatadelayof50msinaroomimpulseresponsefunction[11].Thisdelayreflectsthepropertiesofspeechandisindependentfromreverberationcharacteristics.Consequently,ittranslatestoapproximately7framesforashiftintervalof8ms,andwechooseρ=7asaresult.Finally,thescalingfactor specifiestherelativestrengthofthelate-impulsecomponentsafterinversefilteringandwesetitto0.32.

Consideringtheshapeoftheequalizedimpulseresponse,wechooseanasymmetricalsmoothingfunctionastheRayleighdistribution:

§ i+a2i+a­¨

°w(i)=a2exp¨2a2

©®

°¯w(i)=0

·

¸¸¹

ifi> aotherwise

,(2)

¬

wherewechoosea=5anditcontrolsthespanofthesmoothingfunction.Thissmoothingfunctiongoesdowntozeroontheleftsidequicklybuttailsoffslowlyontherightside;therightsideofthesmoothingfunctionresemblestheshapeofreverberationtailsinequalizedimpulseresponses.

Assumingtheearly-andlate-impulsecomponentsareapproximatelyuncorrelated.,thepowerspectrumoftheearly-impulsecomponentscanbeestimatedbysubtractingthepowerspectrumofthelate-impulsecomponentsfromthatoftheinverse-filteredspeech.Theresultsarefurtherusedasanestimateofthepowerspectrumoforiginalspeech.Specifically,spectralsubtraction[7]isemployedtoestimatethepowerspectrumoforiginalspeechS~x(k;i):

2

SNRYMandfw,proc

SNRfw

representthefrequency-weighted

segmentalSNRvaluesofreverberantspeech,theprocessedspeechusingtheYMalgorithm,andtheprocessedspeechusingouralgorithm,respectively.TheSNRgainsbyemployingthe

rev

andYMalgorithmandouralgorithmaredenotedbySNRYMfw

proc rev

SNRfw,respectively.Ascanbeseen,theYMalgorithm

S~x(k;i)=Sz(k;i)

2

2

ªS(k;i)2 γw(i ρ) S(k;i)2º

zz

max«,ε»,(3)2

«»Szk;i¬¼

whereε=0.001isthefloorandcorrespondstothemaximum

attenuationof30dB.

Naturalspeechutterancescontainsilentgaps,andreverberationfillssomeofthegapsrightafterhigh-intensityspeechsections.Weidentifythesesilentgapsbyexaminetheenergyofinverse-filteredspeechandenergyreductionradioafterspectralsubtractioninatimeframe.Foridentifiedsilentframes,allfrequencybinsareattenuatedby30dB.Finally,the

obtainsanaverageSNRgainof0.74dBcomparedtothatof4.15dBbyouralgorithm.

Althoughouralgorithmisdesignedforenhancingreverberantspeechusingonemicrophone,itisstraightforwardtoextenditintomulti-microphonescenarios.Manyinversefilteringalgorithms,suchasthealgorithmbyGillespieetal.[8],areoriginallyproposedusingmultiplemicrophones.Afterinversefilteringusingmultiplemicrophones,thesecondstageofouralgorithm–thespectralsubtractionmethod–canbeutilizedforreducinglong-termreverberationeffects.

Room reverberation causes two perceptual distortions on clean speech: Coloration and long-term reverberation. These two effects correspond to two physical variables: Signal-toreverberant energy ratio (SRR) and reverberation time, respectively. Based on thi

Toconclude,wehavepresentedatwo-stagereverberantspeechenhancementalgorithmusingonemicrophone,andthestagescorrespondtoinversefilteringandspectralsubtraction.Theevaluationsshowthatouralgorithmenhancesthequalityofreverberantspeecheffectivelyandperformssignificantlybetterthanarecentreverberantspeechenhancementalgorithm.

AcknowledgmentsThisresearchwassupportedinpartbyanNSFgrant(IIS-0081058)andanAFOSRgrant(FA9550-04-1-0117).

REFERENCES

[1]J.B.AllenandD.A.Berkley,"Imagemethodforefficientlysimulatingsmall-roomacoustics,"J.Acoust.Soc.Amer.,vol.65,pp.943-950,1979.

[2]C.AvendanoandH.Hermansky,"Studyonthedereverberationofspeechbasedontemporalenvelopefiltering,"inProc.ICSLP,1996,pp.889-892.

[3]D.Bees,M.Blostein,andP.Kabal,"Reverberantspeechenhancementusingcepstralprocessing,"inProc.IEEEICASSP,1991,pp.977-980.

[4]

D.A.BerkleyandJ.B.Allen,"Normallisteningintypicalrooms:Thephysicalandpsychophysicalcorrelatesofreverberation,"inAcousticalfactorsaffectinghearingaidperformance,G.A.StudebakerandI.Hochberg,Eds.,2nded.,NeedhamHeights,MA:AllynandBacon,1993,pp.3-14.

[5]

M.S.BrandsteinandS.Griebel,"Explicitspeechmodelingformicrophonearrayapplications,"inMicrophonearrays:SignalprocessingtechniquesandApplications,M.S.BrandsteinandD.B.Ward,Eds.,NewYork,NY:SpringerVerlag,2001,pp.133-153.

[6]M.S.BrandsteinandD.B.Ward,"MicrophoneArrays:SignalProcessingTechniquesandApplications."NewYork,NY:SpringerVerlag,2001.

[7]J.R.Deller,J.G.Proakis,andJ.H.L.Hansen,Discrete-timeprocessingofspeechsignals,UpperSaddleRiver,NJ:Prentice-Hall,1987.

[8]B.W.Gillespie,H.S.Malvar,andD.A.F.Florêncio,"Speechdereverberationviamaximum-kurtosissubbandadaptivefiltering,"inProc.IEEEICASSP,2001,pp.3701-3704.

[9]J.J.Jetzt,"Criticaldistancemeasurementofroomsfromthesoundenergyspectralresponse,"J.Acoust.Soc.Amer.,vol.65,pp.1204-1211,1979.

[10]A.H.Koenig,J.B.Allen,D.A.Berkley,andT.H.Curtis,"Determinationofmaskingleveldifferencesinanreverberantenvironment,"J.Acoust.Soc.Amer.,vol.61,pp.1374-1376,1977.[11]H.Kuttruff,RoomAcoustics,4thed.,NewYork,NY:SponPress,2000.

[12]

A.K.Nábelek,"Communicationinnoisyandreverberantenvironments,"inAcousticalfactorsaffectinghearingaidperformance,G.A.StubebakerandI.Hochberg,Eds.,2nded.,NeedhamHeight,MA:AllynandBacon,1993.

[13]T.NakataniandM.Miyoshi,"Blinddereverberationofsinglechannelspeechsignalbasedonharmonicstructure,"inProc.IEEEICASSP,2003,pp.92-95.

[14]J.M.Tribolet,P.Noll,andB.J.McDermott,"Astudyofcomplexityandqualityofspeechwaveformcoders,"inProc.IEEEICASSP,Tulsa,OK,1978,pp.586-590.

[15]M.WuandD.L.Wang,"Aone-microphonealgorithmforreverberantspeechenhancement,"inProc.IEEEICASSP,2003,pp.844-847.

[16]

B.YegnanarayanaandP.S.Murthy,"EnhancementofreverberantspeechusingLPresidualsignal,"IEEETrans.SpeechAudioProcessing,vol.8,pp.267-281,

2000.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Time(sec)

(h)

Fig.2.Resultsofreverberantspeechenhancement:(a)cleanspeech,(b)spectrogramofcleanspeech,(c)reverberantspeech,(d)spectrogramofreverberantspeech,(e)speechprocessedusingtheYMalgorithm,(f)spectrogramof(e),(g)speechprocessedusingouralgorithm,and(h)spectrogramof(g).Thespeechisafemaleutterance“Shehadyourdarksuitingreasywashwaterallyear,”sampledat8kHz.

«

¬

本文来源:https://www.bwwdw.com/article/0r9m.html

Top