Measuring Similarity of Large Software Systems Based on Source Code Correspondence

更新时间:2023-07-25 03:33:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

MeasuringSimilarityofLargeSoftware

SystemsBasedonSourceCode

Correspondence

TetsuoYamamoto,MakotoMatsushita,ToshihiroKamiya,andKatsuro

Inoue

TheauthorsarewiththeSoftwareEngineeringResearchGroup(C/OKatsuroInoue),DivisionofSoftwareScience,GraduateSchoolofEngineeringScience,OsakaUniversity,1-3Machikaneyama-cho,Toyonaka,Osaka560-8531,JAPAN.E-mail:{t-yamamt,matusita}@ics.es.osaka-u.ac.jp,kamiya@is.aist-nara.ac.jp,inoue@ics.es.osaka-u.ac.jp.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Abstract

Itisanimportantandintriguingissuetoknowthequantitativesimilarityoflargesoftwaresystems.

Inthispaper,asimilaritymetricbetweentwosetsofsourcecode lesbasedonthecorrespondenceofoverallsourcecodelinesisproposed.ASoftwaresimilarityMeAsurementToolSMATwasdevelopedandappliedtovariousversionsofanoperatingsystem(BSDUNIXOS).TheresultingsimilarityvaluationsclearlyrevealedtheevolutionaryhistorycharacteristicsoftheBSDUNIXOperatingSystem.Also,asanextensionofSMAT,asystem-widedi erenceextractiontoolwasdeveloped,whiche ectivelycompressedasetofsourcecode lesrelativetoabaseset.

Keywords

BSDUNIX,CodeClones,Plagiarism,SoftwareMetrics

I.Introduction

Long-lifesoftwaresystemsevolvethroughmultiplemodi cations.Manydi erentver-sionsarecreatedanddelivered.Theevolutionisnotsimpleandstraightforward.Itiscommonthatoneoriginalsystemcreatesseveraldistinctsuccessorbranchesduringevo-lution.Severaldistinctversionsmaybeuni edlaterandmergedintoanotherversion.Tomanagethemanyversionscorrectlyande ciently,itisveryimportanttoknowobjectivelytheirrelationships.Therehasbeenvariouskindsofresearchonsoftwareevolution[1],[2],

[3],[4],mostofwhichfocusedonchangesofmetricvaluesforsize,quality,deliverytimeorprocess,etc.

Closelyrelatedsoftwaresystemsusuallyareidenti edasproductlines,sodevelopmentandmanagementofproductlinesareactivelydiscussed[5].Knowingdevelopmentrelationsandarchitecturalsimilarityamongsuchsystemsisakeytoe cientdevelopmentofnewsystemsandtowell-organizedmaintenanceofexistingsystems[6].

Wehavebeeninterestedinmeasuringthesimilaritybetweentwolargesoftwaresystems.Aquantitativeandobjectivemeasureforsimilarityisanimportantvehicleforknowingtheevolutionofsoftwaresystems,asisdoneintheBioinformatics eld.InBioinformatics,distancemetricsarebasedonthealignmentofDNAsequences.Phylogenetictreesusingthisdistancearebuilttoillustraterelationsamongspecies[7].TherearehugenumbersofsoftwaresystemsalreadydevelopedintheworldanditshouldbepossibletoidentifytheevolutionhistoryofsoftwareassetsinamannerlikethatdoneinBioinformatics.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Variousresearchon ndingsoftwaresimilaritieshasbeenperformed,mostofwhichfocusedondetectingprogramplagiarism[8],[9],[10].Theusualapproachextractsseveralmetricvalues(orattributes)characterizingthetargetprogramsandthencomparesthosevalues,usuallyintheeducationalenvironmentwithlimitedapplicabilityelsewhere.

Therealsohasbeensomeresearchonidentifyingsimilarityinlargecollectionsofplain-textorHTMLdocuments[11],[12].Theseworksusesampledinformationsuchaskeywordsequencesor“ ngerprints”.Similarityisdeterminedbycomparingthesampledinforma-tion.

Itisimportantthatthesoftwaresimilaritymetricisnotbasedonsampledinformationastheattributevalue(or ngerprint),butratherre ecttheoverallsystemcharacteris-tics.Acollectionofallsourcecode lesusedtobuildasystemcontainsalltheessentialinformationofthesystem.Thus,weanalyzeandcompareoverallsourcecode lesofthesystem.Thisapproachrequiresmorecomputationpowerandmemoryspacethanusingsampledinformation,butthecurrentcomputinghardwareenvironmentallowsthisoverallsourcecodecomparisonapproach.

Inthispaper,asimilaritymetriccalledSline,isused,whichisde nedastheratioofsharedsourcecodelinestothetotalsourcecodelinesoftwosoftwaresystemsbeingevaluated.

Slinerequirescomputingmatchesbetweensourcecodelinesinthetwosystems,beyondtheboundariesof lesanddirectories.Anaiveapproachforthiswouldbetocompareallsource lepairsinbothsystems,witha lematchingprogramsuchasdi [13],butthecomparisonofall lepairswouldbeimpracticaltoapplytolargesystemswiththousandsof les.

Insted,anapproachisproposedthatimprovese ciencyandprecision.First,afast,codeclone(duplicatedcodeportion)detectionalgorithmisappliedtoall lesinthetwosystemsandthendi isappliedtothe lepairswherecodeclonesarefound.

Usingthisconcept,asimilaritymetricevaluationtoolcalledSMAT(Softwaresimilar-ityMeAsurementTool)wasdevelopedandappliedtovarioussoftwaresystemtargets.WehaveevaluatedthesimilaritybetweenvariousversionsofBSDUNIXOS,andhaveperformedclusteranalysisofthesimilarityvaluestocreateadendrogramthatcorrectlyMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

showsevolutionhistoryofBSDOS.Also,thesimilarityevaluationsofstudentcompilersystemprojectshavecon rmedtheabilityforplagiarismdetection.

Further,usingamethodsimilartomeasuringthesimilarity,atoolcalledDET(Di erenceExtractionTool)wasdevelopedforextractingthedi erencebetweentwosystems.Ap-plicationofDETittoseveralversionsofFreeBSDOShascon rmeditse ectivenessforcompressingonesoftwareversiontoanotherandidentifyingsystem-widedistinctions,ratherthansimplyusingdi .

SectionIIpresentsaformalde nitionofsimilarityanditsmetricSline.SectionIIIde-scribesapracticalmethodforcomputingSlineandshowstheimplementationtoolSMAT.SectionIVshowsapplicationsofSMATtoversionsofBSDUNIXOSandastudentproject.SectionVshowsanextensiontoadi erenceextractiontoolDET.ResultsofourworkandcomparisonwithrelatedresearcharegiveninSectionVI.ConcludingremarksaregiveninSectionVII.

II.SimilarityofSoftwareSystems

A.De nitions

Firstwewillgiveageneralde nitionofsoftwaresystemsimilarityandthenaconcretesimilaritymetric.

AsoftwaresystemPiscomposedofelementsp1,p2,···,pm,andPisrepresentedasaset{p1,p2,···,pm}.Inthesameway,anothersoftwaresystemQisdenotedby{q1,q2,···,qn}.Wewillchoosethetypeofelements,suchas lesandlines,basedonthede nitionsofthesimilaritymetricsdescribedlater.

Supposethatweareabletodeterminematchingbetweenpiandqj(1≤i≤m,1≤j≤n),andwecallthesetofallmatchedpair(pi,qj)CorrespondenceRs,whereRs P×Q.SimilaritySofPandQwithrespecttoRsisde nedasfollows.

S(P,Q)≡|{pi|(pi,qj)∈Rs}|+|{qj|(pi,qj)∈Rs}|

|P|+|Q|

AsshowninFig.1,thisde nitionmeansthatthesimilarityistheratioofthenumberofelementsinRstothetotalnumberofelementsinPandQ.IfRsbecomessmaller,Swilldecrease,andifRs=φthenS=0.Moreover,whenPandQareexactlythesamesystems, i(pi,qi)∈RsandthenS=1.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Rs

q1

p2

qn

Fig.1.CorrespondenceofelementsRs.

B.SimilarityMetrics

Theabovede nitionofthesimilarityleavesroomforimplementingdi erentconcretesimilaritymetricsbychoosingtheelementtypesorcorrespondences.Here,weshowaconcretesimilaritymetricwhichwillbeusedinsubsequentdiscussions.

SimilarityMetricSlineusingequivalentlinematching:

Eachelementofasoftwaresystemisasinglelineofeachsource lecomposingthesystem.Forexample,ifasoftwaresystemXconsistsofsourcecode lesx1,x2,···andeachsourcecode lexiismadeupoflinesxi1,xi2,···,thenx11,x12,···,x21,x22,···,xi1,xi2,···,x(i+1)1,x(i+2)2,···aretheelements.Pair(xij,ymn)oftwolinesxijandymninsystemXandsys-temYisincorrespondencewhenxijandymnmatchasequivalentlines.Theequivalencyisdeterminedbytheduplicatedcodedetectionmethodand lecomparisonmethoddis-cussedindetaillater.Intuitively,twolineswithminordistinctionsuchasspace/commentmodi cationandidenti errenamearerecognizedasequivalent.

Slineisnota ectedby lerenamingorpathchanges.Modi cationofasmallpartinalarge ledoesnotgivegreatimpacttotheresultingvalue.Ontheotherhand, ndingequivalentlinesgenerallywouldbeatimeandspaceconsumingprocess.ApracticalapproachforthisisgiveninSectionIII.

Itispossibletoconsiderotherde parisontoothersuchapproachesarepresentedinSectionVI.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

III.MeasuringSline

A.Approach

ThekeyofSlineiscomputingthecorrespondence.Astraightforwardapproachwemightconsideristhat rstweconstructappended lesx1 x2 ···andy1 y2 ···whichareconcatenationofallsourcecode lesx1,x2,···andy1,y2,···forsystemsXandY,respectively.Thenweextractthelongestcommonsubsequence(LCS)betweenx1 x2 ···andy1 y2 ···bysometool,saydi [13],whichimplementsanLCS- ndingalgorithm[14],[15],[16].TheextractedLCSisusedasthecorrespondence.

However,thismethodisfragiletothechangeof leconcatenationordercausedbyrenaming lesandreorganizing lestructures,sincedi cannotfollowlineblockmovementtodi erentpositionsinthe les.Forexample,fortwo lesx1 x2andx2 x1,theLCSfoundbydi iseitherx1orx2(longeroneofthem).

Anotherapproachisthatwetrytoapplydi toallcombinationof lesbetweentwosys-tems.Thisapproachmightwork,butthescalabilitywouldbeanissue.Theperformanceappliedtohugesystemswiththousandsof leswouldbedoubtful.

Here,anapproachisproposedthate ectivelyusesbothdi andaclonedetectiontoolnamedCCFinder[17],[18].

CCFinderisatoolusedtodetectduplicatedcodeblocks(calledclones)insourcecodewritteninC,C++,Java,andCOBOL.Ite ectivelyperformslexicalanalysis,transforma-tionoftokens,computingduplicatedtokensequencesbyasu xtreealgorithm[19],andthenreportstheresults.Theclonedetectionismadealongwithnormalizationandparam-eterization,thatis,thelocationofwhitespacesandlinesbreaksareignored,commentsareremoves,andthedistinctionofidenti ernamesaredisregarded.Bythenormalizationandparameterization,codeblockswithminormodi cationaree ectivelydetectedasclones.ApplingCCFindertotwoconcatenated lesx1 x2 ···andy1 y2 ··· ndsallclonepairs(bx,by)wherebxisasubsequenceofx1 x2 ···.andbyisthatofy1 y2 ···.Thoseclonepairsfoundaremembersofthecorrespondence.

Codeclonesareonlynon-gappedones.Closelysimilarcodeblockswithagapblock(unmatchingtothem)suchasl1l2andl1lxl2arenotdetectedasalargerclonel1 l2butidenti edastwosmallerclonesl1andl2.Whenthelenghtsofl1andl2arelessthanthresholdofMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Software system XSoftware system Y

Fig.2.Howto ndacorrespondence.

CCFinder(usually20tokens),thenCCFinderreportsnoclonesatall.Therefore,tore-claimthosetokensdi isappliedtoallpairsoftwo lexiandyj,whereCCFinderdetectsaclonepair(bx,by)andbxisinxiandbyisinyj,respectively.Theresultofdi isthelongestcommonsubsequences,whichalsoareconsideredmembersofthecorrespondence.ThecombinedresultsofCCFinderanddi isincreasesSlinebyabout10%,comparedtousingonlyCCFinder.

B.ExampleofMeasurement

AsimpleexampleofcomputingSlinewithCCFinderanddi isgivenhere.ConsiderasoftwaresystemXanditsextendedsystemYasshowninFig.2.Xiscomposedoftwosourcecode lesAandB,andYiscomposedoffour lesA ,A ,B,andC.Here,A andA areevolvedversionsofA,andCisanewlycreated le.

At rst,CCFinderisappliedtodetectclonesbetweentwoconcatenated lesA BandA A B C.This ndsclonesbetweenAandA ,AandA ,andBandB .Assumethatnoclonesaredetectedbetweenothercombinationof les.Eachlinesintheclonesfoundareputintothecorrespondence.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Fig.3.Similaritymeasuringprocess.

Next,di isappliedto lepairsAandA ,AandA ,andBandB .Then,thelinesintheresultingcommonsubsequencesbydi areaddedtothecorrespondenceobtainedbytheclonedetection.

Thisapproachhasbene tsinthesensethatwedonotneedtoperformdi onallthe lepaircombinations.Also,wecanchasemovementoflinesinsideoroutsideofthe les,whichcannotbedetectedbydi only.Also,thisapproachcanidentifyandcountthedirectivesandmacrosnotdetectedbyCCFinder.

C.SMAT

Basedonthisapproach,wehavedevelopedasimilarityevaluationtoolSMATwhiche ectivelycomputesSlinefortwosystems.Thefollowingisthedetailedprocessofthesystem.AnoverviewisillustratedinFig.3.

INPUTS:FilepathsoftwosystemsXandY,eachofwhichrepresentsthesubdirectoriescontainingallsourcecodes

OUTPUTS:SlineofXandY(0≤Sline≤1)

Step1Preprocessing:

Allcomments,whitespaces,andemptylinesareremoved,whichdonota ecttheexecutionMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

oftheprograms.Thisstephelpstoimprovetheprecisionofthefollowingsteps,especiallyStep3.

Step2ExecutionofCCFinder:

WeexecuteCCFinderbetweentwoconcatenated Finderhasanoptionfortheminimumnumberoftokensofclonestobedetected,andwhichissetat20.Thisnumberisobtainedthroughexperiences.Smallernumbersgeneratemanymeaninglessclonesandlargernumbersincreasethechanceofoverlookingusefulclones.

Step3Executionofdi :

Executedi onany lepairxiandyjinXandYrespectively,whereatleastonecloneisdetectedbetweenxiandyj.

Step4ConstructionofCorrespondence:

ThelinesappearingintheclonesdetectedbyStep2andinthecommonsubsequencesfoundinStep3aremergedtodeterminthecorrespondencebetweenXandY.

Step5ComputingSline:

Slineiscalculatedusingitsde nition;i.e.,theratiooflinesinthecorrespondencetothoseinwholesystems.NotethatthenumberoflinesinthewholesystemsisoneafterStep1whereallcommentsandwhitespacesareremoved.

SMATworksonWindows2000forthesourcecode leswritteninC,C++,Java,andCOBOL.

Fortwosystems,eachofwhichhasm lesofnlines,FinderrequiresO(mnlog(mn))[17].di requiresO(n2logn)[13]forasingle lepairandwehavetoperformO(m2)executionofdi forall lepairs.Sointotal,O(m2n2logn)istheworstcasetimecomplexity.

However,inpractice,theexecutionofdi isnotperformedforall lepairs.Inmanycases,codeclonesarenotdetectedbetweenall lepairs,butonlyafew lepairs.

Practically,theexecutionspeedofSMATisfairlye cient,sinceitgrowssuper-linearly.Forexample,ittook329secondstocomputeSlineofabout500KlineCsourcecode lesintotalonPentiumIII1GHzCPUsystemwith2GBytesmemory,and980secondsfor1Mline les.Ontheotherhand,inthecaseofusingonlydi forall lepairs,ittookabout6hourstocomputeSlinefor500Kline les.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Fig.4.BSDUNIXevolutionalhistory.

IV.ApplicationsofSMAT

A.BSDUNIXOSEvolution

A.1Targetsystems

ToexploretheapplicabilityofSlineandSMAT,wehaveusedmanyversionsofopen-sourceBSDUNIXoperatingsystems,namely4.4-BSDLite,4.4-BSDLite2[20],FreeBSD[21],NetBSD[22],OpenBSD[23].TheevolutionhistoriesoftheseversionsisshowninFig.4[24].Asshowninthis gure,4.4-BSDLiteistheoriginationoftheotherversions.NewversionsofFreeBSD,NetBSD,andOpenBSDarecurrentlybeingdevelopedinopensourcedevel-opmentstyle.23major-releaseversions,aslistedinFig.4,werechosenforcomputingSlineofallpaircombinations.Theevaluationwasperformedonlyonsourcecode lesrelatedtotheOSkernelswritteninC(i.e.,*.cor*.h les).

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

TABLEI

ThenumberoffilesandLOCofOS.FreeBSD

Version

No.of les

LOC

Version

No.of les

LOC

Version

No.of lesLOC

Version

No.of les

LOC2.08912288681.023174530262.04200898942Lite16763175942.0.510182750161.130916057902.149871007525Lite219314113732.110622972081.240828223122.2524510663552.211963692561.3538610291472.3531410791633.021426360051.4700213782742.4550711293714.025698785901.5739415183712.5581512328582.6607413292932.7629814384962.864141478035

TABLEII

PartofSlinevaluesbetweenBSDOSkernelfiles.FreeBSD2.0

FreeBSD2.0

FreeBSD2.0.5

FreeBSD2.1

FreeBSD2.2

FreeBSD3.0

FreeBSD4.0

4.4BSD-Lite

4.4BSD-Lite2

NetBSD1.0

NetBSD1.1

NetBSD1.2

NetBSD1.31.0000.8330.7940.5500.3150.2120.4190.2900.4400.3340.2550.205FreeBSD2.0.50.8331.0000.9430.6650.3920.2640.3770.2660.4290.3480.2690.227FreeBSD2.10.7940.9431.0000.7060.4210.2860.3620.2580.4110.3360.2650.225FreeBSD2.20.5500.6650.7061.0000.6030.4050.2260.1790.2910.2540.2250.201FreeBSD3.00.3150.3920.4210.6031.0000.6390.1380.1330.2200.1930.1900.208FreeBSD4.00.2120.2640.2860.4050.6391.0000.1010.1000.1400.1520.1580.1794.4BSD-Lite0.4190.3770.3620.2260.1380.1011.0000.6510.5400.4210.3310.259

A.2Results

TABLEIshowsthenumberof lesandtotalsourcecodelinesofeachversionafterthepreprocessingofStep1.TABLEIIshowspartoftheresultingvaluesSlineforpairsofeachversion.NotethatTABLEIIissymmetric,andthevaluesonthemaindiagonallinearealways1bythenatureofoursimilarity.

Slinevaluesbetweenaversionanditsimmediateancestor/descendantversionarehigherMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

0Fig.5.SlineofFreeBSD2.2andotherversions.

thanthevaluesfornon-immediateancestor/descendantversions.Fig.5showsSlineevolu-tionbetweenFreeBSD2.2andotherFreeBSDversions.Thevaluesmonotonicallydeclinewithincreasingversiondistance.ThisindicatesthatthesimilaritymetricSlineproperlycapturesordinarycharacteristicsofsoftwaresystemsevolution.

Fig.6showsSlinebetweeneachversionofFreeBSDandsomeofNetBSD.Thesetwoversionstreamshavethesameorigin,4.4-BSDLite,anditisnaturallyassumedthatolderversionsbetweenthetwostreamshavehigherSlinevalues,sinceyoungerversionshavealotofindependentlyaddedcodes.ThisassumptionistrueforFreeBSD2.0through2.2.However,forFreeBSD3.0and4.0,theyoungestversionNetBSD1.3hashighervaluesthanotherNetBSDversions(Fig.6AandB).ThisisbecausethatbothFreeBSD3.0andNetBSD1.3importedalotofcodesfrom4.4-BSDLite2asshownFig.4.SMATclearlyspottedsuchanirregularnatureoftheevolution.

A.3ClusterAnalysis

Classi cationsweremadeofOSversionsusingaclusteranalysistechnique[25]withrespecttoSlinevaluesshownabove.Thedistanceusedfortheanalysiswas1 Sline,andtheaveragevaluewasthedistancebetweenclusters.ThedendrogramfromthisclusteranalysisisshowninFig.7.Thehorizontalaxisrepresentsthedistance.OSversionscategorizedontheleft-handsidearecloseroneswithhighsimilarityvaluestoeachother.Thisdendrogramre ectsverywelltheevolutionhistoryofBSDOSversionsdepictedMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

Fig.6.SlinebetweenFreeBSDandNetBSD.

1

FreeBSD 2.0FreeBSD 2.0.5 FreeBSD 2.1 FreeBSD 2.2 FreeBSD 3.0 FreeBSD 4.0 4.4BSD Lite 4.4BSD Lite2 NetBSD 1.0 NetBSD 1.1 NetBSD 1.2 OpenBSD 2.0 OpenBSD 2.1 OpenBSD 2.2 OpenBSD 2.3 OpenBSD 2.4 OpenBSD 2.5 OpenBSD 2.6 OpenBSD 2.7 OpenBSD 2.8 NetBSD 1.3 NetBSD 1.4 NetBSD 1.50.50IIVIIIII

Fig.7.DendrogramusingsimilaritySline.

previouslybyFig.4.Further,asshowninFig.7,allFreeBSDversionsarecontainedinClusterIandallOpenBSDareinClusterII.FreeBSDandOpenBSDaredistinctgenealogicalsystemsthatdivergedataveryearlystageoftheirevolution,asshowninFig.4.ThedendrogramusingSlineobjectivelydisclosesit.

Also,wecanseetheclassi cationofNetBSDandOpenBSD.AllversionsofOpenBSDexceptfor2.0areinthesameclusterIII,andthisclusteriscombinedwithNetBSD1.1inclusterIVtogetherwithOpenBSD2.0.ThissuggeststhatallOpenBSDversionswerederivedfromNetBSD1.1.Thisiscon rmedbytheirevolutionhistory.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

TABLEIII

Slineofstudentexperiment.A

A

B

C

D

E

F

G

H10.0090.0240.0380.03400.0540.047B0.00910.0400.0010000.023C0.0240.04010.0600.0420.0880.1180.170D0.0380.0010.06010.0100.0400.0690.039E0.03400.0420.01010.0220.1720.237F000.0880.0400.022100G0.05400.1180.0690.172010.797H0.0470.0230.1700.0390.23700.7971

A.4SimilaritywithLinux

ThesimilaritybetweenFreeBSD4.0andLinux2.2.1[26]wasevaluated.ThesetwoUNIXOperatingSystemswerereleasedalmostatthesametime,buttheyareconsideredtosharenocommonancestors.TheresultingSlinevalueis0.031,whichisarelativelyverylowvalue(mostofthelinesinthecorrespondencearefordevice-dependentcodes).ThisresultindicatesthatSlineisverye ectiveindistinguishingdi erentsystemswithlittlesharedcode.

B.StudentProject

SMATwasalsoappliedtotheresultsfromanundergraduatestudentproject.StudentsdevelopedcompilerswritteninCforasubsetofPASCAL,underalectureoftheoryandpracticeofcompilerconstruction.Thestudentsturnedinalloftheobject lesandsourcecode lesafterall15testcaseshadbeenpassed.

Wehaverandomlychosen8studentresults(namedAtoH).Thetotalsourcecodesizeswerebetween3427and6866lines.TheresultsofSlinebetweenanytwocompilersareshowninTABLEIII.

Asyoucansee,mostofthesimilarityvalueswereverylow.Forexample,betweenAandCthevalueis0.024.ItisconsideredthatAandCwrotemanydistinctcodesindependentlyandthattheycreatedaccidentallyafewsimilarcodes,asshowninFig.8.Thesecodeportionsweredetectedassharedclonessincetheybothhavean“if...then...else”structurewithfunctioncallshavingtwoparameters.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

if(nenum==STRUE){

elseif(s==SMOD)generateCALL,LMOD);

elsegenerateLMULT);fprintf(ofp,”LEAGR1,1\n”);}else{

fprintf(ofp,”LEAGR1,0\n”);

}

Fig.8.SimilarcodesbetweenAandC.output lename[len+1]=NULL;

in le=fopen(argv[1],”r”);

out le=fopen(output lename,”w”);

if((!in le)——(!out le))

{

fprintf(stderr,”couldnotopen le\n”);

exit(1);

}output lename[len+1]=’\0’;fp=fopen(argv[1],”r”);out le=fopen(output lename,”w”);if((!fp)——(!out le)){fprintf(stderr,”couldnotopen le\n”);exit(1);}

Fig.9.SimilarcodesbetweenGandH.

Thehighestvalue,0.797,wasobtainedbetweenGandH.AsshowninFig.9,thecorrespondinglineshavedi erentlinebreaksandvariablenames,buttheyhavethesamesystemstructures.ThiscasewasconsideredplagiarismandSMATwasverye ectiveindetectingit.

V.ExtensiontoDifferenceExtractionTool

Asdiscussedtheabovesections,SMATverye ectivelycomputessimilaritymetricSlineoftwosoftwaresystems.However,theresultofSMATissimplyasimilarityvalue,andthereisnoreportorobservationofthedetailofthedi erenceofthesystems.

ingthisinformation,thedetaileddi er-encecanbereportedandthenonesystemcanbecompactlyrepresented(orcompressed)byanothersystem.

Adi erenceextractiontool(DET)oftwosoftwaresystemswasdeveloped.AnoverviewofDETfollows.

INPUTS:

Directoryofsourcecode lesforthetargetsystem

Directoryofsourcecode lesforthebasesystem

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

OUTPUTS:

Di erence lebetweenthetargetsystemandthebasesystem

Summaryreportof lecorrespondence

Step1:WeexecuteCCFinderonallconcatenated lesofthetargetsystemandthebasesystem.

Step2:Foreachtarget left,abase lefbispickedwhichcontainsmoreclonesagainstftthananyother les.Thenweperformdi betweenftandfb.Iftheresultofdi issmallerthanftitselfthenweconsiderthatftandfbmatchandthedi resultisoutput.Otherwiserawftisoutput.Ifnoclonesarefoundinft,wealsooutputftitself.

Step3:Reportasummarywhichcontainsthelistofmatched lenames,theiroriginalsizes,anddi resultsizes.

UnderlineconceptofDETisverysimilartoSMAT.ItperformsclonedetectionbyCCFinderbetweentwosystems,andthenextractsdi erencebydi between lepairswithclones.SMATexploresallpossible lepairswhereanysharedcloneexists,butDETtriesonlyone lepairwherethemostsharedclonesarefound.

DETalsorestoresthetargetsystemfromthegivendi erence letogetherwiththebasesystem.

DETwasappliedtovariousBSDUNIXversions.Forexample,itwasappliedtoFreeBSD2.0and2.0.5wherethetotalsizesofallkernelsourcecode leswereabout10.8MBytesand12.8MBytes,respectively.ThesizeoftheDETdi erence lefrom2.0to2.0.5is4.8MBytes,andfrom2.0.5to2.0is2.8MBytes.Thereasonwhythefor-merislargeristhatthenewerversion2.0.5containsmanynewlyadded leswhicharestraightforwardlyincludedinthedi erence le.

Thedi erenceextractedbyDETwassmallerthansimplyusingdi .Tocheckthis,weexecuted“di -nN”recursivelyintodirectoriesandmeasuredtheoutputsizes,whichwereabout5.8MByteseitherfrom2.0to2.0.5orfrom2.0.5to2.0.TheoutputsofDETwereabout1.2to5timessmallerthandi forthecasesofotherversions.

ThissuggestedthatDETcouldbeusefultoarchiveolderversionsbasedonthecurrentversionofasystem.Consideringthedramaticincreaseofdisccapacityandrapiddecreaseofitsprice,itmaybefeasibletostoreallversionsofasystemwithoutthecompression.March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

However,DETgeneratesasummaryreportcontainingcorrespondenceofold lesandnew les,whichgivesveryimportantcluesabout lenamechangesanddirectorystructuremodi cations.

DETisconsideredasuitabletooltotraceevolutionoflargesoftwaresystems.Also,DETcouldbeusefulformanaginglargeproductlineswheremanyversionswithminormodi cationsexist.

VI.DiscussionandRelatedWork

Aspresentedinprevioussections,oursimilarityde nition,similaritymetricSline,andmetricevaluationSMATworkedwell.Ourapproachprovidesapractical,meaningfulandusefulmeasureformaintainingandmanaginglargesoftwaresystems.

A.SimilarityDe nition

Thede nitionofsimilarityusedsymmetricinthesensethatthesimilarityforXandYandthatforYandXarethesame.Thisisbecausethesimilarityisde nedas(|x|+|y|)/(|X|+|Y|)wherexisthesetofX’selementsinthecorrespondenceandyisthatofY’selements.

Anotherde nitionofthesimilarityissuchthat|x|/|X|wherethecorrespondenceisdeterminedwithrespecttoY,butYandydonotappearexplicitlyinthesimilarityfor-mula[11].Thissimilarityde nitiongivesasinglesideviewofthesystemdi erence,whichwouldmakeitsuitableforinvestigatingcharacteristicsofindividualsystems.However,anoverviewofsystemevolutionisdi culttoarchivewiththesinglesidede nition.Forexample,tomakedendrogramsasshowninFig.7,itisnecessarytode nethedistanceoftwosystemsXandY.Withourapproach,thesimilarityisusedasthedistance.Inthecaseofthesinglesidede nition,anaverageoftwosimilarityvaluesmightbeusedsuchthat(|x|/|X|+|y|/|Y|)/2,butthisaveragevaluehaslessrationalmeaning.

B.MetricSline

ThecorrespondencewhichdeterminesSlineisamany-to-manymatchingbetweensourcelineslocatedwithin lesanddirectories.Thereasonsofthemany-to-manymatchingisthatwewouldliketotracethemovementofanysourcecodeblockwithin lesanddirectoriesMarch3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

asmuchaspossible,andobtaintheratioofsucceededandrevisedcodestooverallcodes.Itispossibletouseone-to-onematchinginthecorrespondence,butitcharacterizesthesimilaritymetrictoonaivelytocopiedcodes.AssumethatasystemXiscomposedofa lex1,andanewsystemX iscomposedoftwo lesx 1andx 2wherebothx 1andx 2arethesamecopiesofx1.Inourde nitionusingthemany-to-manymatching,thesimilarityis1.0,butusingtheone-to-onematchinggives0.5whichdoesnotre ectdevelopmente ortsproperly.

Actually,metricsSlinehaveveryhighcorrelationtowithdevelopmente orts.Thecor-relationcoe cientbetweenSlinevaluesandreleasedurationsofFreeBSDversionswas-0.973.Ontheotherhandthecorrelationcoe cientbetweenthesizeincreasesandthere-leasedurationswas0.528.Therefore,Slineshouldbeareasonablemeasuresofdevelopmente orts.

Anotherreasonforusingmany-to-manymatchingisperformance.Theone-to-oneap-proachneedssomemechanismtochoosethebestmatchingpairfrommanypossiblities,whichgenerallyisnotasimple,straightforwardprocess.

C.SimilarityMetricSfnusing lenamematching

AnalternativeandmuchsimplermetricSfnforcanbeemployedforsimilarity.

Considerasoftwaresystemscomposedofsourcecode les.Thecorrespondencebetweentwosuchsystemsisthesetofall lepairshavingthesame le(path)names.Thatis,ifa lepiinonesystemanda leqjinanothersystemhavethesame lenames,including lepaths,thenpair(pi,qj)isincludedinthecorrespondence.

ItisveryeasytocomputeSfn,bycheckingeach lepath.However,Sfnisveryfragiletorenamesandrestructuresofsourcecode les.Also,itcannotdetectchangesof lecontents.Furthermore,sinceSfndoesnotaccountforthesizesofeachsourcecode les,itmightproducevaluesfarfromreality.

Forexample,whenSfnwasappliedtothestudentprojectdescribedinSectionIV-B,ingSlineandSMAT,wewereabletodetectapossibleplagiarismbetweenGandH(Sline=0.797).However,theSfnbetweenGandHwas0.190,whichwastoolowtosuspectplagiarism.ThecorrelationbetweenSlineandSfnwas-0.004,meaningthattherewasnorelationbetweenthem.

March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

TABLEIV

Sfnofcompilersystems.A

A

B

C

D

E

F

G

H1000.1130000B010.6660.1250.6000.6660.1050.428C00.66610.1370.85710.1250.545D0.1130.1250.13710.1330.1370.1020.117E00.6000.8570.13310.8570.2350.500F00.66610.1370.85710.1250.545G00.1050.1250.1020.2350.12510.190H00.4280.5450.1170.5000.5450.1901

D.SMAT

SMATworkedverye cientlyforlargesoftwaresystems.TocomputeSline,executionofdi forallpossible lepairswouldhavebeenasimpleapproach.However,biningCCFinderanddi boostedtheperformanceofSMAT.Also,asmentionedbefore,themovementandmodi cationofsourcecodelinescanbetracedbetterbyCCFinder,whiche ectivelydetectscloneswithdi erentwhitespaces,comments,identi ernames,andsoon.Thematchingcomputationusingonlydi cannotchasethosechanges.

Therearealotofresearchesonclonedetectionandmanytoolshavebeendeveloped[27],

[28],[29].WewouldbeabletousethosetoolsinsteadofCCFinder.

E.RelatedWork

Therehasbeenalotofworkon ndingplagiarisminprograms.OttensteinusedHal-steadmetricvaluations[30]oftargetprogram lesforcomparison[31].Thereareotherapproacheswhichuseasetofmetricvaluestocharacterizesourceprograms[32],[33],[34].Also,structuralinformationhasbeenemployedtoincreaseprecisionofcomparison[35],

[36].Inordertoimprovebothprecisionande ciency,abstractedtextsequences(to-kensequences)canbeemployedforcomparison[8],[9],[37],[10].Sourcecodetextsaretranslatedintotokensequencesrepresentingprogramsstructures,andthelongestcommonsubsequencealgorithmisappliedtoobtainmatching.

Thesesystemsareaimedmainlyat ndingsimilarsoftwarecodeintheeducationenvi-March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

ronment.Thesimilaritymetricvaluescomputedbycomparisonofmetricsvaluesdonotshowtheratioofsimilarcodestonon-similarcodes,andthuswouldbelessintuitivelyaccurate.Also,scalabilityofthoseevaluationmethodstolargesoftwaresystemsuchasUNIXOSisnotknown.

Inreverseengineering eld,therehasbeenresearchonmeasuringsimilarityofcompo-nentsandrestructuringmodulesinasoftwaresystem,toimproveitsmaintenanceabilityandunderstandability[38],[39],[40].Suchsimilaritymeasuresarebasedonseveralmetricvaluessuchassharedidenti ernamesandfunctioninvocationrelations.Althoughtheseapproachesinvolveimportantviewsofsimilarity,theirobjectivesaretoidentifycompo-nentsandmodulesinsideasinglesystem,andcannotbeapplieddirectlytointer-systemsimilaritymeasurement.

AstudyonthesimilaritybetweendocumentsispresentedbyBroder[11].Inthisap-proach,asetof xed-lengthtokensequencesareextractedfromdocuments.ThentwosetsXandYareobtainedforeachdocumenttocomputetheirintersectionofthem.Thesimilarityisde nedas(|X|∩|Y|)/(|X|∪|Y|).

Thisapproachisverysuitablefore cientlycomputingtheresemblanceofalargecollec-tionofdocumentssuchasworld-widewebdocuments.However,choosingtokensequencesgreatlya ectstheresultingvalues.Tokenswithminormodi cationwouldnotbedetected.Therefore,thisisprobablyaninappropriateapproachforcomputingsubjectivesimilaritymetricforsourcecode les.

Manber[12]developedatooltoidentifysimilar lesinlargesystems.Thistoolusesasetofkeywordsandextractssubsequencesstartingwiththosekeywordsas ngerprints.A ngerprintsetXofatarget leisencodedandcomparedtoa ngerprintsetYofaquery le.Thesimilarityisde nedas|X∩Y|/|X|.

Thisapproachworksverye cientlyforbothsourceprogram lesanddocument lesandwould texplorationofsimilar lesinalargesystem.However,itisfragiletotheselectionofkeywords.Also,itwouldbetoosensitivetominormodi cationsofsourceprogram lessuchasidenti erchangesandcommentinsertions.

Thesemethodsareallquitedi erentfromthosedevelopedandpresentedherein,sincetheydonotperformcomparisononrawandoveralltextsequences,butratheronsam-March3,2002DRAFT

It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software

pledtextsequences.Samplingapproacheswouldgethighperformance,buttheresultingsimilarityvalueswouldbelesssigni cantthanourwholetextcomparisonapproach.

VII.Conclusion

Aproposedde nitionofsimilaritybetweentwosoftwaresystemswithrespecttocor-respondenceofsourcecodelineswasformulatedasasimilaritymetriccalledSline.AnSline-basedevaluationtoolSMATwasdevelopedandappliedtovarioussoftwaresystems.TheresultsshowedthatSlineandSMATareveryusefulforidentifyingtheoriginofthesystemsandtocharacterizetheirevolution.Furthermore,usingthecomputationprocessofSMAT,adi erenceextractiontool,DET,wasdevelopedwhichcompressesatargetsoftwaresystemrelativetoabasesystemandreportsthedi erence.

FurtherapplicationsofSMATtovarioussoftwaresystemsandproductlineswillbemadetoinvestigatetheirevolution.Fromamacrolevelanalysisviewpoint,categorizationandtaxonomyofsoftwaresystemsanalogoustomolecularphylogenyshouldbeanintriguingissuetopursue.Fromamicrolevelanalysisviewpoint,chasingspeci ccodeblocksthroughsystemevolutionwillbeinterestingtoperform.

References

[1]V.R.Basili,L.C.Briand,S.E.Condon,Y.-M.Kim,W.L.Melo,andJ.D.Valett,“Understanding

andpredictingtheprocessofsoftwaremaintenancerelease,”in18thInternationalConferenceonSoftwareEngineering,berlin,1996,pp.464–474.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]L.A.BeladyandM.M.Lehman,“Amodeloflargeprogramdevelopment,”IBMSystemsJournal,vol.15,no.3,pp.225–252,1976.S.Cook,H.Ji,andR.Harrison,“Dynamicandstaticviewsofsoftwareevolution,”intheIEEEInternationalConferenceOnSoftwareMaintenance(ICSM2001),Florence,Italy,Nov.2001,pp.592–601.C.F.KemererandS.Slaughter,“Anempiricalapproachtostudyingsoftwareevolution,”IEEETransactionsonSoftwareEngineering,vol.25,no.4,pp.493–509,1999.TheFirstSoftwareProductLineConference(SPLC1),,”http://www.sei.cmu.edu/plp/conf/SPLC.html,Den-ver,Colorado,August2000.P.ClementsandL.Northrop,SoftwareProductLines:PracticesandPatterns,AddisonWesley,2001.A.BaxevanisandF.Ouellette,Eds.,Bioinformatics2ndedition,pp.323–358,JohnWileyandSons,Ltd.,England,2001.A.Aiken,“Moss(measureofsoftwaresimilarity)plagiarismdetectionsystem,”http://www.cs.berkeley.edu/moss/.L.Prechelt,G.Malpohl,andM.Philippsen,“Jplag:Findingplagiarismsamongasetofprograms,”Technical

Report2000-1,FakultatfurInformatik,UniversitatKarlsruhe,Germany,2000.

March3,2002DRAFT

本文来源:https://www.bwwdw.com/article/0olm.html

Top