Measuring Similarity of Large Software Systems Based on Source Code Correspondence
更新时间:2023-07-25 03:33:01 阅读量: 实用文档 文档下载
- measuring推荐度:
- 相关推荐
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
MeasuringSimilarityofLargeSoftware
SystemsBasedonSourceCode
Correspondence
TetsuoYamamoto,MakotoMatsushita,ToshihiroKamiya,andKatsuro
Inoue
TheauthorsarewiththeSoftwareEngineeringResearchGroup(C/OKatsuroInoue),DivisionofSoftwareScience,GraduateSchoolofEngineeringScience,OsakaUniversity,1-3Machikaneyama-cho,Toyonaka,Osaka560-8531,JAPAN.E-mail:{t-yamamt,matusita}@ics.es.osaka-u.ac.jp,kamiya@is.aist-nara.ac.jp,inoue@ics.es.osaka-u.ac.jp.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Abstract
Itisanimportantandintriguingissuetoknowthequantitativesimilarityoflargesoftwaresystems.
Inthispaper,asimilaritymetricbetweentwosetsofsourcecode lesbasedonthecorrespondenceofoverallsourcecodelinesisproposed.ASoftwaresimilarityMeAsurementToolSMATwasdevelopedandappliedtovariousversionsofanoperatingsystem(BSDUNIXOS).TheresultingsimilarityvaluationsclearlyrevealedtheevolutionaryhistorycharacteristicsoftheBSDUNIXOperatingSystem.Also,asanextensionofSMAT,asystem-widedi erenceextractiontoolwasdeveloped,whiche ectivelycompressedasetofsourcecode lesrelativetoabaseset.
Keywords
BSDUNIX,CodeClones,Plagiarism,SoftwareMetrics
I.Introduction
Long-lifesoftwaresystemsevolvethroughmultiplemodi cations.Manydi erentver-sionsarecreatedanddelivered.Theevolutionisnotsimpleandstraightforward.Itiscommonthatoneoriginalsystemcreatesseveraldistinctsuccessorbranchesduringevo-lution.Severaldistinctversionsmaybeuni edlaterandmergedintoanotherversion.Tomanagethemanyversionscorrectlyande ciently,itisveryimportanttoknowobjectivelytheirrelationships.Therehasbeenvariouskindsofresearchonsoftwareevolution[1],[2],
[3],[4],mostofwhichfocusedonchangesofmetricvaluesforsize,quality,deliverytimeorprocess,etc.
Closelyrelatedsoftwaresystemsusuallyareidenti edasproductlines,sodevelopmentandmanagementofproductlinesareactivelydiscussed[5].Knowingdevelopmentrelationsandarchitecturalsimilarityamongsuchsystemsisakeytoe cientdevelopmentofnewsystemsandtowell-organizedmaintenanceofexistingsystems[6].
Wehavebeeninterestedinmeasuringthesimilaritybetweentwolargesoftwaresystems.Aquantitativeandobjectivemeasureforsimilarityisanimportantvehicleforknowingtheevolutionofsoftwaresystems,asisdoneintheBioinformatics eld.InBioinformatics,distancemetricsarebasedonthealignmentofDNAsequences.Phylogenetictreesusingthisdistancearebuilttoillustraterelationsamongspecies[7].TherearehugenumbersofsoftwaresystemsalreadydevelopedintheworldanditshouldbepossibletoidentifytheevolutionhistoryofsoftwareassetsinamannerlikethatdoneinBioinformatics.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Variousresearchon ndingsoftwaresimilaritieshasbeenperformed,mostofwhichfocusedondetectingprogramplagiarism[8],[9],[10].Theusualapproachextractsseveralmetricvalues(orattributes)characterizingthetargetprogramsandthencomparesthosevalues,usuallyintheeducationalenvironmentwithlimitedapplicabilityelsewhere.
Therealsohasbeensomeresearchonidentifyingsimilarityinlargecollectionsofplain-textorHTMLdocuments[11],[12].Theseworksusesampledinformationsuchaskeywordsequencesor“ ngerprints”.Similarityisdeterminedbycomparingthesampledinforma-tion.
Itisimportantthatthesoftwaresimilaritymetricisnotbasedonsampledinformationastheattributevalue(or ngerprint),butratherre ecttheoverallsystemcharacteris-tics.Acollectionofallsourcecode lesusedtobuildasystemcontainsalltheessentialinformationofthesystem.Thus,weanalyzeandcompareoverallsourcecode lesofthesystem.Thisapproachrequiresmorecomputationpowerandmemoryspacethanusingsampledinformation,butthecurrentcomputinghardwareenvironmentallowsthisoverallsourcecodecomparisonapproach.
Inthispaper,asimilaritymetriccalledSline,isused,whichisde nedastheratioofsharedsourcecodelinestothetotalsourcecodelinesoftwosoftwaresystemsbeingevaluated.
Slinerequirescomputingmatchesbetweensourcecodelinesinthetwosystems,beyondtheboundariesof lesanddirectories.Anaiveapproachforthiswouldbetocompareallsource lepairsinbothsystems,witha lematchingprogramsuchasdi [13],butthecomparisonofall lepairswouldbeimpracticaltoapplytolargesystemswiththousandsof les.
Insted,anapproachisproposedthatimprovese ciencyandprecision.First,afast,codeclone(duplicatedcodeportion)detectionalgorithmisappliedtoall lesinthetwosystemsandthendi isappliedtothe lepairswherecodeclonesarefound.
Usingthisconcept,asimilaritymetricevaluationtoolcalledSMAT(Softwaresimilar-ityMeAsurementTool)wasdevelopedandappliedtovarioussoftwaresystemtargets.WehaveevaluatedthesimilaritybetweenvariousversionsofBSDUNIXOS,andhaveperformedclusteranalysisofthesimilarityvaluestocreateadendrogramthatcorrectlyMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
showsevolutionhistoryofBSDOS.Also,thesimilarityevaluationsofstudentcompilersystemprojectshavecon rmedtheabilityforplagiarismdetection.
Further,usingamethodsimilartomeasuringthesimilarity,atoolcalledDET(Di erenceExtractionTool)wasdevelopedforextractingthedi erencebetweentwosystems.Ap-plicationofDETittoseveralversionsofFreeBSDOShascon rmeditse ectivenessforcompressingonesoftwareversiontoanotherandidentifyingsystem-widedistinctions,ratherthansimplyusingdi .
SectionIIpresentsaformalde nitionofsimilarityanditsmetricSline.SectionIIIde-scribesapracticalmethodforcomputingSlineandshowstheimplementationtoolSMAT.SectionIVshowsapplicationsofSMATtoversionsofBSDUNIXOSandastudentproject.SectionVshowsanextensiontoadi erenceextractiontoolDET.ResultsofourworkandcomparisonwithrelatedresearcharegiveninSectionVI.ConcludingremarksaregiveninSectionVII.
II.SimilarityofSoftwareSystems
A.De nitions
Firstwewillgiveageneralde nitionofsoftwaresystemsimilarityandthenaconcretesimilaritymetric.
AsoftwaresystemPiscomposedofelementsp1,p2,···,pm,andPisrepresentedasaset{p1,p2,···,pm}.Inthesameway,anothersoftwaresystemQisdenotedby{q1,q2,···,qn}.Wewillchoosethetypeofelements,suchas lesandlines,basedonthede nitionsofthesimilaritymetricsdescribedlater.
Supposethatweareabletodeterminematchingbetweenpiandqj(1≤i≤m,1≤j≤n),andwecallthesetofallmatchedpair(pi,qj)CorrespondenceRs,whereRs P×Q.SimilaritySofPandQwithrespecttoRsisde nedasfollows.
S(P,Q)≡|{pi|(pi,qj)∈Rs}|+|{qj|(pi,qj)∈Rs}|
|P|+|Q|
AsshowninFig.1,thisde nitionmeansthatthesimilarityistheratioofthenumberofelementsinRstothetotalnumberofelementsinPandQ.IfRsbecomessmaller,Swilldecrease,andifRs=φthenS=0.Moreover,whenPandQareexactlythesamesystems, i(pi,qi)∈RsandthenS=1.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Rs
q1
p2
qn
Fig.1.CorrespondenceofelementsRs.
B.SimilarityMetrics
Theabovede nitionofthesimilarityleavesroomforimplementingdi erentconcretesimilaritymetricsbychoosingtheelementtypesorcorrespondences.Here,weshowaconcretesimilaritymetricwhichwillbeusedinsubsequentdiscussions.
SimilarityMetricSlineusingequivalentlinematching:
Eachelementofasoftwaresystemisasinglelineofeachsource lecomposingthesystem.Forexample,ifasoftwaresystemXconsistsofsourcecode lesx1,x2,···andeachsourcecode lexiismadeupoflinesxi1,xi2,···,thenx11,x12,···,x21,x22,···,xi1,xi2,···,x(i+1)1,x(i+2)2,···aretheelements.Pair(xij,ymn)oftwolinesxijandymninsystemXandsys-temYisincorrespondencewhenxijandymnmatchasequivalentlines.Theequivalencyisdeterminedbytheduplicatedcodedetectionmethodand lecomparisonmethoddis-cussedindetaillater.Intuitively,twolineswithminordistinctionsuchasspace/commentmodi cationandidenti errenamearerecognizedasequivalent.
Slineisnota ectedby lerenamingorpathchanges.Modi cationofasmallpartinalarge ledoesnotgivegreatimpacttotheresultingvalue.Ontheotherhand, ndingequivalentlinesgenerallywouldbeatimeandspaceconsumingprocess.ApracticalapproachforthisisgiveninSectionIII.
Itispossibletoconsiderotherde parisontoothersuchapproachesarepresentedinSectionVI.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
III.MeasuringSline
A.Approach
ThekeyofSlineiscomputingthecorrespondence.Astraightforwardapproachwemightconsideristhat rstweconstructappended lesx1 x2 ···andy1 y2 ···whichareconcatenationofallsourcecode lesx1,x2,···andy1,y2,···forsystemsXandY,respectively.Thenweextractthelongestcommonsubsequence(LCS)betweenx1 x2 ···andy1 y2 ···bysometool,saydi [13],whichimplementsanLCS- ndingalgorithm[14],[15],[16].TheextractedLCSisusedasthecorrespondence.
However,thismethodisfragiletothechangeof leconcatenationordercausedbyrenaming lesandreorganizing lestructures,sincedi cannotfollowlineblockmovementtodi erentpositionsinthe les.Forexample,fortwo lesx1 x2andx2 x1,theLCSfoundbydi iseitherx1orx2(longeroneofthem).
Anotherapproachisthatwetrytoapplydi toallcombinationof lesbetweentwosys-tems.Thisapproachmightwork,butthescalabilitywouldbeanissue.Theperformanceappliedtohugesystemswiththousandsof leswouldbedoubtful.
Here,anapproachisproposedthate ectivelyusesbothdi andaclonedetectiontoolnamedCCFinder[17],[18].
CCFinderisatoolusedtodetectduplicatedcodeblocks(calledclones)insourcecodewritteninC,C++,Java,andCOBOL.Ite ectivelyperformslexicalanalysis,transforma-tionoftokens,computingduplicatedtokensequencesbyasu xtreealgorithm[19],andthenreportstheresults.Theclonedetectionismadealongwithnormalizationandparam-eterization,thatis,thelocationofwhitespacesandlinesbreaksareignored,commentsareremoves,andthedistinctionofidenti ernamesaredisregarded.Bythenormalizationandparameterization,codeblockswithminormodi cationaree ectivelydetectedasclones.ApplingCCFindertotwoconcatenated lesx1 x2 ···andy1 y2 ··· ndsallclonepairs(bx,by)wherebxisasubsequenceofx1 x2 ···.andbyisthatofy1 y2 ···.Thoseclonepairsfoundaremembersofthecorrespondence.
Codeclonesareonlynon-gappedones.Closelysimilarcodeblockswithagapblock(unmatchingtothem)suchasl1l2andl1lxl2arenotdetectedasalargerclonel1 l2butidenti edastwosmallerclonesl1andl2.Whenthelenghtsofl1andl2arelessthanthresholdofMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Software system XSoftware system Y
Fig.2.Howto ndacorrespondence.
CCFinder(usually20tokens),thenCCFinderreportsnoclonesatall.Therefore,tore-claimthosetokensdi isappliedtoallpairsoftwo lexiandyj,whereCCFinderdetectsaclonepair(bx,by)andbxisinxiandbyisinyj,respectively.Theresultofdi isthelongestcommonsubsequences,whichalsoareconsideredmembersofthecorrespondence.ThecombinedresultsofCCFinderanddi isincreasesSlinebyabout10%,comparedtousingonlyCCFinder.
B.ExampleofMeasurement
AsimpleexampleofcomputingSlinewithCCFinderanddi isgivenhere.ConsiderasoftwaresystemXanditsextendedsystemYasshowninFig.2.Xiscomposedoftwosourcecode lesAandB,andYiscomposedoffour lesA ,A ,B,andC.Here,A andA areevolvedversionsofA,andCisanewlycreated le.
At rst,CCFinderisappliedtodetectclonesbetweentwoconcatenated lesA BandA A B C.This ndsclonesbetweenAandA ,AandA ,andBandB .Assumethatnoclonesaredetectedbetweenothercombinationof les.Eachlinesintheclonesfoundareputintothecorrespondence.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Fig.3.Similaritymeasuringprocess.
Next,di isappliedto lepairsAandA ,AandA ,andBandB .Then,thelinesintheresultingcommonsubsequencesbydi areaddedtothecorrespondenceobtainedbytheclonedetection.
Thisapproachhasbene tsinthesensethatwedonotneedtoperformdi onallthe lepaircombinations.Also,wecanchasemovementoflinesinsideoroutsideofthe les,whichcannotbedetectedbydi only.Also,thisapproachcanidentifyandcountthedirectivesandmacrosnotdetectedbyCCFinder.
C.SMAT
Basedonthisapproach,wehavedevelopedasimilarityevaluationtoolSMATwhiche ectivelycomputesSlinefortwosystems.Thefollowingisthedetailedprocessofthesystem.AnoverviewisillustratedinFig.3.
INPUTS:FilepathsoftwosystemsXandY,eachofwhichrepresentsthesubdirectoriescontainingallsourcecodes
OUTPUTS:SlineofXandY(0≤Sline≤1)
Step1Preprocessing:
Allcomments,whitespaces,andemptylinesareremoved,whichdonota ecttheexecutionMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
oftheprograms.Thisstephelpstoimprovetheprecisionofthefollowingsteps,especiallyStep3.
Step2ExecutionofCCFinder:
WeexecuteCCFinderbetweentwoconcatenated Finderhasanoptionfortheminimumnumberoftokensofclonestobedetected,andwhichissetat20.Thisnumberisobtainedthroughexperiences.Smallernumbersgeneratemanymeaninglessclonesandlargernumbersincreasethechanceofoverlookingusefulclones.
Step3Executionofdi :
Executedi onany lepairxiandyjinXandYrespectively,whereatleastonecloneisdetectedbetweenxiandyj.
Step4ConstructionofCorrespondence:
ThelinesappearingintheclonesdetectedbyStep2andinthecommonsubsequencesfoundinStep3aremergedtodeterminthecorrespondencebetweenXandY.
Step5ComputingSline:
Slineiscalculatedusingitsde nition;i.e.,theratiooflinesinthecorrespondencetothoseinwholesystems.NotethatthenumberoflinesinthewholesystemsisoneafterStep1whereallcommentsandwhitespacesareremoved.
SMATworksonWindows2000forthesourcecode leswritteninC,C++,Java,andCOBOL.
Fortwosystems,eachofwhichhasm lesofnlines,FinderrequiresO(mnlog(mn))[17].di requiresO(n2logn)[13]forasingle lepairandwehavetoperformO(m2)executionofdi forall lepairs.Sointotal,O(m2n2logn)istheworstcasetimecomplexity.
However,inpractice,theexecutionofdi isnotperformedforall lepairs.Inmanycases,codeclonesarenotdetectedbetweenall lepairs,butonlyafew lepairs.
Practically,theexecutionspeedofSMATisfairlye cient,sinceitgrowssuper-linearly.Forexample,ittook329secondstocomputeSlineofabout500KlineCsourcecode lesintotalonPentiumIII1GHzCPUsystemwith2GBytesmemory,and980secondsfor1Mline les.Ontheotherhand,inthecaseofusingonlydi forall lepairs,ittookabout6hourstocomputeSlinefor500Kline les.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Fig.4.BSDUNIXevolutionalhistory.
IV.ApplicationsofSMAT
A.BSDUNIXOSEvolution
A.1Targetsystems
ToexploretheapplicabilityofSlineandSMAT,wehaveusedmanyversionsofopen-sourceBSDUNIXoperatingsystems,namely4.4-BSDLite,4.4-BSDLite2[20],FreeBSD[21],NetBSD[22],OpenBSD[23].TheevolutionhistoriesoftheseversionsisshowninFig.4[24].Asshowninthis gure,4.4-BSDLiteistheoriginationoftheotherversions.NewversionsofFreeBSD,NetBSD,andOpenBSDarecurrentlybeingdevelopedinopensourcedevel-opmentstyle.23major-releaseversions,aslistedinFig.4,werechosenforcomputingSlineofallpaircombinations.Theevaluationwasperformedonlyonsourcecode lesrelatedtotheOSkernelswritteninC(i.e.,*.cor*.h les).
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
TABLEI
ThenumberoffilesandLOCofOS.FreeBSD
Version
No.of les
LOC
Version
No.of les
LOC
Version
No.of lesLOC
Version
No.of les
LOC2.08912288681.023174530262.04200898942Lite16763175942.0.510182750161.130916057902.149871007525Lite219314113732.110622972081.240828223122.2524510663552.211963692561.3538610291472.3531410791633.021426360051.4700213782742.4550711293714.025698785901.5739415183712.5581512328582.6607413292932.7629814384962.864141478035
TABLEII
PartofSlinevaluesbetweenBSDOSkernelfiles.FreeBSD2.0
FreeBSD2.0
FreeBSD2.0.5
FreeBSD2.1
FreeBSD2.2
FreeBSD3.0
FreeBSD4.0
4.4BSD-Lite
4.4BSD-Lite2
NetBSD1.0
NetBSD1.1
NetBSD1.2
NetBSD1.31.0000.8330.7940.5500.3150.2120.4190.2900.4400.3340.2550.205FreeBSD2.0.50.8331.0000.9430.6650.3920.2640.3770.2660.4290.3480.2690.227FreeBSD2.10.7940.9431.0000.7060.4210.2860.3620.2580.4110.3360.2650.225FreeBSD2.20.5500.6650.7061.0000.6030.4050.2260.1790.2910.2540.2250.201FreeBSD3.00.3150.3920.4210.6031.0000.6390.1380.1330.2200.1930.1900.208FreeBSD4.00.2120.2640.2860.4050.6391.0000.1010.1000.1400.1520.1580.1794.4BSD-Lite0.4190.3770.3620.2260.1380.1011.0000.6510.5400.4210.3310.259
A.2Results
TABLEIshowsthenumberof lesandtotalsourcecodelinesofeachversionafterthepreprocessingofStep1.TABLEIIshowspartoftheresultingvaluesSlineforpairsofeachversion.NotethatTABLEIIissymmetric,andthevaluesonthemaindiagonallinearealways1bythenatureofoursimilarity.
Slinevaluesbetweenaversionanditsimmediateancestor/descendantversionarehigherMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
0Fig.5.SlineofFreeBSD2.2andotherversions.
thanthevaluesfornon-immediateancestor/descendantversions.Fig.5showsSlineevolu-tionbetweenFreeBSD2.2andotherFreeBSDversions.Thevaluesmonotonicallydeclinewithincreasingversiondistance.ThisindicatesthatthesimilaritymetricSlineproperlycapturesordinarycharacteristicsofsoftwaresystemsevolution.
Fig.6showsSlinebetweeneachversionofFreeBSDandsomeofNetBSD.Thesetwoversionstreamshavethesameorigin,4.4-BSDLite,anditisnaturallyassumedthatolderversionsbetweenthetwostreamshavehigherSlinevalues,sinceyoungerversionshavealotofindependentlyaddedcodes.ThisassumptionistrueforFreeBSD2.0through2.2.However,forFreeBSD3.0and4.0,theyoungestversionNetBSD1.3hashighervaluesthanotherNetBSDversions(Fig.6AandB).ThisisbecausethatbothFreeBSD3.0andNetBSD1.3importedalotofcodesfrom4.4-BSDLite2asshownFig.4.SMATclearlyspottedsuchanirregularnatureoftheevolution.
A.3ClusterAnalysis
Classi cationsweremadeofOSversionsusingaclusteranalysistechnique[25]withrespecttoSlinevaluesshownabove.Thedistanceusedfortheanalysiswas1 Sline,andtheaveragevaluewasthedistancebetweenclusters.ThedendrogramfromthisclusteranalysisisshowninFig.7.Thehorizontalaxisrepresentsthedistance.OSversionscategorizedontheleft-handsidearecloseroneswithhighsimilarityvaluestoeachother.Thisdendrogramre ectsverywelltheevolutionhistoryofBSDOSversionsdepictedMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
Fig.6.SlinebetweenFreeBSDandNetBSD.
1
FreeBSD 2.0FreeBSD 2.0.5 FreeBSD 2.1 FreeBSD 2.2 FreeBSD 3.0 FreeBSD 4.0 4.4BSD Lite 4.4BSD Lite2 NetBSD 1.0 NetBSD 1.1 NetBSD 1.2 OpenBSD 2.0 OpenBSD 2.1 OpenBSD 2.2 OpenBSD 2.3 OpenBSD 2.4 OpenBSD 2.5 OpenBSD 2.6 OpenBSD 2.7 OpenBSD 2.8 NetBSD 1.3 NetBSD 1.4 NetBSD 1.50.50IIVIIIII
Fig.7.DendrogramusingsimilaritySline.
previouslybyFig.4.Further,asshowninFig.7,allFreeBSDversionsarecontainedinClusterIandallOpenBSDareinClusterII.FreeBSDandOpenBSDaredistinctgenealogicalsystemsthatdivergedataveryearlystageoftheirevolution,asshowninFig.4.ThedendrogramusingSlineobjectivelydisclosesit.
Also,wecanseetheclassi cationofNetBSDandOpenBSD.AllversionsofOpenBSDexceptfor2.0areinthesameclusterIII,andthisclusteriscombinedwithNetBSD1.1inclusterIVtogetherwithOpenBSD2.0.ThissuggeststhatallOpenBSDversionswerederivedfromNetBSD1.1.Thisiscon rmedbytheirevolutionhistory.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
TABLEIII
Slineofstudentexperiment.A
A
B
C
D
E
F
G
H10.0090.0240.0380.03400.0540.047B0.00910.0400.0010000.023C0.0240.04010.0600.0420.0880.1180.170D0.0380.0010.06010.0100.0400.0690.039E0.03400.0420.01010.0220.1720.237F000.0880.0400.022100G0.05400.1180.0690.172010.797H0.0470.0230.1700.0390.23700.7971
A.4SimilaritywithLinux
ThesimilaritybetweenFreeBSD4.0andLinux2.2.1[26]wasevaluated.ThesetwoUNIXOperatingSystemswerereleasedalmostatthesametime,buttheyareconsideredtosharenocommonancestors.TheresultingSlinevalueis0.031,whichisarelativelyverylowvalue(mostofthelinesinthecorrespondencearefordevice-dependentcodes).ThisresultindicatesthatSlineisverye ectiveindistinguishingdi erentsystemswithlittlesharedcode.
B.StudentProject
SMATwasalsoappliedtotheresultsfromanundergraduatestudentproject.StudentsdevelopedcompilerswritteninCforasubsetofPASCAL,underalectureoftheoryandpracticeofcompilerconstruction.Thestudentsturnedinalloftheobject lesandsourcecode lesafterall15testcaseshadbeenpassed.
Wehaverandomlychosen8studentresults(namedAtoH).Thetotalsourcecodesizeswerebetween3427and6866lines.TheresultsofSlinebetweenanytwocompilersareshowninTABLEIII.
Asyoucansee,mostofthesimilarityvalueswereverylow.Forexample,betweenAandCthevalueis0.024.ItisconsideredthatAandCwrotemanydistinctcodesindependentlyandthattheycreatedaccidentallyafewsimilarcodes,asshowninFig.8.Thesecodeportionsweredetectedassharedclonessincetheybothhavean“if...then...else”structurewithfunctioncallshavingtwoparameters.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
if(nenum==STRUE){
elseif(s==SMOD)generateCALL,LMOD);
elsegenerateLMULT);fprintf(ofp,”LEAGR1,1\n”);}else{
fprintf(ofp,”LEAGR1,0\n”);
}
Fig.8.SimilarcodesbetweenAandC.output lename[len+1]=NULL;
in le=fopen(argv[1],”r”);
out le=fopen(output lename,”w”);
if((!in le)——(!out le))
{
fprintf(stderr,”couldnotopen le\n”);
exit(1);
}output lename[len+1]=’\0’;fp=fopen(argv[1],”r”);out le=fopen(output lename,”w”);if((!fp)——(!out le)){fprintf(stderr,”couldnotopen le\n”);exit(1);}
Fig.9.SimilarcodesbetweenGandH.
Thehighestvalue,0.797,wasobtainedbetweenGandH.AsshowninFig.9,thecorrespondinglineshavedi erentlinebreaksandvariablenames,buttheyhavethesamesystemstructures.ThiscasewasconsideredplagiarismandSMATwasverye ectiveindetectingit.
V.ExtensiontoDifferenceExtractionTool
Asdiscussedtheabovesections,SMATverye ectivelycomputessimilaritymetricSlineoftwosoftwaresystems.However,theresultofSMATissimplyasimilarityvalue,andthereisnoreportorobservationofthedetailofthedi erenceofthesystems.
ingthisinformation,thedetaileddi er-encecanbereportedandthenonesystemcanbecompactlyrepresented(orcompressed)byanothersystem.
Adi erenceextractiontool(DET)oftwosoftwaresystemswasdeveloped.AnoverviewofDETfollows.
INPUTS:
Directoryofsourcecode lesforthetargetsystem
Directoryofsourcecode lesforthebasesystem
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
OUTPUTS:
Di erence lebetweenthetargetsystemandthebasesystem
Summaryreportof lecorrespondence
Step1:WeexecuteCCFinderonallconcatenated lesofthetargetsystemandthebasesystem.
Step2:Foreachtarget left,abase lefbispickedwhichcontainsmoreclonesagainstftthananyother les.Thenweperformdi betweenftandfb.Iftheresultofdi issmallerthanftitselfthenweconsiderthatftandfbmatchandthedi resultisoutput.Otherwiserawftisoutput.Ifnoclonesarefoundinft,wealsooutputftitself.
Step3:Reportasummarywhichcontainsthelistofmatched lenames,theiroriginalsizes,anddi resultsizes.
UnderlineconceptofDETisverysimilartoSMAT.ItperformsclonedetectionbyCCFinderbetweentwosystems,andthenextractsdi erencebydi between lepairswithclones.SMATexploresallpossible lepairswhereanysharedcloneexists,butDETtriesonlyone lepairwherethemostsharedclonesarefound.
DETalsorestoresthetargetsystemfromthegivendi erence letogetherwiththebasesystem.
DETwasappliedtovariousBSDUNIXversions.Forexample,itwasappliedtoFreeBSD2.0and2.0.5wherethetotalsizesofallkernelsourcecode leswereabout10.8MBytesand12.8MBytes,respectively.ThesizeoftheDETdi erence lefrom2.0to2.0.5is4.8MBytes,andfrom2.0.5to2.0is2.8MBytes.Thereasonwhythefor-merislargeristhatthenewerversion2.0.5containsmanynewlyadded leswhicharestraightforwardlyincludedinthedi erence le.
Thedi erenceextractedbyDETwassmallerthansimplyusingdi .Tocheckthis,weexecuted“di -nN”recursivelyintodirectoriesandmeasuredtheoutputsizes,whichwereabout5.8MByteseitherfrom2.0to2.0.5orfrom2.0.5to2.0.TheoutputsofDETwereabout1.2to5timessmallerthandi forthecasesofotherversions.
ThissuggestedthatDETcouldbeusefultoarchiveolderversionsbasedonthecurrentversionofasystem.Consideringthedramaticincreaseofdisccapacityandrapiddecreaseofitsprice,itmaybefeasibletostoreallversionsofasystemwithoutthecompression.March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
However,DETgeneratesasummaryreportcontainingcorrespondenceofold lesandnew les,whichgivesveryimportantcluesabout lenamechangesanddirectorystructuremodi cations.
DETisconsideredasuitabletooltotraceevolutionoflargesoftwaresystems.Also,DETcouldbeusefulformanaginglargeproductlineswheremanyversionswithminormodi cationsexist.
VI.DiscussionandRelatedWork
Aspresentedinprevioussections,oursimilarityde nition,similaritymetricSline,andmetricevaluationSMATworkedwell.Ourapproachprovidesapractical,meaningfulandusefulmeasureformaintainingandmanaginglargesoftwaresystems.
A.SimilarityDe nition
Thede nitionofsimilarityusedsymmetricinthesensethatthesimilarityforXandYandthatforYandXarethesame.Thisisbecausethesimilarityisde nedas(|x|+|y|)/(|X|+|Y|)wherexisthesetofX’selementsinthecorrespondenceandyisthatofY’selements.
Anotherde nitionofthesimilarityissuchthat|x|/|X|wherethecorrespondenceisdeterminedwithrespecttoY,butYandydonotappearexplicitlyinthesimilarityfor-mula[11].Thissimilarityde nitiongivesasinglesideviewofthesystemdi erence,whichwouldmakeitsuitableforinvestigatingcharacteristicsofindividualsystems.However,anoverviewofsystemevolutionisdi culttoarchivewiththesinglesidede nition.Forexample,tomakedendrogramsasshowninFig.7,itisnecessarytode nethedistanceoftwosystemsXandY.Withourapproach,thesimilarityisusedasthedistance.Inthecaseofthesinglesidede nition,anaverageoftwosimilarityvaluesmightbeusedsuchthat(|x|/|X|+|y|/|Y|)/2,butthisaveragevaluehaslessrationalmeaning.
B.MetricSline
ThecorrespondencewhichdeterminesSlineisamany-to-manymatchingbetweensourcelineslocatedwithin lesanddirectories.Thereasonsofthemany-to-manymatchingisthatwewouldliketotracethemovementofanysourcecodeblockwithin lesanddirectoriesMarch3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
asmuchaspossible,andobtaintheratioofsucceededandrevisedcodestooverallcodes.Itispossibletouseone-to-onematchinginthecorrespondence,butitcharacterizesthesimilaritymetrictoonaivelytocopiedcodes.AssumethatasystemXiscomposedofa lex1,andanewsystemX iscomposedoftwo lesx 1andx 2wherebothx 1andx 2arethesamecopiesofx1.Inourde nitionusingthemany-to-manymatching,thesimilarityis1.0,butusingtheone-to-onematchinggives0.5whichdoesnotre ectdevelopmente ortsproperly.
Actually,metricsSlinehaveveryhighcorrelationtowithdevelopmente orts.Thecor-relationcoe cientbetweenSlinevaluesandreleasedurationsofFreeBSDversionswas-0.973.Ontheotherhandthecorrelationcoe cientbetweenthesizeincreasesandthere-leasedurationswas0.528.Therefore,Slineshouldbeareasonablemeasuresofdevelopmente orts.
Anotherreasonforusingmany-to-manymatchingisperformance.Theone-to-oneap-proachneedssomemechanismtochoosethebestmatchingpairfrommanypossiblities,whichgenerallyisnotasimple,straightforwardprocess.
C.SimilarityMetricSfnusing lenamematching
AnalternativeandmuchsimplermetricSfnforcanbeemployedforsimilarity.
Considerasoftwaresystemscomposedofsourcecode les.Thecorrespondencebetweentwosuchsystemsisthesetofall lepairshavingthesame le(path)names.Thatis,ifa lepiinonesystemanda leqjinanothersystemhavethesame lenames,including lepaths,thenpair(pi,qj)isincludedinthecorrespondence.
ItisveryeasytocomputeSfn,bycheckingeach lepath.However,Sfnisveryfragiletorenamesandrestructuresofsourcecode les.Also,itcannotdetectchangesof lecontents.Furthermore,sinceSfndoesnotaccountforthesizesofeachsourcecode les,itmightproducevaluesfarfromreality.
Forexample,whenSfnwasappliedtothestudentprojectdescribedinSectionIV-B,ingSlineandSMAT,wewereabletodetectapossibleplagiarismbetweenGandH(Sline=0.797).However,theSfnbetweenGandHwas0.190,whichwastoolowtosuspectplagiarism.ThecorrelationbetweenSlineandSfnwas-0.004,meaningthattherewasnorelationbetweenthem.
March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
TABLEIV
Sfnofcompilersystems.A
A
B
C
D
E
F
G
H1000.1130000B010.6660.1250.6000.6660.1050.428C00.66610.1370.85710.1250.545D0.1130.1250.13710.1330.1370.1020.117E00.6000.8570.13310.8570.2350.500F00.66610.1370.85710.1250.545G00.1050.1250.1020.2350.12510.190H00.4280.5450.1170.5000.5450.1901
D.SMAT
SMATworkedverye cientlyforlargesoftwaresystems.TocomputeSline,executionofdi forallpossible lepairswouldhavebeenasimpleapproach.However,biningCCFinderanddi boostedtheperformanceofSMAT.Also,asmentionedbefore,themovementandmodi cationofsourcecodelinescanbetracedbetterbyCCFinder,whiche ectivelydetectscloneswithdi erentwhitespaces,comments,identi ernames,andsoon.Thematchingcomputationusingonlydi cannotchasethosechanges.
Therearealotofresearchesonclonedetectionandmanytoolshavebeendeveloped[27],
[28],[29].WewouldbeabletousethosetoolsinsteadofCCFinder.
E.RelatedWork
Therehasbeenalotofworkon ndingplagiarisminprograms.OttensteinusedHal-steadmetricvaluations[30]oftargetprogram lesforcomparison[31].Thereareotherapproacheswhichuseasetofmetricvaluestocharacterizesourceprograms[32],[33],[34].Also,structuralinformationhasbeenemployedtoincreaseprecisionofcomparison[35],
[36].Inordertoimprovebothprecisionande ciency,abstractedtextsequences(to-kensequences)canbeemployedforcomparison[8],[9],[37],[10].Sourcecodetextsaretranslatedintotokensequencesrepresentingprogramsstructures,andthelongestcommonsubsequencealgorithmisappliedtoobtainmatching.
Thesesystemsareaimedmainlyat ndingsimilarsoftwarecodeintheeducationenvi-March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
ronment.Thesimilaritymetricvaluescomputedbycomparisonofmetricsvaluesdonotshowtheratioofsimilarcodestonon-similarcodes,andthuswouldbelessintuitivelyaccurate.Also,scalabilityofthoseevaluationmethodstolargesoftwaresystemsuchasUNIXOSisnotknown.
Inreverseengineering eld,therehasbeenresearchonmeasuringsimilarityofcompo-nentsandrestructuringmodulesinasoftwaresystem,toimproveitsmaintenanceabilityandunderstandability[38],[39],[40].Suchsimilaritymeasuresarebasedonseveralmetricvaluessuchassharedidenti ernamesandfunctioninvocationrelations.Althoughtheseapproachesinvolveimportantviewsofsimilarity,theirobjectivesaretoidentifycompo-nentsandmodulesinsideasinglesystem,andcannotbeapplieddirectlytointer-systemsimilaritymeasurement.
AstudyonthesimilaritybetweendocumentsispresentedbyBroder[11].Inthisap-proach,asetof xed-lengthtokensequencesareextractedfromdocuments.ThentwosetsXandYareobtainedforeachdocumenttocomputetheirintersectionofthem.Thesimilarityisde nedas(|X|∩|Y|)/(|X|∪|Y|).
Thisapproachisverysuitablefore cientlycomputingtheresemblanceofalargecollec-tionofdocumentssuchasworld-widewebdocuments.However,choosingtokensequencesgreatlya ectstheresultingvalues.Tokenswithminormodi cationwouldnotbedetected.Therefore,thisisprobablyaninappropriateapproachforcomputingsubjectivesimilaritymetricforsourcecode les.
Manber[12]developedatooltoidentifysimilar lesinlargesystems.Thistoolusesasetofkeywordsandextractssubsequencesstartingwiththosekeywordsas ngerprints.A ngerprintsetXofatarget leisencodedandcomparedtoa ngerprintsetYofaquery le.Thesimilarityisde nedas|X∩Y|/|X|.
Thisapproachworksverye cientlyforbothsourceprogram lesanddocument lesandwould texplorationofsimilar lesinalargesystem.However,itisfragiletotheselectionofkeywords.Also,itwouldbetoosensitivetominormodi cationsofsourceprogram lessuchasidenti erchangesandcommentinsertions.
Thesemethodsareallquitedi erentfromthosedevelopedandpresentedherein,sincetheydonotperformcomparisononrawandoveralltextsequences,butratheronsam-March3,2002DRAFT
It is an important and intriguing issue to know the quantitative similarity of large software systems. In this paper, a similarity metric between two sets of source code files based on the correspondence of overall source code lines is proposed. A Software
pledtextsequences.Samplingapproacheswouldgethighperformance,buttheresultingsimilarityvalueswouldbelesssigni cantthanourwholetextcomparisonapproach.
VII.Conclusion
Aproposedde nitionofsimilaritybetweentwosoftwaresystemswithrespecttocor-respondenceofsourcecodelineswasformulatedasasimilaritymetriccalledSline.AnSline-basedevaluationtoolSMATwasdevelopedandappliedtovarioussoftwaresystems.TheresultsshowedthatSlineandSMATareveryusefulforidentifyingtheoriginofthesystemsandtocharacterizetheirevolution.Furthermore,usingthecomputationprocessofSMAT,adi erenceextractiontool,DET,wasdevelopedwhichcompressesatargetsoftwaresystemrelativetoabasesystemandreportsthedi erence.
FurtherapplicationsofSMATtovarioussoftwaresystemsandproductlineswillbemadetoinvestigatetheirevolution.Fromamacrolevelanalysisviewpoint,categorizationandtaxonomyofsoftwaresystemsanalogoustomolecularphylogenyshouldbeanintriguingissuetopursue.Fromamicrolevelanalysisviewpoint,chasingspeci ccodeblocksthroughsystemevolutionwillbeinterestingtoperform.
References
[1]V.R.Basili,L.C.Briand,S.E.Condon,Y.-M.Kim,W.L.Melo,andJ.D.Valett,“Understanding
andpredictingtheprocessofsoftwaremaintenancerelease,”in18thInternationalConferenceonSoftwareEngineering,berlin,1996,pp.464–474.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]L.A.BeladyandM.M.Lehman,“Amodeloflargeprogramdevelopment,”IBMSystemsJournal,vol.15,no.3,pp.225–252,1976.S.Cook,H.Ji,andR.Harrison,“Dynamicandstaticviewsofsoftwareevolution,”intheIEEEInternationalConferenceOnSoftwareMaintenance(ICSM2001),Florence,Italy,Nov.2001,pp.592–601.C.F.KemererandS.Slaughter,“Anempiricalapproachtostudyingsoftwareevolution,”IEEETransactionsonSoftwareEngineering,vol.25,no.4,pp.493–509,1999.TheFirstSoftwareProductLineConference(SPLC1),,”http://www.sei.cmu.edu/plp/conf/SPLC.html,Den-ver,Colorado,August2000.P.ClementsandL.Northrop,SoftwareProductLines:PracticesandPatterns,AddisonWesley,2001.A.BaxevanisandF.Ouellette,Eds.,Bioinformatics2ndedition,pp.323–358,JohnWileyandSons,Ltd.,England,2001.A.Aiken,“Moss(measureofsoftwaresimilarity)plagiarismdetectionsystem,”http://www.cs.berkeley.edu/moss/.L.Prechelt,G.Malpohl,andM.Philippsen,“Jplag:Findingplagiarismsamongasetofprograms,”Technical
Report2000-1,FakultatfurInformatik,UniversitatKarlsruhe,Germany,2000.
March3,2002DRAFT
- 1From Design to Integration of Transitic Systems A Component Based Approach
- 2A Framework for Role-Based Access Control in Group Communication Systems
- 3Design of an economics-based software infrastructure for secure utility computing on superc
- 4Performance Analysis for Bit Error Rate of DS- CDMA SensorNetwork Systems with Source Coding
- 5第九讲 软件复用与构件化软件开发(Software Reuse and Software Design based Component
- 6Agent Based Maintenance for Modularised Case Bases in Collaborative Multi-Expert Systems
- 7LMI-based robust control of uncertain discrete-time piecewise affine systems
- 82011Consensus of linear multi-agent systems with reduced-order observer-based protocols
- 9source insight使用
- 10Software Testing in the Cloud
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Correspondence
- Similarity
- Measuring
- Software
- Systems
- Source
- Large
- Based
- Code
- 英语阅读理解100篇(中级篇)-11_英语题库
- 证券市场中投资者行为的进化博弈分析
- 塑料防雾剂的研究进展
- 2012年河南省中考原创仿真卷(二)政治
- 肿瘤化疗莫要忘了手足综合症
- 综合医院智能化管理系统的协调应用
- 财务部100问知识题库
- 装修半包都包括什么,工长教你小知识
- 持续腹腔冲洗治疗胰十二指肠切除术后吻合口瘘患者的护理
- 宗教旅游文献综述
- 罗兰贝格_科龙电器品牌战略和营销组织架构(全套文件)》D
- 2013-2014学林精品新华师大八年级数学上册期末试卷(6)(权威、经典)
- 全国二级建造师执业资格考试用书(第三版)《建设工程施工管理》2010年网上增值服务(1、2、3、4)打包
- 教师工作调动申请书
- 浅谈农村物流产业的发展
- 日本二维动画制作软件CoreRETAS的使用介绍
- 探析社会语言学视阈下的英语语言性别差异现象
- 新型三维力传感器的研制与应用
- 电和热(焦耳定律)实验专题训练(专刊C)1doc
- 护理类专业2013年山西省中等职业学校对口升学考试大纲