Softwareispronetosecurityvulnerabilities.Programanalysistoolstodetectthemhavelimitedeffectivenessinpractice.Whilelargelanguagemodels(orLLMs)haveshownimpressivecodegenerationcapabilities,theycannotdocomplexreasoningovercodetodetectsuchvulnerabilities,especiallybecausethistaskrequireswhole-repositoryanalysis.Inthiswork,weproposeIRIS,thefirstapproachthatsystematicallycombinesLLMswithstaticanalysistoperformwhole-repositoryreasoningtodetectsecurityvulnerabilities.Wecurateanewdataset,CWE-Bench-Java,comprising120manuallyvalidatedsecurityvulnerabilitiesinreal-worldJavaprojects.Theseprojectsarecomplex,withanaverageof300,000linesofcodeandamaximumofupto7million.Outof120vulnerabilitiesinCWE-Bench-Java,IRISdetects69usingGPT-4,whilethestate-of-the-artstaticanalysistoolonlydetects27.Further,IRISalsosignificantlyreducesthenumberoffalsealarms(bymorethan80%inthebestcase).
Falsenegativesduetomissingtaintspecificationsofthird-partylibraryAPIs.First,statictaintanalysispredominantlyreliesonspecificationsofthird-partylibraryAPIsassources,sinks,orsanitizers.Inpractice,developersandanalysisengineershavetomanuallycraftsuchspecificationsbasedontheirdomainknowledgeandAPIdocumentation.Thisisalaboriousanderror-proneprocessthatoftenleadstomissingspecificationsandincompleteanalysisofvulnerabilities.Further,evenifsuchspecificationsmayexistformanylibraries,theyneedtobeperiodicallyupdatedtocapturechangesinnewerversionsofsuchlibrariesandalsocovernewlibrariesthataredeveloped.
Approach.WeproposeIRIS,thefirstneuro-symbolicapproachforvulnerabilitydetectionthatcombinesthestrengthsofstaticanalysisandLLMswithoutsufferingtheirlimitations.Givenaprojecttoanalyzeforagivenvulnerabilityclass(orCWE),IRISappliesLLMsforminingCWE-specifictaintspecificationsforthird-partylibraryAPIsusedintheproject.IRISaugmentssuchspecificationswithCodeQL,astaticanalysistool,andusesittodetectsecurityvulnerabilities.OurintuitionhereisbecauseLLMshaveseennumeroususagesofsuchlibraryAPIsacrossprojects,theyhaveanunderstandingofwhichAPIsarerelevantfordifferentCWEs.
Dataset.WecurateadatasetofmanuallyvettedandcompilableJavaprojects,CWE-Bench-Java,containing120vulnerabilities(oneperproject)acrossfourcommonvulnerabilityclasses.Theprojectsinthedatasetarecomplex,containing300Klinesofcodeonaverage,and10projectswithmorethanamillionlinesofcode,makingitachallengingbenchmarkforvulnerabilitydetection.
Results.WeevaluateIRISonCWE-Bench-Javausingeightdiverseopen-andclosed-sourceLLMs.Overall,IRISobtainsthebestresultswithGPT-4,detecting69vulnerabilities,whichis42(35%)morethanCodeQL.Amongopen-sourceLLMs,DeepSeekCoder7B,despitebeingmuchsmallerthanotherLLMsperformsthebest,detecting67vulnerabilities,followedbyLlama370Bwith57vulnerabilities.Further,ourcontext-basedfilteringtechniquereducesfalsepositivealertsby80%.
Detectingthisvulnerabilityposesseveralchallenges.First,thecron-utilslibraryconsistsof13KSLOC(linesofcodeexcludingblanksandcomments),whichneedstobeanalyzedtofindthisvulnerability.Thisprocessrequiresanalyzingdataandcontrolflowacrossseveralinternalmethodsandthird-partyAPIs.Second,theanalysisneedstoidentifyrelevantsourcesandsinks.Inthiscase,thevalueparameterofthepublicisValidmethodmaycontainarbitrarystringswheninvoked,andhencemaybeasourceofmaliciousdata.Additionally,externalAPIslikebuildConstraintViolationWithTemplatecanexecutearbitraryJavaELexpressions,hencetheyshouldbetreatedassinksthatarevulnerabletoCodeInjectionattacks.Finally,theanalysisalsorequiresidentifyinganysanitizersthatblocktheflowofuntrusteddata,aswellasintuitiveunderstandingofcontextualinformationofthesourceandsinktosuppressfalsepositives.
Modernstaticanalysistools,likeCodeQL,areeffectiveattracingtaintdataflowsacrosscomplexcodebases.However,CodeQLfailstodetectthisvulnerabilityduetomissingspecifications.CodeQLincludesmanymanuallycuratedspecificationsforsourcesandsinksacrossmorethan360popularJavalibrarymodules.However,manuallyobtainingsuchspecificationsrequiressignificanthumanefforttoanalyze,specify,andvalidate.Further,evenwithperfectspecifications,CodeQLmayoftengeneratenumerousfalsepositivesduetoalackofcontextualreasoning,increasingthedeveloper’sburdenoftriagingtheresults.
Atahighlevel,IRIStakesaJavaprojectPPitalic_P,thevulnerabilityclassCCitalic_Ctodetect,andalargelanguagemodelLLM,asinputs.IRISstaticallyanalyzestheprojectPPitalic_P,checksforvulnerabilitiesspecifictoCCitalic_C,andreturnsasetofpotentialsecurityalertsAAitalic_A.EachalertisaccompaniedbyauniquecodepathfromataintsourcetoataintsinkthatisvulnerabletoCCitalic_C(i.e.,thepathisunsanitized).
Twokeychallengesintaintanalysisinclude:1)identifyingrelevanttaintspecificationsforeachclassCthatcanbemappedtosourceCsubscriptsuperscriptsource\boldsymbol{V}^{C}_{\textit{source}}bold_italic_Vstart_POSTSUPERSCRIPTitalic_Cend_POSTSUPERSCRIPTstart_POSTSUBSCRIPTsourceend_POSTSUBSCRIPT,sinkCsubscriptsuperscriptsink\boldsymbol{V}^{C}_{\textit{sink}}bold_italic_Vstart_POSTSUPERSCRIPTitalic_Cend_POSTSUPERSCRIPTstart_POSTSUBSCRIPTsinkend_POSTSUBSCRIPTforanyprojectPPitalic_P,and2)effectivelyeliminatingfalsepositivepathsinUnsanitized_Paths(Vs,Vt)Unsanitized_Pathssubscriptsubscript\textit{Unsanitized\_Paths}(V_{s},V_{t})Unsanitized_Paths(italic_Vstart_POSTSUBSCRIPTitalic_send_POSTSUBSCRIPT,italic_Vstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT)identifiedbytaintanalysis.Inthefollowingsections,wediscusshowweaddresseachchallengebyleveragingLLMs.
Aprojectmayusevariousthird-partyAPIswhosespecifications(i.e.,sourceorsink)maybeunknown–reducingtheeffectivenessoftaintanalysis.Inaddition,internalAPIsmightacceptuntrustedinputfromdownstreamlibrariesaswell.Hence,ourgoalistoautomaticallyinferspecificationsforsuchAPIs.WedefineaspecificationSCsuperscriptS^{C}italic_Sstart_POSTSUPERSCRIPTitalic_Cend_POSTSUPERSCRIPTasa3-tupleT,F,R5.1ExperimentalSetupLLMselectionandCodeQLbaseline.Weselecttwoclosed-sourceLLMsfromOpenAI:GPT4(gpt-4-0125-preview)andGPT3.5(gpt-3.5-turbo-0125)forourevaluation.Wealsoselectinstruction-tunedversionsofsixstate-of-the-artopen-sourceLLMsviahuggingfaceAPI:Llama38Band70B,DeepSeekCoder7Band33B,Mistral7B,andGemma7B.Forbaseline,weuseCodeQL2.15.3anditsbuilt-inSecurityqueriesspecificallydesignedforeachCWE.
EffectivenessacrossLLMs.Outof120vulnerabilities,34arenotdetectedbyIRISwithanyLLM,78aredetectedbyatleasttwoLLMs,and10aredetectedbyall.Infact,allbutonevulnerabilitydetectedbyCodeQLcannotbedetectedbyIRISwithGPT-4.TheundetectedcaseisduetoamissingsourcespecificationrelatedtoJavaannotations,whichwecurrentlydonotsupport.Moreover,8vulnerabilitiesaredetectedbyonlyoneLLM:GPT-4(1),Mistral7B(1),DeepSeekCoder7B(4),andDeepSeekCoder33B(2).Observingonlysmalldiscrepancies,weconcludethatLLMssharealotofcommonknowledgeandaregenerallyapplicabletospecificationinference.
Continuoustaintspecificationinferenceisnecessary.Ourresultsshowthatthereisahighnumberofbothuniqueandrecurringsourcesandsinks:about900perCWEuniqueand700recurring(tablesinAppendix).Thisindicatesthatevenifpreviouslyinferredspecificationsareuseful,asignificantnumberofnewrelevantAPIsstillremainandneedtobelabeledforeffectivevulnerabilitydetection.ThisobservationstronglymotivatesthedesignofIRISthatinfersthesespecificationson-the-flyforeachprojectviaLLMs,insteadofrelyingonafixedcorpusofspecificationslikeCodeQL.
WepresentedIRIS,anovelneuro-symbolicapproachthatcombinesLLMswithstaticanalysisforvulnerabilitydetection.Wealsocuratedadataset,CWE-Bench-Java,containing120securityvulnerabilitiesacrossfourclassesinreal-worldprojects.OurexperimentalresultsshowthatIRIScansignificantlysurpassstate-of-the-artstaticanalysistools,likeCodeQL,indetectingcriticalvulnerabilities,evenwhenusingsmallerLLMs.Further,IRIS’scontextfilteringtechniquesalsodrasticallyreducethenumberoffalsepositives.Overall,ourresultsshowthateffectivelycombiningLLMswithstaticanalysisleadstomoreeffectiveanalysisandreducesthedeveloperburden.However,therearestillmanyvulnerabilitiesthatcanyetbedetectedbythisapproach.Hence,futureapproachesmayexploreatighterintegrationofthesetwotoolstofurtherimproveperformance.
Limitations.Determiningwhetheravulnerabilityisdetectedisadifficulttask.Hence,werelyonmanualvalidationtofindvulnerablemethodlocationsineachcase,limitingthesizeofCWE-Bench-Java.However,becauseeachprojectisstillquitelarge,webelieveCWE-Bench-Javawillremainachallengingbenchmarkforfutureworks.IRISmakesnumerouscallstoLLMsforspecificationinferenceandfilteringfalsepositives,increasingthepotentialcostofanalysis.However,asourresultsshow,evensmallermodels,likeLlama38BandDeepSeekCoder7B,canperformwellonthesetasksandcanbepotentiallydeployedlocallyforaproject.Foranygivenproject,thecostoflabelingwillconsolidatequicklyacrossversions,makingIRISaviablechoiceforpotentialintegrationintoCI/CDpipelines.WhileourresultsonJavabenchmarksarepromising,itisunknownifIRISwillperformwellonotherlanguages.Weplantoexplorethisfurtherinfuturework.
WhileextractingexternalAPIs,wefilteroutcommonly-usedJavalibrariesthatareunlikelytocontainanypotentialsourcesorsinks.SuchlibrariesincludetestinglibrarieslikeJUnitandHamcrestormockinglibrarieslikeMockito.Whilewefilteroutmethodsthataredefinedintheproject,wespecificallyallowmethodsthatareinheritedfromanexternalclassorinterface.AnexampleisthegetResourcemethodofthegenericclassClassinjava.langpackage,whichtakesapathasastringandaccessesafileinthemodule.Manyprojectscommonlyinheritthisclassandusethismethod.Iftheinputpathisunchecked,itmayleadtoaPath-Traversalvulnerabilityifthepathaccessesresourcesoutsidethegivenmodule.Hence,detectingsuchAPIusagesiscrucial.
Taintsourcesaretypicallyvaluesreturnedbymethodsthatobtaininputsfromexternalsources,suchasresponseofanHTTPrequestoracommandlineargument.Hence,weselectexternalAPIsthathavea“non-void”returntypeascandidatesources.AnothertypeoftaintsourcesarecommonlyseeninJavalibraries.Whenusedbydownstreamlibraries,taintedinformationmaybepassedintothelibrarythroughfunctioncalls.Therefore,wealsocollecttheformalparametersforpublicinternalfunctionassourcecandidates.Duetotheexcessiveamountofsuchcandidates,weposeafurtherconstraintthatthepublicinternalfunctionmustbedirectlyinvokedbyaunittestcasewithinthesamerepository.Here,thetestcasesareidentifiedbycheckingwhethertheresidingfilepathhassrc/testwithinit.
Ontheotherhand,taintsinksaretypicallyargumentstoanexternalAPI.Thisinvolvesexplicitarguments,suchasthecommandargumentpassedtoRuntime.exec(Stringcommand)method,andimplicitthisargumenttonon-staticfunctions,suchasthefilevariableinthefunctioncallfile.delete().ThisistheonlytypeofsinkthatweconsiderwithinIRIS.
Wenotethatthisisnottheentirestoryastheremightbeotherkindsofsourcesandsinks.Othertypesofsourcecandidatesincludetheformalparameterofprotectedbutoverriddeninternalfunctions(thereqparameterinprotectedHTTPServeletResponsedoGet(HTTPServeletRequestreq)),argumentstoanimpureexternalfunction(thebufferargumenttovoidread(byte[]buffer,intsize)),etc.Sinkcandidatesincludethereturnvalueofpublicfacingfunctions,thrownexceptions,andevenstaticmethodswithoutanyparameter(System.exit()).Duetothecomplexity,wedonottacklesuchkindofsourcesofsinksinthiswork.However,weplantoexplorefurtherinfutureworks.
WehypothesizethatsinceLLMsarepre-trainedoninternet-scaledata,theyhaveknowledgeaboutthebehaviorofwidelyusedlibrariesandtheirAPIs.Hence,itisnaturaltoaskwhetherLLMscanbeusedtoidentifyAPIsthatarerelevantassourcesorsinksforanyvulnerabilityclass.Ifsuccessful,LLMscanalleviatemanualeffort,anddrasticallyimprovetheeffectivenessofstaticanalysistools.
WeusetemplatetoconvertLLMinferredspecificationsintoCodeQLqueries.Therearethreekindsofqueries:
IncontrasttoVulDetected,computingtheprecisionoftheresultsismorechallenging.Forinstance,evenifadetectedcodepathdoesnotintersectwithafixedfileormethod,itmayactuallypointtoatruevulnerability(e.g.,adifferentCVEinthesameversion)intheproject.Moreover,evenifthereisaHence,manualanalysisisrequiredtocomputeprecision.
Finally,wemanuallycheckeachfixcommitandvalidatewhetherthecommitactuallycontainsafixtothegivenCVEinaJavafile.Forinstance,wefoundthatinsomecasesthefixisinfileswritteninotherlanguages(suchasScalaorJSP).WhilecodewritteninotherlanguagesmayflowtotheJavacomponentsintheprojectduringruntimeorviacompilation,itisnotpossibletocorrectlydetermineifstaticanalysiscancorrectlydetectsuchavulnerability.Hence,weexcludesuchCVEs.Further,weexcludecaseswherethevulnerabilitywasinadependencyandthefixwasjustaversionupgradeorifthevulnerabilitywasmis-classified.Finally,weendupwith(\star)120projectsthatweevaluatewithIRIS.Forthistask,wedividetheCVEsamongtwoco-authorsoftheproject,whoindependentlyvalidateeachcase.Theco-authorscross-checkeachother’sresultsanddiscusstogethertocomeupwiththefinallistofprojects.
WecompareCWE-Bench-JavawithexistingdatasetsforvulnerabilitydetectioninJava,C,andC++codebases,onthefollowingcriteria:
Weselecttwoclosed-sourceLLMsfromOpenAI:GPT4(gpt-4-0125-preview)andGPT3.5(gpt-3.5-turbo-0125)forourevaluation.GPT4andGPT3.5queriesusedinthepaperareperformedthroughOpenAIAPIduringAprilandMayof2024.
Wealsoselectinstruction-tunedversionsofsixstate-of-the-artopen-sourceLLMsviahuggingfaceAPI:Llama38Band70B,DeepSeekCoder7Band33B,Mistral7B,andGemma7B.Toruntheopen-sourceLLMsweusetwogroupsofmachines:a2.50GHzIntelXeonmachine,with40CPUs,fourGeForceRTX2080TiGPUs,and750GBRAM,andanother3.00GHzIntelXeonmachinewith48CPUs,8A100s,and1.5TRAM.
WeuseCodeQLversion2.15.3asthebackboneofourstaticanalysis.
ForbaselinecomparisonwithCodeQL,weusethebuilt-inSecurityqueriesspecificallydesignedforeachCWEthatcomeswithCodeQL2.15.3.NotethattherearemultiplesecurityqueriesforeachCWE,andeachproducealarmsofdifferentlevels(error,warning,andrecommendation).ForeachCWE,wetaketheunionofalertsgeneratedbyallqueriesanddonotdifferentiatebetweenalarmsofdifferentlevels.Forinstance,thereare3queriesfromCodeQLfordetectingCWE-22vulnerabilities,namelyTaintedPath,TaintedPathLocal,andZipSlip.WhileTaintedPathandZipSlipproduceerrorlevelalarms,TaintedPathLocalproducesonlyalarmrecommendations.ToCodeQL’sadvantage,allalarmsaretreatedequallyinourcomparisons.
DuringIRIS,wehave2promptsthatareusedtolabelexternalandinternalAPIs.RecallthatthepromptscontainbatchedAPIs.Weusebatchsizeof20and30forinternalandexternal,respectively.Intermsoffew-shotexamplespassedtolabelingexternalAPIs,weuse4examplesforCWE-22,3examplesforCWE-78,3examplesforCWE-79,and3examplesforCWE-94.Weuseatemperatureof0,maximumtokensto2048,andtop-pof1forinferencewithalltheLLMs.ForGPT3.5andGPT4,wealsofixaseedtomitigaterandomnessasmuchaspossible.
Weincludethefulltablecontainingstatisticstoprovidemoredetailsaboutprojectsandouranalysis(TableLABEL:tab:longtable).Foreachproject,wepresentitscorrespondingCWEID,thelines-of-code(SLOC),thetimeittakestorunthefullanalysis,thenumbercandidateAPIsandthenumberoflabeledsourceandsinksbyLlama38B.Wealsocolorcodecellsofinterest:ForSLOC,wemarkacellasredif>>>1M;yellowif>>>100k.ForTime,wemarkacellasredif≥\geq≥1h;yellowif≥\geq≥5m.Forthenumberofcandidates,wemarkacellasredif>>>10k.Lastlyforsourcesandsinks,wemarkacellasredifthenumberislargerthan200.