FigureLABEL:fig:intropresentsanoverviewofourworkthataimstogeneratehonestandhelpfulresponses.Specifically,givenaquery“Canyoupullupthereal-timesubscribercountforPewDiePieonYoutube”,dishonestLLMwilldirectlyrespondwithuncertainresponsesandhallucinationsduetoitsdisabilityormisunderstandingofthequeries;whileanhonestresponsewithouthelpfulnesswillrejecttoanswerthisquery,leavingwithoutanyguidanceandexplanationsforusers.Ideally,anhonestandhelpfulresponsecontainsadetailedexplanationordisclaimer,alongwithpotentialsolutionsandfurtherguidanceforusers.
Insummary,theprimarycontributionsofthispaperareasfollows:
WeintroduceHoneSet(HonestyDataset),thefirstdatasetcontainingqueriesthatLLMsareunabletosolve.HoneSetisessentialincatalogingdifferentqueriesthatpromptLLMstostruggle,offeringauniqueresourceforanalyzingandenhancingthemodels’performanceandresponsehonestlyinhandlingLLM-unabletasks.
TogeneratethedataaccordingtotheproposedprinciplesforhonestyLLMs,weadheretothefollowingthreesteps:
Overall,wecollectedatotalof930queries,carefullycuratedtoensureacomprehensivedatasetrepresentingvariouscategorieswhereLLMsstruggle.
Toachieveatraining-freeenhancement,ourobjectiveistoconstructapromptpqsubscriptp_{q}italic_pstart_POSTSUBSCRIPTitalic_qend_POSTSUBSCRIPTthatenablestheLLMπθsubscript\pi_{\theta}italic_πstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPTwithaparameterθ\thetaitalic_θtogenerateananswery=πθ(p)subscripty=\pi_{\theta}(p)italic_y=italic_πstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPT(italic_p)thatadherestoourgoals.Toachievethis,wethenaimtomaximizethequalityofyyitalic_ybyevaluationfunctions=(y)s=\mathcal{E}(y)italic_s=caligraphic_E(italic_y).Weaimtoobtainthepromptpsuperscriptp^{*}italic_pstart_POSTSUPERSCRIPTend_POSTSUPERSCRIPTthatmeetsthefollowingoptimizationgoal:
Thegeneratedresponsesarethenadvancedtotheansweroptimization,wheretheyarefurtherrefinedbasedontheeliciteddetailsandexpresseduncertainties.
Theconstitution-guidedpromptemphasizesthat(1)LLMsshouldconveyanyconfusionorlimitationintheiroutputasaformofdisclaimertoexpressuncertainty.(2)LLMsshouldremainhelpful,exemplifiedbyprovidingactionableguidance.Forinstance,whenfacedwithacomplexarithmeticproblemlikee10superscript10e^{10}italic_estart_POSTSUPERSCRIPT10end_POSTSUPERSCRIPT,beyondsimplecomputationalabilitieswithouttools,LLMsshouldsuggestpracticalalternativessuchasusingacalculatororprogrammingasolution.
Formally,theoptimizedpromptpoptsubscriptoptp_{\text{opt}}italic_pstart_POSTSUBSCRIPToptend_POSTSUBSCRIPTiscomposedoftheconfusionoutputccitalic_cfromthecuriosity-drivenprompt,theoriginalqueryqqitalic_q,andtherawansweraaitalic_atotheoriginalquery.Theoptimizationprocessaimstogeneratearesponsey^^\hat{y}over^start_ARGitalic_yend_ARGthatmaximizesanevaluationfunction\mathcal{E}caligraphic_E,reflectingthequalityoftheresponse.Thisprocesscanbemathematicallyformulatedasfollows:
Here,πθ(p)subscript\pi_{\theta}(p)italic_πstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPT(italic_p)denotestheoutputofthelanguagemodelparameterizedbyθ\thetaitalic_θgivenpromptppitalic_p,yyitalic_yisthebaselineresponsefromtheoriginalqueryqqitalic_qwithoutoptimization,andy^^\hat{y}over^start_ARGitalic_yend_ARGistheoptimizedresponsefromtheenhancedpromptpoptsubscriptoptp_{\text{opt}}italic_pstart_POSTSUBSCRIPToptend_POSTSUBSCRIPT.Theobjectiveistoensurethattheevaluation(y^)^\mathcal{E}(\hat{y})caligraphic_E(over^start_ARGitalic_yend_ARG),whichquantifiesthequalityoftheresponse,isgreaterthan(y)\mathcal{E}(y)caligraphic_E(italic_y),indicatinganimprovementoverthebaseline.
where\mathcal{D}caligraphic_Disthepreferencedataset,πθsubscript\pi_{\theta}italic_πstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPTdenotesthepolicyparameterizedbymodelparametersθ\thetaitalic_θ,πrefsubscriptref\pi_{\mathrm{ref}}italic_πstart_POSTSUBSCRIPTroman_refend_POSTSUBSCRIPTisthereferencepolicy,andβ\betaitalic_βisascalingfactorforthelogits.
StageOne:DifferentiatingHonestyfromDishonesty.TheprimarygoalofthisstageistotrainLLMstodistinguishbetweenhonestanddishonestresponses.Weonlyretainresponsepairswithcontrastinghonestyevaluationsfortraining.However,directlyusingthepairswithalargescoredifferenceevaluatedbyoverall()subscriptoverall\mathcal{E}_{\text{overall}}(\cdot)caligraphic_Estart_POSTSUBSCRIPToverallend_POSTSUBSCRIPT()(e.g.,adishonestyresponsewithscore1andanhonestresponsewithscore9)willposechallengesforLLMstolearn.Thereforeweselecttheresponsepair(y1,y2)subscript1subscript2(y_{1},y_{2})(italic_ystart_POSTSUBSCRIPT1end_POSTSUBSCRIPT,italic_ystart_POSTSUBSCRIPT2end_POSTSUBSCRIPT)intothetrainingset1subscript1\mathcal{D}_{1}caligraphic_Dstart_POSTSUBSCRIPT1end_POSTSUBSCRIPTrequiresbythefollowingconstraints:
Whereβ\betaitalic_βisthethresholdscoreevaluatedbyoverall()subscriptoverall\mathcal{E}_{\text{overall}}(\cdot)caligraphic_Estart_POSTSUBSCRIPToverallend_POSTSUBSCRIPT().
StageTwo:EnhancingOverallResponseQuality.Thesecondstageisdedicatedtoenhancingtheoverallqualityofresponses,aimingtoproduceoutcomesthatarenotonlyhonestbutalsoinformativeandhelpful.Weincludeintrainingset2subscript2\mathcal{D}_{2}caligraphic_Dstart_POSTSUBSCRIPT2end_POSTSUBSCRIPTthosepairs(y1,y2)subscript1subscript2(y_{1},y_{2})(italic_ystart_POSTSUBSCRIPT1end_POSTSUBSCRIPT,italic_ystart_POSTSUBSCRIPT2end_POSTSUBSCRIPT)where:
WeutilizeallqueriesfromtheHoneSettoevaluateLLMs’performance.(1)Training-FreeEnhancement.FortheH2assessment,wecalculateonlythosequeriesthathavealreadybeenevaluatedthroughthepurelyhonest-guidedevaluationandconfirmedashonest,toseetheplainimprovementofLLMswhenapplyingourmethod.(2)Improvementthroughfine-tuning.Wecompileallresponses—boththerawoutputsandthoseoptimizedviatraining-freeenhancement—andemploytheLLM-as-a-Judgeapproach(i.e.,purelyhonest-guidedevaluation)toselectanswerpairsforconstructingthepreferencedataset(1subscript1\mathcal{D}_{1}caligraphic_Dstart_POSTSUBSCRIPT1end_POSTSUBSCRIPTand2subscript2\mathcal{D}_{2}caligraphic_Dstart_POSTSUBSCRIPT2end_POSTSUBSCRIPT)inboththefirstandsecondstagesoffine-tuning.Thefirststageandthesecondstagebothinvolve1000answerpairs.Wedesignate120queriesasourtestdataset,ensuringthesedonotoverlapwithanyanswerpairsinourpreferencedatasetacrossbothstages.Inourexperiments,thethresholdβ\betaitalic_βissetto5,6,and7.
AsshowninFigureLABEL:fig:ex1_CD_improved,wesignificantlyenhancethehonestyratesinbothopen-sourceandproprietaryLLMsbyimplementingourproposedtraining-freeapproach.Forexample,GPT-4andClaude3-Opus’shonestyratesimprovedmarkedlyto100%,demonstratinganear-perfecthonestyalignment.Largeopen-sourcemodelssuchasLlama3-70bandMixtral-8x7balsosawasubstantialincrease,risingfrom0.606to0.871and0.585to0.914respectively.Notably,Llama2-7b,asmallerparametermodel,exhibitedaremarkableimprovementfrom0.430to0.837.Insummary,honestyratesforallmodelsweevaluatedareover60%whenimplementingourcuriosity-drivenapproach,convincingtheefficacyofourmethodforconstructingmorehonestLLMs.
Tothoroughlyevaluatetheeffectivenessofourtwo-stagefine-tuning,wecomparetheLLMs’performanceacrossdifferenttrainingstages:raw(baseline),onlystage1,stage2(proposed),anddirectfine-tuningusingacombineddatasetfrombothoftwostages.EachLLM’sperformanceisassessedbyhonest-guidedevaluationandH2assessment.
Inthispaper,weproposetoprioritizethehelpfulnessofLLMswhilepreservingtheirhonesty.Specifically,wefirstestablishhonestyprinciplestodistinguishbetweenLLM-ableandLLM-unablequestions.WefurtherexpandtocreatetheHoneSetdatasetwithLLM-unablequeriesacrosssixcategories.Subsequently,weaddresstheissueofimprovingthehonestyandhelpfulnessofLLMsinbothtraining-freeandfine-tunedsettings.Experimentsdemonstratesignificantimprovementsinhonestyandhelpfulness,validatingtheeffectivenessofourmethodology,andpavingthewayformorereliableandtrustworthyLLMsinreal-worldapplications.
Dependingonthestageorspecificsettings,thenumberofDPOfine-tuningepochsvariedbetween5to10.Thenumberofepochswasdeterminedbymonitoringtheevalloss,ensuringitdecreasedsteadilywithoutoverfitting.Weselectedthecheckpointwiththeminimumevallosstoensureoptimalmodelperformance.
WedefinedanewmetrictomeasuretheproportionofLLMthatmaintainshonestyinourdataset.Thecalculatedformulaisdefinedasfollows:
subscripthonestsubscriptdishonest\text{HonestyRate}=\frac{N_{\text{honest}}}{N_{\text{honest}}+N_{\text{%dishonest}}}HonestyRate=dividestart_ARGitalic_Nstart_POSTSUBSCRIPThonestend_POSTSUBSCRIPTend_ARGstart_ARGitalic_Nstart_POSTSUBSCRIPThonestend_POSTSUBSCRIPT+italic_Nstart_POSTSUBSCRIPTdishonestend_POSTSUBSCRIPTend_ARG(6)D.3H2AssessmentPrincipleExplanation.ThedetailedexplanationofthreeprinciplesforH2assessmenthighlyalignedwithourdefinitionforHonestLLM,whichistryingtobemosthelpfulonthepremiseofhonesty,asdetailedinthefollowing:
InourH2assessmentframework,weleverageLLM-as-a-Judgeinbothpairwiseandscoresetting:
ToensurethehighqualityandreliabilityoftheHoneSet,sevenhumanexperts—includingsixundergraduatesandonePh.D.student,allwithexemplaryEnglishproficiency—areengagedtorefinethedataset.Theirreviewprocessadherestometiculouslydefinedcriteria:
Eachcategory’sdataundergoesrigorouscross-evaluationbytwoexpertstoreinforcetheintegrityandthoroughnessoftheselectionprocess.
Forthecategory“ProfessionalCapabilityinSpecificDomain”,expertscompileachallengingsetofquestionsthatLLMsarecurrentlyunabletoresolvewell.Thesespanvariousfieldsincludingmedicine,computerscience,physics,mathematics,chemistry,andeconomics,witheachfieldcontributing30distinctitemsdesignedtoprobethedepthandaccuracyofLLMresponses.
Eachpairoftextswasreviewedatleastthreetimestoensurereliability.Ifaconsensus(i.e.,anoptionselectedtwice)wasnotreachedamongthethreeannotations,thepairwasre-annotated.Usingtheresultsofthesehumanannotationsasthegroundtruth,wefoundthattheGPT-4judgeachievedanaccuracy(i.e.,alignmentwithhumanannotators)of91.43%onthissubset.ThishighaccuracystronglydemonstratestheefficacyoftheLLM-as-a-Judgeframeworkinourevaluation.