Offlinereinforcementlearningforjob|onlinelearningandofflinelearning_在线学习

Recentadvancesindeeplearninghaveshownsignificantpotentialforsolvingcombinatorialoptimizationproblemsinreal-time.Unliketraditionalmethods,deeplearningcangeneratehigh-qualitysolutionsefficiently,whichiscrucialforapplicationslikeroutingandscheduling.However,existingapproacheslikedeepreinforcementlearning(RL)andbehavioralcloninghavenotablelimitations,withdeepRLsufferingfromslowlearningandbehavioralcloningrelyingsolelyonexpertactions,whichcanleadtogeneralizationissuesandneglectoftheoptimizationobjective.ThispaperintroducesanovelofflineRLmethoddesignedforcombinatorialoptimizationproblemswithcomplexconstraints,wherethestateisrepresentedasaheterogeneousgraphandtheactionspaceisvariable.Ourapproachencodesactionsinedgeattributesandbalancesexpectedrewardswiththeimitationofexpertsolutions.Wedemonstratetheeffectivenessofthismethodonjob-shopschedulingandflexiblejob-shopschedulingbenchmarks,achievingsuperiorperformancecomparedtostate-of-the-arttechniques.

Themaingoalofthispaperistoproposemoreefficientmethodsforsolvingcombinatorialoptimizationproblemsthatincorporatedifficultconstraintsandrequirereal-timesolutions.Forthispurpose,weintroduceanewofflineRLmethodthatconsidersproblemswherethestatespaceisrepresentedasaheterogeneousgraphandtheactionspaceisvariable.Schedulingproblems,suchasthejob-shopschedulingproblem(JSSP)andtheflexibleJSSP(FJSSP),usuallyhavemoreconstraintsthanroutingproblemsduetotheneedtoaccountforthesequentialorderingofoperations,machineavailability,andprocessingtimes.Asacasestudy,wehaveusedthesetwoschedulingproblemstodemonstratetheeffectivenessofourapproach.Thecontributionsofthispaperaresummarizedasfollows:

Inthissection,wereviewliteratureonreal-timeschedulingsolutions,graph-basedRL,andofflineRLalgorithmstoidentifykeyresearchgaps.

Toorganizetheapproachesthataddressschedulingproblemsinreal-time,wewillfirstexploreRL-basedtechniques,beginningwiththemethodsthatsolvetheJSSPandthenmovingontotheFJSSP.Subsequently,methodsusingotherapproaches,suchasBCorself-supervisedlearning,willbeexamined.

Inschedulingproblems,whichinvolvedifferenttypesofentities(operations,jobs,andmachines),mostapproachesmodeltheproblemasagraphsincethisfacilitateseffectiveproblemmodeling,albeitcoupledwiththeuseofspecificneuralnetworksthatcanprocessthistypeofrepresentation.Themostcommonwaytogeneratesolutionsisconstructively,wheresolutionsareconstructediteratively:ateachstep,anelementisselectedbasedonitscharacteristics.Forinstance,injobscheduling,theprocesscouldbevisualizedassequentiallyassigningoperationstomachines.

However,manycurrentofflineRLapplicationsdonotincorporategraphstructuresintotheirstaterepresentations.Graphs,withtheircomplexandhigh-dimensionalnature,presentuniquechallengesindefiningstatespacesanddesigningeffectiveactionstrategies.ThisunderscorestheneedforspecializedofflineRLapproachescapableofhandlinggraph-structureddataandvariableactionspaces.Adaptingtograph-basedscenariosremainsanarearequiringsignificantfurtherresearch.

Inessence,theFJSSPcombinestwoproblems:amachineselectionproblem,wherethemostsuitablemachineischosenforeachoperation,aroutingproblem,andasequencingorschedulingproblem,wherethesequenceofoperationsonamachineneedstobedetermined.Givenanassignmentofoperationstomachines,thecompletiontimeofajob,jisubscriptj_{i}italic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPT,isdefinedasCjisubscriptsubscriptC_{j_{i}}italic_Cstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPTend_POSTSUBSCRIPT,andthemakespanofascheduleisdefinedasCmax=maxji∈CjisubscriptsubscriptsubscriptsubscriptsubscriptC_{max}=\max\limits_{j_{i}\in\mathcal{J}}C_{j_{i}}italic_Cstart_POSTSUBSCRIPTitalic_mitalic_aitalic_xend_POSTSUBSCRIPT=roman_maxstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPT∈caligraphic_Jend_POSTSUBSCRIPTitalic_Cstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPTend_POSTSUBSCRIPT,whichisthemostcommonobjectivetominimize.

InofflineRL,thereisadataset\mathcal{D}caligraphic_Dthatcontainstuplesofstates,actions,andrewards.Thisdatasetisusedtotrainthepolicywithoutfurtherinteractionwiththeenvironment,addressingthechallengesofdirectinteraction.Byleveragingofflinedata,offlineRLavoidstheriskandexpenseassociatedwithdeployingexploratorypoliciesinreal-worldsettings.Thedataset\mathcal{D}caligraphic_Dallowstheagenttolearnfromawidevarietyofexperiences,includingrareorunsafestatesthatmightbedifficulttoencounterthroughonlineexploration.

BCisanotherapproachthattrainsapolicybyimitatinganexpert’sactions.Thistypeofimitationlearningusessupervisedlearningtoteachthepolicytoreplicateactionsfromadataset.Theeffectivenessofthismethodlargelydependsonthequalityofthedatasetusedfortraining.WhileBCcanbestraightforward,itdoesnotaccountforfuturerewardsandmaystrugglewithgeneralizationtonewsituations.

Inthissection,wepresentournovelofflineRLalgorithmdesignedforcombinatorialoptimizationproblemswithheterogeneousgraphrepresentationsandvariableactionspaces.First,wemodeltheJSSPandFJSSPasMDPs,capturingthecomplexdependenciesbetweenjobsandmachinesthroughagraph-basedstaterepresentation.Additionally,weintroduceamethodforgeneratingdiverseexperiencestoenhancethepolicy’sabilitytosolvetheseproblemsefficientlyinreal-time.

BeforeintroducingourofflineRLmethod,wedescribehowtheJSSPandFJSSPhavebeenmodeledasanMDP.Thismodelingincorporatestwokeyconcepts:

TheMDPisstructuredthroughthedefinitionofthestateandactionspaces,rewardfunction,andtransitionfunctionasfollows:

Actionspace.Theactionspacetsubscript\mathcal{A}_{t}caligraphic_Astart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPTateachtimestepttitalic_tconsistsoffeasiblejob-machinepairs.Whenajobisselected,itsfirstunscheduledoperationischosen.Topreventanexcessivenumberofchoices,theactionspaceisconstrainedbydefiningtesubscriptt_{e}italic_tstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPTastheearliesttimeamachinecanstartanewoperationandmaskingactionswherethestarttimeexceedste×psubscriptt_{e}\timespitalic_tstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPT×italic_p,whereppitalic_pisaparameterslightlygreaterthanone.

Transitionfunction.Thesolutionisconstructedincrementallybyassigningoperationstomachines.Ateachstep,thepolicycanmakemultipleassignments,butwithspecificconstraints:onlyoneoperationperjoboroneoperationpermachinecanbeassigned.Inotherwords,itisnotallowedtoassignmultipleoperationsfromthesamejob;onlythefirstavailableoperationcanbescheduled.Similarly,multipleoperationscannotbeassignedtoasinglemachinesimultaneously.

Onceanoperationisassigned,itisremovedfromthegraph,andtheedgesofthecorrespondingjobareupdatedtoreflectthenextoperationtobeprocessed.Additionally,thefeaturesoftheremainingnodesareupdated,andanewoperationisaddedtothegraphiftherearependingtasks.Thereasonforallowingmultipleassignmentsatonceistoreducethenumberoftimesthemodelisusedtogenerateasolution,asrepeatedlycallingthemodelcanbecomeproblematicinlargerinstances,especiallywhenreal-timeperformanceisrequired.

whereaaitalic_aandπ(s)\pi(s)italic_π(italic_s)arecontinuousmatricesinnsuperscript\mathbb{R}^{n}blackboard_Rstart_POSTSUPERSCRIPTitalic_nend_POSTSUPERSCRIPTrangingfrom∞-\infty-∞to∞\infty∞.

Inthisequation,theparameterλ\lambdaitalic_λadjuststheweightbetweenmaximizingtheQ-valueandminimizingthedifferencebetweenthepolicy’sactionsandthosefromthedataset.Thisbalancingactiscriticalforensuringthatthepolicynotonlyseekshighrewardsbutalsoremainsgroundedintheexpertdata,thusenhancingitsgeneralizationcapabilities.

Therevisedobjectivefunctionis:

whereλRLsubscript\lambda_{RL}italic_λstart_POSTSUBSCRIPTitalic_Ritalic_Lend_POSTSUBSCRIPTandλBCsubscript\lambda_{BC}italic_λstart_POSTSUBSCRIPTitalic_Bitalic_Cend_POSTSUBSCRIPTareadjustableparametersthatcontroltheinfluenceoftherewardmaximizationandbehaviorcloningterms,respectively.Byfine-tuningtheseparameters,wecanachieveabalancedapproachthatoptimizesboththepolicy’sperformanceanditsadherencetoexpertbehavior.

Oneofthesignificantchallengesinapplyingthisalgorithmtocombinatorialoptimizationproblemsmodeledasgraphs,suchasthoseencounteredinschedulingorrouting,liesinthecomputationoftheQ-valueQ(s,π(s))Q(s,\pi(s))italic_Q(italic_s,italic_π(italic_s)).Unliketraditionalenvironmentswherestateandactionspacesarefixed,graphspresentavariableactionspace,makingitdifficulttoapplystandardneuralnetworkarchitecturesdirectly.

Toovercomethis,weproposeintegratingtheactioninformationasanedgeattributewithinthegraphstructure.Specifically,inourscenario,wherethenodesrepresentoperationsandmachinesinaschedulingproblem,wealreadyhaveedgeslinkingthesenodeswithrelevantattributes.Byconcatenatingtheaction-relatedinformationwiththeseexistingattributes,wecanpreservetheflexibilityofthegraphrepresentationwhileensuringthatthepolicycaneffectivelylearnandapplytheQ-valuefunction.

subscriptAppendixANodeandedgefeaturesTable8:AveragecomputationtimefortheTaillardbenchmark.Method15×15151515\times1515×1520×15201520\times1520×1520×20202020\times2020×2030×15301530\times1530×1530×20302030\times2030×2050×15501550\times1550×1550×20502050\times2050×20100×2010020100\times20100×20MeanBiSch0.781.041.391.622.493.686.4130.355.97ResSch0.500.890.9719.932.274.755.9319.764.63RLCP5.606.327.298.9910.9016.9620.6463.7518.81SPN0.190.280.370.400.540.800.912.100.69L2S9.3010.1010.9012.7014.0016.2022.8050.2018.28H-ORL3.253.955.285.676.9512.4514.2260.8710.08Inthisappendix,wedetailthefeaturesofthenodesandedgesinthestaterepresentation.

Forjob-typeandoperation-typenodes,thefeaturesare:

Formachine-typenodes,thefeaturesare:

Edgesinthegrapharecharacterizedasfollows:

Onlytwoedgetypescarryspecificfeatures:operation-machineandjob-machineedges.Featuresforoperation-machineedgesinclude:

Forjob-machineedges,featuresareanalogous,focusingonthedelayorgapcausedbymachinewaitingtimesbetweenoperations,leadingtoidletime.

FortheFJSSP,open-sourceimplementationsofthemethodswereused(exceptfromResSch),butwewereunabletoobtaintimesforLMLPorGGCTastheydidnotpublishinferencetimesorprovidetheirimplementations.Inthiscase,therearenomajordifferencesbetweenthemethodssinceallofthemutilizesimilartypesofneuralnetworksandmodeltheprobleminacomparableway.

THE END

Offlinereinforcementlearningforjob

追问weekly过去一周，AI领域有哪些新突破？Vol.44算法高维多变量大模型神经网络ai领域

《英语演讲》课件ProgramOrientationofEnglishSpeechandDebate.pptx

《家长会英语》课件.pptx

Offlinereinforcementlearningforjob

Learning题目答案解析,Learning题目答案解析1