Recentadvancesindeeplearninghaveshownsignificantpotentialforsolvingcombinatorialoptimizationproblemsinreal-time.Unliketraditionalmethods,deeplearningcangeneratehigh-qualitysolutionsefficiently,whichiscrucialforapplicationslikeroutingandscheduling.However,existingapproacheslikedeepreinforcementlearning(RL)andbehavioralcloninghavenotablelimitations,withdeepRLsufferingfromslowlearningandbehavioralcloningrelyingsolelyonexpertactions,whichcanleadtogeneralizationissuesandneglectoftheoptimizationobjective.ThispaperintroducesanovelofflineRLmethoddesignedforcombinatorialoptimizationproblemswithcomplexconstraints,wherethestateisrepresentedasaheterogeneousgraphandtheactionspaceisvariable.Ourapproachencodesactionsinedgeattributesandbalancesexpectedrewardswiththeimitationofexpertsolutions.Wedemonstratetheeffectivenessofthismethodonjob-shopschedulingandflexiblejob-shopschedulingbenchmarks,achievingsuperiorperformancecomparedtostate-of-the-arttechniques.
Themaingoalofthispaperistoproposemoreefficientmethodsforsolvingcombinatorialoptimizationproblemsthatincorporatedifficultconstraintsandrequirereal-timesolutions.Forthispurpose,weintroduceanewofflineRLmethodthatconsidersproblemswherethestatespaceisrepresentedasaheterogeneousgraphandtheactionspaceisvariable.Schedulingproblems,suchasthejob-shopschedulingproblem(JSSP)andtheflexibleJSSP(FJSSP),usuallyhavemoreconstraintsthanroutingproblemsduetotheneedtoaccountforthesequentialorderingofoperations,machineavailability,andprocessingtimes.Asacasestudy,wehaveusedthesetwoschedulingproblemstodemonstratetheeffectivenessofourapproach.Thecontributionsofthispaperaresummarizedasfollows:
Inthissection,wereviewliteratureonreal-timeschedulingsolutions,graph-basedRL,andofflineRLalgorithmstoidentifykeyresearchgaps.
Toorganizetheapproachesthataddressschedulingproblemsinreal-time,wewillfirstexploreRL-basedtechniques,beginningwiththemethodsthatsolvetheJSSPandthenmovingontotheFJSSP.Subsequently,methodsusingotherapproaches,suchasBCorself-supervisedlearning,willbeexamined.
Inschedulingproblems,whichinvolvedifferenttypesofentities(operations,jobs,andmachines),mostapproachesmodeltheproblemasagraphsincethisfacilitateseffectiveproblemmodeling,albeitcoupledwiththeuseofspecificneuralnetworksthatcanprocessthistypeofrepresentation.Themostcommonwaytogeneratesolutionsisconstructively,wheresolutionsareconstructediteratively:ateachstep,anelementisselectedbasedonitscharacteristics.Forinstance,injobscheduling,theprocesscouldbevisualizedassequentiallyassigningoperationstomachines.
However,manycurrentofflineRLapplicationsdonotincorporategraphstructuresintotheirstaterepresentations.Graphs,withtheircomplexandhigh-dimensionalnature,presentuniquechallengesindefiningstatespacesanddesigningeffectiveactionstrategies.ThisunderscorestheneedforspecializedofflineRLapproachescapableofhandlinggraph-structureddataandvariableactionspaces.Adaptingtograph-basedscenariosremainsanarearequiringsignificantfurtherresearch.
Inessence,theFJSSPcombinestwoproblems:amachineselectionproblem,wherethemostsuitablemachineischosenforeachoperation,aroutingproblem,andasequencingorschedulingproblem,wherethesequenceofoperationsonamachineneedstobedetermined.Givenanassignmentofoperationstomachines,thecompletiontimeofajob,jisubscriptj_{i}italic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPT,isdefinedasCjisubscriptsubscriptC_{j_{i}}italic_Cstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPTend_POSTSUBSCRIPT,andthemakespanofascheduleisdefinedasCmax=maxji∈CjisubscriptsubscriptsubscriptsubscriptsubscriptC_{max}=\max\limits_{j_{i}\in\mathcal{J}}C_{j_{i}}italic_Cstart_POSTSUBSCRIPTitalic_mitalic_aitalic_xend_POSTSUBSCRIPT=roman_maxstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPT∈caligraphic_Jend_POSTSUBSCRIPTitalic_Cstart_POSTSUBSCRIPTitalic_jstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPTend_POSTSUBSCRIPT,whichisthemostcommonobjectivetominimize.
InofflineRL,thereisadataset\mathcal{D}caligraphic_Dthatcontainstuplesofstates,actions,andrewards.Thisdatasetisusedtotrainthepolicywithoutfurtherinteractionwiththeenvironment,addressingthechallengesofdirectinteraction.Byleveragingofflinedata,offlineRLavoidstheriskandexpenseassociatedwithdeployingexploratorypoliciesinreal-worldsettings.Thedataset\mathcal{D}caligraphic_Dallowstheagenttolearnfromawidevarietyofexperiences,includingrareorunsafestatesthatmightbedifficulttoencounterthroughonlineexploration.
BCisanotherapproachthattrainsapolicybyimitatinganexpert’sactions.Thistypeofimitationlearningusessupervisedlearningtoteachthepolicytoreplicateactionsfromadataset.Theeffectivenessofthismethodlargelydependsonthequalityofthedatasetusedfortraining.WhileBCcanbestraightforward,itdoesnotaccountforfuturerewardsandmaystrugglewithgeneralizationtonewsituations.
Inthissection,wepresentournovelofflineRLalgorithmdesignedforcombinatorialoptimizationproblemswithheterogeneousgraphrepresentationsandvariableactionspaces.First,wemodeltheJSSPandFJSSPasMDPs,capturingthecomplexdependenciesbetweenjobsandmachinesthroughagraph-basedstaterepresentation.Additionally,weintroduceamethodforgeneratingdiverseexperiencestoenhancethepolicy’sabilitytosolvetheseproblemsefficientlyinreal-time.
BeforeintroducingourofflineRLmethod,wedescribehowtheJSSPandFJSSPhavebeenmodeledasanMDP.Thismodelingincorporatestwokeyconcepts:
TheMDPisstructuredthroughthedefinitionofthestateandactionspaces,rewardfunction,andtransitionfunctionasfollows:
Actionspace.Theactionspacetsubscript\mathcal{A}_{t}caligraphic_Astart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPTateachtimestepttitalic_tconsistsoffeasiblejob-machinepairs.Whenajobisselected,itsfirstunscheduledoperationischosen.Topreventanexcessivenumberofchoices,theactionspaceisconstrainedbydefiningtesubscriptt_{e}italic_tstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPTastheearliesttimeamachinecanstartanewoperationandmaskingactionswherethestarttimeexceedste×psubscriptt_{e}\timespitalic_tstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPT×italic_p,whereppitalic_pisaparameterslightlygreaterthanone.
Transitionfunction.Thesolutionisconstructedincrementallybyassigningoperationstomachines.Ateachstep,thepolicycanmakemultipleassignments,butwithspecificconstraints:onlyoneoperationperjoboroneoperationpermachinecanbeassigned.Inotherwords,itisnotallowedtoassignmultipleoperationsfromthesamejob;onlythefirstavailableoperationcanbescheduled.Similarly,multipleoperationscannotbeassignedtoasinglemachinesimultaneously.
Onceanoperationisassigned,itisremovedfromthegraph,andtheedgesofthecorrespondingjobareupdatedtoreflectthenextoperationtobeprocessed.Additionally,thefeaturesoftheremainingnodesareupdated,andanewoperationisaddedtothegraphiftherearependingtasks.Thereasonforallowingmultipleassignmentsatonceistoreducethenumberoftimesthemodelisusedtogenerateasolution,asrepeatedlycallingthemodelcanbecomeproblematicinlargerinstances,especiallywhenreal-timeperformanceisrequired.
whereaaitalic_aandπ(s)\pi(s)italic_π(italic_s)arecontinuousmatricesinnsuperscript\mathbb{R}^{n}blackboard_Rstart_POSTSUPERSCRIPTitalic_nend_POSTSUPERSCRIPTrangingfrom∞-\infty-∞to∞\infty∞.
Inthisequation,theparameterλ\lambdaitalic_λadjuststheweightbetweenmaximizingtheQ-valueandminimizingthedifferencebetweenthepolicy’sactionsandthosefromthedataset.Thisbalancingactiscriticalforensuringthatthepolicynotonlyseekshighrewardsbutalsoremainsgroundedintheexpertdata,thusenhancingitsgeneralizationcapabilities.
Therevisedobjectivefunctionis:
whereλRLsubscript\lambda_{RL}italic_λstart_POSTSUBSCRIPTitalic_Ritalic_Lend_POSTSUBSCRIPTandλBCsubscript\lambda_{BC}italic_λstart_POSTSUBSCRIPTitalic_Bitalic_Cend_POSTSUBSCRIPTareadjustableparametersthatcontroltheinfluenceoftherewardmaximizationandbehaviorcloningterms,respectively.Byfine-tuningtheseparameters,wecanachieveabalancedapproachthatoptimizesboththepolicy’sperformanceanditsadherencetoexpertbehavior.
Oneofthesignificantchallengesinapplyingthisalgorithmtocombinatorialoptimizationproblemsmodeledasgraphs,suchasthoseencounteredinschedulingorrouting,liesinthecomputationoftheQ-valueQ(s,π(s))Q(s,\pi(s))italic_Q(italic_s,italic_π(italic_s)).Unliketraditionalenvironmentswherestateandactionspacesarefixed,graphspresentavariableactionspace,makingitdifficulttoapplystandardneuralnetworkarchitecturesdirectly.
Toovercomethis,weproposeintegratingtheactioninformationasanedgeattributewithinthegraphstructure.Specifically,inourscenario,wherethenodesrepresentoperationsandmachinesinaschedulingproblem,wealreadyhaveedgeslinkingthesenodeswithrelevantattributes.Byconcatenatingtheaction-relatedinformationwiththeseexistingattributes,wecanpreservetheflexibilityofthegraphrepresentationwhileensuringthatthepolicycaneffectivelylearnandapplytheQ-valuefunction.
subscriptAppendixANodeandedgefeaturesTable8:AveragecomputationtimefortheTaillardbenchmark.Method15×15151515\times1515×1520×15201520\times1520×1520×20202020\times2020×2030×15301530\times1530×1530×20302030\times2030×2050×15501550\times1550×1550×20502050\times2050×20100×2010020100\times20100×20MeanBiSch0.781.041.391.622.493.686.4130.355.97ResSch0.500.890.9719.932.274.755.9319.764.63RLCP5.606.327.298.9910.9016.9620.6463.7518.81SPN0.190.280.370.400.540.800.912.100.69L2S9.3010.1010.9012.7014.0016.2022.8050.2018.28H-ORL3.253.955.285.676.9512.4514.2260.8710.08Inthisappendix,wedetailthefeaturesofthenodesandedgesinthestaterepresentation.
Forjob-typeandoperation-typenodes,thefeaturesare:
Formachine-typenodes,thefeaturesare:
Edgesinthegrapharecharacterizedasfollows:
Onlytwoedgetypescarryspecificfeatures:operation-machineandjob-machineedges.Featuresforoperation-machineedgesinclude:
Forjob-machineedges,featuresareanalogous,focusingonthedelayorgapcausedbymachinewaitingtimesbetweenoperations,leadingtoidletime.
FortheFJSSP,open-sourceimplementationsofthemethodswereused(exceptfromResSch),butwewereunabletoobtaintimesforLMLPorGGCTastheydidnotpublishinferencetimesorprovidetheirimplementations.Inthiscase,therearenomajordifferencesbetweenthemethodssinceallofthemutilizesimilartypesofneuralnetworksandmodeltheprobleminacomparableway.