Teal:Learning|toplearning官网_在线学习

Therapidexpansionofglobalcloudwide-areanetworks(WANs)hasposedachallengeforcommercialoptimizationenginestoefficientlysolvenetworktrafficengineering(TE)problemsatscale.ExistingaccelerationstrategiesdecomposeTEoptimizationintoconcurrentsubproblemsbutrealizelimitedparallelismduetoaninherenttradeoffbetweenruntimeandallocationperformance.

WepresentTeal,alearning-basedTEalgorithmthatleveragestheparallelprocessingpowerofGPUstoaccelerateTEcontrol.First,Tealdesignsaflow-centricgraphneuralnetwork(GNN)tocaptureWANconnectivityandnetworkflows,learningflowfeaturesasinputstodownstreamallocation.Second,toreducetheproblemscaleandmakelearningtractable,Tealemploysamulti-agentreinforcementlearning(RL)algorithmtoindependentlyallocateeachtrafficdemandwhileoptimizingacentralTEobjective.Finally,Tealfine-tunesallocationswithADMM(AlternatingDirectionMethodofMultipliers),ahighlyparallelizableoptimizationalgorithmforreducingconstraintviolationssuchasoverutilizedlinks.

WeevaluateTealusingtrafficmatricesfromMicrosoft’sWAN.OnalargeWANtopologywith>>>1,700nodes,Tealgeneratesnear-optimalflowallocationswhilerunningseveralordersofmagnitudefasterthantheproductionoptimizationengine.ComparedwithotherTEaccelerationschemes,Tealsatisfies6–32%moretrafficdemandandyields197–625×\times×speedups.

Unfortunately,off-the-shelfdeeplearningmodelsdonotdirectlyapplytoTE.First,standardfully-connectedneuralnetworksfailtotakeintoaccounttheeffectsofWANconnectivityontrafficallocations,yieldingsolutionsthatarefarfromoptimal.Second,theescalatingscaleoftheTEproblemmakesitintractabletotrainamonolithicmodeltonavigatethehigh-dimensionalsolutionspace.Finally,neuralnetworksareunabletoenforceconstraints,leadingtounnecessarytrafficdropsduetoexceededlinkcapacities.

Toaddressthesechallenges,wepresentalearning-acceleratedTEschemenamedTeal.First,Tealconstructsaflow-centricgraphneuralnetwork(GNN)tocaptureWANtopologyandextractinformativefeaturesfromtrafficflowsforthedownstreamallocationtask.Next,Tealallocateseachdemandindividuallyusingasharedpolicy(neural)networkbasedonthelearnedfeatures.Doingsoreducestheproblemscalefromglobaltrafficallocationtoper-demandtasks,makingthelearningprocessmoretractable(inalow-dimensionalspace)andfeasible(fitintoGPUmemory).Tocoordinatetheindependentallocationsofdemandsandavoidcontentionforlinks,Tealleveragesmulti-agentreinforcementlearning(RL)totraintheend-to-endmodel—GNNandpolicynetwork—towardoptimizingacentralTEobjective.Finally,Tealfine-tunesthemodel’soutputallocationsusingahighlyparallelizableconstrainedoptimizationalgorithm—ADMM(alternatingdirectionmethodofmultipliers),whichiswellsuitedforreducingconstraintviolationssuchasoversubscribedlinksandenhancingsolutionquality.

Intheirearlyyears,cloudWANsonlyconsistedoftensofdatacenters,soitwasfeasibleforcommercialLPsolverstocomputetrafficallocationsfrequently.However,therapiddeploymentofnewdatacentershasrenderedtheTEtaskprohibitivelyslowatscale,requiringhoursforcommercialsolverstoallocatetrafficonWANswiththousandsofnodes.Consequently,WANoperatorsareseekingtoaccelerateTEoptimizationtokeeppacewiththegrowingsizeoftheWAN.

TocopewiththegrowingscaleofTE,wearguethatwithjudiciousdesign,machinelearning(ML)cansignificantlyaccelerateTEoptimization.Bytrainingonavastamountofhistoricaltrafficdata,ML-basedTEschemesalsohavethepotentialtoattainnear-optimalallocationperformance.

Unlockingmassiveparallelism.EncodingaTEalgorithminneuralnetworkstransformsthetraditionallyiterativeTEoptimization(LPsolversorcombinatorialalgorithms)intotheinferenceprocessofneuralnetworks,wheretheinputdata(e.g.,trafficdemandsonaWANtopology)ispropagatedintheforwarddirectionthroughtheneuralnetworktocomputetheoutput(e.g.,trafficsplitsonthepreconfiguredpaths).Thisinferenceprocessunlocksmassiveparallelismduetomainlyconsistingofhighlyparallelizableoperationssuchasmatrixmultiplications.

Exploitingabundanttrainingdata.OperationalWANsgenerateanabundanceoftrafficdatathatcanbeusedtotrainneuralnetworks.AcarefullydesignedML-basedTEschemeiscapableofdiscoveringregularitiesinthetrainingdata—suchaspatternsingraphconnectivity,linkcapacities,andtrafficdemands—andultimatelylearnstooptimizeallocationperformancewithrespecttooperator-specifiedobjectives.

Whileholdingthepromiseofnear-optimalperformanceandsubstantialspeeduprelativetoLP-basedTEmethods,deeplearningisnotapanacea.Infact,usingMLforTEoptimizationisnotasstraightforwardasitmayappear.

Inthissection,wepresentthedesignofTeal—TrafficEngineeringAcceleratedwithLearning.ThegoalofTealistotrainafastandscalableTEschemethroughdeeplearningwhileachievingnear-optimaltrafficallocationonlarge-scaletopologies.Therationalebehindusingdeeplearningistoharnessthemassiveparallelismandhardwareaccelerationunlockedbyneuralnetworks.Moreover,everycomponentofTealiscarefullydesignedtobeparallelizable(fastonGPUs)andscalable(performantonlargeWANs).

ForeachWANtopology,Tealtrainsits“model”—FlowGNNandpolicynetwork—endtoendtooptimizeanoperator-providedTEobjective;ADMMrequiresnotraining.AllthethreekeycomponentsofTeal(FlowGNN,multi-agentRL,andADMM)arecarefullydesignedtobehighlyparallelizable,enablingfastcomputationandscalableTEperformanceasthesizeoftheWANtopologygrows.

DespitethestrengthsofGNNs,theprimaryfocusofTEistheassignmentofflows,inwhicheachfloworiginatingfromademandfollowsapredefinedpathalongachainofnetworklinks(edges).TEisconcernedwiththeinterferencebetweenflowsastheycompeteforlinkcapacities.Hence,weputaspotlightonflowsandexplicitlyrepresentflow-relatedentities—edgesandpaths—asthenodesinourTE-specificGNN,whichwecallFlowGNN.

DuetotheabsenceofconnectionsbetweenPathNodes,theGNNlayerisunabletomakeeachPathNodeawareoftheotherPathNodesassociatedwiththesamedemand.Toaddressthis,weaddaDNNlayeraftereachGNNlayertocoordinateflows—representedbytheirPathNodes—thatstemfromthesamedemand.TheDNNlayer,afully-connectedneuralnetwork,essentiallytransformsandupdatestheembeddingsoftherelatedPathNodes(e.g.,V1234superscript1234V^{1234}italic_Vstart_POSTSUPERSCRIPT1234end_POSTSUPERSCRIPT,V124superscript124V^{124}italic_Vstart_POSTSUPERSCRIPT124end_POSTSUPERSCRIPT,V134superscript134V^{134}italic_Vstart_POSTSUPERSCRIPT134end_POSTSUPERSCRIPT,andV1324superscript1324V^{1324}italic_Vstart_POSTSUPERSCRIPT1324end_POSTSUPERSCRIPTforthedemandfromnode#1tonode#4).Specifically,theseembeddingsarefedintotheDNNlayertoobtainanequalnumberofupdatedembeddings,whicharethenstoredbackintotherespectivePathNodes.

GiventheflowembeddingsgeneratedbyFlowGNNasfeatureinputs,Tealcreatesapolicynetworktomaptheseembeddingstotrafficsplitsonthecorrespondingpaths,materializingflowallocation.TheFlowGNNandpolicynetworkconstitutethe“model”ofTeal,whichistrainedendtoendtooptimizeanoperator-specifiedTEobjective.

Despitethebenefits,allocatingeachdemandindependentlycanresultinalackofcoordinationunlessthepolicynetworkistrained—alongwithFlowGNN—tobeawareofacentralTEobjective.Thisraisesthequestion:WhatlearningalgorithmissuitablefortrainingTeal’smodel,whichgenerateslocaltrafficsplitsforeachdemandwhileoptimizingaglobalTEobjectiveToaddressthisquestion,wediscussseveralcandidatesbelowbeforelandingonmulti-agentRL.

Supervisedlearning:Inanofflinesetting,LPsolverssuchasGurobicanbeusedtocomputetheoptimaltrafficallocations,therebyprovidingground-truthtrafficsplitsforTeal’smodeltolearnusingstandardsupervisedlearning.Nevertheless,generatingtheseground-truthlabelsforlargeWANscanbeexcessivelytime-consumingandincursubstantialmemoryusage.E.g.,Gurobirequires5.5hourstofindtheoptimalallocationsforasingletrafficmatrixona1739-nodetopology,whileconsumingupto18.1GBofmemory.

ADMMiswell-suitedtoTealfortworeasons.First,unlikethewidelyusedoptimizationmethodsinLPsolvers,suchassimplexandinterior-pointmethods,ADMMdoesnotrequireaconstraint-satisfyingsolutiontobeginwith.Instead,ADMMallowsstartingfromaconstraint-violatingpoint(asTeal’sneuralnetworksmightoutput)anditerativelymovestowardthefeasibleregion.Second,ADMMishighlyparallelizablebecausetheminimizationoftheaugmentedLagrangiancanbedecomposedintomanysubproblems,eachsolvingforasinglevariable,e.g.,createdforeachpathoredgeinTE.ThesesubproblemscanbesolvedinparallelandbenefitfromsignificantaccelerationonGPUs.

Additionally,wenotethatusingADMMalonedoesnotaccelerateTEoptimization.Thisisbecausewheninitializedrandomly,ADMMstillrequiresanexcessivenumberofiterationstoconvergetoanacceptablesolution,forfeitingitsfastspeedwithineachiteration.Incontrast,Teal’sneuralnetworkscanwarm-startADMMwithareasonablygoodsolution,allowingADMMtoperformfine-tuningandattainanoticeableimprovementinseveraliterationswithanegligibleimpactontheoverallruntime.

Trafficdata.Wecollecttrafficdataovera20-dayperiodin2022fromSWAN,theproductioninter-datacenterWANatMicrosoft.Thetotaltrafficobservedineach5-minuteintervalbetweeneverysource-destinationpairisconsideredastheirdemand.TotranslatethesetrafficdemandsfromSWANtoothertopologies,wemapeachnewnodepairtoarandomnodepairinSWAN,andrandomlysampledisjointsequencesoftrafficmatrices,with700consecutiveintervalsfortraining,100forvalidation,and200fortesting.

Baselines.WecompareTealagainstthefollowingbaselines.

Metrics.Weconsiderthefollowingperformancemetrics.

Smalltopologies(SWANandUsCarrier).AlltheevaluatedschemescancomputeflowallocationonSWANandUsCarrierwithinseconds,e.g.,LP-alltakeslessthan1secondtodeterminetheoptimalallocation,eliminatingtheneedforTEaccelerationschemes.Nonetheless,weobservethatwhenNCFlowisappliedtoUsCarrier,itcanonlymeet72.6%ofthedemand(vs.96.2%forLP-all).Incontrasttoitssuboptimalperformance,Tealretainsademandof92.6%thatiscloseenoughtotheoptimal.

Kdl.OnthelargerKdltopologywith754nodesand1,790edges,Tealtakesonly0.95secondsonaveragetocompleteeachflowallocation,whichis7×\times×fasterthanNCFlow,13×\times×fasterthanPOP,27×\times×fasterthanLP-top,and616×\times×fasterthanLP-all.Meanwhile,Tealsatisfies90.6%ofthedemand,nearlythesameasthebest-performingschemeLP-top(withadifferenceofonly0.14%).Amongtheotherschemes,LP-allrequiresover585secondsforcomputation,exceedingthe5-minutetimebudgetandforcingittoreusestaleroutesfromseveralintervalsago.Asaresult,LP-allonlyallocates87.2%ofthedemanddespiteitsabilitytofindtheoptimalsolutionifgrantedunlimitedruntime.NCFlowandPOP,ontheotherhand,produceflowallocationsquicklyasintended,yettheyonlysatisfy63.8%and87.7%ofthedemand,respectively.

ASN.OnthelargesttopologyofASNwith1,739nodesand8,558edges,Tealachievesamoreremarkablespeeduprelativetotheotherschemes.Withanaveragecomputationtimeof0.97seconds,Tealis394×\times×fasterthanNCFlow,625×\times×fasterthanPOP,and197×\times×fasterthanLP-top.LP-allisimpracticaltorunonASNduetoitsincrediblyslowcomputationtimeofupto5.5hoursperallocation(4ordersofmagnitudeslowerthanTeal),aswellasthememoryerrorsincurred.NotonlydoesTealattainsubstantialacceleration,italsoallocatesthemostdemandonaverage(89.4%),surpassingLP-topby6%,POPby14.5%,andNCFlowby32%.

ItisnoteworthythatTealachievestheaboveperformancewithouthavingtoretrainitsneuralnetworkmodels,showcasingitsgeneralityacross(transient)linkfailures.Inthelesscommoneventofpersistentlinkfailuresorplannednetworkupgradessuchasnewnetworknodesorlinks,weanticipatehavingsufficienttimetoretrainTealwithin6–10hoursandrecoveritsallocationperformanceontheupdatedtopology.

OnKdl,whileLP-allrequiresover500secondstocomputeeachflowallocationandexceedstheallottedtimebudget,itsoutputallocationisoptimalandservesasabenchmark.Tealfallsshortoftheoptimalallocationby4.8%withrespecttotheofflinesatisfieddemand,butremainswithin0.7%ofthebestfeasibleschemeLP-top,andoutperformsNCFlowbyasignificantmarginof27%andPOPby2.8%,respectively.OnASN,TealandLP-topachieveasimilarlevelofofflinesatisfieddemand,whichis30%higherthanNCFlowand11%higherthanPOP.

WeperformanablationstudytoassesstheimpactofTeal’skeyfeaturesonitsoverallperformance.

Fromthevisualization,weobserveanorangeclusterofbusypaths,whichisausefulindicatorforthesubsequentallocationtaskasthedownstreampolicynetworkcanbetrainedtoseparatethisclusterfromtheremainingpathsandallocatemoretraffictothepathsdesiredtobebusy,therebymimickingtheoptimalsolutionprovidedbyLP-all.Inotherwords,thisclusterindicatesthatFlowGNNhasroughlycapturedpathcongestionwithinthenetworkinitslearnedembeddings.

Inthiswork,wedemonstratethatdeeplearningisaneffectivetoolforscalingcloudWANTEsystemstolargeWANtopologies.WedevelopTeal,alearning-basedTEschemethatcombinescarefullydesigned,highlyparallelizablecomponents—FlowGNN,multi-agentRL,andADMM—toallocatetrafficinWANs.Tealcomputesnear-optimaltrafficallocationswithsubstantialaccelerationoverstate-of-the-artTEschemesforlargeWANtopologies.

Ethics:Thisworkdoesnotraiseanyethicalissues.

WethankouranonymousreviewersfortheirinsightfulcommentsandZhizhenZhongforshepherdingthiswork.WealsothankSrikanthKandula,FirasAbuzaid,YangZhou,DeepakNarayanan,FiodarKazhamiaka,UmeshKrishnaswamy,andVictorBahlfortheirhelpfulfeedback.ThisworkwassupportedinpartbytheNSFgrantCNS-2107078.

Appendices

Appendicesaresupportingmaterialthathasnotbeenpeer-reviewed.

ThegoalofcloudWANtrafficengineering(TE)algorithmsistoefficientlyutilizetheexpensivenetworkresourcesbetweendatacenterstoachieveoperator-definedperformancegoals,suchasminimumlatency,maximumthroughput,andfairnessbetweencustomertrafficflows.

Network.WerepresenttheWANtopologyasagraphG=(V,E,c)G=(V,E,c)italic_G=(italic_V,italic_E,italic_c),wherenodes(VVitalic_V)representnetworksites(e.g.,datacenters),edges(EEitalic_E)betweenthesitesrepresentnetworklinksresultingfromlong-haulfiberconnectivity,andc:E→+:→superscriptc:E\rightarrow\mathbb{R}^{+}italic_c:italic_E→blackboard_Rstart_POSTSUPERSCRIPT+end_POSTSUPERSCRIPTassignscapacitiestolinks.Letn=|V|n=|V|italic_n=|italic_V|denotethenumberofnetworksites.Eachnetworksitecanconsistofeitheroneormultipleaggregatedrouters.

Trafficallocations.Atrafficallocation\mathcal{F}caligraphic_Fallocatesademandd∈Dd\inDitalic_d∈italic_DasflowsacrosstheassignednetworkpathsPdsubscriptP_{d}italic_Pstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT.Therefore,dsubscript\mathcal{F}_{d}caligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPTisamappingfromthepathsetPdsubscriptP_{d}italic_Pstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPTtonon-negativesplitratios,i.e.,d:Pd→[0,1]:subscript→subscript01\mathcal{F}_{d}:P_{d}\rightarrow[0,1]caligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT:italic_Pstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT→[0,1],suchthatd(p)subscript\mathcal{F}_{d}(p)caligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT(italic_p)isthefractionoftrafficdemanddditalic_dallocatedonpathppitalic_p.Thetrafficallocationintimeintervaliiitalic_iisdenotedas(i)superscript\mathcal{F}^{(i)}caligraphic_Fstart_POSTSUPERSCRIPT(italic_i)end_POSTSUPERSCRIPT.

Constraints.Foranydemandd∈Dd\inDitalic_d∈italic_D,∑p∈Pdd(p)≤1subscriptsubscriptsubscript1\sum_{p\inP_{d}}\mathcal{F}_{d}(p)\leq1∑start_POSTSUBSCRIPTitalic_p∈italic_Pstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPTend_POSTSUBSCRIPTcaligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT(italic_p)≤1ismaintainedsuchthatweonlyallocateasmuchtrafficasthedemands.Additionally,weconstraintheallocationsbye∈Ee\inEitalic_e∈italic_E,c(e)≥∑pe∑d∈Dd(p)dsubscriptsubscriptsubscriptc(e)\geq\sum_{p\nie}\sum_{d\inD}\mathcal{F}_{d}(p)\cdotditalic_c(italic_e)≥∑start_POSTSUBSCRIPTitalic_pitalic_eend_POSTSUBSCRIPT∑start_POSTSUBSCRIPTitalic_d∈italic_Dend_POSTSUBSCRIPTcaligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT(italic_p)italic_dtoensurethatthetrafficallocationsdonotexceedthecapacityofnetworklinks.

Surrogateloss.Thesurrogatelossthatapproximatesthe(non-differentiable)totalfeasibleflowisdefinedasthetotalflowintendedtoberouted(disregardinglinkcapacities),penalizedbytotallinkoverutilization.Usingtheabovenotations,thesurrogatelosscanbeformallyexpressedas

whereweperformMonteCarlosamplingtoestimatethecounterfactualbaseline,e.g.,bydrawinganumberofrandomsamplesforai′～πθ(|si)a_{i}^{\prime}\sim\pi_{\theta}(\cdot|s_{i})italic_astart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPTstart_POSTSUPERSCRIPT′end_POSTSUPERSCRIPT～italic_πstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPT(|italic_sstart_POSTSUBSCRIPTitalic_iend_POSTSUBSCRIPT).Thegradientofθ\thetaitalic_θisthengivenby

whichisusedfortrainingthepolicynetworkwithstandardpolicygradient.Inpractice,TealtrainsFlowGNNandthepolicynetworkofCOMAendtoend,soθ\thetaitalic_θrepresentsalltheparameterstolearnintheend-to-endmodel,backpropagatinggradientsfromthepolicynetworktoFlowGNN.

subscriptsubscriptsubscriptsubscript2superscriptsubscriptnorm22\displaystyle=-\sum_{d\inD}\sum_{p\inP_{d}}\mathcal{F}_{d}(p)\cdotd+\lambdaG%(\mathcal{F},z,s)+\frac{\rho}{2}\|G(\mathcal{F},z,s)\|_{2}^{2},=-∑start_POSTSUBSCRIPTitalic_d∈italic_Dend_POSTSUBSCRIPT∑start_POSTSUBSCRIPTitalic_p∈italic_Pstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPTend_POSTSUBSCRIPTcaligraphic_Fstart_POSTSUBSCRIPTitalic_dend_POSTSUBSCRIPT(italic_p)italic_d+italic_λitalic_G(caligraphic_F,italic_z,italic_s)+dividestart_ARGitalic_ρend_ARGstart_ARG2end_ARG∥italic_G(caligraphic_F,italic_z,italic_s)∥start_POSTSUBSCRIPT2end_POSTSUBSCRIPTstart_POSTSUPERSCRIPT2end_POSTSUPERSCRIPT,whereG(,z,s)=(G1,G3,G4)superscriptsubscript1subscript3subscript4topG(\mathcal{F},z,s)=(G_{1},G_{3},G_{4})^{\top}italic_G(caligraphic_F,italic_z,italic_s)=(italic_Gstart_POSTSUBSCRIPT1end_POSTSUBSCRIPT,italic_Gstart_POSTSUBSCRIPT3end_POSTSUBSCRIPT,italic_Gstart_POSTSUBSCRIPT4end_POSTSUBSCRIPT)start_POSTSUPERSCRIPTend_POSTSUPERSCRIPTand

THE END

Teal:Learning

Eduline在线教育，重塑在线教育未来之路的先锋力量计算机科技

MathGamesforKidsEducationalAppStore

爱考拉在线，开启在线教育无限可能之旅市场营销策划

专升本Topup在读，成功申请悉尼大学教育学硕士济南立思辰留学官网济南留学机构济南留学中介

Teal:Learning

[C2P1]AndrewNgMachineLearningKerShaw

SAT/ACT高分秘籍：Dr.Z教你轻松突破名校门槛！成绩平台