智能科学与技术学报,2020,2(4):314-326doi:10.11959/j.issn.2096-6652.202034
专刊:深度强化学习
刘朝阳1,穆朝絮1,孙长银21天津大学电气自动化与信息工程学院,天津300072
2东南大学自动化学院,江苏南京210096
LIUZhaoyang1,MUChaoxu1,SUNChangyin21SchoolofElectricalandInformationEngineering,TianjinUniversity,Tianjin300072,China
2SchoolofAutomation,SoutheastUniversity,Nanjing210096,China
修回日期:2020-12-03网络出版日期:2020-12-15
Revised:2020-12-03Online:2020-12-15
作者简介Aboutauthors
刘朝阳(1996-),男,天津大学电气自动化与信息工程学院博士生,主要研究方向为强化学习、多智能体强化学习。。
穆朝絮(1984-),女,博士,天津大学电气自动化与信息工程学院教授,主要研究方向为强化学习、自适应学习系统、非线性控制和优化。。
孙长银(1975-),男,博士,东南大学自动化学院教授,中国自动化学会会士,中国自动化学会人工智能与机器人教育专业委员会主任。主要研究方向为智能控制与优化、强化学习、神经网络、数据驱动控制。担任IEEETransactionsonNeuralNetworksandLearningSystems、IEEE/CAAJournalofAutomaticaSinica、《自动化学报》《控制理论与应用》《智能科学与技术学报》等高质量学术期刊编委。2011年获得国家杰出青年科学基金。“智能机器人感知与控制”江苏高等学校优秀科技创新团队带头人,2016年全国优秀科技工作者,第三批国家“万人计划”科技创新领军人才,中国科学技术协会第九次全国代表大会代表,“自主无人系统协同控制理论及应用”国家自然科学基金委员会创新研究群体学术带头人,科学技术部科技创新2030—“新一代人工智能”重大项目“人在回路的混合增强智能”首席科学家,江苏省前沿引领技术基础研究专项领衔科学家。。
关键词:人工智能;深度强化学习;值函数;策略梯度;导航;协作;复杂环境;泛化性;鲁棒性
Deepreinforcementlearning(DRL)ismainlyappliedtosolvetheperception-decisionproblem,andhasbecomeanimportantresearchbranchinthefieldofartificialintelligence.TwokindsofDRLalgorithmsbasedonvaluefunctionandpolicygradientweresummarized,includingdeepQnetwork,policygradientaswellasrelateddevelopedalgorithms.Inaddition,theapplicationsofDRLinvideogames,navigation,multi-agentcooperationandrecommendationfieldwereintensivelyreviewed.Finally,aprospectforthefutureresearchofDRLwasmade,andsomeresearchsuggestionsweregiven.
Keywords:artificialintelligence;deepreinforcementlearning;valuefunction;policygradient;navigation;cooperation;complexenvironment;generalization;robustness
本文引用格式
刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述.智能科学与技术学报[J],2020,2(4):314-326doi:10.11959/j.issn.2096-6652.202034
LIUZhaoyang.Anoverviewonalgorithmsandapplicationsofdeepreinforcementlearning.ChineseJournalofIntelligentScienceandTechnology[J],2020,2(4):314-326doi:10.11959/j.issn.2096-6652.202034
图1DQN的网络结构
图2DQN算法更新流程
同时,神经网络的参数采用梯度下降的方式进行更新。实验表明,DQN不仅在多种Atari2600游戏中达到人类玩家的水平,还显示出很强的适应性和通用性。
DDQN采用与DQN相同的更新方式。实验结果表明,DDQN能在大部分Atari2600游戏上取得比DQN更好的表现,并且得到更加稳定的策略。
图3DuelingDQN的网络结构
在实际中,一般要将优势函数减去当前状态下所有动作优势函数的平均值,获得的动作值函数如下:
基于策略梯度的DRL算法主要包括策略梯度算法、AC算法以及基于AC的各种改进算法,如深度确定性策略梯度(deepdeterministicpolicygradient,DDPG)算法、异步优势AC(asynchronousadvantageAC,A3C)算法和近端策略优化(proximalpolicyoptimization,PPO)算法等。
策略梯度算法直接对智能体的策略进行优化,它需要收集一系列完整的序列数据τ来更新策略。在DRL中,对序列数据进行收集往往很困难,并且以序列的方式对策略进行更新会引入很大的方差。一种可行的方案是将传统强化学习中的AC结构应用到DRL中。AC结构主要包括执行器和评价器两部分,其中执行器基于策略梯度算法更新动作,评价器则基于值函数法对动作进行评价。AC结构的优点是将策略梯度中的序列更新变为单步更新,不用等序列结束后再对策略进行评估和改进,这样可以减少数据收集的难度,同时可以减小策略梯度算法的方差。
对于值函数部分,也可以用优势函数来代替。优势函数可以表示为:
或
图4A2C的基本结构
SAC算法通过使熵最大化来激励智能体探索,一方面可以避免智能体收敛到次优策略,另一方面可以提升算法的鲁棒性,并且SAC算法能够在多种连续控制的任务中取得比DDPG算法和PPO算法更好的表现。
表1几类DRL的应用领域及研究意义
图5Atari2600典型游戏环境
导航是DRL的另一个重要应用,它的目标是使智能体找到一条从起点到目标点的最优路径,同时,在导航中还需要完成各种任务,如避障、搜集物品以及导航到多个目标等。近年来,利用DRL在迷宫导航、室内导航、街景导航的研究取得了一系列的成果。
图6DRL导航环境
图7MADDPG算法结构
现阶段,关于DRL的研究已经取得了较大的进步,但在算法上仍存在采样效率不足、奖励值设置困难、探索困境等问题。在应用方面,对DRL的研究主要集中在虚拟环境中,无模型DRL算法很难应用于现实环境中。这是因为DRL算法需要大量的采样数据进行训练,而现实中的样本很难通过试错进行获取。此外,DRL算法还存在泛化能力不足、鲁棒性不强等问题,这也限制了DRL在实际生活中的应用。据此,未来对DRL的研究可以从以下方面展开。
SUTTONRS,BARTOAG.Reinforcementlearning:anintroduction
LECUNY,BENGIOY,HINTONG.Deeplearning
赵冬斌,邵坤,朱圆恒,等.深度强化学习综述:兼论计算机围棋的发展
ZHAODB,SHAOK,ZHUYH,etal.ReviewofdeepreinforcementlearninganddiscussionsonthedevelopmentofcomputerGo
万里鹏,兰旭光,张翰博,等.深度强化学习理论及其应用综述
WANLP,LANXG,ZHANGHB,etal.Areviewofdeepreinforcementlearningtheoryandapplication
MNIHV,KAVUKCUOGLUK,SILVERD,etal.Human-levelcontrolthroughdeepreinforcementlearning
SILVERD,HUANGA,MADDISONCJ,etal.MasteringthegameofGowithdeepneuralnetworksandtreesearch
SILVERD,SCHRITTWIESERJ,SIMONYANK,etal.Masteringthegameofgowithouthumanknowledge
BERNERC,BROCKMANG,CHANB,etal.Dota2withlargescaledeepreinforcementlearning
VINYALSO,BABUSCHKINI,CZARNECKIWM,etal.GrandmasterlevelinStarCraftIIusingmulti-agentreinforcementlearning
刘全,翟建伟,章宗长,等.深度强化学习综述
LIUQ,ZHAIJW,ZHANGZZ,etal.Asurveyondeepreinforcementlearning
刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述
LIUJW,GAOF,LUOXL.Surveyofdeepreinforcementlearningbasedonvaluefunctionandpolicygradient
SUTTONRS.Learningtopredictbythemethodsoftemporaldifferences
WATKINSCJCH,DAYANP.Q-learning
VANHASSELTH,GUEZA,SILVERD,etal.DeepreinforcementlearningwithdoubleQ-learning
SCHAULT,QUANJ,ANTONOGLOUI,etal.Prioritizedexperiencereplay
WANGZ,SCHAULT,HESSELM,etal.Duelingnetworkarchitecturesfordeepreinforcementlearning
NAIRA,SRINIVASANP,BLACKWELLS,etal.Massivelyparallelmethodsfordeepreinforcementlearning
SLIVERD,LEVERG,HEESSN,etal.Deterministicpolicygradientalgorithms
LILLICRAPPT,HUNTJJ,PRITZELA,etal.Continuouscontrolwithdeepreinforcementlearning
MNIHV,BADIAAP,MIRZAM,etal.Asynchronousmethodsfordeepreinforcementlearning
SCHULMANJ,WOLSKIF,DHARIWALP,etal.Proximalpolicyoptimizationalgorithms
HAARNOJAT,ZHOUA,ABBEELP,etal.Softactor-critic:off-policymaximumentropydeepreinforcementlearningwithastochasticactor
沈宇,韩金朋,李灵犀,等.游戏智能中的AI——从多角色博弈到平行博弈
SHENY,HANJP,LILX,etal.AIingameintelligence—frommulti-rolegametoparallelgame
BADIAAP,PIOTB,KAPTUROWSKIS,etal.Agent57:outperformingtheatarihumanbenchmark
KEMPKAM,WYDMUCHM,RUNCG,etal.Vizdoom:adoom-basedAIresearchplatformforvisualreinforcementlearning
LAMPLEG,CHAPLOTDS.PlayingFPSgameswithdeepreinforcementlearning
DOSOVITSKIYA,KOLTUNV.Learningtoactbypredictingthefuture
PATHAKD,AGRAWALP,EFROSAA,etal.Curiosity-drivenexplorationbyself-supervisedprediction
WUY,ZHANGW,SONGK.Master-slavecurriculumdesignforreinforcementlearning
VINYALSO,EWALDST,BARTUNOVS,etal.StarcraftII:anewchallengeforreinforcementlearning
ZAMBALDIV,RAPOSOD,SANTOROA,etal.Relationaldeepreinforcementlearning
VASWANIA,SHAZEERN,PARMARN,etal.Attentionisallyouneed
RASHIDT,SAMVELYANM,DEWITTCS,etal.QMIX:monotonicvaluefunctionfactorisationfordeepmulti-agentreinforcementlearning
YED,LIUZ,SUNM,etal.MasteringcomplexcontrolinMOBAgameswithdeepreinforcementlearning
OHJ,CHOCKALINGAMV,SINGHS,etal.Controlofmemory,activeperception,andactioninminecraft
JADERBERGM,MNIHV,CZARNECKIWM,etal.Reinforcementlearningwithunsupervisedauxiliarytasks
MIROWSKIP,PASCANUR,VIOLAF,etal.Learningtonavigateincomplexenvironments
WANGY,HEH,SUNC.Learningtonavigatethroughcomplexdynamicenvironmentwithmodulardeepreinforcementlearning
SHIH,SHIL,XUM,etal.End-to-endnavigationstrategywithdeepreinforcementlearningformobilerobots
SAVINOVN,RAICHUKA,MARINIERR,etal.Episodiccuriositythroughreachability
ZHUY,MOTTAGHIR,KOLVEE,etal.Target-drivenvisualnavigationinindoorscenesusingdeepreinforcementlearning
TAIL,LIUM.Towardscognitiveexplorationthroughdeepreinforcementlearningformobilerobots
TAIL,PAOLOG,LIUM.Virtual-to-realdeepreinforcementlearning:continuouscontrolofmobilerobotsformaplessnavigation
WUY,RAOZ,ZHANGW,etal.Exploringthetaskcooperationinmulti-goalvisualnavigation
ZHANGW,ZHANGY,LIUN.Map-lessnavigation:asingleDRL-basedcontrollerforrobotswithvarieddimensions
MIROWSKIP,GRIMESMK,MALINOWSKIM,etal.Learningtonavigateincitieswithoutamap
LIA,HUH,MIROWSKIP,etal.Cross-viewpolicylearningforstreetnavigation
HERMANNKM,MALINOWSKIM,MIROWSKIP,etal.Learningtofollowdirectionsinstreetview
CHANCáNM,MILFORDM.CityLearn:diversereal-worldenvironmentsforsample-efficientnavigationpolicylearning
孙长银,穆朝絮.多智能体深度强化学习的若干关键科学问题
SUNCY,MUCX.Importantscientificproblemsofmulti-agentdeepreinforcementlearning
OROOJLOOYJADIDA,HAJINEZHADD.Areviewofcooperativemulti-agentdeepreinforcementlearning
OMIDSHAFIEIS,PAZISJ,AMATOC,etal.Deepdecentralizedmulti-taskmulti-agentreinforcementlearningunderpartialobservability
MATIGNONL,LAURENTGJ,LEFORT-PIATN.HystereticQ-learning:analgorithmfordecentralizedreinforcementlearningincooperativemulti-agentteams
FOERSTERJ,NARDELLIN,FARQUHARG,etal.Stabilisingexperiencereplayfordeepmulti-agentreinforcementlearning
PALMERG,TUYLSK,BLOEMBERGEND,etal.Lenientmulti-agentdeepreinforcementlearning
EVERETTR,ROBERTSS.Learningagainstnon-stationaryagentswithopponentmodellinganddeepreinforcementlearning
JINY,WEIS,YUANJ,etal.Stabilizingmulti-agentdeepreinforcementlearningbyimplicitlyestimatingotheragents’behaviors
LIUX,TANY.Attentiverelationalstaterepresentationindecentralizedmultiagentreinforcementlearning
GUPTAJK,EGOROVM,KOCHENDERFERM.Cooperativemulti-agentcontrolusingdeepreinforcementlearning
LOWER,WUY,TAMARA,etal.Multi-agentactor-criticformixedcooperative-competitiveenvironments
FOERSTERJ,FARQUHARG,AFOURAST,etal.Counterfactualmulti-agentpolicygradients
SUNEHAGP,LEVERG,GRUSLYSA,etal.Value-decompositionnetworksforcooperativemulti-agentlearning
MAOH,ZHANGZ,XIAOZ,etal.Modellingthedynamicjointpolicyofteammateswithattentionmulti-agentDDPG
IQBALS,SHAF.Actor-attention-criticformulti-agentreinforcementlearning
FOERSTERJN,ASSAELYM,DEFREITASN,etal.Learningtocommunicatewithdeepmulti-agentreinforcementlearning
SUKHBAATARS,SZLAMA,FERGUSR.Learningmultiagentcommunicationwithbackpropagation
JIANGJ,LUZ.Learningattentionalcommunicationformulti-agentcooperation
KIMD,MOONS,HOSTALLEROD,etal.Learningtoschedulecommunicationinmulti-agentreinforcementlearning
DASA,GERVETT,ROMOFFJ,etal.TarMAC:targetedmulti-agentcommunication
SHANIG,HECKERMAND,BRAFMANRI,etal.AnMDP-basedrecommendersystem
ZHAOX,XIAL,TANGJ,etal.Deepreinforcementlearningforsearch,recommendation,andonlineadvertising:asurvey
ZHAOX,XIAL,ZHANGL,etal.Deepreinforcementlearningforpage-wiserecommendations
ZHENGG,ZHANGF,ZHENGZ,etal.DRN:adeepreinforcementlearningframeworkfornewsrecommendation