NeurIPSSpotlight决策模型有了全新预训练范式统一框架算法鲁棒性智能体信息论

论文一作为之江实验室研究专家、香港中文大学在职博士李蓝青,指导老师为香港中文大学计算机科学与工程系王平安(PhengAnnHeng)教授。同济大学硕士生张海为共同第一作者,指导老师赵君峤教授为论文通讯作者。

现如今,以GPT为代表的大语言模型正深刻影响人们的生产与生活,但在处理很多专业性和复杂程度较高的问题时仍然面临挑战。在诸如药物发现、自动驾驶等复杂场景中,AI的自主决策能力是解决问题的关键,而如何进行决策大模型的高效训练目前仍然是开放性的难题。

强化学习(RL)作为一种经典的时序决策模型的训练方法,势必成为决策大模型训练及微调的核心技术之一。而由于任务和数据的复杂性,我们希望模型在训练时能摆脱传统强化学习与环境在线交互的方式,实现在海量历史数据中进行离线、多任务的高效学习,这一新范式被称为「离线元强化学习」(OfflineMeta-RL)。

问题背景

另一方面,复杂多变的真实场景使得智能体处理多任务能力的必要性与日俱增,这种使智能体像人类一样同时学习多种技能并进行举一反三的范式被称作「元强化学习」(meta-RL)。

离线强化学习和元强化学习作为强化学习的两个分支,有着各自独特的优势。前者由于摆脱了与环境的在线交互,可以重复利用历史数据进行训练,具有高安全性、高样本效率的特点;而后者聚焦多任务及迁移学习,在泛化能力方面表现突出,两者优势互补。

于是在2021年前后,人们开始尝试结合两种范式来训练更加强大的智能体,其中主流的一类方法被称为「基于语境的离线元强化学习」(Context-BasedOfflineMeta-RL,COMRL),其核心思想是将当前任务的表征作为额外的状态信息,训练一个适用于任意任务/环境的通用策略(universalpolicy):

在该框架下,如何学习鲁棒、有效的任务表征Z成为核心问题,而其中最重要的挑战是语境偏移(contextshift)。由于智能体的训练数据是离线也就是固定分布的,但在测试时面临的任务语境未知且多变,导致训练和测试集间可能在状态-动作(state-action)维度或者任务维度上存在巨大分布偏移,这对于模型的鲁棒性、泛化性提出了极高要求。

针对上述问题,现有主流方法例如FOCAL[1]、CORRO[2]和CSRO[3]陆续提出了多种优化目标,利用度量学习(metriclearning)、对比学习(contrastivelearning)等思想进行任务表征学习:

然而,现有方法主要聚焦于对损失函数的经验性改进,缺乏针对任务表示学习尤其是语境偏移的系统性理论支持和设计指导。

基于信息论的统一理论框架UNICORN

UNICORN的核心创新在于借助信息论,从数学定义、因果关系分解、中心定理三个层面依次递进,首次系统性地定义和解构了COMRL中的任务表示学习这一问题,并通过严格理论证明将现有方法的优化目标进行了统一,由此提出并验证了两种新的算法实现,以启迪未来更多新方法的设计。

1.任务表示学习的数学定义

2.因果关系分解

3.中心定理

该中心定理引申出2个重要结论,为未来COMRL领域的新方法设计指明了道路:

基于上述洞察,为了展示UNICORN框架的指导意义,通过对I(Z;M)的近似,我们提出了两种新的算法实现:

实验结果

UNICORN的广泛适用性和鲁棒性

1.BehaviorIID/OOD(训练集与测试集的行为策略采样于相同分布/不同分布)

结论:UNICORN算法在同分布测试集上性能媲美SoTA,在分布外测试集上性能显著优于现有其他方法。

2.不同质量的数据集表现

结论:UNICORN算法(尤其无监督版本)在不同质量的数据集上的性能均达到SoTA。

3.不同模型架构的可迁移性(应用于DecisionTransformer(DT)的测试结果)

结论:UNICORN算法在MLP/DecisionTransformer架构上相比现有方法均呈现明显优势,可以作为即插即用的模块广泛应用于其他RL算法中。

4.对于分布外任务的泛化性

图左为分布外任务的构造方式:以Ant-Dir为例,训练任务的目标方向采样自第二、三象限,测试任务分布于第一、四象限,两者完全不重叠。图右为测试结果:自监督UNICORN为唯一取得正向小样本迁移(positivefew-shottransfer)的算法。

结论:利用无监督UNICORN中的自编码器进行domainrandomization和model-basedRL,可以将智能体的能力外推至分布外的任务,这一点是现有其他方法都无法做到的。

UNICORN的未来展望

为拓展决策大模型的能力边界提供理论基础

UNICORN为离线元强化学习提供了统一理论基础和算法设计准则,对于决策大模型的大规模离线、多任务预训练及微调,从而进一步拓展决策大模型的能力边界具有指导意义。该技术有助于解决药物设计、精准医疗、具身智能等前沿领域面临的AI模型的泛化性、多目标优化、样本利用率等挑战,同时,团队也在探索将UNICORN框架进一步推广到在线强化学习等更多场景中。

参考文献:

[1].LanqingLi,RuiYang,andDijunLuo.Focal:Efficientfully-offlinemeta-reinforcementlearningviadistancemetriclearningandbehaviorregularization.ICLR2021.

[2].HaoqiYuanandZongqingLu.Robusttaskrepresentationsforofflinemeta-reinforcementlearningviacontrastivelearning.ICML2022.

[3].YunkaiGao,etal.Contextshiftreductionforofflinemeta-reinforcementlearning.NeurIPS2023.

THE END
1.GAN就完了!生成对抗网络获得NeurIPS时间检验奖!主要用于「生成与训练数据分布相似的新数据」。 它可以在研究和训练过程中逐渐生成尽可能真实的数据样本,为研究提供准确性。在各大顶会发文中多有应用!我Transfer Learning Enabled Transformer based Generative Adversarial Networks (TT-GAN) for Terahertz Channel Modeling and Generating https://www.bilibili.com/read/cv40111191
2.tensorflow和在线学习onlinelearningtensorflowonlinelearning支持完备的在线学习语义,模型变更实时写出;稀疏特征无需做连续ID化,可以直接使用原始特征表征进行训练,大幅简化了特征工程的复杂度。 异步训练的梯度修正优化器(grad-compensation optimizer),有效减少了异步大规模并发引起的训练效果损失。 集成了高效的Graph Embedding、Memory Network、Cross Media等多种高级训练模式。 https://blog.csdn.net/xymyeah/article/details/84454736
3.OnlineLearningSystemTheSims4ModsLearn up to 40+ Skills (incl. Vet & Hidden Skills) online via the Online Learning System (OLS)https://www.curseforge.com/sims4/mods/online-learning-system
4.OnlineLearningSystemjava源码下载平台OnlineLearningSystemゝE**虐心 在2024-11-29 06:27:19 访问0 Bytes OnlineLearningSystem是一个在线学习系统的管理系统,用于管理在线课程、教师、学生和学习进度。该系统采用模块化设计,包括用户管理、课程管理、教师管理、学生管理和学习进度管理等功能模块。 用户管理模块负责注册新用户、登录验证、权限分配和用户信息https://java.code.coder100.com/index/index/content/id/62197
5.2024年1月浙江首考英语卷深度解析及变式训练(原卷版).docx考点变式训练 1.Inthepastfewyears,onlinelearning___(become)asignificantpartoftheuniversityandcollegeexperience. 2.Themostobviousadvantageofonlinelearningis___youcanstudyanywhereandanytime. 3.“Ithinkapoint___manypeoplelosesightofis___easyitcanbetofallbehindschedule,”saysgraduatestudentAmandaBindman. 4.https://max.book118.com/html/2024/0201/7055014012006036.shtm
6.科技资讯数据资讯System.out.println(wordList); 不传入模型路径时将默认加载配置文件指定的模型。 词性标注 CRF词性标注器的训练与加载与中文分词类似,对应CRFPOSTagger。 命名机器学习PAI全新功效——实时新闻热点Online Learning实践 「深度学习福利」大神带你进阶工程师,立即查看>>> (机器学习PAI Online Learning模块上线邀测,目前http://www.forenose.com/column/blocked/10.html?mid=2&p=12
7.2016华南理工大学网络教育专升本入学考试《大学英语》测试4. Which of the following is the main factor that makes it difficult to define students' perceptions of online learning definitely? A. Learners' varied locations. B. Learners' varied characteristics. C. Learners' varied communication skills. http://www.5184pass.com/aspcms/news/2016-8-8/4529.html
8.RobusttrackingwithweightedonlinestructuredlearningMW Chang,L Ratinov,R Dan - 《Machine Learning》 被引量: 115发表: 2012年 Online structured prediction via coactive learning. arXiv:1205.4213 We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing resultshttps://xueshu.baidu.com/usercenter/paper/show?paperid=388f7ca5bc28a1f1c421af982a2d00ab
9.NeurIPSSpotlight基于信息论,决策模型有了全新预训练范式统一这项研究系统性地提出了一个名为UNICORN(UNIfied Information Theoretic Framework of Context-Based Offline Meta-ReiNforcement Learning)的理论框架,它专注于强化学习中的任务表示学习。UNICORN利用基于任务表征的通用互信息作为优化目标,整合https://mp.weixin.qq.com/s?__biz=MzU0NjM3NjYxMg==&mid=2247506522&idx=1&sn=9af1b245a71e444057effe3148f128aa&chksm=fac16badbc6c06dd32ab00e94dfd2289dcf6cd7880a87662ea0a1ff6f06c8b17a01edf311391&scene=27
10.CourseraDegrees,Certificates,&FreeOnlineCoursesLearn new job skills in online courses from industry leaders like Google, IBM, & Meta. Advance your career with top degrees from Michigan, Penn, Imperial & more.https://www.coursera.org/
11.GenerativeProgramming–LearnProgrammingBuild your software system in a way that it can scale up. Your backend infrastructure should beAnother significant advantage of learning software programming is the potential for financial independenceGaming companies can make their online stores easy for customers to utilize through Magento’s http://generative-programming.org/
12.SSISystemSolutionsInc.OnlineLearning TheLeague for… OnlineCommunity CaliforniaAssociation of… OnlineLearning BuildingOwners and… OnlineCommunity IndianaBankers Association MobileFriendly AmericanSociety of… OnlineCommunity TireIndustry Association MobileFriendly AmericanAcademy of… https://www.systemsolutionsdevelopment.com/
13.ApplicationsofreinforcementlearninginenergysystemsPublications of the energy system domain are divided into 11 subgroups and reviewed. ? Many publications report 10–20% performance improvement. ? Deep learning techniques and state-of-the-art actor-critic methods were not used by many articles. ? Batch reinforcement learning algorithms havehttps://www.sciencedirect.com/science/article/pii/S1364032120309023
14.EnglishModule1.4online learning in the sense of distance learning on the Internet. Because of a lack of agreement on what e-learning is all about, it probably makes sense to use the term online learning when talking about distance learning on the Internet and to use CALL (Computer Assisted Language Learninghttp://www.ict4lt.org/en/en_mod1-4.htm
15.onlinelearningsystem网络线上学习系统;网上学习系统 网络释义 1. 线上学习系统 路,线上参与课程和线上学习系统(Online Learning System)。在随后的 几周中,同学们会用线上的方式参与课堂,但无论如何… www.l99.com|基于2个网页 2. 网上学习系统 OLS是什么意思 On Line Support 在线支持Online Learning System网上学习https://cn.bing.com/dict/online-learning-system
16.NoticeofUseTrainingofSakaiOnlineLearningSystemIn order to promote the teaching quality of MBBS teaching, use training of Sakai online learning system will held at 3rd auditorium in conference center , 4pm on May 15th, 2015. All responsible teachehttps://yxy.ujs.edu.cn/info/1101/1442.htm
17.HomeOnlineLearningSystemSkip to main content Home English (United States) ?(en_us)? Log in Online Learning Systemhttps://learning.challengerschool.com/
18.OnlineLearningSystem的英文简称是OLS海词缩略语词典Online Learning System 英文简称 : OLS 中文全称 : 网上学习系统 所属分类 : 无 词条简介 : 无http://abbr.dict.cn/Online+Learning+System/OLS
19.EDUTRAYGLOBALAnonlinelearningsystemforguaranteedEDUTRAY GLOBAL An online learning system for guaranteed success in ACCA 学生总数 1,943 审核 224 关于我 EduTray is a global organisation with a vision to provide an engaging and enriching learning experience to students of professional qualifications. EduTray has developed a state of the art https://www.udemy.com/user/edutray/
20.andsatisfactionregardingonlinelearningsystemamidstResults have depicted that students are not satisfied with e-learning and they pointed out some critical defects in the system. HEC and rectors should treat this issue as top-priority for provision of good quality education and to save the future of undehttps://www.ncbi.nlm.nih.gov/pubmed/33754524
21.TheDistanceLearningSystemenablesuninterruptedlearningFor months we have witnessed the entire planet fighting the coronavirus. This extraordinary situation has brought numerous challenges, one of which is the implementation of distance learning. All institutions operating under LINKgroup’s education system have effortlessly switched to online learning thankshttps://www.link-group.eu/blog/distance-learning-system-enables-uninterrupted-learning-all-link-institutions
22.WUWOWOnlineLearningSystemApple Store Mac iPad iPhone Watch Vision AirPods TV & Home Entertainment Accessories Support 0+App Store Preview WUWOW Online Learning System You Might Also Like HRDA 雲端智慧面試 Utilities Lasso for Candidates Utilities 研發替代役 Utilities 台北倒垃圾 Utilities 興大校友 Utilities VTR TW https://apps.apple.com/nr/app/wuwow-online-learning-system/id1519193072?see-all=customers-also-bought-apps
23.SEGi’sJackuelineScoreswithRobustOnlineLearningSystemBut the adage, “learning is a lifelong process”, also means that learning continues regardless of the circumstances. Hence SEGi University & Colleges have risen from the challenge by incorporating a strong and robust online education system, one that is safe to say, pandemic proof. https://www.segi.edu.my/segis-jackueline-scores-with-robust-online-learning-system/
24.HomeMizzouOnlineLooking for information about online programs and degrees from the other UM System campuses?Please visit: University of Missouri - Kansas City Online education is more than simply learning from your computer. It is a gateway to forming a community with doers, dreamers and achievers from https://online.missouri.edu/
25.在线学习系统,OnlineLearningSystem,音标,读音,翻译,英文例句1. Study on the Autonomous Learning Supporting Online Learning System; 支持自主学习的在线学习系统研究更多例句>> 2) Learning Manager System 在线学习管理系统3) web-based auxiliary learning system 在线辅助学习系统 1. The paper introduced the application and characters of general-purpose web-based http://www.dictall.com/indu/322/3210056BBB2.htm
26.MakingContentUsableforPeoplewithCognitiveandLearningIt gives advice on how to make content usable for people with cognitive and learning disabilitiesI need to understand the consequences of what I do online. Related Personas: Alison, George, Use a clear and easy layout to help users navigate the system easily. For example: Make https://www.w3.org/TR/coga-usable/
27.JOLTMERLOT Journal of Online Learning and Teaching Vol. 5, No. 2, June 2009 Integrating Online Multimedia into College Course and Classroom: With However, many distributors are now offering digital licenses or closed-system streaming rights for such purposes along with sale of their videos, https://jolt.merlot.org/vol5no2/miller_0609.htm
28.机器学习PAI全新功效——实时新闻热点OnlineLearning实践针对这种场景,PAI平台开创性的提出来Online-Learning的解决方案,通过流式算法和离线算法的结合,既能够发挥离线训练对大规模数据的强大处理能力,又能够发挥流式机器学习算法对实时模型的更新能力,做到流批同跑,完美解决模型时效性的问题。今天就以实时热点新闻挖掘案例为例,为大家介绍PAI OnlineLearning的解决方案。 https://maimai.cn/article/detail?fid=1092991292&efid=q4lYsgkD4uccYLTNjKAn9A
29.深度知识追踪模型综述和性能比较?Knowledge tracking is an important cognitive diagnosis method, which is often used in digitalized education platforms such as online learning platforms 再比如, 高等教育中通常会考察学生综合 运用知识的能力, 因此每个习题一般会包含多个知识点, 这时也不再适合用独热向量来表示习题, 可 以考虑采用预训练https://www.jos.org.cn/josen/article/pdf/6715