{"id":702,"date":"2025-12-05T01:46:14","date_gmt":"2025-12-04T17:46:14","guid":{"rendered":"https:\/\/www.ndnlab.com\/?p=702"},"modified":"2025-12-08T09:55:07","modified_gmt":"2025-12-08T01:55:07","slug":"a-survey-on-self-play-methods-in-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.ndnlab.com\/?p=702","title":{"rendered":"A Survey on Self-Play Methods in Reinforcement Learning"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">\u8bba\u6587\u8981\u70b9<\/h2>\n\n\n\n<p>\u8bba\u6587\u63d0\u51fa\u4e86\u4e00\u4e2a\u7edf\u4e00\u6846\u67b6\u6765\u523b\u753b self-play\uff08\u667a\u80fd\u4f53\u4e0e\u81ea\u8eab\u6216\u81ea\u8eab\u5386\u53f2\u7248\u672c\u4ea4\u4e92\u4ee5\u6539\u8fdb\u7b56\u7565\uff09\u7684\u5404\u79cd\u65b9\u6cd5\uff0c\u6309\u7b56\u7565\u66f4\u65b0\u673a\u5236\u3001\u5bf9\u624b\u9009\u62e9\u4e0e\u4eba\u53e3\u7ba1\u7406\u3001\u535a\u5f08\u7c7b\u578b\uff08\u96f6\u548c\/\u975e\u96f6\u548c\u3001\u53ef\u8f6c\u6027\/\u975e\u53ef\u8f6c\u6027\uff09\u7b49\u7ef4\u5ea6\u8fdb\u884c\u5206\u7c7b\uff0c\u5e76\u56de\u987e\u4e86\u4ee3\u8868\u6027\u7b97\u6cd5\u3001\u5e94\u7528\u573a\u666f\u4e0e\u7406\u8bba\/\u5b9e\u8df5\u6311\u6218\uff0c\u540c\u65f6\u5217\u51fa\u4e86\u672a\u6765\u7814\u7a76\u65b9\u5411\u4e0e\u8bc4\u4f30\u6307\u6807\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"830\" height=\"453\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-10.png\"  class=\"wp-image-703\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-10.png 830w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-10-300x164.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-10-768x419.png 768w\" sizes=\"auto, (max-width: 830px) 100vw, 830px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">\u5f62\u5f0f\u5316\u80cc\u666f\uff08\u6a21\u578b\u4e0e\u535a\u5f08\u6982\u5ff5\uff09<\/h2>\n\n\n\n<p>\u591a\u667a\u80fd\u4f53\/\u535a\u5f08\u6846\u67b6\uff1a\u8bba\u6587\u4ee5\u4e00\u822c\u6027\u591a\u667a\u80fd\u4f53\u5f3a\u5316\u5b66\u4e60\uff08MARL\uff09\u548c\u535a\u5f08\u8bba\u4e3a\u57fa\u7840\uff0c\u533a\u5206\u6b63\u89c4\u578b\uff08normal-form\uff09\/\u6269\u5c55\u578b\uff08extensive-form\uff09\u535a\u5f08\u3001\u9759\u6001\/\u52a8\u6001\u3001\u9636\u6bb5\u5316\/\u91cd\u590d\u535a\u5f08\u7b49\u6982\u5ff5\uff0c\u4ee5\u4fbf\u7edf\u4e00\u63cf\u8ff0 self-play \u5728\u4e0d\u540c\u535a\u5f08\u7ed3\u6784\u4e0b\u7684\u884c\u4e3a\u4e0e\u76ee\u6807\uff08\u4f8b\u5982\u5bfb\u627e\u7eb3\u4ec0\u5747\u8861\u3001\u6f14\u5316\u7a33\u5b9a\u7b56\u7565\u6216\u5f31\u4f18\u52bf\u7b56\u7565\uff09\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"830\" height=\"258\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-11.png\"  class=\"wp-image-704\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-11.png 830w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-11-300x93.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-11-768x239.png 768w\" sizes=\"auto, (max-width: 830px) 100vw, 830px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe1\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe1\" \/><\/figure>\n\n\n\n<p>\u53ef\u8f6c\u6027\uff08transitive\uff09 vs \u975e\u53ef\u8f6c\u6027\uff08non-transitive\uff09\uff1a\u53ef\u8f6c\u6027\u6307\u201c\u82e5A\u4f18\u4e8eB\u3001B\u4f18\u4e8eC\u5219A\u4f18\u4e8eC\u201d\u7684\u94fe\u5f0f\u4f18\u5148\u5173\u7cfb\uff1b\u8bb8\u591a\u5bf9\u6297\u4efb\u52a1\uff08\u5c24\u5176\u591a\u4eba\u535a\u5f08\u3001\u7b56\u7565\u6e38\u620f\uff09\u8868\u73b0\u975e\u53ef\u8f6c\u6027\uff0c\u56e0\u6b64\u7b80\u5355\u7684\u68af\u5ea6\u4e0a\u5347\/\u81ea\u6211\u590d\u5236\u5bb9\u6613\u9677\u5165\u5faa\u73af\/\u65e0\u63d0\u9ad8\u3002\u8bba\u6587\u628a\u8fd9\u4e00\u70b9\u4f5c\u4e3a\u81ea\u6211\u5bf9\u6218\u8bbe\u8ba1\uff08population \/league\uff09\u4e0e\u8bc4\u4f30\u8bbe\u8ba1\u7684\u91cd\u8981\u52a8\u56e0\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u7edf\u4e00\u6846\u67b6\u4e0e\u5206\u7c7b<\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"831\" height=\"317\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-12.png\"  class=\"wp-image-705\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-12.png 831w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-12-300x114.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-12-768x293.png 768w\" sizes=\"auto, (max-width: 831px) 100vw, 831px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe2\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe2\" \/><\/figure>\n\n\n\n<p>\u8bba\u6587\u63d0\u51fa\u4e00\u4e2a\u628a self-play \u65b9\u6cd5\u653e\u5165\u7edf\u4e00\u5ea7\u6807\u7cfb\u7684\u6846\u67b6\uff0c\u4e3b\u8981\u7ef4\u5ea6\u5305\u62ec\uff1a<\/p>\n\n\n\n<p>\u5bf9\u624b\/\u4eba\u53e3\uff08Opponent\/Population\uff09\u7b56\u7565\u7684\u6765\u6e90<\/p>\n\n\n\n<p>\u5355\u4e00\u5386\u53f2\u7248\u672c\uff08vanilla self-play\uff09<\/p>\n\n\n\n<p>\u5386\u53f2\u5e73\u5747\u7b56\u7565 \/ \u8f68\u8ff9\u8bb0\u5fc6\uff08fictitious-play \/ NFSP\uff09<\/p>\n\n\n\n<p>\u4eba\u53e3\/\u8054\u76df\uff08population\/league\uff09\u673a\u5236\uff08AlphaStar\u3001OpenAI Five \u7b49\uff09<\/p>\n\n\n\n<p>\u5143\u7b56\u7565\/\u5143\u535a\u5f08\uff08PSRO \u7b49\u57fa\u4e8e meta-game \u7684\u65b9\u6cd5\uff09<\/p>\n\n\n\n<p>\u7b56\u7565\u66f4\u65b0 \/ \u5b66\u4e60\u6a21\u5f0f\uff08Policy Update\uff09<\/p>\n\n\n\n<p>\u76f4\u63a5\u57fa\u4e8e RL\uff08PPO\/PG\/Q-learning\uff09\u81ea\u5bf9\u5f08\u66f4\u65b0<\/p>\n\n\n\n<p>\u6a21\u4eff\/\u76d1\u7763\u5b66\u4e60\u6210\u5206\uff08\u4f8b\u5982\u628a\u5386\u53f2\u80dc\u5229\u8f68\u8ff9\u5f53\u4f5c\u4e13\u5bb6\u6837\u672c\uff09<\/p>\n\n\n\n<p>\u6df7\u5408\uff1a\u76d1\u7763 + \u5f3a\u5316\uff08\u5982 NFSP \u7684 supervised policy + RL best-response\uff09<\/p>\n\n\n\n<p>\u89e3\u7b97\u76ee\u6807\uff08Solution Concept\uff09<\/p>\n\n\n\n<p>\u903c\u8fd1\u7eb3\u4ec0\u5747\u8861\uff08Nash\uff09\/\u6700\u5c0f\u6700\u5927\u89e3<\/p>\n\n\n\n<p>\u8bad\u7ec3\u4e00\u4e2a\u5f3a\u7b56\u7565\u4ee5\u51fb\u8d25\u7279\u5b9a\u5bf9\u624b\u96c6\u5408\uff08best-response \/ exploiters\uff09<\/p>\n\n\n\n<p>\u73af\u5883 \/\u535a\u5f08\u7c7b\u578b\uff08\u68cb\u7c7b\u3001\u5361\u724c\u3001MOBA\u3001\u89c6\u9891\u6e38\u620f\u7b49\uff09\u2014\u2014\u4e0d\u540c\u4efb\u52a1\u504f\u597d\u4e0d\u540c self-play \u7b56\u7565\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"954\" height=\"920\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-13.png\"  class=\"wp-image-706\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-13.png 954w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-13-300x289.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-13-768x741.png 768w\" sizes=\"auto, (max-width: 954px) 100vw, 954px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe3\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe3\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"953\" height=\"498\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-14.png\"  class=\"wp-image-707\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-14.png 953w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-14-300x157.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-14-768x401.png 768w\" sizes=\"auto, (max-width: 953px) 100vw, 953px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe4\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe4\" \/><\/figure>\n\n\n\n<p>\u8bba\u6587\u5728\u8be5\u6846\u67b6\u4e0b\u628a\u5927\u91cf\u7b97\u6cd5\uff08vanilla SP\u3001fictitious SP\u3001NFSP\u3001PSRO\u3001CFR \u53ca\u5176\u53d8\u4f53\u3001league\/population methods\u3001time\/space saving CFR\u3001MCCFR\u3001NeuPL\u3001FTW \u7b49\uff09\u653e\u5165\u5bf9\u5e94\u7c7b\u522b\uff0c\u5e76\u5206\u6790\u5404\u81ea\u4f18\u7f3a\u70b9\u4e0e\u9002\u7528\u573a\u666f\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"881\" height=\"1024\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-15-881x1024.png\"  class=\"wp-image-708\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-15-881x1024.png 881w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-15-258x300.png 258w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-15-768x893.png 768w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-15.png 974w\" sizes=\"auto, (max-width: 881px) 100vw, 881px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe5\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe5\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"944\" height=\"182\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-16.png\"  class=\"wp-image-709\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-16.png 944w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-16-300x58.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-16-768x148.png 768w\" sizes=\"auto, (max-width: 944px) 100vw, 944px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe6\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe6\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"935\" height=\"155\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-17.png\"  class=\"wp-image-710\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-17.png 935w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-17-300x50.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-17-768x127.png 768w\" sizes=\"auto, (max-width: 935px) 100vw, 935px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe7\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe7\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"987\" height=\"573\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-18.png\"  class=\"wp-image-711\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-18.png 987w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-18-300x174.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-18-768x446.png 768w\" sizes=\"auto, (max-width: 987px) 100vw, 987px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe8\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe8\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">\u4e3b\u8981 self-play \u65b9\u6cd5\u4e0e\u6280\u672f\u8981\u70b9<\/h2>\n\n\n\n<p>\u4e0b\u9762\u5217\u51fa\u8bba\u6587\u91cd\u70b9\u8ba8\u8bba\u7684\u51e0\u5927\u7c7b\u65b9\u6cd5\uff0c\u5e76\u7ed9\u51fa\u5b9e\u73b0\/\u7b97\u6cd5\u5c42\u9762\u7684\u8981\u70b9\u4e0e\u5de5\u7a0b\u6ce8\u610f\u4e8b\u9879\u3002<\/p>\n\n\n\n<p>1) Vanilla Self-Play\uff08\u539f\u59cb\u81ea\u6211\u5bf9\u6218\uff09<\/p>\n\n\n\n<p>\u6838\u5fc3\u601d\u60f3\uff1a\u5f53\u524d\u7b56\u7565\u4e0e\u5176\u81ea\u8eab\u526f\u672c\u6216\u6700\u65b0\u7248\u672c\u5bf9\u6218\uff0c\u76f4\u63a5\u7528 RL\uff08\u5982 PPO\u3001A2C\uff09\u5728\u81ea\u5bf9\u5f08\u6570\u636e\u4e0a\u66f4\u65b0\u7b56\u7565\uff08\u5e38\u89c1\u4e8e AlphaZero \u7684\u65e9\u671f\u6d41\u7a0b\uff09\u3002<\/p>\n\n\n\n<p>\u4f18\u70b9\uff1a\u5b9e\u73b0\u7b80\u5355\uff0c\u80fd\u81ea\u52a8\u6784\u9020 curriculum\uff08\u5bf9\u624b\u968f\u8fdb\u5316\u81ea\u52a8\u53d8\u96be\uff09\u3002<\/p>\n\n\n\n<p>\u7f3a\u70b9\uff1a\u5728\u975e\u53ef\u8f6c\u6027\/\u591a\u6837\u5bf9\u6297\u73af\u5883\u4e2d\u4f1a\u6536\u655b\u5230\u5faa\u73af\u3001\u88ab exploit \u6216\u51fa\u73b0\u9000\u5316\uff08\u5728\u590d\u6742\u591a\u4eba\u535a\u5f08\u4e2d\u5c24\u751a\uff09\u3002<\/p>\n\n\n\n<p>\u5de5\u7a0b\u8981\u70b9\uff1a\u9700\u8981\u7ef4\u62a4\u7248\u672c\u7ba1\u7406\uff08checkpoint\u3001\u5bf9\u6218\u5339\u914d\u89c4\u5219\uff09\uff0c\u4ee5\u53ca\u7a33\u5b9a\u5316\u624b\u6bb5\uff08\u5b66\u4e60\u7387\u3001\u71b5\u6b63\u5219\u3001\u5bf9\u624b\u91c7\u6837\u673a\u5236\uff09\u3002<\/p>\n\n\n\n<p>2) Fictitious Play \/ Neural Fictitious Self-Play (NFSP)<\/p>\n\n\n\n<p>\u6838\u5fc3\u601d\u60f3\uff1a\u7ecf\u5178\u201c\u865a\u6784\u535a\u5f08\uff08fictitious play\uff09\u201d\u601d\u60f3\uff1a\u4ee5\u5bf9\u624b\u7b56\u7565\u7684\u5386\u53f2\u5e73\u5747\u4f5c\u4e3a\u5bf9\u624b\u5206\u5e03\u6765\u8bad\u7ec3 best-response\u3002NFSP \u7528\u795e\u7ecf\u7f51\u7edc\u8868\u793a\u7b56\u7565\u4e0e\u5e73\u5747\u7b56\u7565\uff08\u7528\u76d1\u7763\u5b66\u4e60\u903c\u8fd1\u5386\u53f2\u7b56\u7565\u5206\u5e03\uff09\uff0c\u5c06 RL \u7684 best-response \u4e0e\u76d1\u7763\u5b66\u4e60\u5206\u79bb\u8bad\u7ec3\u3002<\/p>\n\n\n\n<p>\u4f18\u70b9\uff1a\u7406\u8bba\u4e0a\u66f4\u63a5\u8fd1\u6536\u655b\u5230\u7eb3\u4ec0\uff08\u5728\u67d0\u4e9b\u96f6\u548c\/\u5bf9\u79f0\u8bbe\u7f6e\uff09\uff0c\u7f13\u548c\u4e86\u975e\u53ef\u8f6c\u6027\u95ee\u9898\u3002<\/p>\n\n\n\n<p>\u5b9e\u73b0\u7ec6\u8282\uff1a\u9700\u8981\u53cc\u7f51\u7edc\uff08RL network + supervised average network\uff09\u3001\u7ecf\u9a8c\u7f13\u51b2\u533a\u7684\u8bbe\u8ba1\uff08\u4fdd\u5b58\u8f68\u8ff9\u4ee5\u5b66\u4e60\u5e73\u5747\u7b56\u7565\uff09\u3001\u4e0eRL\u66f4\u65b0\u65f6\u5e8f\u7684\u534f\u8c03\u3002<\/p>\n\n\n\n<p>3) PSRO\uff08Policy-Space Response Oracles\uff09\u53ca\u5176\u53d8\u4f53<\/p>\n\n\n\n<p>\u6838\u5fc3\u601d\u60f3\uff1a\u628a\u535a\u5f08\u62bd\u8c61\u6210meta-game\uff08\u7b56\u7565\u96c6\u5408\u7684\u6536\u76ca\u77e9\u9635\uff09\uff0c\u8fed\u4ee3\u5730\uff1a\u8ba1\u7b97 meta-game\uff08\u901a\u8fc7\u81ea\u6211\u5bf9\u6218\/\u4eff\u771f\u4f30\u8ba1\u6536\u76ca\uff09\uff0c\u5bf9 meta-game \u6c42\u89e3\uff08\u627e\u5230\u6df7\u5408\u7b56\u7565\uff09\uff0c\u7136\u540e\u8bad\u7ec3\u4e00\u4e2a oracle\uff08\u6700\u4f18\u54cd\u5e94\u7f51\u7edc\uff09\u5bf9\u6297\u5f53\u524d\u6df7\u5408\u7b56\u7565\uff0c\u52a0\u5165\u4eba\u53e3\uff0c\u91cd\u590d\u3002PSRO \u662f\u5bf9 Double-Oracle\u3001Fictitious Play \u7684\u73b0\u4ee3\u6269\u5c55\u3002<\/p>\n\n\n\n<p>\u6280\u672f\u70b9\uff1a<\/p>\n\n\n\n<p>\u4f30\u8ba1 payoffs \u7684\u6837\u672c\u6548\u7387\u548c\u65b9\u5dee\u63a7\u5236\uff08\u9700\u8981\u5927\u91cf\u81ea\u5bf9\u5f08\uff09<\/p>\n\n\n\n<p>meta-solver \u7684\u9009\u62e9\uff08\u600e\u6837\u4ece payoff matrix \u63d0\u53d6\u6df7\u5408\u7b56\u7565\uff1bNash solver \/ regularized solver\uff09<\/p>\n\n\n\n<p>oracle \u7684\u8bad\u7ec3\u65b9\u5f0f\uff08RL\u3001imitation\uff09\u3001\u4ee5\u53ca\u4eba\u53e3\u7ba1\u7406\uff08\u4f55\u65f6\u6dd8\u6c70\/\u4fdd\u5b58\u7b56\u7565\uff09<\/p>\n\n\n\n<p>\u9002\u7528\u6027\uff1a\u64c5\u957f\u5904\u7406\u975e\u53ef\u8f6c\u6027\u95ee\u9898\u5e76\u80fd\u663e\u5f0f\u63a7\u5236\u7b56\u7565\u591a\u6837\u6027\uff0c\u9002\u7528\u4e8e\u590d\u6742\u5bf9\u6297\u4efb\u52a1\u3002<\/p>\n\n\n\n<p>4) Counterfactual Regret Minimization (CFR) \u53ca\u53d8\u4f53\uff08\u6251\u514b\u7c7b\u6210\u529f\u8303\u5f0f\uff09<\/p>\n\n\n\n<p>\u6838\u5fc3\u601d\u60f3\uff1a\u7528\u4e8e\u4e0d\u5b8c\u5168\u4fe1\u606f\u535a\u5f08\uff08\u5982\u6251\u514b\uff09\uff0c\u901a\u8fc7\u5728\u7ebf\/\u79bb\u7ebf\u4f30\u8ba1\u6bcf\u4e2a\u4fe1\u606f\u96c6\u4e0b\u7684 regrets\uff0c\u53cd\u590d\u6700\u5c0f\u5316\u540e\u6536\u655b\u5230\u8fd1\u4f3c\u7eb3\u4ec0\u3002MCCFR\/\u62bd\u6837\u53d8\u4f53\u7528\u4e8e\u89c4\u6a21\u5316\uff1b\u8fd8\u6709 time-saving \/ space-saving \u7248\u672c\u4f18\u5316\u6548\u7387\u3002<\/p>\n\n\n\n<p>\u5de5\u7a0b\u5b9e\u8df5\uff1a\u6251\u514b\u7c7b\u5927\u578b\u7cfb\u7edf\uff08Pluribus\u3001Libratus\uff09\u7528\u4e86 CFR \u6216 MCCFR \u7684\u5de5\u7a0b\u5316\u7248\u672c + \u5927\u89c4\u6a21\u5e76\u884c \/\u8d44\u6e90\u8c03\u5ea6\u4e0e\u62bd\u6837\u7b56\u7565\u3002\u8bba\u6587\u5bf9\u8fd9\u4e9b\u53d8\u4f53\u548c\u5de5\u7a0b\u6280\u5de7\u505a\u4e86\u6574\u7406\u3002<\/p>\n\n\n\n<p>5) Population \/ League Methods\uff08\u4eba\u53e3\/\u8054\u76df\u8bad\u7ec3\uff09<\/p>\n\n\n\n<p>\u4f8b\u5b50\uff1aAlphaStar\uff08league training\uff09\u3001OpenAI Five\uff08league + exploiters\uff09\u3001MuZero\/AlphaZero \u7684\u53d8\u4f53\u3002<\/p>\n\n\n\n<p>\u5173\u952e\u601d\u60f3\uff1a\u7ef4\u62a4\u4e00\u4e2a\u7b56\u7565\u4eba\u53e3\uff08\u5305\u62ec\u4e3b\u7b56\u7565\u3001\u5386\u53f2\u5747\u503c\u3001\u4e13\u95e8\u7684 exploiters\u3001\u968f\u673a\u5bf9\u624b\u7b49\uff09\uff0c\u901a\u8fc7\u5bf9\u4e0d\u540c\u5bf9\u624b\u7684\u6709\u9488\u5bf9\u6027\u8bad\u7ec3\u63d0\u9ad8\u9c81\u68d2\u6027\u5e76\u907f\u514d\u8fc7\u62df\u5408\u5355\u4e00\u5bf9\u624b\u3002<\/p>\n\n\n\n<p>\u5de5\u7a0b\u8981\u70b9\uff1a\u7b56\u7565\u9009\u62e9\/\u91c7\u6837\u5206\u5e03\u3001\u4eba\u53e3\u589e\u957f\/\u4fee\u526a\u7b56\u7565\u3001\u5982\u4f55\u751f\u6210\u6311\u6218\u6027\u5bf9\u624b\uff08exploiters\uff09\u3001\u4ee5\u53ca\u5982\u4f55\u4fdd\u8bc1\u8bad\u7ec3\u7a33\u5b9a\uff08\u907f\u514d population collapse\uff09\u3002<\/p>\n\n\n\n<p>6) \u6df7\u5408\u65b9\u6cd5\u4e0e\u8fd1\u671f\u521b\u65b0\uff08NeuPL\u3001FTW\u3001R-NaD \u7b49\uff09<\/p>\n\n\n\n<p>\u8fd1\u671f\u5de5\u4f5c\u5c1d\u8bd5\u6a21\u5757\u5316\u3001\u7528\u5143\u5b66\u4e60\u6216\u57fa\u4e8e\u795e\u7ecf\u65b9\u6cd5\u5b66\u4e60 population \u7ba1\u7406\u7b56\u7565\u3001\u81ea\u52a8\u5316\u5bf9\u624b\u751f\u6210\/\u7b5b\u9009\u7b56\u7565\uff08\u8bba\u6587\u5217\u51fa\u4e86\u82e5\u5e72\u5177\u4f53\u65b9\u6cd5\u4e0e\u8bc4\u6d4b\uff09\u3002\u8fd9\u4e9b\u65b9\u6cd5\u5f80\u5f80\u5728\u590d\u6742\u89c6\u9891\u6e38\u620f\/\u591a\u4eba MOBA \u4e2d\u8868\u73b0\u66f4\u597d\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u4ee3\u8868\u6027\u7cfb\u7edf\u4e0e\u5b9e\u73b0\u8981\u70b9<\/h2>\n\n\n\n<p>\u8bba\u6587\u91cc\u6709\u4e00\u5f20\u8868\u683c\u5217\u51fa\u4f17\u591a\u4ee3\u8868\u6027\u7cfb\u7edf\uff08AlphaGo\/AlphaZero\/MuZero\u3001OpenAI Five\u3001AlphaStar\u3001Pluribus\u3001DeepStack\u3001Libratus\u3001TiZero\u3001DeltaDou \u7b49\uff09\uff0c\u5e76\u4ee5\u6e38\u620f\u7c7b\u522b\u3001\u53c2\u4e0e\u65b9\u4fe1\u606f\u3001\u662f\u5426\u4f7f\u7528\u4e13\u5bb6\u6570\u636e\u3001\u4f7f\u7528\u7684 self-play \u7c7b\u522b\u7b49\u7ef4\u5ea6\u6bd4\u8f83\u3002\u5173\u952e\u5b9e\u73b0\u7ecf\u9a8c\uff08\u4ece\u8fd9\u4e9b\u7cfb\u7edf\u603b\u7ed3\uff09\u5305\u62ec\uff1a\u5927\u91cf\u5e76\u884c\u81ea\u5bf9\u5f08\u4eff\u771f\u3001\u7248\u672c\u7ba1\u7406\u4e0echeckpointing\u3001\u4e13\u95e8\u7684 exploiters\/league \u8bbe\u8ba1\u3001meta-game \u89e3\u6790\uff08PSRO\uff09\u7b49\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"894\" height=\"1024\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-19-894x1024.png\"  class=\"wp-image-712\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-19-894x1024.png 894w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-19-262x300.png 262w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-19-768x879.png 768w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/12\/image-19.png 993w\" sizes=\"auto, (max-width: 894px) 100vw, 894px\" title=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe9\" alt=\"A Survey on Self-Play Methods in Reinforcement Learning\u63d2\u56fe9\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">\u8bc4\u4f30\u6307\u6807\u4e0e\u5b9e\u9a8c\u8bbe\u8ba1<\/h2>\n\n\n\n<p>\u8bba\u6587\u8ba8\u8bba\u4e86\u8bc4\u4f30 self-play \u7b56\u7565\u7684\u82e5\u5e72\u6307\u6807\u4e0e\u5b9e\u8df5\u65b9\u6cd5\uff0c\u5305\u62ec\uff1a<\/p>\n\n\n\n<p>NashConv\uff1a\u8861\u91cf\u8ddd\u79bb\u7eb3\u4ec0\u5747\u8861\u7684\u5dee\u8ddd\uff08\u5c24\u5176\u5728\u96f6\u548c\u535a\u5f08\u4e2d\u5e38\u7528\uff09\u3002<\/p>\n\n\n\n<p>Exploitability \/ exploit rate\uff1a\u8861\u91cf\u80fd\u5426\u88ab exploiter \u6253\u8d25\uff08\u8d8a\u4f4e\u8d8a\u9c81\u68d2\uff09\u3002<\/p>\n\n\n\n<p>Head-to-head win rate \/ ELO \/ Glicko\uff1a\u76f4\u63a5\u5bf9\u5f08\u5f3a\u5ea6\u5ea6\u91cf\uff08\u4f46\u5bf9\u975e\u8f6c\u6027\u6e38\u620f\u53ef\u80fd\u8bef\u5bfc\uff09\u3002<\/p>\n\n\n\n<p>Population-level coverage \/ diversity metrics\uff1a\u8861\u91cf\u7b56\u7565\u4eba\u53e3\u8986\u76d6\u7b56\u7565\u7a7a\u95f4\u7684\u7a0b\u5ea6\uff08\u6709\u52a9\u4e8e\u5224\u65ad\u662f\u5426\u89e3\u51b3\u975e\u8f6c\u6027\u5faa\u73af\uff09\u3002<\/p>\n\n\n\n<p>\u6837\u672c\u6548\u7387 \/\u8ba1\u7b97\u6d88\u8017 \/\u5e76\u884c\u5316\u5f00\u9500\uff1a\u5728\u5de5\u7a0b\u5b9e\u8df5\u4e2d\u975e\u5e38\u5173\u952e\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u4e3b\u8981\u6311\u6218\uff08\u8bba\u6587\u603b\u7ed3\u7684\u7814\u7a76\u7a7a\u767d\uff09<\/h2>\n\n\n\n<p>\u8bba\u6587\u5217\u51fa\u5e76\u8ba8\u8bba\u4e86\u82e5\u5e72\u672a\u89e3\u51b3\u6216\u672a\u5145\u5206\u89e3\u51b3\u7684\u95ee\u9898\uff1a<\/p>\n\n\n\n<p>\u975e\u53ef\u8f6c\u6027\u4e0e\u5faa\u73af\u52a8\u6001\uff1a\u5982\u4f55\u6784\u5efa\u80fd\u5728\u9ad8\u5ea6\u975e\u8f6c\u6027\u73af\u5883\u4e2d\u7a33\u5065\u589e\u957f\u7684\u8bad\u7ec3\u6d41\u7a0b\uff1f\uff08PSRO\/Population \u63d0\u4f9b\u65b9\u5411\uff0c\u4f46\u6837\u672c\/\u8ba1\u7b97\u4ee3\u4ef7\u9ad8\uff09<\/p>\n\n\n\n<p>\u6837\u672c\/\u8ba1\u7b97\u6548\u7387\uff1a\u5927\u89c4\u6a21\u81ea\u5bf9\u5f08\u975e\u5e38\u6602\u8d35\uff0c\u5982\u4f55\u7528\u66f4\u6837\u672c\u9ad8\u6548\u7684 meta-solvers\u3001\u4f30\u8ba1\u5668\u6216\u6a21\u62df\u5668\u51cf\u5c11\u6210\u672c\uff1f<\/p>\n\n\n\n<p>\u6536\u655b\u6027\u4e0e\u7406\u8bba\u4fdd\u8bc1\uff1a\u8bb8\u591a\u65b9\u6cd5\u5728\u5b9e\u8df5\u6709\u6548\u4f46\u7f3a\u4e4f\u5e7f\u6cdb\u7684\u7406\u8bba\u6536\u655b\u4fdd\u8bc1\uff08\u5c24\u5176\u5728\u975e\u96f6\u548c\u3001\u591a\u73a9\u5bb6\u573a\u666f\uff09\u3002<\/p>\n\n\n\n<p>\u8bc4\u4f30\u57fa\u51c6\u4e0d\u5b8c\u5584\uff1a\u7f3a\u4e4f\u8986\u76d6\u73b0\u4ee3 self-play \u65b9\u6cd5\u6240\u6709\u80fd\u529b\u7684\u7efc\u5408 benchmark\uff0c\u5c24\u5176\u9488\u5bf9\u591a\u4eba\u3001\u975e\u96f6\u548c\u4e0e\u975e\u53ef\u8f6c\u6027\u7684\u573a\u666f\u3002<\/p>\n\n\n\n<p>\u4ece\u6a21\u62df\u5230\u73b0\u5b9e\uff08Sim2Real\uff09\u4e0e\u5728\u4eba\u7c7b\u2014\u673a\u5668\u4ea4\u4e92\u7684\u9c81\u68d2\u6027\uff1a\u771f\u5b9e\u4e16\u754c\u5e94\u7528\uff08\u673a\u5668\u4eba\u3001\u7ecf\u6d4e\u5b66\u6a21\u62df\uff09\u4ecd\u9762\u4e34\u8f6c\u79fb\u95ee\u9898\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u7ed3\u8bba\u4e0e\u8bba\u6587\u8d21\u732e<\/h2>\n\n\n\n<p>\u8fd9\u7bc7 survey \u7cfb\u7edf\u5316\u4e86 self-play \u7684\u7814\u7a76\u666f\u89c2\uff0c\u4ece\u5f62\u5f0f\u5316\u5b9a\u4e49\u5230\u7edf\u4e00\u6846\u67b6\u3001\u4ece\u7b97\u6cd5\u5206\u7c7b\u5230\u4ee3\u8868\u6027\u7cfb\u7edf\u4e0e\u8bc4\u4f30\u5b9e\u8df5\u3001\u4ece\u5de5\u7a0b\u7ecf\u9a8c\u5230\u7406\u8bba\u4e0e\u73b0\u5b9e\u6311\u6218\uff0c\u7ed9\u51fa\u4e86\u4e00\u5f20\u6e05\u6670\u7684\u201c\u5730\u56fe\u201d\u4f9b\u7814\u7a76\u8005\u4e0e\u5de5\u7a0b\u56e2\u961f\u9009\u62e9\u65b9\u6cd5\u3001\u8bbe\u8ba1\u5b9e\u9a8c\u4e0e\u8bc6\u522b\u7814\u7a76\u7a7a\u767d\u3002\u8bba\u6587\u540c\u65f6\u6307\u51fa PSRO \/ population-based methods \u5728\u5e94\u5bf9\u975e\u53ef\u8f6c\u6027\u65b9\u9762\u7684\u6f5c\u529b\uff0c\u4ee5\u53ca\u5728\u6837\u672c\/\u8ba1\u7b97\u6548\u7387\u4e0e\u7406\u8bba\u4fdd\u969c\u65b9\u9762\u4e9f\u9700\u6539\u8fdb\u7684\u95ee\u9898\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u8bba\u6587\u8981\u70b9 \u8bba\u6587\u63d0\u51fa\u4e86\u4e00\u4e2a\u7edf\u4e00\u6846\u67b6\u6765\u523b\u753b self-play\uff08\u667a\u80fd\u4f53\u4e0e\u81ea\u8eab\u6216\u81ea\u8eab\u5386\u53f2\u7248\u672c\u4ea4\u4e92\u4ee5\u6539\u8fdb\u7b56\u7565\uff09\u7684\u5404\u79cd\u65b9\u6cd5\uff0c\u6309\u7b56\u7565\u66f4\u65b0\u673a\u5236\u3001\u5bf9\u624b\u9009\u62e9\u4e0e\u4eba\u53e3\u7ba1\u7406\u3001\u535a\u5f08\u7c7b\u578b\uff08\u96f6\u548c\/\u975e\u96f6\u548c\u3001\u53ef\u8f6c\u6027\/\u975e\u53ef\u8f6c\u6027\uff09\u7b49\u7ef4\u5ea6\u8fdb\u884c\u5206\u7c7b\uff0c\u5e76\u56de\u987e\u4e86\u4ee3\u8868\u6027\u7b97\u6cd5\u3001\u5e94\u7528\u573a\u666f\u4e0e\u7406\u8bba\/\u5b9e\u8df5\u6311\u6218\uff0c\u540c\u65f6\u5217\u51fa\u4e86\u672a\u6765\u7814\u7a76\u65b9\u5411\u4e0e\u8bc4\u4f30\u6307\u6807\u3002 \u5f62\u5f0f\u5316\u80cc\u666f\uff08\u6a21\u578b\u4e0e\u535a\u5f08\u6982\u5ff5\uff09 \u591a\u667a\u80fd\u4f53\/\u535a\u5f08\u6846\u67b6\uff1a\u8bba\u6587\u4ee5\u4e00\u822c\u6027\u591a\u667a\u80fd\u4f53\u5f3a\u5316\u5b66\u4e60\uff08MARL\uff09\u548c\u535a\u5f08\u8bba\u4e3a\u57fa\u7840\uff0c\u533a\u5206\u6b63\u89c4\u578b\uff08normal-form\uff09\/\u6269\u5c55\u578b\uff08extensive-form\uff09\u535a\u5f08\u3001\u9759 &hellip; <a href=\"https:\/\/www.ndnlab.com\/?p=702\">\u7ee7\u7eed\u9605\u8bfb <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":703,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-702","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-rengongzhineng"],"_links":{"self":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/702","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=702"}],"version-history":[{"count":1,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/702\/revisions"}],"predecessor-version":[{"id":713,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/702\/revisions\/713"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/media\/703"}],"wp:attachment":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=702"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=702"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=702"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}