{"id":455,"date":"2025-08-14T21:41:21","date_gmt":"2025-08-14T13:41:21","guid":{"rendered":"https:\/\/www.ndnlab.com\/?p=455"},"modified":"2025-11-09T04:37:00","modified_gmt":"2025-11-08T20:37:00","slug":"cassini-network-aware-job-scheduling-in-machine-learning-clusters","status":"publish","type":"post","link":"https:\/\/www.ndnlab.com\/?p=455","title":{"rendered":"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters"},"content":{"rendered":"\n<p><strong>\u4f5c\u8005:<\/strong><a href=\"https:\/\/arxiv.org\/search\/cs?searchtype=author&amp;query=Rajasekaran,+S\">Sudarsanan Rajasekaran<\/a>&nbsp;(1),&nbsp;<a href=\"https:\/\/arxiv.org\/search\/cs?searchtype=author&amp;query=Ghobadi,+M\">Manya Ghobadi<\/a>&nbsp;(1),&nbsp;<a href=\"https:\/\/arxiv.org\/search\/cs?searchtype=author&amp;query=Akella,+A\">Aditya Akella<\/a>&nbsp;(2) ((1) Massachusetts Institute of Technology, (2) UT Austin)<\/p>\n\n\n\n<p class=\"has-text-align-left\"><strong>\u8bba\u6587\u6458\u8981\u539f\u6587\uff1a<\/strong>We present <a href=\"https:\/\/www.usenix.org\/conference\/nsdi24\/presentation\/rajasekaran\">CASSINI, a network-aware job scheduler for machine learning (ML) clusters<\/a>. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-ar<\/p>\n\n\n\n<p class=\"has-text-align-left\">t ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6\u00d7 and 2.5\u00d7,respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33\u00d7.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"871\" height=\"652\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/image.png\"  class=\"wp-image-530\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/image.png 871w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/image-300x225.png 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/image-768x575.png 768w\" sizes=\"auto, (max-width: 871px) 100vw, 871px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u7814\u7a76\u80cc\u666f<\/strong><\/h2>\n\n\n\n<p>\u8bad\u7ec3\u5927\u89c4\u6a21\u6a21\u578b\u9700\u8981\u4e00\u4e2a\u9ad8\u6548\u7684GPU\u96c6\u7fa4\u3002<\/p>\n\n\n\n<p>\u76f8\u5173\u7814\u7a76\u8868\u660e\u968f\u7740GPU\u6570\u91cf\u7684\u589e\u52a0\uff0c\u5206\u5e03\u5f0f\u8bad\u7ec3\u7684\u901a\u4fe1\u5f00\u9500\u5360\u603b\u65f6\u95f4\u5f00\u9500\u7684\u4e00\u5927\u90e8\u5206\u3002<\/p>\n\n\n\n<p>\u73b0\u6709\u6700\u5148\u8fdb\u7684\uff08state-of-the-art, SOTA\uff09\u6df1\u5ea6\u5b66\u4e60\u8c03\u5ea6\u5668\uff08Machine Learning Scheduler\uff09\u5728\u5206\u914d\u4efb\u52a1\u65f6\u5ffd\u7565\u4e86\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u7684\u901a\u4fe1\u6a21\u5f0f\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u7814\u7a76\u52a8\u673a<\/strong><\/h2>\n\n\n\n<p>\u56e0\u6b64\uff0c\u672c\u6587\u5bf9\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u5728\u4e0d\u540c\u5e76\u884c\u7b56\u7565\u4e0b\u7684\u901a\u4fe1\u6a21\u5f0f\u8fdb\u884c\u4e86\u5206\u6790\uff0c\u5206\u522b\u662f<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=%E6%95%B0%E6%8D%AE%E5%B9%B6%E8%A1%8C&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiLmlbDmja7lubbooYwiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.Qb0V8KmC5NhWk6AG5SRyhCWEt30H9rnFpNoeKfykPbw&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">\u6570\u636e\u5e76\u884c<\/a>\u3001<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=%E6%B5%81%E6%B0%B4%E7%BA%BF%E5%B9%B6%E8%A1%8C&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiLmtYHmsLTnur_lubbooYwiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.yknQ40JBeh5dazes2bj5D8COSwrVmY22ATSkkW3_tqA&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">\u6d41\u6c34\u7ebf\u5e76\u884c<\/a>\u3001<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=%E5%BC%A0%E9%87%8F%E5%B9%B6%E8%A1%8C&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiLlvKDph4_lubbooYwiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.MzXBrR8a9YyZOC7AkaomRWnnIZoO-9ded4HeqB9V_9A&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">\u5f20\u91cf\u5e76\u884c<\/a>\u4ee5\u53ca<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=3D%E6%B7%B7%E5%90%88%E5%B9%B6%E8%A1%8C&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiIzROa3t-WQiOW5tuihjCIsInpoaWRhX3NvdXJjZSI6ImVudGl0eSIsImNvbnRlbnRfaWQiOjI1OTU2OTcwMywiY29udGVudF90eXBlIjoiQXJ0aWNsZSIsIm1hdGNoX29yZGVyIjoxLCJ6ZF90b2tlbiI6bnVsbH0.B1E1ouJRT5xIHYB-jSlQ5n0V8A3mcaDlRM5MBRqO1hk&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">3D\u6df7\u5408\u5e76\u884c<\/a>\uff0c\u7ed3\u679c\u5982\u4e0b\u56fe\u6240\u793a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"384\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5c0c8394786bd6aa29df7698575bd43e_1440w.jpg\"  class=\"wp-image-473\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5c0c8394786bd6aa29df7698575bd43e_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5c0c8394786bd6aa29df7698575bd43e_1440w-300x80.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5c0c8394786bd6aa29df7698575bd43e_1440w-1024x273.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5c0c8394786bd6aa29df7698575bd43e_1440w-768x205.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe1\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe1\" \/><\/figure>\n\n\n\n<p>\u5b50\u56fe(a)-\u6570\u636e\u5e76\u884c\uff1a\u65e0\u9700\u901a\u4fe1\u7684\u524d\u53cd\u5411\u8ba1\u7b97\u4ee5\u53ca\u6700\u540e\u7684AllReduce\u901a\u4fe1\u3002<\/p>\n\n\n\n<p>\u5b50\u56fe(b)-\u6d41\u6c34\u7ebf\u5e76\u884c\uff1a\u4f7f\u7528<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=PipeDream&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiJQaXBlRHJlYW0iLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.uLb5kbZtuPn2mAAhezT33tNi7Pr4VYuEofgBxEI-lss&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">PipeDream<\/a>\u6d41\u6c34\u7ebf\u5e76\u884c\u65b9\u6cd5\uff0c\u524d\u53cd\u5411\u8fc7\u7a0b\u4e2d\u6709\u6fc0\u6d3b\u503c\u7684\u901a\u4fe1\uff0c\u6700\u540e\u4f1a\u6709\u4e00\u4e2aEmbedding\u7684AllReduce\u901a\u4fe1\uff0c\u5f00\u9500\u8f83\u5927\u3002<\/p>\n\n\n\n<p>\u5b50\u56fe(c)-\u5f20\u91cf\u5e76\u884c\uff1a\u524d\u53cd\u5411\u8fc7\u7a0b\u90fd\u9700\u8981\u8fdb\u884c\u901a\u4fe1\uff0c\u5728\u8fed\u4ee3\u4e4b\u95f4\uff0c\u8fdb\u884c\u6570\u636e\u52a0\u8f7d\u65f6\uff0c\u6ca1\u6709\u901a\u4fe1\u5f00\u9500\u3002<\/p>\n\n\n\n<p>\u5b50\u56fe(d)-3D\u6df7\u5408\u5e76\u884c\uff1a\u5728\u524d\u53cd\u5411\u8fc7\u7a0b\u4e2d\u6709\u8bb8\u591a\u901a\u4fe1\u64cd\u4f5c\uff0c\u8fd9\u4e9b\u901a\u4fe1\u64cd\u4f5c\u5bf9\u7f51\u7edc\u5e26\u5bbd\u7684\u9700\u6c42\u4e0d\u540c\u3002<\/p>\n\n\n\n<p>\u5173\u952e\u53d1\u73b0\uff1a<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u7f51\u7edc\u5e26\u5bbd\u7684\u9700\u6c42\u5728\u8fed\u4ee3\u4e4b\u95f4\u662f\u91cd\u590d\u6709\u89c4\u5f8b\u7684\u3002<\/li>\n\n\n\n<li>\u5355\u6b21\u8fed\u4ee3\u8fc7\u7a0b\u4e2d\u5b58\u5728\u591a\u4e2aUp\u548cDown\u9636\u6bb5\uff08\u5b50\u56fe(d)\u4e2d1\u30013\u4e3aDown\u9636\u6bb5\uff0c2\u30014\u30015\u30016\u4e3aUp\u9636\u6bb5\uff09\uff0c\u4e0d\u540cUp\u9636\u6bb5\u5bf9\u7f51\u7edc\u5e26\u5bbd\u7684\u9700\u6c42\u53ef\u80fd\u4e0d\u540c\u3002<\/li>\n<\/ul>\n\n\n\n<p>\u57fa\u4e8e\u8fd9\u4e9b\u53d1\u73b0\uff0c\u672c\u6587\u5c31\u63d0\u51fa\u4e86<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=CASSINI&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiJDQVNTSU5JIiwiemhpZGFfc291cmNlIjoiZW50aXR5IiwiY29udGVudF9pZCI6MjU5NTY5NzAzLCJjb250ZW50X3R5cGUiOiJBcnRpY2xlIiwibWF0Y2hfb3JkZXIiOjEsInpkX3Rva2VuIjpudWxsfQ.mSKXnXJkSRZVUvIF0aM-HH8yW4JSHVQl4z9zpyRfrVc&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">CASSINI<\/a>\uff0c\u901a\u8fc7\u4ea4\u9519\uff08Interleaving\uff09\u96c6\u7fa4\u4e0a\u4e0d\u540c\u4efb\u52a1Up\u548cDown\u9636\u6bb5\uff0c\u907f\u514d\u901a\u4fe1\u7ade\u4e89\uff0c\u5b9e\u73b0\u4efb\u52a1\u8bad\u7ec3\u6548\u7387\u7684\u63d0\u5347\u3002<strong>CASSINI<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Single Link Multiple Tasks<\/strong><\/h3>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Geometric Abstraction<\/strong><\/h3>\n\n\n\n<p>\u4e3a\u4e86\u89e3\u51b3\u5355\u94fe\u8def\u591a\u4efb\u52a1\u7684\u95ee\u9898\uff0cCASSINI\u9996\u5148\u63d0\u51fa\u4e86\u4e00\u4e2a\u51e0\u4f55\u62bd\u8c61\uff0c\u5bf9\u4e8e\u5355\u4e2a\u4efb\u52a1\uff0c\u6211\u4eec\u4f7f\u7528\u4e00\u4e2a\u5468\u957f\u4e3a\u5176\u8fed\u4ee3\u65f6\u95f4\u7684\u5706\u6765\u8868\u793a\uff0c\u7136\u540e\u6839\u636e\u5176\u5728\u8be5\u94fe\u8def\u4e0a\u901a\u4fe1\u53d1\u751f\u7684\u65f6\u95f4\uff0c\u6211\u4eec\u5728\u5706\u4e0a\u8fdb\u884c\u76f8\u5e94\u7684\u6807\u6ce8\uff0c \u5c31\u5f97\u5230\u4e86\u4e00\u4e2a\u4efb\u52a1\u7684\u51e0\u4f55\u62bd\u8c61\uff0c\u5982\u4e0b\u56fe\u7684\u5de6\u8fb9\u6240\u793a\u3002\u5047\u8bbe\u6211\u4eec\u6709\u4e24\u4e2a\u76f8\u540c\u7684\u4efb\u52a1\u5728\u8be5\u94fe\u8def\u4e0a\u8fd0\u884c\uff0c\u6211\u4eec\u5c31\u53ef\u4ee5\u901a\u8fc7\u5bf9\u51e0\u4f55\u62bd\u8c61\u8fdb\u884c\u65cb\u8f6c\uff08\u65cb\u8f6c\u89d2\u5ea6\u5927\u5c0f\u5bf9\u5e94\u7684\u662f\u67d0\u4e2a\u4efb\u52a1\u5ef6\u8fdf\u5f00\u59cb\u7684\u65f6\u95f4\uff09\uff0c\u5c06\u4e24\u4e2a\u4efb\u52a1\u7684Up\u548cDown\u9636\u6bb5\u9519\u5f00\uff0c\u907f\u514d\u901a\u4fe1\u7ade\u4e89\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"368\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3f4b5d30da395c529d7639235cafd141_1440w.jpg\"  class=\"wp-image-466\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3f4b5d30da395c529d7639235cafd141_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3f4b5d30da395c529d7639235cafd141_1440w-300x77.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3f4b5d30da395c529d7639235cafd141_1440w-1024x262.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3f4b5d30da395c529d7639235cafd141_1440w-768x196.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe2\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe2\" \/><\/figure>\n\n\n\n<p>\u5982\u679c\u662f\u4e24\u4e2a\u4e0d\u540c\u7684\u4efb\u52a1\uff0c\u5176\u5177\u6709\u4e0d\u540c\u7684\u8fed\u4ee3\u65f6\u95f4\uff0c\u6211\u4eec\u5c31\u53d6\u4e24\u4e2a\u4efb\u52a1\u8fed\u4ee3\u65f6\u95f4\u7684\u6700\u5c0f\u516c\u500d\u6570\u4f5c\u4e3a\u5706\u7684\u5468\u957f\uff0c\u518d\u5728\u5706\u4e0a\u8fdb\u884c\u6807\u6ce8\uff0c\u5982\u4e0b\u56fe\u5de6\u8fb9\u7684(a)\u548c(b)\u6240\u793a\u3002\u5bf9\u5e94\u4e8eMoti\u4e2d3D\u5e76\u884c\u7684\u60c5\u51b5\uff0c\u5982\u679c\u4e0d\u540c\u7684Up\u9636\u6bb5\u6709\u4e0d\u540c\u7684\u5e26\u5bbd\u9700\u6c42\uff0c\u6211\u4eec\u4f7f\u7528\u989c\u8272\u7684\u6df1\u6d45\u8fdb\u884c\u533a\u5206\uff0c\u5982\u4e0b\u56fe\u53f3\u8fb9\u6240\u793a\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"353\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-419bfe3c3a09cf53c82870601279b9ee_1440w.jpg\"  class=\"wp-image-472\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-419bfe3c3a09cf53c82870601279b9ee_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-419bfe3c3a09cf53c82870601279b9ee_1440w-300x74.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-419bfe3c3a09cf53c82870601279b9ee_1440w-1024x251.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-419bfe3c3a09cf53c82870601279b9ee_1440w-768x188.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe3\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe3\" \/><\/figure>\n\n\n\n<p>\u81f3\u6b64\uff0c\u6211\u4eec\u5c31\u53ef\u4ee5\u4f7f\u7528\u51e0\u4f55\u62bd\u8c61\u7edf\u4e00\u63cf\u8ff0\u5355\u4e2a\u94fe\u8def\u4e0a\u591a\u4e2a\u4efb\u52a1\u7684\u5e26\u5bbd\u4f7f\u7528\u60c5\u51b5\uff08\u5c06\u591a\u4e2a\u4efb\u52a1\u4f7f\u7528\u76f8\u540c\u5468\u957f\u7684\u5706\u8868\u793a\uff09\uff0c\u7136\u540eCASSINI\u8bbe\u8ba1\u4e86\u4f18\u5316\u51fd\u6570\u6765\u4e3a\u5355\u4e2a\u94fe\u8def\u4e0a\u7684\u591a\u4e2a\u4efb\u52a1\u6c42\u89e3\u51fa\u4e00\u4e2a\u6700\u5927\u7684\u517c\u5bb9\u6027\u5f97\u5206\uff0c\u8fdb\u800c\u6c42\u51fa\u964d\u4f4e\u901a\u4fe1\u7ade\u4e89\u7684\u6700\u4f18\u65cb\u8f6c\u89d2\u3002\u5f62\u5f0f\u5316\u8868\u8fbe\u5982\u4e0b\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"763\" height=\"794\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3774b25298c1ac25ec29041ec765601b_1440w.jpg\"  class=\"wp-image-460\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3774b25298c1ac25ec29041ec765601b_1440w.jpg 763w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-3774b25298c1ac25ec29041ec765601b_1440w-288x300.jpg 288w\" sizes=\"auto, (max-width: 763px) 100vw, 763px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe4\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe4\" \/><\/figure>\n\n\n\n<p>\u81f3\u6b64\uff0c\u6211\u4eec\u5c31\u53ef\u4ee5\u6c42\u89e3\u51fa\u5355\u94fe\u8def\u591a\u4efb\u52a1\u60c5\u51b5\u4e0b\uff0c\u4e3a\u4e86\u964d\u4f4e\u901a\u4fe1\u7ade\u4e89\uff0c\u5404\u4e2a\u4efb\u52a1\u5e94\u8be5\u65cb\u8f6c\u7684\u89d2\u5ea6\uff0c\u901a\u8fc7\u8fd9\u4e2a\u89d2\u5ea6\uff0c\u6211\u4eec\u53ef\u4ee5\u5229\u7528\u4e0b\u9762\u7684\u516c\u5f0f\u7b97\u51fa\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"679\" height=\"103\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5bee5f54b65880cabe6f23f256ed7ffb_1440w.png\"  class=\"wp-image-469\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5bee5f54b65880cabe6f23f256ed7ffb_1440w.png 679w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-5bee5f54b65880cabe6f23f256ed7ffb_1440w-300x46.png 300w\" sizes=\"auto, (max-width: 679px) 100vw, 679px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe5\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe5\" \/><\/figure>\n\n\n\n<p><strong>Multiple Link Multiple Tasks<\/strong><\/p>\n\n\n\n<p>\u63a5\u4e0b\u6765\uff0c\u672c\u6587\u5c06\u573a\u666f\u6269\u5c55\u5230\u591a\u94fe\u8def\u591a\u4efb\u52a1\u7684\u573a\u666f\uff0c\u4e5f\u5c31\u662f\u5b9e\u9645\u8bad\u7ec3\u96c6\u7fa4\u7684\u573a\u666f\u3002\u5728\u96c6\u7fa4\u4e2d\uff0c\u5f88\u5bb9\u6613\u5c31\u51fa\u73b0\u4e00\u4e2a\u4efb\u52a1\u4e0e\u591a\u4e2a\u4efb\u52a1\u7ade\u4e89\u94fe\u8def\u7684\u60c5\u51b5\uff0c\u5982\u4e0b\u56fe\u6240\u793a\uff0c\u548c\u7ade\u4e89\uff0c\u548c\u7ade\u4e89\uff0c\u5982\u679c\u57fa\u4e8e\u548c\u5206\u522b\u5229\u7528\u5355\u94fe\u8def\u591a\u4efb\u52a1\u7684\u4f18\u5316\u51fd\u6570\u6c42\u89e3\uff0c\u90a3\u4e48\u5c06\u4f1a\u6709\u4e24\u4e2a\u5f00\u59cb\u65f6\u95f4\uff0c\u56e0\u6b64\uff0c\u5c06\u5355\u94fe\u8def\u591a\u4efb\u52a1\u6269\u5c55\u5230\u591a\u4efb\u52a1\u591a\u94fe\u8def\u7684\u573a\u666f\uff0c\u8981\u89e3\u51b3\u7684\u5c31\u662f\u5982\u4f55\u7edf\u4e00\u591a\u4e2a\u94fe\u8def\u4e0a\u6c42\u89e3\u51fa\u7684\u5f00\u59cb\u65f6\u95f4\u7684\u95ee\u9898\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"772\" height=\"315\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-77581c9f467dc0a06b57022f557a4531_1440w.jpg\"  class=\"wp-image-459\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-77581c9f467dc0a06b57022f557a4531_1440w.jpg 772w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-77581c9f467dc0a06b57022f557a4531_1440w-300x122.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-77581c9f467dc0a06b57022f557a4531_1440w-768x313.jpg 768w\" sizes=\"auto, (max-width: 772px) 100vw, 772px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe6\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe6\" \/><\/figure>\n\n\n\n<p>\u4e3a\u6b64\uff0cCASSINI\u5c06\u4efb\u52a1\u548c\u94fe\u8def\u62bd\u8c61\u6210\u4e86\u5173\u8054\u56fe\uff08Affinity graph\uff09\uff0c\u5982\u4e0b\u56fe\u5de6\u8fb9\u6240\u793a\uff0cCASSINI\u5c06\u6240\u6709\u4efb\u52a1\u653e\u5728\u96c6\u5408\u4e2d\uff0c\u5c06\u6240\u6709\u94fe\u8def\u653e\u5728\u96c6\u5408\u4e2d\uff0c\u8fd9\u6837\uff0c\u56fe\u4e2d\u7684\u4e00\u6761\u8fb9\u5c31\u8868\u793a\u4efb\u52a1\u5728\u94fe\u8def\u4e0a\u8fd0\u884c\uff0c\u8fb9\u7684\u6743\u91cd\u4e3a\u5176\u5728\u94fe\u8def\u4e0a\u7684\u5f00\u59cb\u65f6\u95f4\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"754\" height=\"528\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-692ebf7a3085c90231e947473d91ccd7_1440w.jpg\"  class=\"wp-image-470\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-692ebf7a3085c90231e947473d91ccd7_1440w.jpg 754w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-692ebf7a3085c90231e947473d91ccd7_1440w-300x210.jpg 300w\" sizes=\"auto, (max-width: 754px) 100vw, 754px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe7\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe7\" \/><\/figure>\n\n\n\n<p>\u57fa\u4e8e\u5173\u8054\u56fe\uff0cCASSINI\u57fa\u4e8e\u5e7f\u5ea6\u4f18\u5148\u641c\u7d22\u7b97\u6cd5\uff08Breadth First Search, BFS\uff09\u8bbe\u8ba1\u4e86\u7b97\u6cd51\u6765\u6c42\u89e3\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\uff0c\u7b97\u6cd51\u5982\u4e0b\u6240\u793a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"761\" height=\"1014\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-225f467946209614bf3261621f0bbe3a_1440w.jpg\"  class=\"wp-image-462\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-225f467946209614bf3261621f0bbe3a_1440w.jpg 761w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-225f467946209614bf3261621f0bbe3a_1440w-225x300.jpg 225w\" sizes=\"auto, (max-width: 761px) 100vw, 761px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe8\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe8\" \/><\/figure>\n\n\n\n<p>\u7b97\u6cd51\u5728\u5173\u8054\u56fe\u7684\u5404\u4e2a\u8fde\u901a\u5b50\u56fe\u4e0a\u8fd0\u884c\uff0c\u901a\u8fc7\u4f20\u9012\u4e0d\u540c\u4efb\u52a1\u5728\u540c\u4e00\u94fe\u8def\u4e0a\u5f00\u59cb\u65f6\u95f4\u7684\u5dee\u503c\u6765\u4e3a\u5404\u4e2a\u4efb\u52a1\u786e\u5b9a\u5f00\u59cb\u65f6\u95f4\uff0c\u540c\u65f6\u4fdd\u6301\u5404\u4e2a\u4efb\u52a1\u5728\u94fe\u8def\u4e0a\u7684\u901a\u4fe1\u7ade\u4e89\u60c5\u51b5\u4e0d\u53d8\u3002\u8bba\u6587\u5728\u9644\u5f55\u4e2d\u8bc1\u660e\u4e86\u8be5\u7b97\u6cd5\u5728\u65e0\u73af\u56fe\u4e0a\u7684\u6b63\u786e\u6027\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Put It All Together<\/strong><\/h3>\n\n\n\n<p>\u672c\u90e8\u5206\u8bb2\u8ff0\u4e86\u5c06CASSINI\u96c6\u6210\u5230<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=Themis&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiJUaGVtaXMiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.0soj8zwrf_JIGP97xZPIBpxyy5Eiq3OPVfMj0b8zuRs&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">Themis<\/a>\uff08NSDI&#8217;20\u7684\u96c6\u7fa4\u8c03\u5ea6\u5de5\u4f5c\uff09\u4e2d\u5b9e\u73b0\u843d\u5730\u7684\u8fc7\u7a0b\u3002CASSINI\u96c6\u6210\u5230Themis\u540e\u7684\u603b\u4f53\u8fd0\u884c\u6d41\u7a0b\u5982\u4e0b\u56fe\u6240\u793a\uff0c\u76f8\u5f53\u4e8e\u5728\u539f\u672cThemis\u7684\u5de5\u4f5c\u6d41\u4e2d\u52a0\u5165\u4e86\u56fe\u4e2d\u84dd\u8272\u7684Cassini\u6a21\u5757\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"618\" height=\"821\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-c4bbec53aac11237b4150ee013854e31_1440w.jpg\"  class=\"wp-image-461\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-c4bbec53aac11237b4150ee013854e31_1440w.jpg 618w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-c4bbec53aac11237b4150ee013854e31_1440w-226x300.jpg 226w\" sizes=\"auto, (max-width: 618px) 100vw, 618px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe9\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe9\" \/><\/figure>\n\n\n\n<p>\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0cThemis\u4f1a\u76f4\u63a5\u7ed9\u51fa\u4efb\u52a1\u8fd0\u884c\u7684GPU\u6570\u91cf\u548c\u5bf9\u5e94\u7684\u8bbe\u5907\u653e\u7f6e\u7b56\u7565\uff0c\u4f46\u662f\u7531\u4e8eCASSINI\u7684\u7b97\u6cd51\u7684\u6b63\u786e\u8fd0\u884c\u9700\u8981\u65e0\u73af\u56fe\u7684\u4fdd\u8bc1\uff0c\u6240\u4ee5\u8fd9\u91cc\u5c06Themis\u539f\u672c\u5de5\u4f5c\u6d41\u4e2d\u4e3a\u6bcf\u4e2a\u4efb\u52a1\u5206\u914dGPU\u6570\u91cf\u548c\u6bcf\u4e2a\u4efb\u52a1\u7684\u8bbe\u5907\u653e\u7f6e\u7b56\u7565\u89e3\u8026\uff0c\u4f7f\u5f97Themis\u80fd\u591f\u4e3a\u6bcf\u4e2a\u4efb\u52a1\u8fd4\u56deN\u4e2a\u53ef\u80fd\u7684\u653e\u7f6e\u7b56\u7565\uff0c\u6587\u4e2d\u4e5f\u63d0\u5230\u4e0d\u540c\u7684\u653e\u7f6e\u7b56\u7565\u4e0d\u4f1a\u5f71\u54cdThemis\u7684\u8bc4\u4ef7\u6307\u6807\uff0c\u6240\u4ee5\u5728\u6bcf\u4e00\u6b21\u8c03\u5ea6\u5185\u5bf9Themis\u6ca1\u6709\u5f71\u54cd\u3002<\/p>\n\n\n\n<p>\u57fa\u4e8eThemis\u7ed9\u51fa\u7684\u4efb\u52a1\u653e\u7f6e\u7b56\u7565\uff0cCASSINI\u8bbe\u8ba1\u4e86\u7b97\u6cd52\u6c42\u89e3\u8be5\u653e\u7f6e\u7b56\u7565\u4e0b\u96c6\u7fa4\u7684\u517c\u5bb9\u6027\u5f97\u5206\uff0c\u6839\u636e\u517c\u5bb9\u6027\u5f97\u5206\u9009\u51fa\u6700\u4f18\u7684\u653e\u7f6e\u7b56\u7565\uff0c\u6700\u540e\u57fa\u4e8e\u6700\u4f18\u7684\u653e\u7f6e\u7b56\u7565\u6c42\u89e3\u51fa\u6bcf\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\u3002<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"778\" height=\"1396\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-2a13d38a45cd218ef8d4d2d282df2f05_1440w.jpg\"  class=\"wp-image-463\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-2a13d38a45cd218ef8d4d2d282df2f05_1440w.jpg 778w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-2a13d38a45cd218ef8d4d2d282df2f05_1440w-167x300.jpg 167w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-2a13d38a45cd218ef8d4d2d282df2f05_1440w-571x1024.jpg 571w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-2a13d38a45cd218ef8d4d2d282df2f05_1440w-768x1378.jpg 768w\" sizes=\"auto, (max-width: 778px) 100vw, 778px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe10\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe10\" \/><\/figure>\n\n\n\n<p>\u6700\u540e\uff0c\u914d\u7f6eThemis\u4f7f\u7528\u6700\u4f18\u7684\u653e\u7f6e\u7b56\u7565\u53ca\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\u5373\u53ef\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u5b9e\u9a8c\u8bbe\u7f6e<\/strong><\/h3>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u5b9e\u9a8c\u73af\u5883<\/strong><\/h3>\n\n\n\n<p>24\u53f0 ASUS ESC4000A-E10\u670d\u52a1\u5668\uff1a\u4e00\u5f2040GB A100\uff0c\u4e00\u5f2050Gbps Mellanox ConnectX5\u7f51\u5361<\/p>\n\n\n\n<p>\u4f7f\u7528Ubuntu 18.04 LTS\uff0cPyTorch 1.8.0\uff0cCUDA 11.1\u4ee5\u53caNCCL 2.11.4<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u7f51\u7edc\u62d3\u6251<\/strong><\/h3>\n\n\n\n<p>\u642d\u5efa\u4e86\u5982\u4e0b\u56fe\u6240\u793a\u7684\u7f51\u7edc\u62d3\u6251\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"750\" height=\"288\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-4f5da47e52679aad7dd183d65ecc2c51_1440w.jpg\"  class=\"wp-image-457\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-4f5da47e52679aad7dd183d65ecc2c51_1440w.jpg 750w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-4f5da47e52679aad7dd183d65ecc2c51_1440w-300x115.jpg 300w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe11\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe11\" \/><\/figure>\n\n\n\n<p><strong>\u6a21\u578b\u8d1f\u8f7d<\/strong><\/p>\n\n\n\n<p>13\u79cdDNN\uff1aVGG11\u3001VGG13\u3001VGG19\u3001ResNet50\u3001WideResNet101\u3001BERT\u3001RoBERTa\u3001XLM\u3001CamemBERT\u3001GPT-1\u3001GPT-2\u3001GPT-3\u548cDLRM\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u5e76\u884c\u7b56\u7565<\/strong><\/h3>\n\n\n\n<p>\u6570\u636e\u5e76\u884c\uff08PyTorch DDP\uff09\uff1aVGG\u3001ResNet\u3001BERT<\/p>\n\n\n\n<p>\u6df7\u5408\u5e76\u884c\uff08Meta&#8217;s opensource codebase\uff09\uff1aDLRM<\/p>\n\n\n\n<p>\u6df7\u5408\u5e76\u884c\uff08Microsoft DeepSpeed\uff09\uff1aGPT<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Traces<\/strong><\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Poission trace: \u4efb\u52a1\u5230\u8fbe\u65f6\u95f4\u7b26\u5408\u6cca\u677e\u5206\u5e03\uff0c\u4fdd\u8bc1\u96c6\u7fa4GPU\u5229\u7528\u7387\u572880%-100%\u4e4b\u95f4\u3002<\/li>\n\n\n\n<li>Dynamic trace: \u4e00\u6279DNN\u4efb\u52a1\u5728\u96c6\u7fa4\u4e0a\u8fd0\u884c\uff0c\u53e6\u4e00\u6279DNN\u4efb\u52a1\u5230\u8fbe\u3002<\/li>\n\n\n\n<li>snapshot trace: \u96c6\u7fa4\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u7684\u5feb\u7167\u3002<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u53c2\u4e0e\u6bd4\u8f83\u7684\u65b9\u6cd5<\/strong><\/h3>\n\n\n\n<p>Themis\u3001Th+CASSNI\u3001Pollux\u3001Po+CASSINI\u3001Ideal\u3001Random<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u6027\u80fd\u63d0\u5347<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"486\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-d411e533d8281e133cf9847db9610c0f_1440w.jpg\"  class=\"wp-image-465\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-d411e533d8281e133cf9847db9610c0f_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-d411e533d8281e133cf9847db9610c0f_1440w-300x101.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-d411e533d8281e133cf9847db9610c0f_1440w-1024x346.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-d411e533d8281e133cf9847db9610c0f_1440w-768x259.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe12\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe12\" \/><\/figure>\n\n\n\n<p>\u56fe\u7684\u5de6\u534a\u90e8\u5206\u5c55\u793a\u4e86\u96c6\u7fa4\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u5404\u4e2a\u4efb\u52a1\u7684\u8fed\u4ee3\u65f6\u95f4\uff0c\u9664\u4e86DRLM\u5176\u4ed6DNN\u8d1f\u8f7d\u5747\u4f7f\u7528\u6570\u636e\u5e76\u884c\u3002\u53ef\u4ee5\u770b\u5230\uff0cCASSINI\u901a\u8fc7\u6d88\u9664\u4e0d\u540c\u4efb\u52a1\u5728\u94fe\u8def\u4e0a\u7684\u901a\u4fe1\u7ade\u4e89\uff0c\u5927\u5927\u964d\u4f4e\u4e86\u4efb\u52a1\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u7684\u8fed\u4ee3\u65f6\u95f4\u3002\u4e0d\u8fc7\u8fd9\u91cc\u6709\u4e00\u4e2a\u7591\u60d1\u7684\u70b9\u662f\u56fe\u4e2d\u5e76\u6ca1\u6709\u5c55\u73b0\u51faCASSINI\u51b3\u7b56\u540e\u5ef6\u8fdf\u4efb\u52a1\u5f00\u59cb\u65f6\u95f4\u7684\u6548\u679c\uff0c\u770b\u8d77\u6765\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\u5728Themis\u548cTh+CASSINI\u4e2d\u662f\u4e00\u6837\u7684\u3002<\/p>\n\n\n\n<p>\u56fe\u7684\u53f3\u534a\u90e8\u5206\u5c55\u793a\u4e86\u8fed\u4ee3\u65f6\u95f4\u7684CDP\u66f2\u7ebf\uff0c\u53ef\u4ee5\u770b\u5230CASSINI\u80fd\u591f\u5c06Themis\u7684\u8c03\u5ea6\u7ed3\u679c\u63d0\u5347\u5230\u63a5\u8fd1\u4e13\u7528GPU\u96c6\u7fa4\u8bad\u7ec3\u7684\u6548\u679c\uff0c\u76f8\u6bd4\u5982Themis\uff0cTh+CASSINI\u5c06\u5e73\u5747\u7684\u8fed\u4ee3\u65f6\u95f4\u63d0\u5347\u4e861.6\u500d\uff0c\u5c06\u8fed\u4ee3\u65f6\u95f4\u768499\u5206\u4f4d\u6570\u63d0\u9ad8\u4e861.8\u500d\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u6a21\u578b\u5e76\u884c\u7684\u5f71\u54cd<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"330\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-47e92549f37ce7d1899bfec4b8c183bf_1440w.jpg\"  class=\"wp-image-458\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-47e92549f37ce7d1899bfec4b8c183bf_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-47e92549f37ce7d1899bfec4b8c183bf_1440w-300x69.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-47e92549f37ce7d1899bfec4b8c183bf_1440w-1024x235.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-47e92549f37ce7d1899bfec4b8c183bf_1440w-768x176.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe13\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe13\" \/><\/figure>\n\n\n\n<p>\u63a2\u7a76CASSINI\u5728\u6a21\u578b\u5e76\u884c\u60c5\u51b5\u4e0b\u7684\u6548\u679c\uff0c\u4f7f\u7528\u7684\u6a21\u578b\u662fDLRM\u548cGPT\u7cfb\u5217\u6a21\u578b\u3002<\/p>\n\n\n\n<p>\u76f8\u6bd4\u5982Themis\uff0cTh+CASSINI\u5c06\u5e73\u5747\u7684\u8fed\u4ee3\u65f6\u95f4\u63d0\u5347\u4e861.2\u500d\uff0c\u5c06\u8fed\u4ee3\u65f6\u95f4\u768499\u5206\u4f4d\u6570\u63d0\u9ad8\u4e861.6\u500d\u3002<\/p>\n\n\n\n<p>\u5b50\u56fe(b)-(e)\u7ed9\u51fa\u4e86\u96c6\u7fa4\u4e2d\u4e0d\u540c\u6a21\u578b\u6bcf\u4e2a\u8fed\u4ee3\u4ea7\u751f\u7684\u663e\u5f0f\u62e5\u585e\u901a\u77e5\uff08Explicit Congestion Notification\uff0cECN\uff09\u6570\u636e\u5305\u7684\u6570\u91cf\uff0cTh+CASSINI\u65b9\u6cd5\u76f8\u8f83\u4e8eThemis\u65b9\u6cd5\u5927\u5927\u964d\u4f4e\u4e86<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=ECN%E6%95%B0%E6%8D%AE%E5%8C%85&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiJFQ07mlbDmja7ljIUiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9._UwVBaM8E_iREhUP_jECcBr0Oda-MhN9-CGkkqFuZJI&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">ECN\u6570\u636e\u5305<\/a>\u6570\u91cf\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>CASSINI\u964d\u4f4e\u901a\u4fe1\u7ade\u4e89\u7684\u6548\u679c<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"392\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-7099099ae85afe449854fc09a0044ea7_1440w.jpg\"  class=\"wp-image-467\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-7099099ae85afe449854fc09a0044ea7_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-7099099ae85afe449854fc09a0044ea7_1440w-300x82.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-7099099ae85afe449854fc09a0044ea7_1440w-1024x279.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-7099099ae85afe449854fc09a0044ea7_1440w-768x209.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe14\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe14\" \/><\/figure>\n\n\n\n<p>Dynamic trace\uff1a\u96c6\u7fa4\u4e0a\u672c\u6765\u8fd0\u884c\u7740VGG16\u3001RoBERTa\u4ee5\u53caDLRM\uff0c\u6b64\u65f6\u6765\u4e86\u4e00\u6279DLRM\u548cResNet50\u7684\u4efb\u52a1<\/p>\n\n\n\n<p>Pollux\u548cThemis\u7684\u653e\u7f6e\u4f7f\u5f97\u65b0\u6765\u7684DLRM\u4efb\u52a1\u4e0e\u539f\u6765\u7684\u4e0d\u517c\u5bb9DLRM\u7684\u4efb\u52a1\u5171\u4eab\u94fe\u8def\uff0c\u589e\u52a0\u4e86\u8fed\u4ee3\u65f6\u95f4\uff0c\u76f8\u6bd4\u4e8eThemis\uff0cTh+CASSINI\u5728\u5e73\u5747\u8fed\u4ee3\u65f6\u95f4\u548c\u8fed\u4ee3\u65f6\u95f4\u768499\u5206\u4f4d\u6570\u4e0a\u53d6\u5f97\u4e861.2\u548c2.2\u500d\u7684\u52a0\u901f\uff1b\u76f8\u6bd4\u4e8ePollux\uff0cPo+CASSINI\u5728\u5e73\u5747\u8fed\u4ee3\u65f6\u95f4\u548c\u8fed\u4ee3\u65f6\u95f4\u768499\u5206\u4f4d\u6570\u4e0a\u53d6\u5f97\u4e861.6\u548c2.5\u500d\u7684\u52a0\u901f\u3002<\/p>\n\n\n\n<p>\u5b50\u56fe(b)-(d)\u7ed9\u51fa\u4e86\u96c6\u7fa4\u4e2d\u4e0d\u540c\u6a21\u578b\u6bcf\u4e2a\u8fed\u4ee3\u4ea7\u751f\u7684ECN\u6570\u636e\u5305\u7684\u6570\u91cf\uff0c\u53ef\u4ee5\u770b\u5230Th+CASSINI\u4ee5\u53caPo+CASSINI\u65b9\u6cd5\u5728\u4e0d\u540c\u7684\u6a21\u578b\u4e0a\u76f8\u8f83\u4e8e\u539f\u6765\u7684\u65b9\u6cd5\u90fd\u53ef\u4ee5\u5b9e\u73b0ECN\u6570\u636e\u5305\u6570\u91cf\u7684\u964d\u4f4e\u3002\u5728DLRM\u6a21\u578b\u4e0a\uff0cCASSINI\u76f8\u8f83\u4e8eThemis\u548cPollux\u80fd\u591f\u5c06ECN\u6570\u636e\u5305\u6570\u91cf\u964d\u4f4e27\u548c33\u500d\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u90e8\u5206\u517c\u5bb9\u60c5\u51b5\u4e0b\u7684\u6027\u80fd<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"748\" height=\"457\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-8e27cb464cca449d383152f76e785d3a_1440w.jpg\"  class=\"wp-image-471\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-8e27cb464cca449d383152f76e785d3a_1440w.jpg 748w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-8e27cb464cca449d383152f76e785d3a_1440w-300x183.jpg 300w\" sizes=\"auto, (max-width: 748px) 100vw, 748px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe15\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe15\" \/><\/figure>\n\n\n\n<p>\u4e0a\u8868\u5c55\u793a\u4e86\u4e94\u4e2a\u4e0d\u540c\u7684\u65f6\u523b\u96c6\u7fa4\u4e2dDNN\u8bad\u7ec3\u4efb\u52a1\u5728\u67d0\u6761\u94fe\u8def\u4e0a\u7684\u517c\u5bb9\u6027\u5f97\u5206\uff0c\u53ef\u4ee5\u770b\u5230\uff0c\u5f53\u517c\u5bb9\u6027\u5f97\u5206\u964d\u4f4e\u81f30.6\u65f6\uff0cCASSINI\u5e26\u6765\u7684\u6027\u80fd\u63d0\u5347\u5c31\u53d8\u5f97\u76f8\u5f53\u6709\u9650\u3002\u4e3a\u4e86\u5206\u6790\u5176\u539f\u56e0\uff0c\u8bba\u6587\u5c55\u793a\u4e86\u4e94\u4e2a\u4e0d\u540c\u65f6\u523b\u8be5\u94fe\u8def\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u5982\u4e0b\u56fe\u6240\u793a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"373\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-260278b868dd9d015dcee6d4b312a928_1440w.jpg\"  class=\"wp-image-464\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-260278b868dd9d015dcee6d4b312a928_1440w.jpg 1440w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-260278b868dd9d015dcee6d4b312a928_1440w-300x78.jpg 300w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-260278b868dd9d015dcee6d4b312a928_1440w-1024x265.jpg 1024w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-260278b868dd9d015dcee6d4b312a928_1440w-768x199.jpg 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe16\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe16\" \/><\/figure>\n\n\n\n<p>\u53ef\u4ee5\u770b\u5230\uff0c\u5728\u524d\u56db\u4e2a\u65f6\u523b\uff0cCASSINI\u5f88\u597d\u7684\u5c06\u4e0d\u540c\u6a21\u578b\u7684\u901a\u4fe1\u4ea4\u9519\u5f00\u6765\uff0c\u4f46\u662f\u5f53\u517c\u5bb9\u6027\u5f97\u5206\u8f83\u4f4e\u65f6\uff0c\u4e0d\u540c\u6a21\u578b\u7684\u901a\u4fe1\u540c\u65f6\u8fdb\u884c\uff0c\u4ea7\u751f\u901a\u4fe1\u7ade\u4e89\uff0c\u5982\u4e0a\u56fe(e)\u6240\u793a\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>\u5355\u673a\u591a\u5361\u7684\u96c6\u7fa4\u5e26\u6765\u7684\u5f71\u54cd<\/strong><\/h3>\n\n\n\n<p>\u524d\u9762\u7684\u5b9e\u9a8c\u90fd\u662f\u5728\u5355\u673a\u5355\u5361\u7684\u96c6\u7fa4\u4e0a\u8fdb\u884c\u7684\uff0c\u73b0\u6709\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\u96c6\u7fa4\u5b58\u5728\u5355\u673a\u591a\u5361\u7684\u60c5\u51b5\uff0c\u5e76\u4e14\u504f\u5411\u4e8e\u5c06\u4efb\u52a1\u653e\u7f6e\u5728\u5355\u4e2a\u8282\u70b9\u5185\uff0c\u4f46\u662f\u968f\u7740\u4efb\u52a1\u89c4\u6a21\u7684\u589e\u5927\uff0c\u603b\u662f\u4f1a\u6709\u4efb\u52a1\u9700\u8981\u8de8\u8282\u70b9\u653e\u7f6e\uff0c\u56e0\u6b64\u4f1a\u5b58\u5728\u4e0d\u540c\u4efb\u52a1\u5171\u4eab\u94fe\u8def\u5e26\u5bbd\u7684\u60c5\u51b5\u3002\u5728\u8fd9\u79cd\u60c5\u51b5\u4e0b\uff0cCASSINI\u7684\u6027\u80fd\u6536\u76ca\u5728\u8de8\u8282\u70b9\u4efb\u52a1\u4e0a\u66f4\u52a0\u660e\u663e\u3002<\/p>\n\n\n\n<p>\u4e3a\u4e86\u9a8c\u8bc1CASSINI\u5728\u5355\u673a\u591a\u5361\u4e0a\u7684\u6027\u80fd\uff0c\u901a\u8fc7\u5bf9\u5b9e\u9a8c\u4f7f\u7528\u7684\u8bbe\u5907\u8fdb\u884c\u624b\u52a8\u8c03\u6574\uff0c\u8bba\u6587\u9020\u51fa\u4e86\u4e00\u4e2a\u5355\u8282\u70b9\u4e24\u5f20\u5361\u76846\u8282\u70b9\u96c6\u7fa4\uff0c\u5176\u7f51\u7edc\u62d3\u6251\u5982\u4e0b\u56fe\u5de6\u8fb9\u6240\u793a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"736\" height=\"399\" src=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-507ebaf572fffa32611e16c821ffbec3_1440w.jpg\"  class=\"wp-image-468\" srcset=\"https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-507ebaf572fffa32611e16c821ffbec3_1440w.jpg 736w, https:\/\/www.ndnlab.com\/wp-content\/uploads\/2025\/08\/v2-507ebaf572fffa32611e16c821ffbec3_1440w-300x163.jpg 300w\" sizes=\"auto, (max-width: 736px) 100vw, 736px\" title=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe17\" alt=\"CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters\u63d2\u56fe17\" \/><\/figure>\n\n\n\n<p>\u5728\u8fd9\u4e2a\u96c6\u7fa4\u4e0a\u4f7f\u7528\u6570\u636e\u5e76\u884c\u548c\u6a21\u578b\u5e76\u884c\u4efb\u52a1\u8fdb\u884c\u5b9e\u9a8c\uff0c\u5b9e\u9a8c\u7ed3\u679c\u8868\u660e\u76f8\u6bd4\u4e8eThemis\uff0cTh+CASSINI\u5728\u5e73\u5747\u8fed\u4ee3\u65f6\u95f4\u548c\u8fed\u4ee3\u65f6\u95f4\u768499\u5206\u4f4d\u6570\u4e0a\u53d6\u5f97\u4e861.4\u548c1.9\u500d\u7684\u52a0\u901f\u3002\u8fd9\u4e9b\u6536\u76ca\u662f\u4e00\u4e9b\u8de8\u8282\u70b9\u4efb\u52a1\u5e26\u6765\u7684\u3002\u4f8b\u5982\uff0cXLM\u6a21\u578b\u548cResNet50\u5747\u4f7f\u7528\u4e09\u4e2aGPU\u8fdb\u884c\u8bad\u7ec3\uff0c\u6b64\u65f6\u6765\u4e86\u4e00\u4e2aDLRM\u7684\u4efb\u52a1\uff0c\u9700\u8981\u8d85\u8fc73\u4e2aGPU\uff0cThemis\u5c31\u4f1a\u8ba9DLRM\u4e0e\u4e0d\u517c\u5bb9\u7684XLM\u6a21\u578b\u5171\u4eab\u4e00\u4e2a\u8282\u70b9\uff0c\u5e26\u6765\u901a\u4fe1\u7ade\u4e89\uff0c\u800cTh+CASSINI\u4f1a\u8ba9DLRM\u4e0e\u517c\u5bb9\u7684ResNet50\u6a21\u578b\u5171\u4eab\u4e00\u4e2a\u8282\u70b9\uff0c\u63d0\u9ad8\u4e24\u4e2a\u6a21\u578b\u7684\u8bad\u7ec3\u6548\u7387\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u603b\u7ed3<\/strong><\/h2>\n\n\n\n<p>\u4e2a\u4eba\u89c9\u5f97\u672c\u6587\u89e3\u51b3\u95ee\u9898\u7684\u601d\u8def\u5f88\u5de7\u5999\uff0c\u4e4d\u4e00\u60f3\uff0c\u591a\u94fe\u8def\u591a\u4efb\u52a1\u7684\u573a\u666f\u4e0b\uff0c\u786e\u5b9a\u964d\u4f4e\u901a\u4fe1\u7ade\u4e89\u60c5\u51b5\u4e0b\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\uff0c\u662f\u4e00\u4ef6\u5f88\u56f0\u96be\u7684\u4e8b\u60c5\u3002\u4e3a\u4e86\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\uff0c\u672c\u6587\u4ece\u5355\u94fe\u8def\u591a\u4efb\u52a1\u51fa\u53d1\uff0c\u9996\u5148\u6c42\u89e3\u51fa\u5355\u94fe\u8def\u573a\u666f\u4e0b\u5404\u4e2a\u4efb\u52a1\u7684\u5f00\u59cb\u65f6\u95f4\uff0c\u6269\u5c55\u5230\u591a\u94fe\u8def\u573a\u666f\u65f6\uff0c\u901a\u8fc7\u6784\u5efa\u5173\u8054\u56fe\uff0c\u53ea\u4f7f\u7528\u7b80\u5355\u7684<a href=\"https:\/\/zhida.zhihu.com\/search?content_id=259569703&amp;content_type=Article&amp;match_order=1&amp;q=BFS%E7%AE%97%E6%B3%95&amp;zd_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ6aGlkYV9zZXJ2ZXIiLCJleHAiOjE3NTUzNTA0ODYsInEiOiJCRlPnrpfms5UiLCJ6aGlkYV9zb3VyY2UiOiJlbnRpdHkiLCJjb250ZW50X2lkIjoyNTk1Njk3MDMsImNvbnRlbnRfdHlwZSI6IkFydGljbGUiLCJtYXRjaF9vcmRlciI6MSwiemRfdG9rZW4iOm51bGx9.RAaAvC1b5G5OG0fJKrthlMXrTf9o-631btLEIX0tOq4&amp;zhida_source=entity\" target=\"_blank\" rel=\"noreferrer noopener\">BFS\u7b97\u6cd5<\/a>\u5c31\u89e3\u51b3\u4e86\u95ee\u9898\u3002\u5f53\u7136\u8fd9\u4e5f\u662f\u957f\u65f6\u95f4\u6c89\u6dc0\u7684\u7ed3\u679c\uff0c\u8fd9\u7bc7\u6587\u7ae0\u662f2022\u5e74\u7684\u4e00\u7bc7Workshop\u7684\u540e\u7eed\u3002\u5b9e\u9a8c\u7ed3\u679c\u4e5f\u8868\u660e\uff0c\u901a\u8fc7\u9519\u5f00\u4e0d\u540c\u4efb\u52a1\u5728\u540c\u4e00\u94fe\u8def\u4e0a\u7684\u901a\u4fe1\u64cd\u4f5c\uff0c\u672c\u6587\u5de5\u4f5cCASSINI\u80fd\u591f\u6709\u6548\u63d0\u9ad8\u5404\u4e2a\u4efb\u52a1\u7684\u8bad\u7ec3\u6027\u80fd\u3002<\/p>\n\n\n\n<p>\u672c\u6587\u7684\u89e3\u6cd5\u8981\u6c42\u80fd\u591f\u6784\u5efa\u51fa\u65e0\u73af\u5173\u8054\u56fe\uff0c\u4f46\u662f\u672c\u6587\u5e76\u6ca1\u6709\u8ba8\u8bba\u6784\u5efa\u65e0\u73af\u5173\u8054\u56fe\u7684\u53ef\u80fd\u6027\uff0c\u5982\u679c\u80fd\u591f\u518d\u8ba8\u8bba\u4e00\u4e0b\u65e0\u73af\u5173\u8054\u56fe\u6784\u5efa\u7684\u53ef\u80fd\u6027\uff0c\u8bc1\u660e\u53ef\u80fd\u6027\u8f83\u5927\u7684\u8bdd\uff0c\u611f\u89c9\u80fd\u8fdb\u4e00\u6b65\u63d0\u9ad8\u672c\u6587\u7684\u4ef7\u503c\u3002<\/p>\n\n\n\n<p>\u4e0a\u8ff0\u5185\u5bb9\u6765\u81ea<a href=\"https:\/\/zhuanlan.zhihu.com\/p\/1921513034762912747?share_code=1cMLB44skKMAX&amp;utm_psn=1939440401846088397\">https:\/\/zhuanlan.zhihu.com\/p\/1921513034762912747?share_code=1cMLB44skKMAX&amp;utm_psn=1939440401846088397<\/a><\/p>\n\n\n\n<p>\u8bba\u6587\u5730\u5740\uff1a <a href=\"https:\/\/arxiv.org\/abs\/2308.00852\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2308.00852<\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We present CASSINI, a network-aware job scheduler for<br \/>\nmachine learning (ML) clusters. CASSINI introduces a novel<br \/>\ngeometric abstraction to consider the communication pattern<br \/>\nof different jobs while placing them on network links. To<br \/>\ndo so, CASSINI uses an affinity graph that finds a series of<br \/>\ntime-shift values to adjust the communication phases of a<br \/>\nsubset of jobs such that the communication patterns of jobs<br \/>\nsharing the same network link are interleaved with each other.<br \/>\nExperiments with 13 common ML models on a 24-server<br \/>\ntestbed demonstrate that compared to the state-of-the-art ML<br \/>\nschedulers, CASSINI improves the average and tail completion<br \/>\ntime of jobs by up to 1.6\u00d7 and 2.5\u00d7, respectively. Moreover,<br \/>\nwe show that CASSINI reduces the number of ECN marked<br \/>\npackets in the cluster by up to 33\u00d7.<\/p>\n","protected":false},"author":2,"featured_media":530,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17,23],"tags":[],"class_list":["post-455","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-17","category-23"],"_links":{"self":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/455","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=455"}],"version-history":[{"count":3,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/455\/revisions"}],"predecessor-version":[{"id":531,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/posts\/455\/revisions\/531"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=\/wp\/v2\/media\/530"}],"wp:attachment":[{"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=455"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=455"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ndnlab.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=455"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}