About Colocation
The left figure below describes the resource allocation of online and offline tasks in a cluster over a period of time. In the initial stage, online tasks do not consume many resources, and a large amount of computing resources are allocated to offline tasks with lower priority. When the resource demand of online tasks surges due to a special event (emergency, hot search, etc.), Gödel immediately allocates resources to online tasks, and the resource allocation of offline tasks decreases rapidly. After the peak, online tasks begin to reduce resource requests, and the scheduler shifts resources to offline tasks again. By combining offline pools and dynamic resource transfer, ByteDance can always maintain a high resource utilization rate. During the evening peak hours, the average resource rate of the cluster reaches more than 60%, and it can also be maintained at around 40% during the daytime trough stage.
[2022] Xin Tao: ByteDance’s cloud-native machine learning system implementation / 辛涛:字节跳动机器学习系统云原生落地实践
[2022] ByteDance YARN Cloud Native Evolution Practice / 字节跳动 YARN 云原生化演进实践
[2023] ByteDance Spark supports tens of thousands GPUs model inference practice / 字节跳动 Spark 支持万卡模型推理实践
[2023] ByteDance Spark Shuffle large-scale cloud-native evolution practice / 字节跳动 Spark Shuffle 大规模云原生化演进实践
[2023] ByteDance Flink’s large-scale cloud-native practice / 字节跳动 Flink 大规模云原生化实践
[2024] How ByteDance improves resource efficiency for large-scale Spark jobs / 字节跳动如何对大规模 Spark 作业进行资源提效
Paper link: Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDance
Paper link: ResLake: Towards Minimum Job Latency and Balanced Resource Utilization in Geo-distributed Job Scheduling
[2022] ByteDance’s evolution of cloud-native technology / 字节跳动的云原生技术历程演进
[2023] ByteDance’s multi-cloud and cloud-native practice / 字节跳动的多云云原生实践之路
[2022] ByteDance’s large-scale K8s cluster management practice / 字节跳动大规模 K8s 集群管理实践
KubeWharf: A practice-driven cloud native project set / KubeWharf: 一个实践驱动的云原生项目集
Katalyst: ByteDance’s cloud-native cost optimization practice / Katalyst:字节跳动云原生成本优化实践
KubeZoo: ByteDance’s lightweight multi-tenant open source solution / KubeZoo:字节跳动轻量级多租户开源解决方案