diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..ffd145a Binary files /dev/null and b/.DS_Store differ diff --git a/cluster-trace-microservices-v2021/README.md b/cluster-trace-microservices-v2021/README.md index 37ba7a5..41cd1ae 100644 --- a/cluster-trace-microservices-v2021/README.md +++ b/cluster-trace-microservices-v2021/README.md @@ -1,5 +1,6 @@ # Overview of Microservices Traces -The released traces contain the detailed runtime metrics of nearly twenty thousand microservices. They are collected from Alibaba production clusters of over ten thousand [bare-metal nodes](https://dl.acm.org/doi/10.1145/3373376.3378507) during twelve hours in 2021. + +The released traces contain the detailed runtime metrics of nearly twenty thousand microservices. They are collected from Alibaba production clusters of over ten thousand [bare-metal nodes](https://dl.acm.org/doi/10.1145/3373376.3378507) during twelve hours in 2021. We conduct a characterization analysis on the trace in a paper, [Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis](http://cloud.siat.ac.cn/pdca/socc2021-AlibabaTraceAnalysis.pdf), published in SoCC’21. And we would encourage anybody who uses this trace to cite our paper. @@ -15,37 +16,38 @@ We conduct a characterization analysis on the trace in a paper, [Characterizing # Cluster Architecture -The production cluster contains a large number of bare-metal (BM) nodes and is running in the Alibaba cloud. Users could submit their offline jobs through different controls, which will require resources from uniformed resource management (URM), or send microservices (MS) requirements to URM directly. To improve resource efficiency, URM will place containers of offline jobs and MS on the same BM node. More specifically, containers of offline jobs are scheduled into secure containers, which could be regarded as lightweight virtual machine. This can enhance isolation and mitigate interference from offline jobs to provide performance guarantees for MS containers. +The production cluster contains a large number of bare-metal (BM) nodes and is running in the Alibaba cloud. Users could submit their offline jobs through different controls, which will require resources from uniformed resource management (URM), or send microservices (MS) requirements to URM directly. To improve resource efficiency, URM will place containers of offline jobs and MS on the same BM node. More specifically, containers of offline jobs are scheduled into secure containers, which could be regarded as lightweight virtual machine. This can enhance isolation and mitigate interference from offline jobs to provide performance guarantees for MS containers. ![Clusterarchitecture](./figures/Clusterarchitecture.png) - # MS architecture in Alibaba -As shown in this figure, users usually send a web request to the entering MS first, e.g., MS A, which will trigger a series of calls between related microservices. We define the set of these calls as a call graph. As such, a call graph contains multiple calls between different pairs of microservices. Here, a pair of microservices contain one upstream microservice (UM) and one downstream microservice (DM). - +As shown in this figure, users usually send a web request to the entering MS first, e.g., MS A, which will trigger a series of calls between related microservices. We define the set of these calls as a call graph. As such, a call graph contains multiple calls between different pairs of microservices. Here, a pair of microservices contain one upstream microservice (UM) and one downstream microservice (DM). ![MSarchitecture](./figures/msArchitecture.png) -Microservices can be categorized into two types, stateless services (e.g., a circle in the above Figure) and stateful services (e.g., a rectangle or hexagon). Stateless services are isolated from state data while stateful services need to store data in some locations, such as Database (DB) and Memcached (MC). There exist three types of communication paradigms between a pair of microservices, i.e., inter-process communication, remote invocation, and indirect communication. +Microservices can be categorized into two types, stateless services (e.g., a circle in the above Figure) and stateful services (e.g., a rectangle or hexagon). Stateless services are isolated from state data while stateful services need to store data in some locations, such as Database (DB) and Memcached (MC). There exist three types of communication paradigms between a pair of microservices, i.e., inter-process communication, remote invocation, and indirect communication. # Introduction of Trace Data + The traces include four parts of data as follows: -NodeTable: BM Node runtime information. It records CPU and memory utilization of 1300+ BM nodes in a production cluster. +node: BM Node runtime information. It records CPU and memory utilization of 1300+ BM nodes in a production cluster. + +MS_Resource_Table: MS runtime information. It records CPU and memory utilization of 90000+ containers for 1300+ MSs in the same production cluster. -MS_Resource_Table: MS runtime information. It records CPU and memory utilization of 90000+ containers for 1300+ MSs in the same production cluster. - -MS_RT_Qps_Table: Microservice call rate (MCR) and response time (RT) information. It records MCR and RT of the calls via different communication paradigms among 1300+ MSs with 90000+ containers in the same production cluster. +MS_Metrics_Table: Microservice call rate (MCR) and response time (RT) information. It records MCR and RT of the calls via different communication paradigms among 1300+ MSs with 90000+ containers in the same production cluster. -MS_CallGraph_Table: MS Call Graphs information. Due to the large-scale data size, we sample the call graph based on the rate of 0.5%. It contains about more than twenty million call graphs among 20000+ MSs with in more than ten clusters. +MS_CallGraph_Table: MS Call Graphs information. Due to the large-scale data size, we sample the call graph based on the rate of 0.5%. It contains about more than twenty million call graphs among 20000+ MSs with in more than ten clusters. + +Users could run the following command to fetch data. -Users could run the following command to fetch data. > bash fetchData.sh -It includes node/Node.tar.gz, MSCallGraph/MSCallGraph_*.tar.gz, MSResource/MSResource_*.tar.gz, MSRTQps/MSRTQps_*.tar.gz. +It includes node/Node.tar.gz, MSCallGraph/MSCallGraph_*.tar.gz, MSResource/MSResource_*.tar.gz, MSRTQps/MSRTQps_*.tar.gz. Size of each directory: + - 1.1Gi node - 25Gi MSCallGraph - 16Gi MSResource @@ -53,87 +55,86 @@ Size of each directory: Usage In each directory (node,MSCallGraph,MSResource,MSRTQps), please execute: + > for file in `ls *.tar.gz`; do tar -xzf $file; done node: -|columns | Example Entry | -| ---------- | :-----------: | -| timestamp | 1000 | -| nodeid | ff1fb31957db767c5be4de2855488f128532efc2df0a673c6fa3e7718d10f355 | -| cpu_utilization | 0.7236219289416774 | -| memory_utilization | 0.738378997619996 | - -- timestamp: Timestamp of recorded metrics. Range from 0 to 43200000 for twelve hours (12 * 60 * 60 * 1000). The recording interval is the 30s (30 * 1000). + +| columns | Example Entry | +| ------------------ | :--------------------------------------------------------------: | +| timestamp | 1000 | +| nodeid | ff1fb31957db767c5be4de2855488f128532efc2df0a673c6fa3e7718d10f355 | +| cpu_utilization | 0.7236219289416774 | +| memory_utilization | 0.738378997619996 | + +- timestamp: Timestamp of recorded metrics. Range from 0 to 43200000 for twelve hours (12 * 60 * 60 * 1000). The recording interval is the 30s (30 * 1000). - nodeid: The specific id of BM node. It could be joined with nodeid in MS_Resource_Table. - cpu_utilization: CPU utilization of BM node. - memory_utilization: Memory utilization of BM node. MS_Metrics_Table: -|columns | Example Entry | -| ---------- | :-----------: | -| timestamp | 0 | -| msname | 99f2e7b501f50db9b4089242a5d3e1aba334c32e0c718b8d79281529a9489b15 | -| msinstanceid | 4d1cf65970444ef3ba9870468d7ecf9c17a93782134464767fbdc4aeb6f162eb | -| nodeid | ecd8a876344d673d8e934f566aae38aec05e572d43ab8d58624bd59e3ec43928 | -| cpu_utilization | 0.1299166666654249 | -| memory_utilization | 0.6126489639282227 | - +| columns | Example Entry | +| ------------------ | :--------------------------------------------------------------: | +| timestamp | 0 | +| msname | 99f2e7b501f50db9b4089242a5d3e1aba334c32e0c718b8d79281529a9489b15 | +| msinstanceid | 4d1cf65970444ef3ba9870468d7ecf9c17a93782134464767fbdc4aeb6f162eb | +| nodeid | ecd8a876344d673d8e934f566aae38aec05e572d43ab8d58624bd59e3ec43928 | +| cpu_utilization | 0.1299166666654249 | +| memory_utilization | 0.6126489639282227 | -- timestamp: Mentioned in NodeTable. The recording interval is the 60s (60 * 1000). +- timestamp: Mentioned in NodeTable. The recording interval is the 60s (60 * 1000). - msname: The name of MS, to be joined with MSName in MS_MCR_RT_Table, and DM and UM in MS_CallGraph_Table. MSName only contains stateless services, as stateful services run in other dedicated clusters. -- msinstanceid: The specific container id of MS. An MS may have more than one container. -- nodeid: The specific BM node in which MSInstanceID runs. +- msinstanceid: The specific container id of MS. An MS may have more than one container. +- nodeid: The specific BM node in which MSInstanceID runs. - cpu_utilization: CPU utilization of MSInstanceID. - memory_utilization: Memory utilization of MSInstanceID. - MS_MCR_RT_Table: -|columns | Example Entry | -| ---------- | :-----------: | -| timestamp | 6600000 | -| msname | 1e5dd1f5843e50b9282fb99c58d8fe9c6e3d712d9e601a7a5264ca4ff7d96773 | -| msinstanceid | 0c14b76dd7faa42a7e8f5fa5c59ad20e23ff8ce2981c1ff86544e76e72f1cca2 | -| metrics | consumerRPC_RT | -| value | 9.14142215173143 | - - + +| columns | Example Entry | +| ------------ | :--------------------------------------------------------------: | +| timestamp | 6600000 | +| msname | 1e5dd1f5843e50b9282fb99c58d8fe9c6e3d712d9e601a7a5264ca4ff7d96773 | +| msinstanceid | 0c14b76dd7faa42a7e8f5fa5c59ad20e23ff8ce2981c1ff86544e76e72f1cca2 | +| metrics | consumerRPC_RT | +| value | 9.14142215173143 | + - timestamp: Mentioned in MS_Metrics_Table. -- msname: Mentioned in MS_Metrics_Table. -- msinstanceid: Mentioned in MS_Metrics_Table. -- metrics: Calls Rate with different communication paradigms and corresponding RT. The value of metrics for an MS is an aggregation of all its DMs and UMs. To distinguish whether an MS is DM or UM, the Metrics are recorded with a prefix before communication paradigms. For example, RPC is named consumerRPC and providerRPC, meaning an MS as the consumer calling its DM and as the provider being called by its UM respectively. Correspondingly, MQ could be classified into two groups from an MS's point of view, namely, providerMQ, and consumerMQ. For the former, MQ is a provider that sends messages to the third party whereas, the latter is a consumer that fetches messages from the third party. As MSs in this table are all stateless services, they are only UMs to read or write stateful services. - In summary, these metrics include consumerRPC_MCR, providerRPC_MCR, HTTP_MCR, providerMQ_MCR, consumerMQ_MCR, consumerRPC_RT, providerRPC_RT, HTTP_RT, providerMQ_RT, and consumerMQ_RT. +- msname: Mentioned in MS_Metrics_Table. +- msinstanceid: Mentioned in MS_Metrics_Table. +- metrics: Calls Rate with different communication paradigms and corresponding RT. The value of metrics for an MS is an aggregation of all its DMs and UMs. To distinguish whether an MS is DM or UM, the Metrics are recorded with a prefix before communication paradigms. For example, RPC is named consumerRPC and providerRPC, meaning an MS as the consumer calling its DM and as the provider being called by its UM respectively. Correspondingly, MQ could be classified into two groups from an MS's point of view, namely, providerMQ, and consumerMQ. For the former, MQ is a provider that sends messages to the third party whereas, the latter is a consumer that fetches messages from the third party. As MSs in this table are all stateless services, they are only UMs to read or write stateful services. + In summary, these metrics include consumerRPC_MCR, providerRPC_MCR, HTTP_MCR, providerMQ_MCR, consumerMQ_MCR, consumerRPC_RT, providerRPC_RT, HTTP_RT, providerMQ_RT, and consumerMQ_RT. - value: The value of Metrics. For example, the value of metric providerRPC_MCR and providerRPC_RT characterize the number of calls per second and the average of response time respectively. Here, the response time is measured by millisecond (ms). MS_CallGraph_Table: -|columns | Example Entry | -| ---------- | :-----------: | -| timestamp | 16397576 | -| traceid | 015101cd15919399974329000e | -| rpcid | 0.1.1.2.50 | -| um | 35114acfb54c54fb9618f23cd28bbc57c765f597df140977d7030dcc52775ed4 | -| rpctype | rpc | -| interface | af42b5e3e0eb334d38619733586d78d1414f6549f24d31b39a5294454638bc59 | -| dm | b65fdc9bfef6b4974c3e90e1ec7b92d30e639789da5a78c1d4685857e19c75a0 | -| rt | 13 | - - - - -- timestamp: Mentioned in MS_CallGraph_Table. + +| columns | Example Entry | +| --------- | :--------------------------------------------------------------: | +| timestamp | 16397576 | +| traceid | 015101cd15919399974329000e | +| rpcid | 0.1.1.2.50 | +| um | 35114acfb54c54fb9618f23cd28bbc57c765f597df140977d7030dcc52775ed4 | +| rpctype | rpc | +| interface | af42b5e3e0eb334d38619733586d78d1414f6549f24d31b39a5294454638bc59 | +| dm | b65fdc9bfef6b4974c3e90e1ec7b92d30e639789da5a78c1d4685857e19c75a0 | +| rt | 13 | + +- timestamp: Mentioned in MS_CallGraph_Table. - traceid: Each call graph has a unique traceID. -- rpcid: Each call is identified by a unique rpcID, which contains the ID information of a pair of UM and DM. For example, rpcID 0.1.1 and 0.1.2 denote two calls that two different DMs are called by the same UM, which is the DM in the call with rpcID 0.1. Note that, the call via remote invocation is recorded twice with the same rpcID in the UM and DM independently. +- rpcid: Each call is identified by a unique rpcID, which contains the ID information of a pair of UM and DM. For example, rpcID 0.1.1 and 0.1.2 denote two calls that two different DMs are called by the same UM, which is the DM in the call with rpcID 0.1. Note that, the call via remote invocation is recorded twice with the same rpcID in the UM and DM independently. - um: The name of UM. - rpctype: The communication paradigms. We record rpc_type as "DB" and "MC" for the calls via inter-process communication if DM is DB and MC respectively. - interface: The interface of DM is called by UM. The calls via remote invocation or HTTP have the interface. - DM: The name of DM. -- rt: Response time of the call. It is measured by millisecond (ms). If rt is less than 1 ms, e.g. rt of read/write MC, the value will be recorded as 0. For call via RPC, the value of RT could be a positive integer and negative integer, which are recorded in UM and DM respectively, and represents UM RT (from UM sending a request to receiving a reply) and the opposite of DM RT (from DM receiving the request to sending the reply) respectively. The RT of a call via MQ is the interval from DM fetching the message to finishing it. For call via HTTP, the UM and DM RT is also recorded as positive integer and negative integer respectively. +- rt: Response time of the call. It is measured by millisecond (ms). If rt is less than 1 ms, e.g. rt of read/write MC, the value will be recorded as 0. For call via RPC, the value of RT could be a positive integer and negative integer, which are recorded in UM and DM respectively, and represents UM RT (from UM sending a request to receiving a reply) and the opposite of DM RT (from DM receiving the request to sending the reply) respectively. The RT of a call via MQ is the interval from DM fetching the message to finishing it. For call via HTTP, the UM and DM RT is also recorded as positive integer and negative integer respectively. # Discussion + - How to identify a specific service in the traces? -In practice, microservice architecture adopts proxy modules like Nginx to forward users' requests to an entering MS. The entering MS, e.g., interface of entering MS A in Fig.2, containing multiple interfaces, and each interface provides a specific service. As such, each interface of MS called by proxy MS could be labelled as an online service. As revealed in the paper, we classify the call graphs of each online service into different classes when analyzing the dynamics of microservices. It is worth noting that, some call graphs could even contain two proxy modules, and the name of the second proxy MS are usually recorded as '(?)' or ''. +In practice, microservice architecture adopts proxy modules like Nginx to forward users' requests to an entering MS. The entering MS, e.g., interface of entering MS A in Fig.2, containing multiple interfaces, and each interface provides a specific service. As such, each interface of MS called by proxy MS could be labelled as an online service. As revealed in the paper, we classify the call graphs of each online service into different classes when analyzing the dynamics of microservices. It is worth noting that, some call graphs could even contain two proxy modules, and the name of the second proxy MS are usually recorded as '(?)' or ''. - Missing items in traces. - -In these traces, it happens that some metrics in MS_CallGraph_Table are lost. For example, the name of some MS is recorded as NAN, '(?)' or '' in the traces. As the call via RPC will be recorded twice in MS_CallGraph_Table, some metrics related to rpcID could be found from another record even if one is missing. + +In these traces, it happens that some metrics in MS_CallGraph_Table are lost. For example, the name of some MS is recorded as NAN, '(?)' or '' in the traces. As the call via RPC will be recorded twice in MS_CallGraph_Table, some metrics related to rpcID could be found from another record even if one is missing. diff --git a/cluster-trace-microservices-v2021/fetchData.sh b/cluster-trace-microservices-v2021/fetchData.sh index 3b251bd..d0b910e 100644 --- a/cluster-trace-microservices-v2021/fetchData.sh +++ b/cluster-trace-microservices-v2021/fetchData.sh @@ -11,6 +11,7 @@ mkdir MSRTQps mkdir MSCallGraph cd Node + command="wget -c --retry-connrefused --tries=0 --timeout=50 ${url}/node/Node_0.tar.gz" ${command} diff --git a/cluster-trace-microservices-v2022/README.md b/cluster-trace-microservices-v2022/README.md index 30b103b..c07ebb8 100644 --- a/cluster-trace-microservices-v2022/README.md +++ b/cluster-trace-microservices-v2022/README.md @@ -1,8 +1,6 @@ # Overview of Microservices Traces -The released traces contain the detailed runtime metrics of nearly twenty thousand microservices. They are collected from Alibaba production clusters of over ten thousand [bare-metal nodes](https://dl.acm.org/doi/10.1145/3373376.3378507) during a week in 2022. The traces have more diverse metrics within a longer time window than the microservices traces in 2021. We will release the new version of traces in **the coming days**. - -We design a proactive microservice auto-scaler for individual microservices with workload uncertainty learning in a paper, accepted by SoCC 2022. And we would encourage anybody who uses this trace to cite our paper. +The released traces contain the detailed runtime metrics of nearly twenty thousand microservices. They are collected from Alibaba production clusters of over ten thousand [bare-metal nodes](https://dl.acm.org/doi/10.1145/3373376.3378507) during 13 days in 2022. In comparison to the previous trace version (v2021), this updated trace offers an extended duration (13 days) and includes additional information, such as the service ID within the call graph. ```BibTeX @inproceedings{luo2022Prediction, @@ -27,87 +25,133 @@ As shown in this figure, users usually send a web request to the entering MS fir Microservices can be categorized into two types, stateless services (e.g., a circle in the above Figure) and stateful services (e.g., a rectangle or hexagon). Stateless services are isolated from state data while stateful services need to store data in some locations, such as Database (DB) and Memcached (MC). There exist three types of communication paradigms between a pair of microservices, i.e., inter-process communication, remote invocation, and indirect communication. -# Introduction of Trace Data +# Overview of Trace Data The traces include four parts of data as follows: -NodeTable: BM Node runtime information. It records CPU and memory utilization of 1300+ BM nodes in a production cluster. +Node: BM Node runtime information. It records CPU and memory utilization of 40000+ BM nodes in a production cluster. + +MSResource: MS runtime information. It records CPU and memory utilization of 470000+ containers for 28000+ MSs in the same production cluster. + +MSRTMCR: Microservice call rate (MCR) and response time (RT) information. It records MCR and RT of the calls via different communication paradigms among 28000+ MSs with 470000+ containers in the same production cluster. + +MSCallGraph: MS Call Graphs information. It contains about more than twenty million call graphs among 17000+ MSs with in more than ten clusters. + +Note: The value of resource utilization and the MCR has been **normalized** by Max-min method. + +# Trace Data Download -MS_Resource_Table: MS runtime information. It records CPU and memory utilization of 90000+ containers for 1300+ MSs in the same production cluster. +User can use the following script to download the trace with different intervals. -MS_RT_Qps_Table: Microservice call rate (MCR) and response time (RT) information. It records MCR and RT of the calls via different communication paradigms among 1300+ MSs with 90000+ containers in the same production cluster. +> bash fetchData.sh start_date=0d0 end_date=1d1 -MS_CallGraph_Table: MS Call Graphs information. Due to the large-scale data size, we sample the call graph based on the rate of 0.5%. It contains about more than twenty million call graphs among 20000+ MSs with in more than ten clusters. +Where the `start_date` and `end_date` follow the following format: `${day}d${hour}`, and they are `[start_date, end_date)`. The day and hour are all started by 0, data will be saved in `data/MSCallGraph`, `data/MSResource`, `data/Node` and `data/MSRTMCR` respectively. -The link to download trace is coming soon. +Size of each directory (compressed) for an hour: -node: +- ~10Mi Node +- ~4Gi MSCallGraph +- ~700Mi MSResource +- ~3Gi MSRTMCR -| columns | Example Entry | -| ------------------ | :--------------------------------------------------------------: | -| timestamp | 1000 | -| nodeid | ff1fb31957db767c5be4de2855488f128532efc2df0a673c6fa3e7718d10f355 | -| cpu_utilization | 0.7236219289416774 | -| memory_utilization | 0.738378997619996 | +The size of all files for 13 days is about 2T. -- timestamp: Timestamp of recorded metrics. Range from 0 to 43200000 for twelve hours (12 * 60 * 60 * 1000). The recording interval is the 30s (30 * 1000). +Usage +In each directory (Node,MSCallGraph,MSResource,MSRTMCR), please execute: + +> for file in `ls *.tar.gz`; do tar -xzf $file; done + +# Introduction of Trace Data + +Node: + +| columns | Example Entry | +| ------------------ | :---------------: | +| timestamp | 60000 | +| nodeid | NODE_10632 | +| cpu_utilization | 0.266488095525847 | +| memory_utilization | 0.159064258887333 | + +- timestamp: Timestamp of recorded metrics. The recording interval is the 60s (60 * 1000). - nodeid: The specific id of BM node. It could be joined with nodeid in MS_Resource_Table. -- cpu_utilization: CPU utilization of BM node. -- memory_utilization: Memory utilization of BM node. +- cpu_utilization: **Normalized** CPU utilization of BM node. +- memory_utilization: **Normalized** memory utilization of BM node. -MS_Metrics_Table: +MSResource: -| columns | Example Entry | -| ------------------ | :--------------------------------------------------------------: | -| timestamp | 0 | -| msname | 99f2e7b501f50db9b4089242a5d3e1aba334c32e0c718b8d79281529a9489b15 | -| msinstanceid | 4d1cf65970444ef3ba9870468d7ecf9c17a93782134464767fbdc4aeb6f162eb | -| nodeid | ecd8a876344d673d8e934f566aae38aec05e572d43ab8d58624bd59e3ec43928 | -| cpu_utilization | 0.1299166666654249 | -| memory_utilization | 0.6126489639282227 | +| columns | Example Entry | +| ------------------ | :-----------------: | +| timestamp | 180000 | +| msname | MS_21881 | +| msinstanceid | MS_21881_POD_0 | +| nodeid | NODE_11517 | +| cpu_utilization | 0.21995999999530616 | +| memory_utilization | 0.833001454671224 | -- timestamp: Mentioned in NodeTable. The recording interval is the 60s (60 * 1000). +- timestamp: Mentioned in Node. The recording interval is the 60s (60 * 1000). - msname: The name of MS, to be joined with MSName in MS_MCR_RT_Table, and DM and UM in MS_CallGraph_Table. MSName only contains stateless services, as stateful services run in other dedicated clusters. - msinstanceid: The specific container id of MS. An MS may have more than one container. - nodeid: The specific BM node in which MSInstanceID runs. -- cpu_utilization: CPU utilization of MSInstanceID. -- memory_utilization: Memory utilization of MSInstanceID. - -MS_MCR_RT_Table: - -| columns | Example Entry | -| ------------ | :--------------------------------------------------------------: | -| timestamp | 6600000 | -| msname | 1e5dd1f5843e50b9282fb99c58d8fe9c6e3d712d9e601a7a5264ca4ff7d96773 | -| msinstanceid | 0c14b76dd7faa42a7e8f5fa5c59ad20e23ff8ce2981c1ff86544e76e72f1cca2 | -| metrics | consumerRPC_RT | -| value | 9.14142215173143 | - -- timestamp: Mentioned in MS_Metrics_Table. -- msname: Mentioned in MS_Metrics_Table. -- msinstanceid: Mentioned in MS_Metrics_Table. -- metrics: Calls Rate with different communication paradigms and corresponding RT. The value of metrics for an MS is an aggregation of all its DMs and UMs. To distinguish whether an MS is DM or UM, the Metrics are recorded with a prefix before communication paradigms. For example, RPC is named consumerRPC and providerRPC, meaning an MS as the consumer calling its DM and as the provider being called by its UM respectively. Correspondingly, MQ could be classified into two groups from an MS's point of view, namely, providerMQ, and consumerMQ. For the former, MQ is a provider that sends messages to the third party whereas, the latter is a consumer that fetches messages from the third party. As MSs in this table are all stateless services, they are only UMs to read or write stateful services. +- cpu_utilization: **Normalized** CPU utilization of MSInstanceID. +- memory_utilization: **Normalized** memory utilization of MSInstanceID. + +MSRTMCR: + +| columns | Example Entry | +| --------------- | :--------------------: | +| timestamp | 60000 | +| msname | MS_73317 | +| msinstanceid | MS_73317_POD_1797 | +| nodeid | NODE_3619 | +| providerrpc_rt | 10.119451170298627 | +| providerrpc_mcr | 1.216773932801612e-05 | +| consumerrpc_rt | 9.996974281391829 | +| consumerrpc_mcr | 7.169679436055699e-12 | +| writemc_rt | 0.0 | +| writemc_mcr | 0.0 | +| readmc_rt | 0.40625 | +| readmc_mcr | 3.142596113773332e-07 | +| writedb_rt | 0.0 | +| writedb_mcr | 0.0 | +| readdb_rt | 0.9693548387096775 | +| readdb_mcr | 6.088779970435831e-06 | +| consumermq_rt | 0.0 | +| consumermq_mcr | 0.0 | +| providermq_rt | 24.4218009478673 | +| providermq_mcr | 2.0721493125192907e-06 | +| http_mcr | 0.0 | +| http_rt | 0.0 | + +- timestamp: Mentioned in Node. The recording interval is the 60s (60 * 1000). +- msname: Mentioned in MSResource. +- msinstanceid: Mentioned in MSResource. +- nodeid: Mentioned in Node. +- Other columns: The value of corresponding RT and calls rate with different communication paradigms. For example, the value of metric providerRPC_MCR and providerRPC_RT characterize the number of calls per second and the average of response time respectively. Here, the response time is measured by millisecond (ms) and the MCR is **normalized** through max-min in range from 0 to 1. The value of metrics for an MS is an aggregation of all its DMs and UMs. To distinguish whether an MS is DM or UM, the Metrics are recorded with a prefix before communication paradigms. For example, RPC is named consumerRPC and providerRPC, meaning an MS as the consumer calling its DM and as the provider being called by its UM respectively. Correspondingly, MQ could be classified into two groups from an MS's point of view, namely, providerMQ, and consumerMQ. For the former, MQ is a provider that sends messages to the third party whereas, the latter is a consumer that fetches messages from the third party. As MSs in this table are all stateless services, they are only UMs to read or write stateful services. In summary, these metrics include consumerRPC_MCR, providerRPC_MCR, HTTP_MCR, providerMQ_MCR, consumerMQ_MCR, consumerRPC_RT, providerRPC_RT, HTTP_RT, providerMQ_RT, and consumerMQ_RT. -- value: The value of Metrics. For example, the value of metric providerRPC_MCR and providerRPC_RT characterize the number of calls per second and the average of response time respectively. Here, the response time is measured by millisecond (ms). - -MS_CallGraph_Table: - -| columns | Example Entry | -| --------- | :--------------------------------------------------------------: | -| timestamp | 16397576 | -| traceid | 015101cd15919399974329000e | -| rpcid | 0.1.1.2.50 | -| um | 35114acfb54c54fb9618f23cd28bbc57c765f597df140977d7030dcc52775ed4 | -| rpctype | rpc | -| interface | af42b5e3e0eb334d38619733586d78d1414f6549f24d31b39a5294454638bc59 | -| dm | b65fdc9bfef6b4974c3e90e1ec7b92d30e639789da5a78c1d4685857e19c75a0 | -| rt | 13 | - -- timestamp: Mentioned in MS_CallGraph_Table. + +MSCallGraph: + +| columns | Example Entry | +| ------------ | :--------------: | +| timestamp | 115352 | +| traceid | T_11560863075 | +| service | S_153587416 | +| rpc_id | 0.1 | +| um | MS_58845 | +| uminstanceid | MS_58845_POD_0 | +| rpctype | rpc | +| interface | xOuy6-80Vt | +| dm | MS_71712 | +| dminstanceid | MS_71712_POD_244 | +| rt | 2.0 | + +- timestamp: Mentioned in Node. - traceid: Each call graph has a unique traceID. +- service: Online service id. A specific online service provides a function for users. For example, the online shopping application can provide multiple online services, including ordering, goods searching, delivering and so on. - rpcid: Each call is identified by a unique rpcID, which contains the ID information of a pair of UM and DM. For example, rpcID 0.1.1 and 0.1.2 denote two calls that two different DMs are called by the same UM, which is the DM in the call with rpcID 0.1. Note that, the call via remote invocation is recorded twice with the same rpcID in the UM and DM independently. - um: The name of UM. +- uminstanceid: The specific container id of um MS. An MS may have more than one container. - rpctype: The communication paradigms. We record rpc_type as "DB" and "MC" for the calls via inter-process communication if DM is DB and MC respectively. - interface: The interface of DM is called by UM. The calls via remote invocation or HTTP have the interface. -- DM: The name of DM. -- rt: Response time of the call. It is measured by millisecond (ms). If rt is less than 1 ms, e.g. rt of read/write MC, the value will be recorded as 0. For call via RPC, the value of RT could be a positive integer and negative integer, which are recorded in UM and DM respectively, and represents UM RT (from UM sending a request to receiving a reply) and the opposite of DM RT (from DM receiving the request to sending the reply) respectively. The RT of a call via MQ is the interval from DM fetching the message to finishing it. For call via HTTP, the UM and DM RT is also recorded as positive integer and negative integer respectively. +- dm: The name of DM. +- dminstanceid: The specific container id of dm MS. An MS may have more than one container. \ No newline at end of file diff --git a/cluster-trace-microservices-v2022/fetchData.sh b/cluster-trace-microservices-v2022/fetchData.sh new file mode 100644 index 0000000..f2284d9 --- /dev/null +++ b/cluster-trace-microservices-v2022/fetchData.sh @@ -0,0 +1,49 @@ +#!/bin/bash +prepare_dir() { + mkdir -p data/NodeMetrics data/MSMetrics data/MSRTMCR data/CallGraph +} + +# $1 = start_day, $2 = end_day +# $3 = start_hour, $4 = end_hour +fetch_data() { + declare -a file_names=( + "data/CallGraph/CallGraph" "data/MSMetrics/MSMetrics" + "data/NodeMetrics/NodeMetrics" "data/MSRTMCR/MSRTMCR" + ) + declare -a remote_paths=( + "CallGraph/CallGraph" "MSMetricsUpdate/MSMetricsUpdate" + "NodeMetricsUpdate/NodeMetricsUpdate" "MCRRTUpdate/MCRRTUpdate" + ) + declare -a ratios=(3 30 720 3) + start_hour=$(($1 * 24 * 60 + $3 * 60)) + end_hour=$(($2 * 24 * 60 + $4 * 60)) + for i in $(seq 0 3); do + start_idx=$(($start_hour / ${ratios[$i]})) + end_idx=$(($end_hour / ${ratios[$i]} - 1)) + if [[ $i == 2 && $(($end_hour % ${ratios[$i]})) != 0 ]]; then + end_idx=$(($end_idx + 1)) + fi + for idx in $(seq $start_idx $end_idx); do + file_name="${file_names[$i]}_$idx.tar.gz" + remote_path="${remote_paths[$i]}_$idx.tar.gz" + url="https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2022MicroservicesTraces/$remote_path" + command="wget -c --retry-connrefused --tries=0 --timeout=50 -O $file_name $url" + $command + done + done +} + +for ARGUMENT in "$@"; do + KEY=$(echo $ARGUMENT | cut -f1 -d=) + + KEY_LENGTH=${#KEY} + VALUE="${ARGUMENT:$KEY_LENGTH+1}" + + export "$KEY"="$VALUE" +done +start_day=$(expr $(echo $start_date | cut -f1 -dd) + 0) +start_hour=$(expr $(echo $start_date | cut -f2 -dd) + 0) +end_day=$(expr $(echo $end_date | cut -f1 -dd) + 0) +end_hour=$(expr $(echo $end_date | cut -f2 -dd) + 0) +prepare_dir +fetch_data $start_day $end_day $start_hour $end_hour \ No newline at end of file diff --git a/cluster-trace-v2023/README.md b/cluster-trace-v2023/README.md new file mode 100644 index 0000000..b0130f9 --- /dev/null +++ b/cluster-trace-v2023/README.md @@ -0,0 +1,106 @@ +# Overview of Unified Scheduler Traces + +The released traces contain the detailed runtime metrics of nearly nine thousand machines and over five hundred thousand pods belonging to over ten thousand applications. They are collected from Alibaba production clusters during a week in 2022. The traces provide some new information about the clusters with the unified scheduler. We characterize the unified scheduling clusters with the trace datasets and design an optimization method to help to improve the resource utilization of the clusters. The paper has been accepted by EuroSys'23. And we would encourage anybody who uses this trace to cite our paper. + +```python +@inproceedings{lu2023understanding, + title={Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud Platforms}, + author={Lu, Chengzhi and Xu, Huanle and Ye, Kejiang and Xu, Guoyao and Zhang, Liping and Yang, Guodong and Xu, Chengzhong}, + booktitle={Proceedings of the Eighteenth European Conference on Computer Systems}, + pages={416--432}, + year={2023} +} +``` + +# Cluster Architecture + +Unified scheduling in Alibaba data centers has fully unified the scheduling of e-commerce, search and promotion, MaxCompute, and Ant businesses. Before an application (such as a MapReduce job or a long-running web service) can run its tasks under unified scheduling, it first follows its application-specific task execution plan, reconstructing a set of task requests according to the Unified Request format, and then submits these requests (with affinity requirements) to the API Server. After receiving the task requests, the unified scheduler selects an appropriate physical host to place each task following its scheduling policy. Specifically, the scheduler first selects the nodes satisfying the affinity as the candidate nodes for the task and then ranks these candidate nodes according to their balance between the resource utilization and the SLO (Service Level Objective) requirement of the task request. Then an agent located on the selected host will start a pod that consists of several containers to run the new scheduled task. + +![unified_scheduling_framework](./figures/unified_scheduling_framework.png) + +# Tables + +We collect the metrics of node-, pod- and container-level from the clusters and use a max-min scaler to normalize the numeric values like *_usage, *_util and *_bytes. + +### 1. node running metrics + +| Column Name | Description | Type | Example Entry | +| ------------------------- | ----------------------------------------------- | ------ | -------------------------------- | +| collect_timestamp | Timestamp, the number of seconds from the start | int | 46309 | +| node_name | the node name | string | 104a3526662b2768549d116ee5a36b24 | +| node_cpu_usage | the cpu usage of the node | float | 0.0059184 | +| node_memory_util | the memory utilization of the node | float | 0.33276688 | +| node_memory_total_bytes | the memory capacity of the node | float | 0.35017976 | +| node_network_bandwidth | the network bandwidth of the node | float | 0.04762132 | +| node_network_receive_bps | the total recieve bytes of the node | float | 0.0000237 | +| node_network_transmit_bps | the total transmit bytes of the node per sampl | float | 0.00007433 | +| node_cpu_cores | the node cpu numbers | float | 0.5 | +| node_disk_io_usage | the io utilization of the node | float | 0.0003389 | + +`collect_timestamp`: Some `collect_timestamp` may be negative because the entry is collected before the start of the trace. + +### 2. pod meta info + +| Column Name | Description | Type | Example Entry | +| ---------------- | ------------------------------------------------------------ | ------ | -------------------------------- | +| create_timestamp | Pod creation timestamp, the number of seconds from the start | int | 2202266 | +| pod | the pod name | string | 65c3924ccc3145e4a348f3328d40c606 | +| cpu | CPU request number | float | 0.0390625 | +| memory | memory request bytes | float | 0.01589419 | +| disk | disk request number | float | 0.0 | +| cpu_limit | CPU limit number | float | 0.046875 | +| memory_limit | Memory limit bytes | float | 0.01589419 | +| disk_limit | disk limit number | float | 0.0 | +| qos | QoS of the pod | string | "BE" | + +`create_timestamp`: Some `create_timestamp` may be negative because the pod is created before the start of the trace. + +`qos`: Some `qos` is empty because the pod has no SLO requirement. + +### 3. pod running metrics + +| Column Name | Description | Type | Example Entry | +| ----------------------------------- | ----------------------------------------------- | ------ | ----------------------------------------------------- | +| collect_timestamp | Timestamp, the number of seconds from the start | float | -3390 | +| node_name | the node name | string | be8cf1b4733b1292cb6af34d24f5f723 | +| pod | the pod name | string | bbe70f56eb2e6b7f3833531e6b837602 | +| app_group | the application name that the pod belongs to | string | 329437a0e697b328a4ea5b8a98d54e2c | +| pod_cpu_limit_usage | the cpu limit usage of the pod | float | 0.00055937 | +| pod_cpu_request_util | the cpu request utilization of the pod | float | 0.05074873 | +| pod_memory_util | the memory utilization of pod | float | 0.50453097 | +| pod_memory_usage_bytes | the memory usage bytes of the pod | float | 0.00940145 | +| pod_disk_read_bps_total | the pod disk read BPS | float | 0.0 | +| pod_disk_write_bps_total | the pod disk write BPS | float | 0.04751506 | +| pod_disk_read_iops_total | the pod disk read IOPS | float | 0.0 | +| pod_disk_write_iops_total | the pod disk write IOPS | float | 0.27036923 | +| pod_network_transmit_bytes_ps_total | the pod network transimission BPS | float | 0.05314832 | +| pod_network_receive_bytes_ps_total | the pod network recieve BPS | float | -1.0 | +| cpu_psi | CPU PSI | string | 0.0;0.01;0.05;3113525409.0 | +| mem_psi | Memory PSI | string | 0.0;0.0;0.0;0.0;0.0;0.0;56.0;57.0 | +| disk_psi | Disk PSI | string | 0.0;0.0;0.0;0.0; 0.0;0.0;531401.0;536175.0 | +| web_qps | the web query per second of the pod | float | 0.0 | +| qps | the total query per second of the pod | float | 0.0 | +| rt | the main response time of the pod | float | 0.0 | + +[`psi`](https://docs.kernel.org/accounting/psi.html) (Pressure Stall Information) identifies and quantifies the disruptions caused by such resource crunches and the time impact it has on complex workloads or even entire systems. + +`cpu_psi`: seperated by `;`, each item represents `cpu_avg10 `, `cpu_avg60 `, `cpu_avg300 `,`cpu_psi_total` respectively. + +`mem_psi`: seperated by `;`, each item represents `mem_avg10_full `, `mem_avg10_some`, `mem_avg60_full`, `mem_avg60_some`, `mem_avg300_full `, `mem_avg300_some`, `mem_total_full`, `mem_total_some` respectively. + +`disk_psi`: seperated by comma, each item represents `disk_avg10_full`, `disk_avg10_some`, `disk_avg60_full`, `disk_avg60_some`, `disk_avg300_full`, `disk_avg300_some`, `disk_total_full`, `disk_total_some` respectively. + +`web_qps` and `qps`: We separate the qps of a pod to web qps and total qps. Web QPS can be seen as the number of requests that may be generated by the user and will invoke all the services in the applications (like [microservices](https://github.com/alibaba/clusterdata/tree/master/cluster-trace-microservices-v2022)). Total QPS is the total number of requests that come from all other pods in the cluster. + +Some metrics may be -1 in the trace data, which represents the metric is not available at that time. + +# Fetch DataSet + +User can use the following script to download the trace. + +``bash fetch.sh`` + +The node and pod running metrics are divided into multiple compressed packages, each containing one hour of running metrics. Size of each directory (compressed) for an hour: + +* ~140Mi Node +* ~1.5Gi Pod diff --git a/cluster-trace-v2023/fetchdata.sh b/cluster-trace-v2023/fetchdata.sh new file mode 100644 index 0000000..051ea02 --- /dev/null +++ b/cluster-trace-v2023/fetchdata.sh @@ -0,0 +1,52 @@ +#!/bin/bash +prepare_dir() { + mkdir -p data/NodeResourceUsage data/PodMetaInfo data/PodResourceUsage +} + +# you can change the start_idx and end_idx to fetch the data you want +fetch_data() { + # get node resource usage + get_node_resource_usage() { + local_path="data/NodeResourceUsage/node_resource_usage" + remote_paths="NodeResourceUsage/node_resource_usage" + start_idx=0 + end_idx=215 + for idx in $(seq $start_idx $end_idx); do + file_name="${local_path}_$idx.tar.gz" + remote_path="${remote_paths}_$idx.tar.gz" + url="https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2023UnifiedSchedulerTraces/$remote_path" + command="wget -c --retry-connrefused --tries=0 --timeout=50 -O $file_name $url" + $command + done + } + # get pod meta info + get_pod_meta_info() { + local_path="data/PodMetaInfo/pod_meta_info" + remote_paths="PodMetaInfo/pod_meta_info" + file_name="${local_path}.tar.gz" + remote_path="${remote_paths}.tar.gz" + url="https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2023UnifiedSchedulerTraces/$remote_path" + command="wget -c --retry-connrefused --tries=0 --timeout=50 -O $file_name $url" + $command + } + + # get pod resource usage + get_pod_resource_usage() { + local_path="data/PodResourceUsage/pod_resource_usage" + remote_paths="PodResourceUsage/pod_resource_usage" + start_idx=-1 + end_idx=215 + for idx in $(seq $start_idx $end_idx); do + file_name="${local_path}_$idx.tar.gz" + remote_path="${remote_paths}_$idx.tar.gz" + url="https://aliopentrace.oss-cn-beijing.aliyuncs.com/v2023UnifiedSchedulerTraces/$remote_path" + command="wget -c --retry-connrefused --tries=0 --timeout=50 -O $file_name $url" + $command + done + } + get_node_resource_usage + get_pod_meta_info + get_pod_resource_usage +} +prepare_dir +fetch_data \ No newline at end of file diff --git a/cluster-trace-v2023/figures/.DS_Store b/cluster-trace-v2023/figures/.DS_Store new file mode 100644 index 0000000..561b960 Binary files /dev/null and b/cluster-trace-v2023/figures/.DS_Store differ diff --git a/cluster-trace-v2023/figures/unified_scheduling_framework.png b/cluster-trace-v2023/figures/unified_scheduling_framework.png new file mode 100644 index 0000000..a30d265 Binary files /dev/null and b/cluster-trace-v2023/figures/unified_scheduling_framework.png differ