- flexible library for parallel computing in python
- Dynamic task scheduling
- "Big Data" collections : parallel arrays, dataframes, and lists -> run on top of dynamic task schedulers
import dask.dataframe as dd
df = dd.read_csv('who.csv')
df.groupby(df.user_id).value.mean().compute()-
Dask DataFrame <- Pandas
-
Dask Array <- NumPy
-
Dask Bag <- iterators. Toolz, and PySpark
-
Dask Delayed <- for loops and wraps custom code
-
concurrent.futures interface provides general submisiion of custom tasks(???)
- installs with conda or pip
- extends the size of convenient datasets from "fits in memory" to "fits on disk"
- scale to a cluster of 100s of machines -> resilient, elastic, data local, and low latency
- represents parallel computations with task graphs
참고) visualize task graph
import dask.array as ds
x da.ones((15, 15), chunks=(5,5))
y = x + x.T
# y.compute()
y.visualize(filename='transpose.svg')정말 간략하게 Dask에 대한 소개 정도만...
