Skip to content

Conversation

@gmodena
Copy link
Collaborator

@gmodena gmodena commented May 5, 2021

Bump Spark Driver memory to acount for larger results set.
The memory upper bound was found to allow the job to complete on enwiki.

This change is experimental, and meant to enable analysis/experimentation.

Bump Spark Driver memory to acount for larger results set.
The memory upper bound was found to allow the job to complete on enwiki.

This change is experimental, and meant to enable analysis/experimentation.
@gmodena gmodena requested a review from clarakosi May 5, 2021 12:31
Copy link
Collaborator

@clarakosi clarakosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it on stat1005 for enwiki and still ran into memory issues.

For reference it was run: 39607674-87fa-4ee5-9158-c008c150c505

This commit adds some tweaks to spark init,
memory limits and garbage collection policies
needed to meet enwiki memory requirements.
@gmodena
Copy link
Collaborator Author

gmodena commented May 27, 2021

@clarakosi I have been able to reproduce the memory errors you reported. There's a few things to unpack.

I was able to get the process to complete and generate valid data, by significantly increasing the Spark Driver memory (64G) and tweaking related memory settings.

tl;dr: given the very large memory footpring, I would not introduce the query change for now. I'd stick to filtering results by language.

How does the error manifests.

The query change triggered the following chain of failures:

  1. The memory footprint of results serialized and returned by each single worker exceeds default limits.
  2. The total memory footprint of the result sets collected to the driver exceed the driver memory.
  3. The default java garbage collector chokes because of large heap allocation.

These result is OOM on the driver and (after memory increase) GC failures.

Tuning

I tweaked the following (in order):

  1. Increase the max size of a result set returned by the workers spark.driver.maxResultSize to 2G
  2. Increase the total memory available to the driver with spark.driver.memory to 64G.
  3. Switch from MarkAndSweep to the G1 garbage collector on the driver. G1 is better suited for handling large heaps (note: the current config is chatty, and will output verbose info to console).

WARNING: with pyspark in client mode (how we run it from cli) spark.driver.memory must be set before the JVM starts. E.g. we should pass the value like spark-submit --driver-memory <size>, and not configure it in the SparkConf builder (its configs are applied after JVM startup). wmfdata does not account for this behavior; the sample code changes in this PR drop wmfdata in favor of manually starting spark via findspark and setting the PYSPARK_SUBMIT_ARGS.

Mitigation

IMHO we should look at query optimizations, before committing to system changes. 64G is around 10% of the total memory available on stat hosts, and not sustainable in the long run.
The memory footprint is impacted by SerDe and moving data from the JVM to CPython. In particular, with our current config,toPandas() calls introduce additional memory overhead, and could potentially double the footprint.

Enabling arrow serialization (available in 2.3) might help with reducing the JVM <-> CPython SerDe footprint. See https://issues.apache.org/jira/browse/SPARK-13534.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants