Remove the language filter for wikidata labels. #26

gmodena · 2021-05-05T12:31:42Z

Bump Spark Driver memory to acount for larger results set.
The memory upper bound was found to allow the job to complete on enwiki.

This change is experimental, and meant to enable analysis/experimentation.

Bump Spark Driver memory to acount for larger results set. The memory upper bound was found to allow the job to complete on enwiki. This change is experimental, and meant to enable analysis/experimentation.

clarakosi

I tested it on stat1005 for enwiki and still ran into memory issues.

For reference it was run: 39607674-87fa-4ee5-9158-c008c150c505

This commit adds some tweaks to spark init, memory limits and garbage collection policies needed to meet enwiki memory requirements.

gmodena · 2021-05-27T10:26:05Z

@clarakosi I have been able to reproduce the memory errors you reported. There's a few things to unpack.

I was able to get the process to complete and generate valid data, by significantly increasing the Spark Driver memory (64G) and tweaking related memory settings.

tl;dr: given the very large memory footpring, I would not introduce the query change for now. I'd stick to filtering results by language.

How does the error manifests.

The query change triggered the following chain of failures:

The memory footprint of results serialized and returned by each single worker exceeds default limits.
The total memory footprint of the result sets collected to the driver exceed the driver memory.
The default java garbage collector chokes because of large heap allocation.

These result is OOM on the driver and (after memory increase) GC failures.

Tuning

I tweaked the following (in order):

Increase the max size of a result set returned by the workers spark.driver.maxResultSize to 2G
Increase the total memory available to the driver with spark.driver.memory to 64G.
Switch from MarkAndSweep to the G1 garbage collector on the driver. G1 is better suited for handling large heaps (note: the current config is chatty, and will output verbose info to console).

WARNING: with pyspark in client mode (how we run it from cli) spark.driver.memory must be set before the JVM starts. E.g. we should pass the value like spark-submit --driver-memory <size>, and not configure it in the SparkConf builder (its configs are applied after JVM startup). wmfdata does not account for this behavior; the sample code changes in this PR drop wmfdata in favor of manually starting spark via findspark and setting the PYSPARK_SUBMIT_ARGS.

Mitigation

IMHO we should look at query optimizations, before committing to system changes. 64G is around 10% of the total memory available on stat hosts, and not sustainable in the long run.
The memory footprint is impacted by SerDe and moving data from the JVM to CPython. In particular, with our current config,toPandas() calls introduce additional memory overhead, and could potentially double the footprint.

Enabling arrow serialization (available in 2.3) might help with reducing the JVM <-> CPython SerDe footprint. See https://issues.apache.org/jira/browse/SPARK-13534.

Remove the language filter for wikidata labels.

237410e

Bump Spark Driver memory to acount for larger results set. The memory upper bound was found to allow the job to complete on enwiki. This change is experimental, and meant to enable analysis/experimentation.

gmodena requested a review from clarakosi May 5, 2021 12:31

Fix. specify memory unit of measurement

61d156c

clarakosi requested changes May 12, 2021

View reviewed changes

WIP: memory tweaks to generate enwiki.

894d5c2

This commit adds some tweaks to spark init, memory limits and garbage collection policies needed to meet enwiki memory requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove the language filter for wikidata labels. #26

Remove the language filter for wikidata labels. #26

Uh oh!

gmodena commented May 5, 2021

Uh oh!

clarakosi left a comment

Uh oh!

gmodena commented May 27, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove the language filter for wikidata labels. #26

Are you sure you want to change the base?

Remove the language filter for wikidata labels. #26

Uh oh!

Conversation

gmodena commented May 5, 2021

Uh oh!

clarakosi left a comment

Choose a reason for hiding this comment

Uh oh!

gmodena commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How does the error manifests.

Tuning

Mitigation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gmodena commented May 27, 2021 •

edited

Loading