dimanche 18 mars 2018

How to prevent memory leak when testing with HiveContext in PySpark

I use pyspark to do some data processing and leverage HiveContext for the window function.

In order to test the code, I use TestHiveContext, basically copying the implementation from pyspark source code:

https://spark.apache.org/docs/preview/api/python/_modules/pyspark/sql/context.html

@classmethod
def _createForTesting(cls, sparkContext):
    """(Internal use only) Create a new HiveContext for testing.

    All test code that touches HiveContext *must* go through this method. Otherwise,
    you may end up launching multiple derby instances and encounter with incredibly
    confusing error messages.
    """
    jsc = sparkContext._jsc.sc()
    jtestHive = sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
    return cls(sparkContext, jtestHive)

My tests then inherit the base class which can access the context.

This worked fine for a while. However, I started noticing some intermittent process running out of memory issues as I added more tests. Now I can't run the test suite without a failure.

"java.lang.OutOfMemoryError: Java heap space"

I explicitly stop the spark context after each test is run, but that does not appear to kill the HiveContext. Thus, I believe it keeps creating new HiveContexts everytime a new test is run and doesn't remove the old one which results in the memory leak.

Any suggestions for how to teardown the base class such that it kills the HiveContext?

Aucun commentaire:

Enregistrer un commentaire