Empty RDD while reading compressed bson files

From: Shantanu Alshi <shantanuals@xxxxxxxxx>
To: mongodb-user <mongodb-user@xxxxxxxxxxxxxxxx>
Date: Wed, 27 Apr 2016 02:47:47 -0700 (PDT)
Why ads?
Hi,


I am using Pyspark to analyse a BSON File. When I run my program on a 
deflated BSON file, it runs perfectly fine.
However, running the same program on the same file now compressed gives me 
an empty RDD.

pyspark_location = 'lib/pymongo_spark.py'

HDFS_HOME = 'hdfs://1.1.1.1/'

INPUT_FILE = 'really_large_bson.gz'


class *BsonEncoder*(JSONEncoder):

    def *default*(self, obj):

        if *isinstance*(obj, ObjectId):

            return *str*(obj)

        elif *isinstance*(obj, datetime):

            return obj.isoformat()

        return JSONEncoder.default(self, obj)



def *setup_spark_with_pymongo*(app_name='PySprkApp'):

    conf = SparkConf().setAppName(app_name)

    sc = SparkContext(conf=conf)

    sc.addPyFile(pyspark_location)

    return sc



def *main*():

    spark_context = setup_spark_with_pymongo()

    filename = HDFS_HOME + INPUT_FILE

    import pymongo_spark

    pymongo_spark.activate()

    rdd = spark_context.BSONFileRDD(filename)

    print(rdd.first())   # ValueError: RDD is empty


I am using mongo-java-driver.jar 3.2.2, mongo-hadoop-spark.jar 1.5.2, 
pymongo_spark and pymongo-3.2.2
The deployed Spark version is 1.6.1 and Hadoop 2.6.4.


I am aware that the current library does not support splitting compressed 
bson files, however it in my opinion, it should work with a single split. I 
have hundreds of them files to analyse, so deflating all of those does not 
seem a viable option.
Can anyone please give a direction in order to proceed? 

-- 
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.org/manual/support/
--- 
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@xxxxxxxxxxxxxxxx.
To post to this group, send email to mongodb-user@xxxxxxxxxxxxxxxx.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/da7f6d65-352a-4cc1-8615-99b5816ba4e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Why ads?