This S3 Bucket contains a copy of the Enron Email Corpus formatted for MongoDB, provided in a compressed "mongodump" format. It is 256MB compressed, and 1.5GB of MongoDB data (Comprising 517,425 emails). As a large, public domain dataset you may find it useful for testing things like MapReduce, Aggregation and MongoHadoop, as well as seeding machine learning databases (a common use for this corpus in the "real world").

You can download the compressed mongodump data here. Import it using the standard MongoDB mongorestore tool.

The work to produce a MongoDB version of this set was originally done by Bryan Nehl (@k0emt), in his blog post. It has been cleaned up a bit to use Python's email parsing modules to separate out the headers and email body, making it more useful for analysis. A sample doc:

        "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
        "body" : "Here is our forecast\n\n ",
        "subFolder" : "allen-p/_sent_mail",
        "mailbox" : "maildir",
        "filename" : "1.",
        "headers" : {
            "X-cc" : "",
            "From" : "",
            "Subject" : "",
            "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail",
            "Content-Transfer-Encoding" : "7bit",
            "X-bcc" : "",
            "To" : "",
            "X-Origin" : "Allen-P",
            "X-FileName" : "pallen (Non-Privileged).pst",
            "X-From" : "Phillip K Allen",
            "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
            "X-To" : "Tim Belden ",
            "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
            "Content-Type" : "text/plain; charset=us-ascii",
            "Mime-Version" : "1.0"

If you'd ilke to reproduce the import yourself from the raw data, the modified copy of Bryan's Python script is available as well.