online learning, blog, anil singh

«

»

Feb 29

MongoDB Basics Primer Series - Step 3

Objective: Get familiar with Aggregation Framework in MongoDB.

Prerequisite: MongoDB Basics Primer Series - Step 1MongoDB Basics Primer Series - Step 2

 

I hope your journey thus far has been smooth and CRUD operations on MongoDB do not unsettle you any more. Now it is time to learn 'GROUP BY' a.k.a Aggregation Pipeline in MongoDB. "Aggregation Pipeline" is analogus to Aggregation in the RDBMS. You can also use Map-reduce for the purpose of aggregation but it adds complexity. Hence Aggregation Pipeline is the preferred way to approach aggregation in MongoDB.

Aggregation Framework was introduced in MongoDB 2.2. It is a multi-stage pipeline that works on the documents to produce aggregated results. Use aggregate method of the collection for the purpose.

The syntax for aggregation pipeline is,

db.<collection name>.aggregate([{stage 1}, {stage 2}, {stage 3}, {stage 4},......], {<options>})

Here are some of the stages that a Aggregate Pipeline has,

  • $match : Filters-in only the matching documents.
  • $project : Re-shapes the documents flowing into it. You can add new fields, rename fields and remove fields.
  • $sort : Arranges the documents by sort key.
  • $limit : Limits the documents to the specified number.
  • $skip : Skips the specified number of documents.
  • $group : Groups the documents by the specified grouping key.
  • $out : Writes the aggregation pipeline result to a collection on the disk. This stage, if required, must be applied at the end of the pipeline.

Some of the commonly used  options in a Aggregation pipeline are,

  • allowDiskUse : It allows the usage of disk in case a stage exceeds the max. memory limit. It is optional.
  • explain : Displays the query plan. It is optional.

In the examples below, I will take you through each of these stages and options.

 

Upload sample data.

For the purpose of illustrating the above stages in action, I will use a simple data set so that the focus remains on the stages and is not marred by the data complexity. Start Mongo Database Server, mongod. Flip to Step 1 of the series if still struggling. Use mongoimport and upload, IndianCensus2011 into the database as follows,

mongoimport -d MyDB -c Population --drop < IndianCensus2011.json

The data from json file IndianCensus2011.json is uploaded to collection 'Population' in database 'MyDB'. '--drop' option drops the collection if it already exists, thereby you can upload the file multiple times without generating multiple instances of the same documents.

 

Look at the data model

Check the uploaded data by running the mongo shell.

> mongo

> use MyDB

> db.Population.findOne()

Following is the output,

20160227_01

For the purpose of illustrating the Aggregation Pipeline, I have picked up only a small fraction of data from Indian Population Census 2011. It does not represent the actual entire population of any State when you sum up. It contains urban and rural population of males and females in Districts of various States. The data is stored in document form in the database.  The structure of these documents is as follows,

  1. "_id" - It is system generated mandatory field. The values are automatically generated as no data is provided for the field in the upload file.
  2. "district" - It contains the name of the district.
  3. "state" - It contains the name of the state to which the district belongs.
  4.  "rural_male_pop" - It contains the Rural Male Population in the district.
  5. "rural_female_pop" - It contains the Rural Female Population in the district.
  6. "urban_male_pop" - It contains the Urban Male Population in the district.
  7. "urban_female_pop" - It contains the Urban Female Population in the district.

 

Working with Aggregation Queries

  1. Using $group - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}]) - This pipeline has only one stage '$group'. It groups the data by 'state' and sums up 'urban_male_pop'.
  2. Using $sort - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}},{$sort:{"State Urban Male Population":-1}}]) - This pipeline has two stages. After the $group stage, the $sort stage sorts the documents in descending order of 'State Urban Male Population'. Try to arrange in ascending order yourself, isn't it easy!!
  3. Using $match - >  db.Population.aggregate([{$match:{"state":{$in:["Haryana", "Punjab"]}}}]) - In this single stage pipeline, the documents containing states, 'Haryana' and 'Punjab' are filtered in.
  4. Using $match and $group - > db.Population.aggregate([{$match:{"state":{$in:["Haryana", "Punjab"]}}}, {$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}]) - This clubs the above two illustrations for $match and $group. So now, you have the 'State Urban Male Population' for the States of 'Haryana' and 'Punjab'. Try sorting the data on the names of the states yourself. To cut down on the amount of data that would undergo processing at later stages, use $match at the beginning of a pipeline.
  5. Using $project - > db.Population.aggregate([{$match:{"state":"Haryana"}}, {$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}]) - Look at the document structure in the above figure. The data on population is segregated across rural and urban population of males and females in respective fields. Therefore, in order to get the total population of a district we need to sum these fields. Here I have used the $project stage to add up these fields to get 'Total Population In District' in the state of Haryana. As seen below, the new documents have the following structure, 20160227_02 In the next illustration, I will use this stage as input to $group stage to find the 'Total Population' in each State.
  6. Using $project, $group and $sort - > db.Population.aggregate([{$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}, {$group:{"_id":"$State", "Total Population":{$sum:"$Total Population In District"}}}, {$project:{"_id":0, "State":"$_id", "Total Population":1}}, {$sort:{"State":1}}]) - This is a 4 stage pipeline with two $project, one $group and one $sort stages. The output is the population at State level. See the second $project stage. The field '_id' is suppressed. Remember, I covered this in the Step 2 of this series!! 20160227_03 Try to find out average population across Districts in each State. That shouldn't be hard!!
  7. Using $out - > db.Population.aggregate([{$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}, {$group:{"_id":"$State", "Total Population":{$sum:"$Total Population In District"}}}, {$project:{"_id":0, "State":"$_id", "Total Population":1}}, {$sort:{"State":1}}, {$out:"StatePopulation"}]) - $out if needed, should be the last stage of a pipeline. It creates/ overwrites a collection containing the result of the query. In this case you will find that a new collection 'StatePopulation' is created. Check the data in the collection yourself. Does everything look good?
  8. Using option, explain - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}], {explain:true}) - Here the query plan is printed. None of the stages are run on the server.
  9. Using option, allowDiskUse - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}], {allowDiskUse:true}) -  In case working on large data sets, use this option to allow using disk in case the memory limit is exceeded by any stage in the pipeline. The current memory limit for a stage is 100 MB.

 

Aggregation pipeline is a quick and simple way of building real-time analytics on MongoDB. With enough ammo now, you can frame your pipelines with the innumerable operators at your disposal to fulfill a use case.

 

In the upcoming Step 4 of this series, I will talk about how to make your queries run faster by introducing indexes. Till then, build up on the newly acquired know-how on Aggregation Pipeline..........try to breach the memory limit for a stage and check whether allowDiskUse can come to your rescue!!

 

 

Recommended for further reading,

  1. Read more on MongoDB Aggregation Pipeline here
  2. Read more on MongoDB Aggregation Pipeline Operators here

 

EmailWhatsAppTwitterLinkedInGoogle+FacebookShare

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>