online learning, blog, anil singh

Apr 18

Wealth of India - Natural Resources - Acacia

One of the essential requirements for the rapid  economic growth of a nation is to develop its own natural resources. Fortunately, India is blessed with abundant natural resources. Forest products and medicinal plants form an important group of valuable resources. For fully utilizing this important group which is one of the biggest wealth of India, we should have the essential basic scientific knowledge about them. It is, therefore, natural for scientists engaged in this arena to work hard for the study of various different parts of the Herbs, which are used as food and medicine since man has been on earth.

Acacia Tree

Fig. 1. Acacia Catechu Tree

Acacia Catechu Tree

Fig. 2. Acacia Catechu Tree (close-up)

Ayurveda recognizes the use of Acacia Catechu heartwood extract powder known as CATECHIN or KATHA which has powerful astringent and anti-oxidant properties. The water extract of Khair heartwood contains mainly Katha (catechin) and Cutch (catechu tannic acid). Katha is used in the preparation of Paan. It is a principal ingredient in the preparation of Paan alongwith other ingredients which include Betel leaf, slaked lime, areca nut and occasionally tobacco. When combined with slaked lime or chuna in Paan, katha gives the red colouration on chewing.

Chewing Paan has been an integral part of culture of many South Asian countries.

Cutch, the other product found in the heartwood,  is used largely as a tanning material and as a preservative purposes of fishing nets to impart them long life

Catechin Manufacturing Process

Catechin Molecular Diagram

Fig. 3. Molecular Structure of Catechin (C15H14O6)

Bark and sapwood of Khair tree are removed from the wood by means of  hand axes.The log is cut into Billets by means of saw machine. The heart wood billets are converted in the form of chips by means of chipping machine. The chips are boiled with water in autoclaves by live steam for complete extraction. The water extracts obtained are filtered through filter press and concentrated in the evaporator to the desired thickness. The concentrated juice is cooled to room temperature and stored in cold room for 15 days at low temperature for crystallization and after that filtered to remove cutch as a by-product. The cake is then plated, cut into required sizes and kept in drying chamber at low temperature for drying. It is then finally dried at ambient temperature and packed in boxes of varying capacity, but usually 20 kg boxes.

The following figure depicts the processing  steps involved in procuring Catechin from the Khair Tree.

 

Catechin Extraction

Fig. 4. Manufacturing Process of Catechin

 

The finished product can be in powder or biscuit form and has higher catechin content as compared to the initial raw product. By using specific processes, the catechin content can be varied depending upon the requirement and the nature of catechin use. Catechin enriched to 90% is used for therapeutic purposes. It gives excellent results in curing dental, oral and  throat infections and used as an astringent for reducing oozing from chronic ulcers and wounds.

 

Amba Industries

 

Don't forget to subscribe to our newsletters and keep abreast of the latest blogs on this platform.

 


The contributing writer, Dr. R. B. Singh is Ph. D. in Organic Chemistry with more than 40 years of consulting industrial experience in Pharmaceutical & Phytochemicals. He is  associated with many Indian based companies manufacturing catechin and other pharmaceutical products. For any queries on the subject, he can be reached at drsingh47@gmail.com.

Apr 09

MongoDB Basics Primer Series - Step 4 (Indexes)

Objective: Make your queries run faster on MongoDB by introducing indexes.

Prerequisite: MongoDB Basics Primer Series - Step 1, Step 2, Step 3.

 

As promised in Step 3 of this series, this step will take you through Indexing in MongoDB. The index concepts in MongoDB are very similar to any other conventional RDBMS database that you might have used. It is the single biggest tunable performance factor in a database and MongoDB is no exception.

An index enables the query to focus on the right set of documents instead of scanning all the documents in a collection.

 

Upload sample data

For purpose of illustrating indexes in action, let us load some sample data. Start mongod, the MongoDB data server process. Hopefully you might have become adept at running the process by now!! Refer to the earlier posts in this series, if needed or run the following in the Terminal window.

$ mongod --dbpath Data/ --logpath mongod.log --storageEngine wiredTiger --fork

Create the Data directory before running the above command. Once the process is started, download json file, IndianFestivalsHolidays.json, and import the data into MongoDB as follows,

mongoimport -d MyDB -c IndianFestivalsHolidays --drop < IndianFestivalsHolidays.json

The data from IndianFestivalsHolidays.json is loaded in collection IndianFestivalsHolidays in database MyDB. Run a count on the collection, there must be 61 documents.

 

The Data Model

The data contains the  festival holidays in India for the calendar year 2016. The loaded documents have the following structure,

Document structure

Fig. 1. Loaded Sample Data

Here I am using a GUI based MongoDB client, MongoChef Professional. Remember!! I discussed this in Step 2 of this series. Anyways, if you haven't tried this yet, you can try it now.

The documents have the following fields,

  1. "_id" - If not provided by the user, as is the case, it is generated by the system. The values are unique.
  2. "On" - It contains the date.
  3. "Day" - It contains the day of the week.
  4. "Month" - It contains the month name corresponding to the date.
  5. "Occasion" - It is an array containing one or more events/ festivals.

 

In the Thick of Indexes

Now that you have the data loaded, figure out the currently available indexes in the collection.

In the Mongo shell or MongoChef IntelliShell run,

> db.IndianFestivalsHolidays.getIndexes()

MongoDB primary index

Fig. 2. Existing Indexes

As you see, there is already an index on "_id" field. This is the primary index created automatically by MongoDB and is the reason why you must provide unique values in case you are provisioning the data into the field. Any other index on the documents is called Secondary Index.

Let's look for Occasion "Diwali" in the collection,

db.IndianFestivalsHolidays.find({Occasion:"Diwali"})

Now run explain on the above query to know how did the query travel to get the required data. Explain provides the execution plan of the query without running the query on the database. It helps to identify the stages where collection scans are being done, whether indexes are being used etc. A well-designed query must have minimal collection scans and must make good use of available indexes.

> db.IndianFestivalsHolidays.find({Occasion:"Diwali"}).explain("executionStats")

On running the above explain, you will get the following output,

Query execution plan in MongoDB

Fig. 3. Query Execution Plan

Look at the fields marked with the red dots. "stage":"COLLSCAN" indicates that the query is scanning the entire collection to get the required data. "nReturned": NumberInt(1) indicates that 1 document is being returned by the query and finally look at "totalDocsExamined": NumberInt(61), which indicates that the query is examining 61 documents which in fact is the total number of documents that are available in the collection. The difference in values between "nReturned" and "totalDocsExamined" suggests that an Index will help to make the query more efficient. SO LET'S BUILD ONE!! Narrower the difference between  "nReturned" and "totalDocsExamined", the better it is.

Since the query focuses on the field "Occasion", the index must be built on this field.

> db.IndianFestivalsHolidays.createIndex({"Occasion":1})

Now check the available indexes in the collection again,

> db.IndianFestivalsHolidays.getIndexes()

Secondary index in MongoDB

Fig. 4. New index on field "Occasion"

As you see, now there are two indexes in the collection. The new index is "Occasion_1" on the field "Occasion". With the new index in place, run the explain on the query again to check the difference which the index has brought to the query efficiency.

> db.IndianFestivalsHolidays.find({Occasion:"Diwali"}).explain("executionStats")

Since the output is a long one, I'll present it in two snapshots,

Index scan in MongoDB

Fig. 5. Index scan introduced

Here you see "COLLSCAN" changes to "IXSCAN" for field "stage". "IXSCAN" is Index Scan.

MongoDB query execution plan with index scan

Fig. 6. Query statistics

As you see here, the total documents examined has dropped from 61 to 1 after the introduction of index. Look at "executionTimeMillis", the value has dropped from an earlier 209 ms to 0. This is why it is said that a right index does wonders to a query!!

Once done, you can drop the index and try creating the ones that satisfy the needs of queries that you wish to run on the database.

> db.IndianFestivalsHolidays.dropIndex({Occasion:1})

Now check the available indexes in the collection.

In case you wish to drop all the indexes in the collection, run

> db.IndianFestivalsHolidays.dropIndexes()

All indexes except the one on the field "_id" get dropped.

 

With Index basics in place, you are all set to speed up your queries in MongoDB. In the next step of the series, I'll cover index types and properties. Till then tinker on your MongoDB box and try to do something new each day.

 

Don't forget to subscribe to our newsletters and keep abreast of the latest blogs on this platform.

 

 

Recommended for further reading,

  1. Introduction to Indexes in MongoDB
  2. Running explain in MongoDB

 

 

Mar 15

How I multiply my Dracaena Fragrans Victoriae

20160313_01 come Spring, the plant lovers cannot desist from the opportunity of multiplying their existing plant stock especially the ones that can be vegetatively propagated through stem cuttings. The success rate is the highest during the spring owing to better vascular movement. I'm an ardent gardener myself.

My first encounter with Dracaena Fragrans Victoriae (DFV) a.k.a. Corn Plant was in the summer of 1998, when I was in college and visited a nursery just before the summer break. I didn't know it's name then. Planted in a 6 inch pot, it rose a foot above the soil with one cluster, of glossy oblong leaves with green and yellow stripes running through the length of the leaves, at the top end of the round woody stem. It looked beautiful. Over the years, I have multiplied it to 6 independent plants from the one I had. I propagated it again last week. On a 100% success rate, it will take the count to 10, :)

 

About the plant

The plant contains one or more woody stems resembling a cane with a crown of leaves drooping downwards. As the plant grows, the crown of leaves moves upwards and the older leaves wither thus exposing a beautiful woody round stem. The stem would branch if the crown is cut.

With it's variegated, shiny long leaves, it adds to the ambiance and thrives well indoors. DFV withstands low to direct sunlight, is a low maintenance plant. It is doing well in my terrace garden at Bahadurgarh (28.68°N 76.92°E) where the temperature varies from 4 to 45 deg. celsius through the year. Once in a year, remove the top soil and add manure. As the plant grows shift it into a bigger pot. Keep the soil moist. Do not overwater or underwater.

 

How I propagate

  1. The soil mixture - The potting mix. must contain soil and manure in equal quantity. Use a trowel to mix well. Get a 12" flower pot. Cover the drainage hole at the bottom of the pot with small stones and pebbles to ensure that it is not blocked by the potting mix and the excess water is able to seep out. Fill the pot with the mix.  till 1.5" from the top. This ensures that the water does not overflow during watering.
  2. Identify the mother stem - Pick a DFV stem that is tall enough and woody. Older the stem, better are the chances of rooting.
    20160313_02

    Fig. 1. Identify a suitable stem

    Use a sharp pruning shear or secateur to cut the stem from the bottom. The cut must be clean which is only possible if you use a sharp tool else it will rupture the stem instead.

    20160313_03

    Fig. 2. Cut the stem from the bottom

  3. Cut the stem into smaller pieces - Once the stem is detached from the mother plant, lay it on the ground. Remove the crown which contains green fleshy stem and leaves.
    20160313_04

    Fig. 3. The cut stem

    20160313_05

    Fig. 4. Stem cut into 8"-10" sized pieces

    Assess the number of cuttings that can be made out of the stem. 8" to 10" is a good size. Now using the secateur, cut the long stem into smaller sized pieces. I was able to get 4 pieces out of the stem.

  4. Pot the cuttings - Now into the flower pot push each of the cuttings half their length into the soil. The cuttings must be equidistant from each other.
    20160313_06

    Fig. 5. Potted cuttings

    Water the pot with a sprinkler. The excess of water will seep out through the drainage hole at the bottom of the pot.
    Although you can apply rooting hormone powder, to aid rooting, to the base of these cuttings before pushing them into the soil, but it is not really required as it is SPRING now.

  5. Post-potting care - Place the pot in shade, most likely under some bigger plant in the garden in order to avoid direct sunlight. Keep the soil moist. Sprinkle the water onto the cuttings and the soil in the early mornings and evenings. If you take good care of the cuttings, then in around 30 to 40 days, the roots will form at the base of each potted cutting and  green shoots will appear near the top ends of the cuttings. You will have your new DFV ready to be planted into separate pots.

 

Gardening brings you closer to the nature and is a great learning experience. The joy and satisfaction that you derive on seeing your efforts take shape, is no less than the one you have on successful compilation of your code on your Desktop.

 

In my next blog I will share with you how well my Cycad pups, from the previous year, are doing. Till then go look for things that you can do to add some greens to your surroundings.

 

 

Mar 11

Neo4j Basics Primer Series - Step 2

 

Objective: Learn graph database concepts and the first dive into Cypher.

Prerequisite: Neo4j Basics Primer Series - Step 1

 

A Graph database i.e. Neo4j is based on Graph Theory which talks about Graph as a set of Edges and Vertices. These Vertices and Edges are Nodes and Relationships respectively in Neo4j. The Relationships connect the Nodes and form natural paths. These relationships unlike the contrived ones in RDBMS are rich and explanatory as these can be described by set of Properties. Even nodes have properties. Therefore, Neo4j is a Property Graph Database.

The edges are adjacent if there is a common vertex between the two edges and two vertices are adjacent if there is a edge between the two vertices. Adjacency is an important aspect that gives Graph Databases an upper hand  when dealing with connected data. These natural links between nodes are the pointers or the connectors between the nodes and is known as Index Free Adjacency which aids traversals without the need for indexes. The indexes are required to speed-up the lookup for starting node/s only. Once located, the index free adjacency takes care of the rest of the traversals.

A Graph is traversable if you can draw a path between all the vertices without retracing the same path.

Even the data structure,  trees,  used to store indexes and data is a Graph, a Acyclic Graph.

Neo4j consists of the following key elements,

  • Nodes : represent entities; are equivalent to rows in a table in RDBMS; are represented by circles.
  • Relationships : connects two nodes; represented by (directed) lines between the nodes.
  • Labels : are names which organize Nodes into various groups/ types/ roles. These are equivalent to tables in RDBMS. A node can belong to more than one label. Besides, labels are used to define constraints and create indexes.
  • Properties : These are the attributes  defined on the nodes and relationships. These are equivalent to columns in RDBMS which describe the rows. Properties are key:value pairs. Relationship Types are mandatory properties of relationships and every relationship must have only one type. Properties give a rich semantic description to the relationships than is possible in RDBMS.

Note: There may be zero or more Labels for a Node. There may be zero or more properties on a Node or a Relationship.


Why use Neo4j

Neo4j is Agile, Fast, ACID compliant, Scalable, Production-ready. You need nothing more than a white-board to start modelling in Neo4j. You are defining the data model while discussing the domain. The platform is best suited for data that is highly connected.


Cypher

Cypher is Neo4j Query Language, just like the standard SQL for RDBMS. There is no standard defined for NoSQL query language yet. Each NoSQL database has it own query language. Neo4j has Cypher, Couchbase has N1QL (pronounced Nickel) and so on. It is a declarative, pattern-matching query language i.e. to say you need to tell it the pattern you want to find, the how part is taken care of by Cypher.

Nodes are represented by round brackets, (). A node identifier is (n). A labelled node identifier is (n:Student). A specific node/nodes is (n:Student {name:"Anil"}). This representation denotes a node Student with name (property) as "Anil".

Relationships are represented by dashes. A undirected relationship is --. A directed relationship is <-- or -->. A specific relationship between two nodes is -[r:STUDIES_IN]->. Here 'r' is the identifier and 'STUDIES_IN' is the relationship type.


The Sample Data

Start the database engine. Refer, Step 1 of this series, if required. Download the 20160310_SampleData.cql.tar.gz  file containing Cypher queries to create nodes and relationships. This is the sample data on which I will take you through your Cypher Learning.  On the Terminal window, change directory to the one where you downloaded the aforementioned file and

  1. Unzip and untar the file,
    $ tar -zxvf 20160310_SampleData.cql.tar.gz
  2. Open the Terminal window and run,
    neo4j-shell -file 20160310_SampleData.cql > output.txt
    Check output.txt to figure out whether everything ran well or not. Alternatively, you can run these queries in Neo4j Browser or simply copy paste onto the Neo4j shell and press enter to execute. You can opt for any of the methods that you find convenient.

Once done, in the Neo4j Browser editor run,  match (n) return n; Do you see a similar output as the one below,

 

20160310_AllNodes

Fig.1 Sample Data Visualization


Ok, time to juggle with the sample data using Cypher

  1. match (n)
    return n;
    Retrieves all nodes in the database. 'match' searches for specified  node/s or pattern in the database. 'return' is similar to the 'select' in RDBMS which determines the projection.
  2. //This is a comment
    match (n)
    return n;
    Use '//' to add comments to your queries. These are very useful to describe your queries, especially  when you mark them as 'Favorite' in the Neo4j Browser.
  3. match (n:Student)
    return n.name, n.age;
    Retrieves name and age of all students. To get properties, use the notation, <node identifier>.<node property>. Similarly you can fetch properties of relationships.
  4. match (n:Student) --> (g:Grade)
    return n,g;
    Retrieves all students and grades. Reverse the direction of the arrow to '<--' , what did you get?
  5. match (n:Student) --> (g:Grade)
    return n.name, g.name order by g.name, n.name;
    Arranges the output by grade and student name in ascending order. Use 'desc' to arrange in descending order.
  6. match (n:Student) -- (g:Grade)
    return n,g;
    Note the missing arrow head. This checks for the relationship in either direction. The result is same as the above query.
  7. match (n:Student) --> (g:Grade {name:2})
    return n,g;
    Retrieves all the students in grade '2'.
  8. match (n:Student) --> (g:Grade)
    where g.name=2
    return n,g;
    This is another way of writing the above query by using the 'where' clause instead.
  9. match (n:Student) -- (g:Grade)
    where not g.name=2
    return n,g;
    Using 'not' in 'where' clause to filter out grade 2.
  10. match (s:Subject)
    return distinct s.name;
    Retrieves all subjects taught in the school (grade 2 and grade 3). 'distinct' is used to remove the repeating or duplicate subject names.
  11. match (s:Subject)
    return distinct s.name as Subject;
    Using alias 'Subject' on the returned subject names.
  12. match (n:Student) --> (g:Grade)
    where g.name=2
    return count(*) as NoOfStudents;
    Counts the number of students in grade 2.
  13. match (n:Student) --> (g:Grade)
    return g as Grade,count(*) as NoOfStudents;
    Counts the number of students in each grade. Neo4j doesn't need to be told about the grouping key, it figures that out automatically.
  14. match (n:Student) --> (g:Grade)
    return g as Grade, collect(n.name) as Students, count(*) as NoOfStudents;
    Here all students as part of the aggregate count are collected in a array/ list against each grade. Hence, apart from the count on students, you also get the name of the students in each grade.
  15. match (n:Student)
    return n limit 4;
    Limits the students listed to 4 out of the total 6.
  16. match (n:Student)
    return n skip 4 limit 10;

    Skips 4 and returns 2 student nodes as there are only 6 student nodes.
  17. match (n:Student) -[r:StudiesIn]-> (g:Grade)
    return n.name as Name, type(r) as Relation, g.name as Grade
    union all
    match (t:Teacher)-[r:IS_CLASS_TEACHER]-> (g:Grade)
    return t.name as Name, type(r) as Relation, g.name as Grade;
    Neo4j supports union all just like SQL in RDBMS. The query retrieves the names of class teachers and students in each grade.
  18. match (n)-[r]->()
    return distinct type(r) as Relationships;
    Fetches all the relationships that exist in the database.
  19. match (n:Student)
    where n.name contains 'mit'
    return n;
    Fetches Student nodes whose name contains 'mit'.
  20. match (n:Student)
    where n.name ends with 't'
    return n;
    Fetches Student nodes whose name end with 't'.
  21. match (n:Student)
    where n.name in ['Amit','Anu', 'Jaya']
    return n;
    Retrieves Student nodes with names 'Amit' or 'Anu' or 'Jaya'.
  22. merge (n:Student {name:'Anil', age:7, sex:'M'})
    return n;
    If the node does not exists, creates a new Student node.
  23. match (n:Student {name:'Anil', age:7, sex:'M'}), (g:Grade {name:2})
    merge (n)-[r:StudiesIn]-> (g)
    return n, r, g
    If the student node 'Anil' and grade node '2' exist, then create the relationship 'StudiesIn' between them if it doesn't already exist.
  24. Now try deleting a student 'Amit' who studies in grade '2',
    match (n:Student {name:'Amit'})-[r:StudiesIn]->(g:Grade {name:2})
    delete n;
    What is your observation? Can you relate this to RDBMS world? Now try again with the below query,
    match (n:Student {name:'Amit'})-[r:StudiesIn]->(g:Grade {name:2})
    detach delete n;
    This time the node gets deleted along with the relationship it was holding with the grade. A simple delete would be successful only if the node is an isolated node i.e. does not have any relationships. For a connected node, use 'detach delete'.
  25. This is the BIG ONE!!! Deletes all nodes and relationships.
    match (n)

    detach delete n;
    Check the data again, did you get any?
    Load the sample data again in case you wish to work on the same data set. You can create your own as well.

 

After working through the above queries, you will find that Neo4j is a lot simpler than it looks on the outside. For people from the RDBMS world, they would find similarities between SQL and Cypher when it comes to the various clauses and operators.

Now that you have had your first dive, in the upcoming Step 3 of this series, I will talk more on Cypher queries especially the advanced part. Till then you have enough ammo for practicing and learning Cypher. Wish you a pleasant Weekend!!

 

Recommended for further reading,

  1. Neo4j concepts
  2. Read about Cypher

 

 

Feb 29

MongoDB Basics Primer Series - Step 3

Objective: Get familiar with Aggregation Framework in MongoDB.

Prerequisite: MongoDB Basics Primer Series - Step 1MongoDB Basics Primer Series - Step 2

 

I hope your journey thus far has been smooth and CRUD operations on MongoDB do not unsettle you any more. Now it is time to learn 'GROUP BY' a.k.a Aggregation Pipeline in MongoDB. "Aggregation Pipeline" is analogus to Aggregation in the RDBMS. You can also use Map-reduce for the purpose of aggregation but it adds complexity. Hence Aggregation Pipeline is the preferred way to approach aggregation in MongoDB.

Aggregation Framework was introduced in MongoDB 2.2. It is a multi-stage pipeline that works on the documents to produce aggregated results. Use aggregate method of the collection for the purpose.

The syntax for aggregation pipeline is,

db.<collection name>.aggregate([{stage 1}, {stage 2}, {stage 3}, {stage 4},......], {<options>})

Here are some of the stages that a Aggregate Pipeline has,

  • $match : Filters-in only the matching documents.
  • $project : Re-shapes the documents flowing into it. You can add new fields, rename fields and remove fields.
  • $sort : Arranges the documents by sort key.
  • $limit : Limits the documents to the specified number.
  • $skip : Skips the specified number of documents.
  • $group : Groups the documents by the specified grouping key.
  • $out : Writes the aggregation pipeline result to a collection on the disk. This stage, if required, must be applied at the end of the pipeline.

Some of the commonly used  options in a Aggregation pipeline are,

  • allowDiskUse : It allows the usage of disk in case a stage exceeds the max. memory limit. It is optional.
  • explain : Displays the query plan. It is optional.

In the examples below, I will take you through each of these stages and options.

 

Upload sample data.

For the purpose of illustrating the above stages in action, I will use a simple data set so that the focus remains on the stages and is not marred by the data complexity. Start Mongo Database Server, mongod. Flip to Step 1 of the series if still struggling. Use mongoimport and upload, IndianCensus2011 into the database as follows,

mongoimport -d MyDB -c Population --drop < IndianCensus2011.json

The data from json file IndianCensus2011.json is uploaded to collection 'Population' in database 'MyDB'. '--drop' option drops the collection if it already exists, thereby you can upload the file multiple times without generating multiple instances of the same documents.

 

Look at the data model

Check the uploaded data by running the mongo shell.

> mongo

> use MyDB

> db.Population.findOne()

Following is the output,

20160227_01

For the purpose of illustrating the Aggregation Pipeline, I have picked up only a small fraction of data from Indian Population Census 2011. It does not represent the actual entire population of any State when you sum up. It contains urban and rural population of males and females in Districts of various States. The data is stored in document form in the database.  The structure of these documents is as follows,

  1. "_id" - It is system generated mandatory field. The values are automatically generated as no data is provided for the field in the upload file.
  2. "district" - It contains the name of the district.
  3. "state" - It contains the name of the state to which the district belongs.
  4.  "rural_male_pop" - It contains the Rural Male Population in the district.
  5. "rural_female_pop" - It contains the Rural Female Population in the district.
  6. "urban_male_pop" - It contains the Urban Male Population in the district.
  7. "urban_female_pop" - It contains the Urban Female Population in the district.

 

Working with Aggregation Queries

  1. Using $group - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}]) - This pipeline has only one stage '$group'. It groups the data by 'state' and sums up 'urban_male_pop'.
  2. Using $sort - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}},{$sort:{"State Urban Male Population":-1}}]) - This pipeline has two stages. After the $group stage, the $sort stage sorts the documents in descending order of 'State Urban Male Population'. Try to arrange in ascending order yourself, isn't it easy!!
  3. Using $match - >  db.Population.aggregate([{$match:{"state":{$in:["Haryana", "Punjab"]}}}]) - In this single stage pipeline, the documents containing states, 'Haryana' and 'Punjab' are filtered in.
  4. Using $match and $group - > db.Population.aggregate([{$match:{"state":{$in:["Haryana", "Punjab"]}}}, {$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}]) - This clubs the above two illustrations for $match and $group. So now, you have the 'State Urban Male Population' for the States of 'Haryana' and 'Punjab'. Try sorting the data on the names of the states yourself. To cut down on the amount of data that would undergo processing at later stages, use $match at the beginning of a pipeline.
  5. Using $project - > db.Population.aggregate([{$match:{"state":"Haryana"}}, {$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}]) - Look at the document structure in the above figure. The data on population is segregated across rural and urban population of males and females in respective fields. Therefore, in order to get the total population of a district we need to sum these fields. Here I have used the $project stage to add up these fields to get 'Total Population In District' in the state of Haryana. As seen below, the new documents have the following structure, 20160227_02 In the next illustration, I will use this stage as input to $group stage to find the 'Total Population' in each State.
  6. Using $project, $group and $sort - > db.Population.aggregate([{$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}, {$group:{"_id":"$State", "Total Population":{$sum:"$Total Population In District"}}}, {$project:{"_id":0, "State":"$_id", "Total Population":1}}, {$sort:{"State":1}}]) - This is a 4 stage pipeline with two $project, one $group and one $sort stages. The output is the population at State level. See the second $project stage. The field '_id' is suppressed. Remember, I covered this in the Step 2 of this series!! 20160227_03 Try to find out average population across Districts in each State. That shouldn't be hard!!
  7. Using $out - > db.Population.aggregate([{$project:{"_id":0,"State":"$state", "District":"$district","Total Population In District":{$sum:["$rural_male_pop", "$rural_female_pop","$urban_male_pop", "$urban_female_pop" ]}}}, {$group:{"_id":"$State", "Total Population":{$sum:"$Total Population In District"}}}, {$project:{"_id":0, "State":"$_id", "Total Population":1}}, {$sort:{"State":1}}, {$out:"StatePopulation"}]) - $out if needed, should be the last stage of a pipeline. It creates/ overwrites a collection containing the result of the query. In this case you will find that a new collection 'StatePopulation' is created. Check the data in the collection yourself. Does everything look good?
  8. Using option, explain - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}], {explain:true}) - Here the query plan is printed. None of the stages are run on the server.
  9. Using option, allowDiskUse - > db.Population.aggregate([{$group:{"_id":"$state", "State Urban Male Population":{$sum:"$urban_male_pop"}}}], {allowDiskUse:true}) -  In case working on large data sets, use this option to allow using disk in case the memory limit is exceeded by any stage in the pipeline. The current memory limit for a stage is 100 MB.

 

Aggregation pipeline is a quick and simple way of building real-time analytics on MongoDB. With enough ammo now, you can frame your pipelines with the innumerable operators at your disposal to fulfill a use case.

 

In the upcoming Step 4 of this series, I will talk about how to make your queries run faster by introducing indexes. Till then, build up on the newly acquired know-how on Aggregation Pipeline..........try to breach the memory limit for a stage and check whether allowDiskUse can come to your rescue!!

 

 

Recommended for further reading,

  1. Read more on MongoDB Aggregation Pipeline here
  2. Read more on MongoDB Aggregation Pipeline Operators here

 

Feb 21

Neo4j Basics Primer Series - Step 1

Objective: Setup Neo4j on your workstation.

Prerequisite: Hunger to learn.

 

Google, Facebook, Twitter, LinkedIn......are an important part of our lives today. Ever thought of the database  on which these applications run? Some aspects, if not all, of these applications run on proprietary GRAPH database technologies.  How about doing a hands on one such database technology, Neo4j.

Neo4j might not be the most popular database, but it tops the list as the first among the Graph Databases. Check www.db-engines.com for the rankings. So that means you would be laying your hands again on a technology that is very relevant in the current market place!!! I see that gleam in your eyes.

Keeping with the spirits of 'Basics Primer Series', this series too is aimed to be an authoritative, step-by-step reference guide for a non-starter who wishes to embark on the journey to learn Neo4j.  The series will safeguard the learner from the information deluge on the subject, which might leave him perplexed. This guide will take a very structured approach in disseminating  the information.

Let us kickstart,

 

What do we need to setup up a working Neo4j environment,

  1. Identify the platform that you are on. I personally prefer to work on Ubuntu. Not for the reason that it's free, but most of the production installations are on one or the other Unix flavours that match very closely to Ubuntu, besides, the Long Term Support (LTS) that you get on Ubuntu is quite inviting. I use Ubuntu 14.04. To the undergrads who are reading this primer, I would highly recommend using a Linux flavour of their choice for the reasons that they would really appreciate when they would be into their careers. In case you have a windows machine, you can run Ubuntu on a virtual machine using Oracle Virtual Box or VMWare Player.
  2. Download a copy of the latest Neo4j for Ubuntu. It is always good to learn on the latest GA release, the current one is 2.3.2. A Community Edition is good to start with. It doesn't have the scalability and other advanced features required for a Production worthy setup, but then these are quite non-essential for this basic series.
  3. Starting the Database Server,
    1. Open the Terminal window (Ctrl + Alt + T). Unzip and untar the downloaded archive - $ tar -zxvf neo4j-community-2.3.2-unix.tar.gz - You get a directory 'neo4j-community-2.3.2'. This directory contains all the Neo4j binaries, jars etc.
    2. Move it to your home directory - $ mv neo4j-community-2.3.2 ~
    3. Update the PATH variable with location to Neo4j executables - $ echo export PATH=$PATH:~/neo4j-community-2.3.2/bin >> ~/.bashrc - This will enable you to run Neo4j programs and utilities from any location on the shell.
    4. Make the new setting effective either by re-opening the Terminal window or run - $ source ~/.bashrc
    5. Run the database server - $ neo4j start - A database server process spawns out.
    6. Check the server status - $ neo4j status - You should get the message 'Neo4j Server is running at pid <pid number>', if everything went well.
  4. Connecting to the Database Server
    1. Click here or type in 'http://localhost:7474'  in your browser window. This is Neo4j Browser. You can run your queries as well as visualize the results in colorful Graphs.
    2. On the web page, key-in the Username/ password -- neo4j/neo4j and change the password.
    3. An alternate method of connecting to the database server is using Neo4j shell - $ neo4j-shell - Starts the shell. I recommend using the Neo4j Browser during the initial learning phase as it helps relate to the basic concepts of the database - Nodes, Relationships, Labels, Properties - faster than when using the bare bones but effective shell.
  5. Using Neo4j Browser and the first steps
    1. The Neo4j Browser is rich in features. For now I would limit our discussion to the Editor and the Run/Play button. Look at the screenshot below, you type  your queries in the area marked as 'Editor' and press the button marked as 'Run' to execute the query. 01_Browser
    2. Simplifying things....to get you some glimpse into the look and feel of how to work on data modeling in Neo4j, let me represent Students studying in various Grades, in Neo4j. Copy, Paste and Run each of the statements in the file SampleData in Neo4j Browser Editor. The statements must be run one at a time.The file contains Cypher (equivalent of SQL in RDBMS) queries to create a sample data set in order to give you a sneak peak into the Neo4j world. You might get surprised to see the circles and arrows getting created with each successful execution of the queries, but wait little more for the BIG PICTURE to come ALIVE!!.
  6. The BIG PICTURE, now that you have a sample data set to work on, it's time to look at it. In the Neo4j Browser Editor run - match (n) return n; - This is a Cypher equivalent of SQL 'select * from <table name>' in RDBMS. It displays all the Nodes (Student and Grade) and Relationships (StudiesIn) in the database. 02_Data

Did you see the connectedness in data? The relationship 'StudiesIn' is quite evident even when you are looking into the data and not a design diagram. Relationships are first-class citizens of the graph data model. This is what makes GRAPH DATABASE stand out when it comes to dealing with the connected data. Graph databases are the database of choice when it comes to areas like Routing, Bio-Informatics, Social Relationships, E-Commerce etc.

I hope you got a kick from this brief startup on Neo4j. In the Step 2 of this series, I will take you through the concepts and how to write Cypher queries. Till then, for the curious minds, now that you have setup your Neo4j environment, start tinkering and I will catchup soon.

 

 

Recommended for further reading,

  1. Neo4j concepts
  2. Read about Cypher

 

Feb 15

MongoDB Basics Primer Series - Step 2

Objective: First dive into the MongoDB CRUD operations and know third-party tools to connect to MongoDB.

 

Prerequisite: MongoDB Basics Primer Series - Step 1

 

I hope your previous run through of Step 1 of this series was smooth. Now that you have set up the environment yourself and done some tinkering, you must be brimming with confidence. It is time to put forward some basic concepts on MongoDB.

MongoDB is a NoSQL database as it is based on the following NoSQL pillars,

  1. Non-relational data model. It is a JSON (document) store. You must have gone through the JSON in Step 1, right?
  2. Cluster friendly, highly scalable.
  3. Schemaless. The structure of the documents is not required to be pre-declared and need not be consistent across multiple documents in the same collection. Remember the RDBMS world!!! all rows in a table have the same fields whether the field for an instance is required or not, thus giving rise to sparse data, generic column names and wide tables.

 

Load Data into MongoDB

To start firing queries at the server, we need some data. So let's prepare to load some from a JSON file,

  1. Open a Terminal window and create a Temp directory to hold upload data file - $ mkdir ~/MongoDbWorkSpace/Temp
  2. Add MongoDB utilities and programs to the PATH variable so that you need not go every time to the bin folder to run them - $ echo export PATH=$PATH:~/MongoDB_3.2.1/bin >> .bashrc - '>>' appends to '.bashrc' file.
  3. Close the Terminal window and open a new one or just run the following in the current Terminal - $ source ~/.bashrc - This refreshes the bash env. to include the new settings.
  4. Check whether you have access to MongoDB utilities and programs from anywhere in the shell - $ which mongo - Did it point to the right path?
  5. Start the MongoDB server aka the mongod process, refer to  Step 1 of the series, in case you are still struggling.
  6. The real thing .....download students.json (right click and save link as in the newly created Temp folder as above). Now import the data from this JSON file into the database - $ mongoimport -d MyDB -c Students ~/MongoDbWorkSpace/Temp/students.json - Did you realize!! you just did a successful import of 200 documents in MongoDB, WOW, that's significant step in your journey. The '-d' option specifies the database into which the data must be loaded. In case the database doesn't exist, like in this case, it will be created. The data was loaded into the collection 'Students' specified against '-c' option.

 

A Short Talk on the Uploaded Data

It is essential to understand what we have just loaded. The data contains scores of students in homework, quiz and exam. There are 3 fields, '_id', 'name' and 'scores'. It is important to understand here that the field 'scores' contains array of documents, these documents have 2 fields 'type' and 'score'. Getting interesting!!

Open a Terminal window and connect to MongoDB - $ mongo MyDB - MyDB is the database into which you uploaded the data a moment ago.

Now read a document - > db.Students.findOne() - here is what you get

1

 

You traveled well this far, now let us get into the thick of the jungle...ready.....steady GO

 

Firing the Queries

  1. > db.Students.findOne() - Fetches only one document from the entire lot.
  2. > db.Students.find() - Displays 20 documents at a time, for the rest, type 'it' (iterate) on the shell.
  3. db.Students.find().pretty() - The method 'pretty()' formats the output and makes it more readable.
  4. db.Students.find().limit(5) - Limits the output only to 5 documents.
  5. db.Students.find().skip(10).limit(5) - Skips first 10 documents and then limits the output only to 5 documents.
  6. > db.Students.find({}, {name:1}) - Projects fields 'name' and '_id'.
  7. db.Students.find({}, {name:1, _id:0}) - This one suppresses field '_id' as well. Unlike other fields, '_id' needs to be explicitly suppressed.
  8. db.Students.find({name:"Bao Ziglar"}).pretty() - Picks up documents where  name is "Bao Ziglar". Did you see, there are two students with the name?
  9. db.Students.find({name:"Bao Ziglar"}, {name:1}) - Picks up documents where  name is "Bao Ziglar" and projects fields 'name' and 'id'.
  10. > db.Students.find({scores:{$elemMatch:{type:"exam", score:{$gt:97}}}}) - Picks the documents with documents in array 'scores' having 'type' as 'exam' and 'score' above 97. There are 4 such documents.
  11. > db.Students.find({name:"Bao Ziglar"}, {name:1}).count() - Counts the number of matching documents.
  12. db.Students.find({},{name:1}).sort({name:1}) - Sorts the output by 'name' in ascending order.
  13. db.Students.find({},{name:1}).sort({name:-1})  - Sorts the output by 'name' but in descending order. Look carefully at the sorted data, there is an interesting observation in the first two lines.
  14. > db.Students.insert({ "_id" : 200, "name" : "Bob Light", "scores" : [ { "type" : "exam", "score" : 88.11742562118049 }, { "type" : "quiz", "score" : 50.61295450928224 }, { "type" : "homework", "score" : 30.86823689842918 }, { "type" : "homework", "score" : 6.861613903793295 } ] }) - Inserts a new document in the collection. So now you have 201 documents in the collection, isn't it!! Run a count on the collection and check. Try inserting the same document again. What do you get? Duplicate key error!!! right, because in a collection all '_id' must have unique values.
  15. > db.Students.update({_id:200}, {$set:{name:"Bob Tyler"}}) - Updates the 'name' for document with '_id'  200.
  16. > db.Students.update({_id:200}, {$set:{grade:"A"}}) - Did you notice? The field 'grade' did not exist, so it was added in the document. Of all the 201 documents, this is the only document with field 'grade' now. Ain't this different from what you observe in a RDBMS? Remember, a Document corresponds to a Row and a Collection corresponds to a Table.
  17. > db.Students.save({_id:200, name:"Bob Tyler"}) - Since '_id' 200 exists, the save method updated, rather replaced the document. Now you have only two fields for the document. save() replaces if a match is found else inserts.
  18. > db.Students.remove({_id:200}) - Deletes the document with '_id' 200. Try to run a find on the document, did you get it? How many documents do you have in the collection now? (clue: run a count on the collection).
  19. > db.Students.remove({}) - Deletes all the documents in the collection 'Students', but the collection still exists. Just check, does it? (clue: run 'show collections')
  20. > db.Students.drop() - Drops the collection from the database. Check whether the collection exists now.
  21. db.dropDatabase() - Drops the database. So now, you do not have your database 'MyDB' anymore. Run mongoimport , as above, to reinstate the working set and run the queries.

Not only read through each of the statements above but also run them on your Mongo shell, it will help you get familiar with various CRUD operations in MongoDB.

 

Third-party tools to connect to MongoDB

MongoChef is one such UI based tool that helps you to view the databases and run queries. You can use it for free against non-commercial license though with limited features.

2

The interface is neat, responsive and sophisticated. You can save your queries as scripts  which you can recall later. In case you find it difficult to read the resultsets in Mongo shell, I recommend you to use this one.

Robomongo is not far behind. You can use this tool as well in your initial learning phase.

3_RoboMongo

Although the above mentioned tools provide an effective alternative to Mongo shell, but do remember, they are not a replacement for Mongo shell, :).

 

I hope this Step 2 of the MongoDB Basics Primer Series helps you get further acquainted with the database. Read more about the queries and run them on your environment. In case you are switching over from the RDBMS world, reflect back on the various queries you had framed and figure out yourself how can they be represented and run on MongoDB. Do always remember, it would be unwise to pit a RDBMS against a NoSQL datastore as they cater to different use cases.

In the upcoming Step 3 of the series, I will cover more on the CRUD operations and touch upon Aggregation Framework.Till then, fire queries on your MongoDB server.

 

 

 

Recommended for further reading,

  1. Read more on MongoDB CRUD here.
  2. Read more on mongoimport here.

 

Feb 10

MongoDB Basics Primer Series - Step 1

 

This primer series on MongoDB is aimed to be an authoritative, step-by-step reference guide for a non-starter who wishes to embark on the journey to learn MongoDB, the most popular NoSQL database on this day, according to www.db-engines.com. Learning a popular technology also ensures that you cement your relevance in the job market, isn't it !!

In order to safeguard the learner from the information deluge on the subject, which might leave him perplexed,  this guide will take a very structured approach in disseminating  the information.

Let us kickstart,

What do we need to setup up a working MongoDB environment,

  1. Identify the platform that you are on. I personally prefer to work on Ubuntu. Not for the reason that it's free, but most of the production installations are on one or the other Unix flavours that match very closely to Ubuntu, besides, the Long Term Support (LTS) that you get on Ubuntu is quite inviting. I use Ubuntu 14.04. To the undergrads who are reading this primer, I would highly recommend using a Linux flavour of their choice for the reasons that they would really appreciate when they would be into their careers.
  2. Download a copy of the latest MongoDB for your platform. It is always good to learn on the latest GA releases, the current one is 3.2.1.
  3. Starting the Server,
    1. Open the Terminal window (Ctrl + Alt + T). Unzip and untar the downloaded archive - $ tar -zxvf mongodb-linux-x86_64-ubuntu1404-3.2.1.tgz
    2. Rename the folder to a shorter name and move it to your home directory - $ mv mongodb-linux-x86_64-ubuntu1404-3.2.1 ~/MongoDB_3.2.1
    3. Create directories for storing Mongo work files -  $ mkdir ~/MongoDbWorkSpace ~/MongoDbWorkSpace/Data ~/MongoDbWorkSpace/Logs - This creates directories for storing MongoDB data files and logs.
    4. Reach out to the bin directory in the install base - $ cd ~/MongoDB_3.2.1/bin - Takes you to the bin directory.
    5. Start the server $ ./mongod --dbpath ~/MongoDbWorkSpace/Data --storageEngine wiredTiger --logpath ~/MongoDbWorkSpace/Logs/mongod.log --fork - mongod is the server process; here I am using the new storage engine introduced in version 3.0 "wiredTiger", which offers greater capabilities as compared to the earlier and still available, mmapv1 (more on this in later series); the server log is generated in mongod.log and this will come to your rescue in case something goes wrong and you need to debug it; I am using the fork option to enable me to continue using the same terminal window as the control comes back to the shell and the process runs in the background.
    6. Once the server successfully starts, it will return a process id. You can further drill down into the Data and Log directories and look at the newly created directories and files.
  4. Connecting to the Server,
    1. Again reach out to the bin directory in the install base - $ cd ~/MongoDB_3.2.1/bin - Takes you to the bin directory.
    2. Connect using the shell - $ mongo - voila, you are now connected to the MongoDB Server. The Server is available on the default port number, 27017. You can also specify  a different port number than the default by using --port <number> option when starting the server i.e. the mongd process as in step 3(e).
  5. Working the Mongo Shell,
    1. You are now good to run any of the shell commands
      1. > show dbs - Lists the available databases.
      2. > use test - Attaches you to test database. You can attach to a non-existent database and once you create a collection then the database  gets created as well if it did not exist earlier. 
      3. > db - Lists the database that you are attached to.
      4. db.myFirstCollection.insert({a:"Hello World"}) - Inserts a document in the collection myFirstCollection. In case the collection didn't exist, it creates the collection as well. Documents are analogous to rows in tables in the RDBMS world.
      5. > show collections - Lists the collections in the database that you are attached to. Collections are analogous to tables in the RDBMS world.
      6. > db.myFirstCollection.find() - Lists all the documents in the collection.
      7. > db.help() - Lists the available helper methods on the db object.
      8. > db.shutdownServer() - To shutdown the server, but attach to the admin db first by running 'use admin'.
      9. > exit - To exit the mongo shell.

MongoDB is a document data store. These documents are represented as JSON.

In this primer you have been initiated on how to configure and start using MongoDB. In the upcoming Step 2 of this series, I will discuss more on "MongoDB CRUD and Administrative Commands" and alternate ways of connecting to the Server. Till then, tinker with the newly setup MongoDB environment.

 

Recommended for further reading,

  1. Read more about JSON at, www.json.org
  2. MongoDB shell help at, docs.mongodb.org

 

Feb 07

My first blog…..the First Steps

 

01_FirstSteps

 

 

 

On this sunny winter afternoon relaxing in my sunlit balcony, I was pondering on the question, what should be my first blog like? Pick something......but what............ah, the easiest would be to pick something on the technology, well that's what I have been doing for the last one and a half decade, shouldn't something come pouring out from the technology front, especially something on the databases, DB2, Oracle, Sybase, Netezza, MongoDB, Couchbase, Neo4j....etc. Hmmm...sounds good. Doesn't every first steps in the IT world begin with a "Hello World"!!! Yes it does, so here is my first blog,

 

 

Hello World, here I come.....

 

- Anil K. Singh