Source- Canva Team

Mongo Maths

The mathematics behind MongoDB Scoring

There is no such thing as too much Data. The more data the better. As Clive Humbly rightly said,

Data is the new Oil.

But as data grows, the need to search the precise data becomes increasingly important. Whatsoever we search on the internet, be it a book, a pizza, or a laptop, there are an enormous number of choices available. A customer would be more satisfied if he/she finds what they are looking for with minimal effort. This customer demand let to the birth of Search bar functionality in any application. This need for quickly getting the required data in the large pool of choices led to the development of building efficient search engines. These search engines primarily focus on:

  1. Searching the required data from a large selection
  2. Getting the search result in optimal time
  3. The search result should be relevant to user requirements. For example: if the user is searching “Apple Laptops” then he should not get “Apple fruit” in the search result

Many vendors provide search functionality like Elasticsearch, Solr, MongoDB, etc. Today, we discuss one approach to searching with MongoDB.

Before jumping into the MongoDB specific implementation of searching, let’s discuss the few terminologies related to searching.

  1. Text Search — Text search refers to the technique to search the available documents or data sets against the search criteria like the input of the user. For example, consider a food delivery platform where you enter the search string “Pizza” to get all the matching results
  2. Stop Words — The words that are filtered out from original documents before performing the search. For example: is, the, and, etc.
  3. Tokenization- The process of breaking a sentence into multiple tokens/words
  4. Stemming — The process of reducing the words to their base word
  5. Document Scoring — Every document which matches the search text is assigned a score based on how closely they match with the searched text

MongoDB performs text searches by creating a text index.

For example, A shopping application like Amazon can have a product collection storing all their product. For simplicity, consider a product collection having the following document structure:

{
"name" : "Programming Laptop",
"description" :"8 GB/512 GB SSD/15 inch"
}

Let’s insert multiple products in this collection next.

db.product.insertMany([{
"name" : "Programming Laptops - Dell Laptop",
"description" :"8 GB/512 GB SSD/intel Core i5 8th gen"
},{
"name" : "Programming in C",
"description" :"Programming in C | Third Edition | By Pearson"
},{
"name" : "programmer Laptop",
"description" :"Code|Half Sleeve T Shirt for Men"
},{
"name" : "Laptop",
"description" :"latest laptop"
},{
"name" : "Titan",
"description" :"premium watch"
}])

For searching in this collection, you would need to create a text index as follows:

db.product.createIndex({"name":"text"})

Now to get the search result for any search keyword, you need to run the following query. For example: To search for Laptop:

db.product.find({$text: {$search: "Laptop"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

The above query returns the following documents sorted by the document score.

{
"_id" : ObjectId("5f40f2aaa2a276d5a15fac3d"),
"name" : "Programming Laptops - Dell Laptop",
"description" : "8 GB/512 GB SSD/intel Core i5 8th gen",
"score" : 1.125
}
{
"_id" : ObjectId("5f40f2aaa2a276d5a15fac40"),
"name" : "Laptop",
"description" : "latest laptop",
"score" : 1.1
}
{
"_id" : ObjectId("5f40f2aaa2a276d5a15fac3f"),
"name" : "programmer Laptop",
"description" : "Code|Half Sleeve T Shirt for Men",
"score" : 0.75
}

That was the most basic implementation of the search. MongoDB provides a lot of more customizations like:

  • Searching on multiple fields in the document- For example, We only searched against the name of the product. We could have also searched against both name and description fields of the product.
db.product.createIndex({"name":"text","description":"text"}
  • Assigning custom weights for every search field in text index- In the above example, we can assign a higher weightage for name fields compared to the description field.
db.product.createIndex({"name":"text","description":"text"}, {"weights": { name: 3, description:1 }})
  • Searching every String fields using wildcard searches- You can specify wildcard while creating text index. It will allow you to search across all the string fields in the document.
db.product.createIndex({"$**":"text"})
  • Searching exact phrases- You can search for exact phrases in MongoDB using the following command:
db.product.find({$text: {$search: "\"programmer Laptop\""}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}}).pretty()
  • Excluding some terms from the search- For example: To search Laptop product excluding “Lenovo” you can run the following query:
db.product.find({$text: {$search: "Laptop -Lenovo"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}}).pretty()
  • Support for multiple languages, case sensitive searches, diacritic sensitive searches, etc.

You can get the complete list of supported features here MongoDB Documentation.

When returning the search result, mongo generated a score for every document. The higher the score more is the chances of resemblance to the searched text. Hence it becomes crucial to understand the mathematics behind this score.

For every score calculation, mongo performs a series of following operations (in order) :

  1. Tokenization: Mongo will tokenize the search string using whitespace and most punctuation as delimiters, and perform a logical OR of all such tokens. (List of Delimiters mongo delimiters).Mongo skips tokenization in phrase searches.
  2. Stop Word removal: Mongo will remove the stop words from the documents before calculating scores. (List of English stop words)​.
  3. Stemming: Next, mongo will reduce every word of the text index document to the root word.
  4. Score calculation: Finally, mongo will compute the matching score of the document against the searched text.

Mongo uses the following algorithm (code reference) to score any document:

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing Phrase search). Let's say T1,T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:

score = (weight * data.freq * coeff * adjustment);

Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = ​(0.5 * data.count / numTokens) + 0.5
data.count
= Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)

For a demonstration, consider the product collection with the below-mentioned products. On searching this data-set for the “program” keyword following documents are returned:

Let’s understand the score calculation by taking examples:

  • Example 1: Single keyword search
Search String: “program” keyword
Sample Document: {"name" : "Programming with Program"}
Index Defined: $text index on "name" field
Score Calculation:Step 1: Tokenize search string.Apply Stemming.
Token 1: "program"
Step 2: For every search token obtained in Step 1, do steps 3-11:

Step 3: Take Sample Document and Remove Stop Words -> "Programming Program"
Step 4: Apply Stemming -> "program program"
Step 5: Calculate data.count per search token
data.count(program) = 2 (as program appears twice)
Step 6: Calculate total number of token in document
numTokens = 2
Step 7: Calculate coefficient per search token
coeff(program) =​ 0.5*(2⁄2) + 0.5 = 1.0
Step 8: Calculate adjustment per search token
adjustment(program) = 1 (As the search string is not equal to sample document)
Step 9: weight = 1 (As no special weight is assigned while creating text index)
Step 10: Calculate frequency of Data (data.freq) per search token:
a. Count the frequency of every token in sample document
b. [program => 2]
c. Data.freq(program) = 1/(2^0) + 1⁄(2^1) = 1.5
Step 11: Calculate score per search token:
score = (weight * data.freq * coeff * adjustment);
score(program) = (1 * 1.5 * 1.0 * 1.0) = 1.5
Step 12: Add individual score for every token of search string to get total score
Total score = score(program) = 1.5
  • Example 2: Multi-keyword search
Search String: “Programming books” keyword
Sample Document: {"name" : "Programming books: Programming with Java"}
Index Defined: $text index on "name" field
Score Calculation:Step 1: Tokenize search string. Apply Stemming.
Token 1: "program"
Token 2: "book"
Step 2: For every search token obtained in Step 1, do steps 3-11:

Step 3: Take Sample Document and Remove Stop Words -> "Programming books Programming Java"
Step 4: Apply Stemming -> "Program book Program Java"
Step 5: Calculate data.count per token
data.count(Programming) = 2 (as program appears twice)
data.count(books) = 1 (as book appears once)
Step 6: Calculate total number of token in document = numTokens = 4
Step 7: Calculate coefficient per token
coeff(Programming) =​ 0.5*(2⁄4) + 0.5 = 0.75
coeff(
books) =​ 0.5*(1⁄4) + 0.5 = 0.625
Step 8: Calculate adjustment per search token
adjustment(Programming) = 1
adjustment(books) = 1
Step 9: weight = 1 (As no special weight is assigned while creating text index)
Step 10: Frequency of Data (data.freq):
a. Count the frequency of every token in sample document
b. [program => 2] [book => 1] [Java => 1]
c. data.freq(Programming) = 1/(2^0) + 1⁄(2^1) = 1.5
d. data.freq(books) = 1/(2^0) = 1
Step 11: Calculate score for the token:
score = (weight * data.freq * coeff * adjustment);
score(
Programming) = (1 * 1.5 * 0.75 * 1.0) = 1.125
score(books) = (1 * 1 * 0.625 * 1.0) = 0.625
Step 12: Add individual score for every token of search string to get total scoreTotal score = score(Programming) + score(books) = 1.125 + 0.625 = 1.75
  • Example 3: Multi-field Index with custom weights
Search String: “program” keyword
Sample Document: {"name" : "Program: programming app", "description" : "program"}
Index Defined: $text index on "name" field and "description" field. Weight of name field = 3
Weight of descrption field = 1
Score Calculation:Step 1: Tokenize search string.Apply Stemming.
Token 1: "program"
Step 2: For every search token obtained in Step 1, do steps 3-11:

Step 3: Take Sample Document and Remove Stop Words ->
{"name" : "Program programming app", "description": "program"}
Step 4: Apply Stemming ->
{"name" : "program program app", "description": "program"}
Step 5: Calculate data.count per search token per index field
data.count(program)(name field) = 2
data.count(program)(description field) = 1
Step 6: Calculate total number of token in document per field
numTokens(name field) = 3
numTokens(description field) = 1
Step 7: Calculate coefficient per search token per field
coeff(program)(name field) =​ 0.5*(2⁄3) + 0.5 = (5/6)
coeff(program)(description field)=​ 0.5*(1⁄1) + 0.5 = 1
Step 8: Calculate adjustment per search token per field
adjustment(program)(name field) = 1
adjustment(program)(description field) = 1.1
Step 9: Input weight per field
weight(name field) = 3
weight(description field) = 1
Step 10: Calculate frequency of Data (data.freq) per search token per field:
a. Count the frequency of every token in sample document
b. Frequency in name field: [program => 2][app=>1]
c. Frequency in description field: [program => 1]
d. Data.freq(program)(name field) = 1/(2^0) + 1⁄(2^1) = 1.5
e. Data.freq(program)(description field) = 1/(2^0)=1
Step 11: Calculate score per search token per field:
score = (weight * data.freq * coeff * adjustment);
score(program)(name field) = (3 * 1.5 * (5/6) * 1.0) = 3.75
score(program)(description field) = (1 * 1 * 1 * 1.1) = 1.1
Step 12: Add individual score for every token of search string to get total scoreTotal score = score(program)(name field) + score(program)(description field) = 3.75 + 1.1 = 4.85

In this article, we visited one approach to searching using MongoDB. Of course, there are a lot many other ways to search which we didn’t cover in this article. Keep watching this space for more content. Till then, Happy reading !! 😃

--

--

Ananya

Software Developer | Technical Writer | Technology Enthusiast