Search Engine from Scratch

Let’s define the problem at a high level first.

User enters a query.
The search engine returns the most relevant document (HTML, PDF, or any other textual content).

How do we define relevance? We can say it’s 0 if the document doesn’t contain the query and 1 if it does like this:

document = "The quick brown fox jumps over the lazy dog."
query = "..." # comes from the user
is_relevant = document.include?(query)

Entering “Quick” instead of “quick” shouldn’t matter.

document = "The quick brown fox jumps over the lazy dog.".downcase
query = "...".downcase # comes from the user
is_relevant = document.include?(query)

Neither the order of the words, e.g. “dog lazy” vs. “lazy dog”.

is_relevant = query
  .split
  .all? { |word| document.include?(word) }

A word may not be present in the document and that’s okay.

is_relevant = query
  .split
  .any? { |word| document.include?(word) }

Coming up with a Ranking System

Putting documents into two buckets: relevant and irrelevant won’t be very useful when there are billions of documents to sort through as in the case of the web.

It would be nice if we could give documents a ranking score. It’s hard to do this semantically. We would need an AI model for that. Instead, we can use a quantitative one. We can say a document has a higher score if it contains the words in the query more frequently. If the query contains the word “quick” for example, a document that contains 50 times is more relevant than one that contains it only once.

relevance = query
  .split
  .map { |word| document.scan(word).count }
  .sum

With an example query and document:

query = "the frequency of"
document = "the frequency of the word."
relevance = query
  .split
  .map { |word| document.scan(word).count }
  .sum

For the query “the frequency of”, the document “the frequency of the word and the importance of the term and the length of the word.” would have a higher score than the document “the frequency of the word” just because it contains the words “the” and “of” more times.

Instead of just counting the occurrences, we can count their frequencies. This gives a score between 0 and 1 and prevent the score to blow up as the document gets longer. This is called term frequency.

number_of_words = document.split.count
relevance = query
  .split
  .map { |word| document.scan(word).count / number_of_words.to_f }
  .sum

By factoring out the common denominator:

relevance = query
  .split
  .map { |word| document.scan(word).count }
  .sum / number_of_words.to_f

Not All Words Are Created Equal

Not all words carry the same weight. In the sentence ‘the quick brown fox jumps over the lazy dog,’ removing ‘the’ preserves the core meaning; removing ‘fox’ destroys it. Common articles like ‘the,’ ‘a,’ and ‘an’ appear in almost every document, making them useless for differentiating one text from another.

Storing them on a list only works if we’re only dealing with English. There is a language agnostic approach: calculating how frequently a word appears in all the documents.

$$ \frac{\text{number of documents}}{\text{number of documents containing the word}} $$

The word “the” would appear in 99 documents out of 100 making its importance $100/99 = 1.01$. A word like “fox” would appear only in 10 documents giving it an importance score of $100/10 = 10$.

If the word doesn’t appear in any document at all then we get $100/0 = \infty$. Adding 1 to the numerator and denominator solves it.

$$ \frac{\text{number of documents}\ +\ 1}{\text{number of documents containing the word}\ +\ 1} $$

Why did I add +1 to the numerator too? In the extreme case that the word appears in all documents, the old formula would’ve given 1. I’d like to preserve that. It’s not that important, though. in the limit, that +1 doesn’t matter.

To calculate the importance of a word, we can use this fraction along with the term frequency we calculated before.

def importance(word, documents)
  number_of_documents = documents.count
  number_of_documents_containing_word = documents.count { |document| document.include?(word) }
  (number_of_documents + 1) / (number_of_documents_containing_word + 1).to_f
end

def relevance(query, document, documents)
  number_of_words = document.split.count
  query
    .split
    .map { |word| document.scan(word).count * importance(word, documents) }
    .sum / number_of_words.to_f
end

We multiply the term frequency by the importance of the word. Less important words will contribute less to the relevance score.

The importance formula is almost something called IDF(Inverse Document Frequency). There is only one difference. IDF takes the logarithm of the fraction. Why? If the word occurs too frequently or too rarely, it can dominate the scores. Using a logarithm dampens this effect.

The update is simple:

def importance(word, documents)
  number_of_documents = documents.count
  number_of_documents_containing_word = documents.count { |document| document.include?(word) }
+ Math.log((number_of_documents + 1) / (number_of_documents_containing_word + 1).to_f)
end

Partial Queries

What if the user enters a partial query like “tes”? The search engine should match the words “test”, “testing”, “tested” too. Whole words aren’t granular enough. Stemming reduces a word to its root form. For example, the stem of “fishing”, “fished” and “fisher” is “fish”.

Stemming is tedious and assumes a document uses a single language. N-grams offer a more general, though less semantic, alternative. The 2-grams of “quick” are: “qu”, “ui”, “ic” and “ck”. The 3-grams are: “qui”, “uic” and “ick”. It’s like a sliding window that moves to the right.

We wouldn’t search for “quick” by entering “uic” or “ick”(there are exceptions to this e.g. irresponsibility). Instead of moving the window, we’re going to extend it. For example, for the word “quick”, we get “qu”, “qui”, “quic”, “quick”. This achieves what stemming does but without knowing the language.

The average length of a root word in European languages is roughly 5. I’m going to use a window size between 2 and 5.

def n_grams(document, min_n: 2, max_n: 5)
  words = document.split
  # This will store the n-gram and the number of times it appears in the document.
  terms = []
  words.each do |word|
    # the word can be smaller than the max window size so we take the minimum.
    max_window_size = [max_n, word.length].min
    # for every window size take a substring and add it to the terms.
    (min_n..max_window_size).each do |window_size|
      terms << word[...window_size]
    end
  end
  terms
end

A word isn’t the smallest unit anymore so these need an update:

def importance(term, documents)
  number_of_documents = documents.count
+ number_of_documents_containing_term = documents.count { |document| n_grams(document).include?(term) }
  Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
end

def relevance(query, document, documents)
+ number_of_terms = n_grams(document).count
+ query_terms = n_grams(query)
  query_terms
+   .map { |term| n_grams(document).count(term) * importance(term, documents) }
    .sum / number_of_terms.to_f
end

Punctuation isn’t important either:

def clean(document)
  document.gsub(/[[:punct:]]/, " ")
end

Full Engine

The API could look like this:

search_engine = SearchEngine.new
search_engine.add_document("first_document", "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh")
search_engine.add_document("second_document", "Everyone,\n\nM-m-m-m-my red stapler has gone missing. H-h-has a-an-anyone seen it?\n\nMilton")
search_engine.add_document("third_document", "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh")

results = search_engine.search("tps repor")
puts results
<<-OUTPUT
[
  { :name => "first_document", :relevance => SCORE_1, :content => "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh" },
  { :name => "third_document",  :relevance => SCORE_2, :content => "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh" }
]
OUTPUT

Where SCORE_1 should be higher than SCORE_2 and the documents should be sorted by relevance.

Here is the full implementation:

class SearchEngine
  def initialize
    @documents = []
  end

  def add_document(name, document)
    @documents << { name: name, content: clean(document) }
  end

  def search(query)
    document_contents = @documents.map { |document| document[:content] }
    @documents
      .map { |document| { name: document[:name], relevance: relevance(query, document[:content], document_contents), content: document[:content] } }
      .sort_by { |document| -document[:relevance] }
  end

  private

  def clean(document)
    document.gsub(/[[:punct:]]/, " ").downcase
  end

  def n_grams(document, min_n: 2, max_n: 5)
    words = document.split
    terms = []
    words.each do |word|
      # the word can be smaller than the max window size like the word 'and' so we take the minimum.
      max_window_size = [max_n, word.length].min
      (min_n..max_window_size).each do |window_size|
        terms << word[...window_size]
      end
    end
    terms
  end

  def importance(term, documents)
    number_of_documents = documents.count
    # `count` method can take a block and count the number of times the block returns true.
    number_of_documents_containing_term = documents.count { |document| n_grams(document).include?(term) }
    Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
  end

  def relevance(query, document, documents)
    # Instead of just using `.split` we generate the n-grams and work on that.
    number_of_terms = n_grams(document).count
    query_terms = n_grams(query)
    query_terms
      .map { |term| n_grams(document).count(term) * importance(term, documents) }
      .sum / number_of_terms.to_f
  end
end

Performance

There is lots of data processing. A systems programming language like C, Rust, or Zig will be a better choice.

There is another optimization we could do with the current version. We process the documents for every query from scratch. Most of the time, the documents we’re searching for will largely stay the same relative to the queries.

We can preprocess the documents and store the n-grams and the importance of each term. This way, we can just use the stored data and calculate the relevance of the query. Most of the work will be a lookup.

OK, how are we going to store this preprocessed data? The trick is to look at the relevance method. number_of_terms = n_grams(document).count: We need a mapping between documents and their number of terms. We can have a documents collection and store the number of terms for each document. This document could look like this:

{
  "_id": "first_document",
  "content": "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh",
  "number_of_terms": 20
}

In the web, _id would be the URL of the document, but here we’re using a simple string. The database can index based on the _id field, so we can quickly access the document.

query_terms = n_grams(query): We can’t do much about this because the query is completely dynamic.

n_grams(document).count(term): We need a mapping between terms and appearances. Since a term can appear in multiple documents, we need to have another mapping between document names and the number of appearances of the term. This structure could look like this:

{
  "_id": "tes",
  "appearances": {
    "first_document": 2,
    "second_document": 0,
    "third_document": 1
  }
}

Here I’m using the term’s name as the _id because it’s unique.

importance(term, documents): We can just add another field to the above structure:

{
  "name": "tes",
  "appearances": {
    "first_document": 2,
    "second_document": 0,
    "third_document": 1
  },
+ "importance": 0.5
}

By the way, this is called an inverted index. You can read more about it on Wikipedia.

Let’s put these into code. I’m going to use MongoDB because its document structure is already JSON-like.

def initialize(url)
  @client = Mongo::Client.new(url)
end

initialize method is going to take a URL to connect to the MongoDB server.

def search(query)
  document_contents = @client[:documents].find.map { |document| document[:content] }
  @client[:documents]
    .find
    .map { |document| { name: document[:_id], relevance: relevance(query, document), content: document[:content] } }
    .sort_by { |document| -document[:relevance] }
    .reject { |document| document[:relevance] == 0 }
end

search is almost the same. We get the documents from database instead of an instance variable.

def add_document(name, document)
  document = clean(document)
  terms = n_grams(document)
  number_of_terms = terms.count
  @client[:documents].insert_one({ _id: name, content: document, number_of_terms: number_of_terms })
  terms.each do |term|
    @client[:terms].update_one(
      { _id: term },
      # Increment the count of this document by one.
      { "$inc": { "appearances.#{name}": 1 } },
      upsert: true,
    )
  end
end

Here after we save the document, for each term we increment the count of the document by one or set it to one if it doesn’t exist.

def importance(term)
  appearances = @client[:terms].find(_id: term).first["appearances"]
  number_of_documents = @client[:documents].count
  number_of_documents_containing_term = appearances.count
  Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
end

Not much to say here. We don’t have to search through the entire documents to find how many documents contain the term. We can just look at the appearances field’s length.

def relevance(query, document)
  number_of_terms = @client[:documents].find(_id: document[:_id]).first["number_of_terms"]
  query_terms = n_grams(query)
  query_terms.map { |term| term_relevance(term, document[:_id]) }.sum / number_of_terms.to_f
end

def term_relevance(term, document_id)
  appearances = @client[:terms].find(_id: term).first["appearances"]
  number_of_terms = @client[:documents].find(_id: document_id).first["number_of_terms"]
  if appearances.nil? || appearances[document_id].nil?
    return 0
  end
  importance(term) * appearances[document_id]
end

relevance doesn’t have to take the documents as an argument now; it can just look at the database. I refactored the block that calculates the relevance of a term into a separate method called term_relevance for readability. If the term doesn’t appear in the document, we return 0; otherwise, we calculate the relevance as before by multiplying the importance by the number of appearances.

The full implementation looks like this:

require "mongo"

class SearchEngine
  def initialize(url)
    @client = Mongo::Client.new(url)
  end

  def add_document(name, document)
    document = clean(document)
    terms = n_grams(document)
    number_of_terms = terms.count
    @client[:documents].insert_one({ _id: name, content: document, number_of_terms: number_of_terms })
    terms.each do |term|
      @client[:terms].update_one(
        { _id: term },
        # Increment the count of this document by one.
        { "$inc": { "appearances.#{name}": 1 } },
        upsert: true,
      )
    end
  end

  def search(query)
    document_contents = @client[:documents].find.map { |document| document[:content] }
    @client[:documents]
      .find
      .map { |document| { name: document[:_id], relevance: relevance(query, document), content: document[:content] } }
      .sort_by { |document| -document[:relevance] }
      .reject { |document| document[:relevance] == 0 }
  end

  private

  def clean(document)
    document.gsub(/[[:punct:]]/, " ").downcase
  end

  def n_grams(document, min_n: 2, max_n: 5)
    words = document.split
    terms = []
    words.each do |word|
      # the word can be smaller than the max window size like the word 'and' so we take the minimum.
      max_window_size = [max_n, word.length].min
      (min_n..max_window_size).each do |window_size|
        terms << word[...window_size]
      end
    end
    terms
  end

  def importance(term)
    appearances = @client[:terms].find(_id: term).first["appearances"]
    number_of_documents = @client[:documents].count
    number_of_documents_containing_term = appearances.count
    Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
  end

  def relevance(query, document)
    number_of_terms = @client[:documents].find(_id: document[:_id]).first["number_of_terms"]
    query_terms = n_grams(query)
    query_terms.map { |term| term_relevance(term, document[:_id]) }.sum / number_of_terms.to_f
  end

  def term_relevance(term, document_id)
    appearances = @client[:terms].find(_id: term).first["appearances"]
    number_of_terms = @client[:documents].find(_id: document_id).first["number_of_terms"]
    if appearances.nil? || appearances[document_id].nil?
      return 0
    end
    importance(term) * appearances[document_id]
  end
end

Conclusion

If you want to go deeper into search engines, you should look at BM25, PageRank, Word2Vec. Maybe a next step would be using AI models but I’m not sure if this improves the search experience. For last few years current search engines gets shittier every day. Maybe it’s not such a good idea.

Inspiration

Building A Python-Based Search Engine. An excellent conference talk by Daniel Lindsley on building a search engine in Python. I’m heavily inspired by this talk. You can take a look at his source code here.
Search Engine in Rust by Tsoding.