Indexes - Optimize Queries¶
Note
The indexes explanation is taken almost entirely from the excellent tutorial constructed by the SQLModel folks and their tutorial website. Imitation is the sincerest form of flattery. Or, direct borrowing?
We just saw how to get some data WHERE
a condition is true. For example, where the roots column nicodemus is "bc".
If we just create the tables and the data as we have been doing, when we SELECT
some data using WHERE
, the database would have to scan through each one of the records to find the ones that match. This is not a problem with 3 roots as in these examples.
But imagine that your database has thousands or millions of records, if every time you want to find the roots with nicodemus="bc" it has to scan through all of the records to find all the possible matches, then that becomes problematic, as it would be too slow.
I'll show you how to handle it with a database index.
The change in the code is extremely small, but it's useful to understand what's happening behind the scenes, so I'll show you how it all works and what it means.
If you already executed the previous examples and have a database with data, remove the database file before running each example, that way you won't have duplicate data and you will be able to get the same results.
What is an Index¶
In general, an index is just something we can have to help us find things faster. It normally works by having things in order. Let's think about some real-life examples before even thinking about databases and code.
An Index and a Dictionary¶
Imagine a dictionary, a book with definitions of words. 📔. 😅
Let's say that you want to find a word, for example the word "database". You take the dictionary, and open it somewhere, for example in the middle. Maybe you see some definitions of words that start with m
, like manual
, so you conclude that you are in the letter m
in the dictionary.
You know that in the alphabet, the letter d
for database
comes before the letter m
for manual
.
So, you know you have to search in the dictionary before the point you currently are. You still don't know where the word database
is, because you don't know exactly where the letter d
is in the dictionary, but you know that it is not after that point, you can now discard the right half of the dictionary in your search.
Next, you open the dictionary again, but only taking into account the half of the dictionary that can contain the word you want, the left part of the dictionary. You open it in the middle of that left part and now you arrive maybe at the letter f
.
You know that d
from database
comes before f
. So it has to be before that. But now you know that database
is not after that point, and you can discard the dictionary from that point onward.
Now you have a small section of dictionary to search (only a quarter of dictionary can have your word). You take that quarter of the pages at the start of the dictionary that can contain your word, and open it in the middle of that section. Maybe you arrive at the letter c
.
You know the word database
has to be after that and not before that point, so you can discard the left part of that block of pages.
You repeat this process a few more times, and you finally arrive at the letter d
, you continue with the same process in that section for the letter d
and you finally find the word database
. 🎉
You had to open the dictionary a few times, maybe 5 or 10. That's actually very little work compared to what it could have been.
Technical Details
Do you like fancy words? Cool! Programmers tend to like fancy words. 😅
That algorithm I showed you above is called Binary Search.
It's called like that because you search something by splitting the dictionary (or any ordered list of things) in two ("binary" means "two") parts. And you do that process multiple times until you find what you want.
An Index and a Novel¶
Let's now imagine you are reading a novel. And someone told you that at some point, they mention a database, and you want to find that chapter.
How do you find the word "database" there? You might have to read the entire book to find where the word "database" is located in the book. So, instead of opening the book 5 or 10 times, you would have to open each of the 500 pages and read them one by one until you find the word. You might enjoy the book, though. 😅
But if we are only interested in quickly finding information (as when working with SQL databases), then reading each of the 500 pages is too inefficient when there could be an option to open the book in 5 or 10 places and find what you're looking for.
A Technical Book with an Index¶
Now let's imagine you are reading a technical book. For example, with several topics about programming. And there's a couple of sections where it talks about a database.
This book might have a book index: a section in the book that has some names of topics covered and the page numbers in the book where you can read about them. And the topic names are sorted in alphabetic order, pretty much like a dictionary (a book with words, as in the previous example).
In this case, you can open that book in the end (or in the beginning) to find the book index section, it would have only a few pages. And then, you can do the same process as with the dictionary example above.
Open the index, and after 5 or 10 steps, quickly find the topic "database" with the page numbers where that is covered, for example "page 253 in Chapter 5". Now you used the dictionary technique to find the topic, and that topic gave you a page number.
Now you know that you need to find "page 253". But by looking at the closed book you still don't know where that page is, so you have to find that page. To find it, you can do the same process again, but this time, instead of searching for a topic in the index, you are searching for a page number in the entire book. And after 5 or 10 more steps, you find the page 253 in Chapter 5.
After this, even though this book is not a dictionary and has some particular content, you were able to find the section in the book that talks about a "database" in a few steps (say 10 or 20, instead of reading all the 500 pages).
The main point is that the index is sorted, so we can use the same process we used for the dictionary to find the topic. And then that gives us a page number, and the page numbers are also sorted! 😅
When we have a list of sorted things we can apply the same technique, and that's the whole trick here, we use the same technique first for the topics in the index and then for the page numbers to find the actual chapter.
Such efficiency! 😎
What are Database Indexes¶
Database indexes are very similar to book indexes.
Database indexes store some info, some keys, in a way that makes it easy and fast to find (for example sorted), and then for each key they point to some data somewhere else in the database.
Let's see a more clear example. Let's say you have this table in a database:
id | root | nicodemus | english | originator |
---|---|---|---|---|
1 | a | a | hello. | Lawrence Nicodemus |
2 | bc | buts | boots. | Lawrence Nicodemus |
3 | bs | tbilso | dragonfly. | null |
And let's imagine you have many more rows, many more roots. Probably thousands.
If you tell the SQL database to get you a word by a specific nicodemus orthography, for example tbilso
(by using the nicodemus
in the WHERE
part of the SQL query), the database will have to scan all the roots, checking one by one to find all the ones where nicodemus=tbilso
.
In this case, there's only one, but there's nothing limiting the database from having more records with the same word. And because of that, the database would continue searching and checking each one of the records, which would be very slow.
But now let's say that the database has an index for the column nicodemus
. The index could look something like this, we could imagine that the index is like an additional special table that the database manages automatically:
nicodemus | id |
---|---|
a | 1 |
buts | 3 |
tbilso | 2 |
It would have each nicodemus
field from the roots
table in order. It would not be sorted by id
, but by nicodemus
(in alphabetical order, as nicodemus
is a string). So, first it would have a
, then buts
, and last tbilso
. It would also include the id
of each word. Remember that this could have thousands of words.
Then the database would be able to use more or less the same ideas in the examples above with the dictionary and the book index.
It could start somewhere (for example, in the middle of the index). It could arrive at some word there in the middle, like buts
. And because the index has the nicodemus
fields in order, the database would know that it can discard all the previous index rows and only search in the following index rows.
name | id |
---|---|
a | 1 |
buts | 2 |
tbilso | 3 |
And that way, as with the example with the dictionary above, instead of reading thousands of words, the database would be able to do a few steps, say 5 or 10 steps, and arrive at the row of the index that has tbilso
, even if the table (and index) has thousands of rows:
name | id |
---|---|
a | 1 |
buts | 2 |
✨ tbilso ✨ | 3 |
Then by looking at this index row, it would know that the id
for tbilso
in the roots
table is 3
.
So then it could search that id
in the roots
table using more or less the same technique.
That way, in the end, instead of reading thousands of records, the database only had to do a few steps to find the word/root we wanted.
Updating the Index¶
As you can imagine, for all this to work, the index would need to be up to date with the data in the database.
If you had to update it manually in code, it would be very cumbersome and error-prone, as it would be easy to end up in a state where the index is not up to date and points to incorrect data. 😱
Here's the good news: when you create an index in a SQL Database, the database takes care of updating it automatically whenever it's necessary. 😎🎉
If you add new records to the roots
table, the database will automatically update the index. It will do the same process of finding the right place to put the new index data (those 5 or 10 steps described above), and then it will save the new index information there. The same would happen when you update or delete data.
Defining and creating an index is very easy with SQL databases. And then using it is even easier... it's transparent. The database will figure out which index to use automatically, the SQL queries don't even change.
So, in SQL databases indexes are great! And are super easy to use. Why not just have indexes for everything? .....Because indexes also have a "cost" in computation and storage (disk space).
Index Cost¶
There's a cost associated with indexes. 💰
When you don't have an index and add a new row to the table roots
, the database has to perform 1 operation to add the new root row at the end of the table.
But if you have an index for the root nicodemus entries, now the database has to perform the same 1 operation to add that row plus some extra 5 or 10 operations in the index, to find the right spot for the name, to then add that index record there.
And if you have an index for nicodemus
, one for the root
, and one for the english
, now the database has to perform the same 1 operation to add that row plus some extra 5 or 10 operations in the index times 3, for each of the indexes. This means that now adding one row takes something like 31 operations.
This also means that you are exchanging the time it takes to read data for the time it takes to write data plus some extra space in the database.
If you have queries that get data out of the database comparing each one of those fields (for example using WHERE
), then it makes total sense to have indexes for each one of them. Because 31 operations while creating or updating data (plus the space of the index) is much, much better than the possible 500 or 1000 operations to read all the rows to be able to compare them using each field.
But if you never have queries that find records by the root
(you never use root
in the WHERE
part) it probably doesn't make sense to have an index for the root
field/column, as that will increase the computational and space cost of writing and updating the database.
Create an Index with SQL¶
Phew, that was a lot of theory and explanations. 😅
The most important thing about indexes is understanding them, how, and when to use them.
Let's now see the SQL syntax to create an index. It is very simple:
CREATE INDEX ix_roots_nicodemus
ON roots (nicodemus)
This means, more or less:
Hey SQL database 👋, please
CREATE
anINDEX
for me.I want the name of the index to be
ix_roots_nicodemus
.This index should be
ON
the tableroots
, it refers to that table.The column I want you to use for it is
nicodemus
.
Primary Key and Indexes¶
Because the id
is already the primary key, the database will automatically create an internal index for it.
Info
We have another brief explainer about primary and foreign keys here.
The database always creates an internal index for primary keys automatically, as those are the primary way to organize, store, and retrieve data. 🤓
But if you want to be frequently querying the SQL database for any other field (e.g. using any other field in the WHERE
section), you will probably want to have at least an index for that.
Recap¶
Indexes are very important to improve reading performance and speed when querying the database. 🏎
Creating and using them is very simple and easy. The most important part is to understand how they work, when to create them, and for which columns.