Emulating SQLite with vanilla Python
In an ideal world, when attempting to get something done, our brain should remain at the same level of abstraction as the task at hand. As the saying goes, there is no need to reinvent the wheel, or to unmount a wheel and study its functioning: simply get yourself a robust, tested wheel that gets the job done, and use it! The same goes for software engineering tasks. Some tools have been around for so long that we can safely take them for granted.
But when learning, “unmounting the wheel” can be a rewarding experience. Or, perhaps, trying to build your own wheel. As a matter of fact, as part of my journey with the Qwasar Silicon Valley learning community, I recently had the opportunity to implement a partial replica of the SQLite command-line interface (CLI) and data manipulation operations, using only modules of the Python Standard Library. This is an important restriction, because it can be argued that the data manipulation part of this project would have been trivial if I had used a tool such as the pandas library.
In this post, I invite you to follow me through some of the challenges I encountered and important choices I had to make during my work on this project. More specifically, these will pertain to the following:
- Table format, including separator characters
- Types and conversion, including escape sequences
- An efficient way to perform a JOIN
- Ordering the results by multiple keys
It should be stated that I gave myself a time budget of approximately one week to complete the project. So while there was some basic functionality that I wanted to cover, I did not intend in any way whatsoever to implement a tool that would come close to matching the actual SQLite software.
To help you follow through the rest of this piece, I invite you to read the README that is attached to my project on Github. If you’re short on time, simply take a look at the five railroad diagrams for each of the five implemented statements (DESCRIBE, SELECT, INSERT, UPDATE and DELETE). That should be enough to give you an overview of the scope of the project, in terms of what I included and what I decided to leave out.
Table format and separator characters
This first issue is a relatively simple one, but failing to address it would have resulted in a great deal of work and hassle: in the files storing the contents of a table, what characters should I use, respectively, to separate the records and to separate the columns within a record?
Now the starting idea that my mentor gave me was to keep things simple and use a comma-separated value (CSV) format. Following the CSV convention, records would have been separated with a line feed character (‘\n’) and columns would have been separated with a comma. Simple, right?
But sure enough, this simplicity leads to a host of problems. What if some field of some record were to contain a line feed or a comma? Well the usual way to deal with this would be to encode these characters when storing them — perhaps with html encoding, the character’s unicode code, or percent encoding — and decode them back when retrieving the data.
Being lazy and always on the lookout for a cleaner, simpler solution, I started to ask myself whether I could avoid this encoding-decoding work. And it turns out that there was a way!
And that way was through a pair of ASCII control codes. From 28 to 31, ASCII defines four characters that are seldom referred to nowadays (so it seems to me, at least). They are, respectively, the file separator, the group separator, the record separator and the unit separator. Wikipedia mentions that they “can be used as delimiters to mark fields of data structures,” while also adding that “Python’s
splitlines string method treats FS, GS and RS, but not US, as separators in addition to the line-breaking characters.”
With all that in mind, I set out to use the unit separator (US; 0x1F) as my column separator, and the record separator (RS; 0x1E)… well you guessed it, as my record separator!
There are obvisously some drawbacks to this solution. First of all, it creates a format that is — to the best of my knowledge — unheard of. This means that no data could be imported into the database without some form of processing. Also, compared to plain CSV files, it makes the raw data pretty much unreadable by humans.
A solution to that drawback — which I have not tried to apply — might be, if possible at all, to instruct your text editor to display US and RS as more common, more readable characters.
The other noticeable drawback is that there could still be a need to store the US and RS characters as part of one or more records in the database, which would pretty much defeat the purpose of using them as separator characters. We would then be back to the problem initially stated in the beginning of this section.
However, at the end of the day, it cleary appeared to me that the benefits of using this original format far outweighed the risks, especially in the context of a toy project such as this one.
Types, conversions and escape sequences
Another issue that quickly comes to mind when trying to build an SQLite replica is the issue of types. SQLite currently supports the following storage classes: NULL, INTEGER, REAL, TEXT and BLOB.
This is not a trivial issue. Properly managing types means maintaining a database schema, checking the correctness of the input, associating each column with its type, and so on.
Once again, given my limited time budget, I started looking for a solution that would be almost as good as the thorough, canonical one, but with largely reduced complexity.
So I ended up deciding not to maintain and enforce types. Instead, all stored data is to be considered of type TEXT. However, a simple conversion is to be performed on input data and on stored data whenever it is needed for comparison purposes. This situation arises when implementing the WHERE clause of the SELECT, DELETE and UPDATE statements, and the ORDER BY clause of the SELECT statement. The function performing the conversion is simple enough:
It first tries to convert the value to an integer. It that conversion fails, a conversion to float is attempted. If that also fails, the value remains as text.
Thanks to these conversions, the natural ordering of numerical types can be observed, i.e. a comparison such as 2 < 10 is true, whereas ‘2’ < ‘10’ would be false.
Note that the order of attempted conversion matters. Trying to convert ‘10’ to float will succeed, but ‘10’ needs to be converted to int. Fortunately, trying to convert ‘10.1’ to int will fail. We can therefore apply the conversion order specified above.
It should be noted that such a simple conversion scheme entails a drawback: all data that can be converted to a numerical type will be so converted. This is not always desirable. In other words, there might be cases where you would like ‘10’ and ‘2’ to remain as text so that ‘10’ < ‘2’ can be true. Because of the operations set out above, this is not possible.
On another note, I decided to support the usage of escape sequences (such as
\t, etc.) in the CLI queries. The main use case for that arises when the user needs to include double quotes inside a
value element (which itself has to be surrounded with double quotes), such as in the following query:
But it can also be useful if the user needs to include tabulations, line feeds, etc. Escape sequences are decoded before the query text is submitted to the query runner module — except
\", which needs to remain encoded for parsing purposes further down the road:
How to perform a JOIN efficiently
If not crafted properly, the JOIN operation can certainly be quite slow. And at first, mine was indeed slow. Allow me to explain why, and how I fixed the issue.
First, referring back to my railroad diagram for the SELECT statement (in the introduction above), note that I allow only one junction per query, and that the only operator that can be used within the ON condition is the equality operator (=).
Initially, I observed that a JOIN clause without any ON clause (i.e. without any restriction) will simply generate a cross join. More formally, given tables A and B, the cross join of A and B is the list of all concatenations a + b, where (a, b) is an element of the cartesian product A x B.
Great! So in order to process a JOIN, I could first generate a cross join, and then apply the ON clause, if any, by filtering the output of the cross join.
Sure. But it is easy to notice that cross joins have a relatively high time complexity. Is there any way to reduce that complexity? Let us perform an analysis in order to find out. The time complexity of a cross join is simply the following:
Now suppose we have the following query:
Suppose further that there are n distinct values in the xyz columns of tables ‘players’ and ‘batting’. And suppose also, for the sake of simplicity, that for each distinct value v of the xyz columns, there are m rows in each of ‘players’ and ‘batting’ where xyz = v. Hence, since each table has n*m rows, the time complexity of the cross join operation can be expressed as:
Now let us pause for a second and examine the meaning of the ON clause. We can express that meaning as “from the cross join of p and b, only keep those rows where b.xyz = p.xyz”. But an implementation deriving from this formulation would preserve the time complexity of a full cross join.
Here’s a better way to proceed:
- For each of the ‘players’ and ‘batting’ tables, group the rows by value of the xyz column. This can be done in O(n*m) time with a hash map (called “dictionary” in Python).
- Instead of performing one cross join on the complete ‘players’ and ‘batting’ tables, perform a cross join on the pair of groups (grouped_players[v], grouped_batting[v]) for each v in the n distinct values of the xyz columns. Under the assumptions set out above, grouped_players[v] and grouped_batting[v] both contain m rows for any v. The time complexity of the n cross joins will therefore be:
- Simply concatenate the n cross joins.
The enhanced implementation is accomplished with the following Python code:
The time complexity of this improved JOIN operation is:
Notice that compared to the naive implementation, we have reduced the time complexity by a factor of n. In other words, if n = 1000 (i.e. if the ON key can take 1000 distinct values), the enhanced operation could be up to 1000 times faster than the naive one.
Ordering the results by multiple keys
A feature that was important for me to include was the possibility of ordering the results of a SELECT query by multiple keys. A good example of a query that uses this feature is the following. Since this data is about baseball, note that HR stands for “home run.”
The purpose of this query is to retrieve the data sorted by the number of home runs in a given season. Therefore, it is obvious that the first ordering key shall be “HR DESC”. But as you can see, I also added three other keys: “nameLast”, “nameFirst” and “yearID DESC.”
Now sorting is pretty straightforward in Python. If I only needed to sort with a single key, I would do it like so:
(See the Types, conversions and escape sequences section above for the definition of the “converted” function.)
Let us now introduce multiple keys. If we were only dealing with numerical types and no reverse sorting, sorting could be done rather simply:
Note that the “sort” method call doesn’t specify a “reverse” argument anymore. That’s because the sorting can be done reversed on some keys and non-reversed on some other keys. So instead, for each key, negate the value if and only if the sorting needs to be done reversed (see line 6 above). Here’s a snippet to demonstrate the “sort_key” function:
When comparing two rows for the purposes of sorting, the rows will first be ordered according to the first element of the tuple output by sort_key. Only if the element is equal for both rows will the ordering be done according to the second element of the tuple.
That’s great, but it obviously doesn’t work as soon as we are dealing with non-numerical types:
Fortunately, there is another way. The idea is to take advantage of the fact that Python’s sorting is stable, which means that when multiple records have the same key, their original order is preserved.
The stability property has this very fortunate consequence: if you need to sort by columns a, b, c, then simply sort by column c, then b, then a:
One last thing before we’re done with the ordering.
If we sort in ascending order by a text column, the rows with an empty (NULL) value will appear first:
Had we sorted in descending order, the rows with an NULL key would have (as desired) appeared last. But how to make them appear last whether or not we sort in descending order?
We modify the “key” argument of the sort function in the following way:
Let’s examine the consequences of line 5. If we are sorting in ascending order, the “sort_key” function will output (True, ‘’) for empty key values and (False, value) for non-empty ones. Since False < True, rows with empty key values will be sorted last.
If we are sorting in descending order, the “sort_key” function will output (False, ‘’) for empty key values and (True, value) for non-empty ones. Again, False < True, but this time reverse=True, so rows with empty key values will also be sorted last.
Thank you for reading through this piece. Feel free to show your appreciation or provide suggestions in the comments!