Day 1: Our Notes¶
This page includes our notes on today's tutorials. You can use them to review, to practice, or you can ignore them entirely!
1. Why use a database¶
You might want to review the following questions (also shared on our workshop home page):
2. Prepare your Spreadsheet¶
We'll need to think about the information in your spreadsheet in terms of what's in each column, and how can we arrange the columns into two (or more) related tables.
Imagine we start with a table like this one:
id | root | nicodemus | english | originator |
---|---|---|---|---|
1 | a | a | hello. | Lawrence Nicodemus |
2 | bc | buts | boots. | Lawrence Nicodemus |
3 | bs | tbilso | dragonfly. |
Are some columns limited in the values that might appear in them? For example, you might want to separate your originators from the words information in separate tables. That way you can manage the originators on their own (for example, you will only have to make sure each person's name is spelled correctly in one row in one table, not in lots of places).
There may be other opportunities to separate information into several related tables. Depending on how many different categories of grammar or symbols, those might be candidates to extract into a separate table.
You can try different strategies as you get started, so don't worry about making a mistake now!
3. Define your columns¶
Take a look at your column headers in your spreadsheet. These will likely become the name of the corresponding table in your database - but be careful. Your column names in a spreadsheet can be almost anything - they can have multiple words with spaces, dashes or other punctuation. Database column names need to be more regular than this. Here's a good rule of thumb: make your column names all lower case, and if you need to have more than one word in a column name connect the words with underscore, like this like first_name, whole_word, english_translation.
As you think about your spreadsheet, it will be helpful to identify the type of information that gets stored in each column. Databases know about several different types of information, here are the most likely types you might want to use in your tables:
Tip
The official SQLite documentation describing the data types you can use is here
- Integer (abbreviated 'int') is any whole number including 0. Integer data are particularly useful for the column you will designate as a primary key in your table. An integer column will sort from high to low or low to high, as you'd expect them to do (see our discussion of collation sequences for an example of how this is different than how numbers sort if they're not designated as integers!)
Tip
If you want your database to do fancy mathematics, you'll want to know more about different ways that you can store and use numerical data - for example as Real (for 'real number') in SQLite. You could look here to get a start. We won't deal with math so much, as it's not typically an important part of a language application.
-
Text (abbreviated 'txt') is a sequences of letters, numbers, punctuation marks, and may even include blanks. Strings will sort based on a default collation sequence unless you specify something different. Text may need to be entered between "double" or 'single' quote marks because of the possibility that they might contain spaces. Depending on the database software you're using, you might also see this type called String (str)or even Varchar (short for 'character string of variable length). SQLite uses text for this kind of data, so that's what we'll call it.
-
Blob a blob is a bunch of stuff, stored exactly as it was input. Blob fields are useful if you want to store really long text passages, structured data like arrays or hashes, encoded graphics, and the like. Blobs won't have an obvious sort order, and they won't show up nicely in a table-type display.
-
Boolean is a logical operator type. In people language, words that work like 'true', 'false', 'and', 'or', and 'not' are booleans. In database language, there can be more - but in SQLite the primary boolean is a simple 'true' (encoded as '1') or 'false' (encoded as '0'). You could use a boolean field to indicate that you want a row to display (1), or to be hidden (0), in some situation, for example.
-
Date, Time and Timestamp are data types that database software use and that are very handy to include in your tables if you want to keep track of changes over time. You can find specific information about how these data types SQLite here.
4. Identify required vs optional columns¶
Since we're talking about database tables now (and not about stuff like online forms), we want to be careful about what we mean by a required versus an optional column.
In our spreadsheet example above, we can see that every row includes values for 'root', 'nicodemus', and 'english'; but not every row has a value for 'originator'. This probably means that in our database tables we'll want to require non-null values (all of those are probably text fields) for the first three columns, but allow NULL values for the last one.
Null is a data type that's used when a column in your table needs allow completely empty cells. Empty cells will look blank on the screen, but there are actually several kinds of cells that will look blank on the screen, and if you're not careful they can get weirdly confusing, because database systems will treat them differently from each other:
- cells that contain one or more space characters
- cells that contain an empty string
- cells that are NULL
Null is a super interesting and super-weird data type, and it's important to remember that to display a blank your table will probably need to store a null.
We can think of the columns that allow NULL values as optional, and that don't as required; and we'll have to tell our database software which columns are of which type. Usually, column that requires a value will be satisfied by fields containing an empty string, or even one or more spaces.
This means that if we were to load our spreadsheet into a single database table it'd really look like this:
id | root | nicodemus | english | originator |
---|---|---|---|---|
1 | a | a | hello. | Lawrence Nicodemus |
2 | bc | buts | boots. | Lawrence Nicodemus |
3 | bs | tbilso | dragonfly. | NULL |
Tip
Now's a great time to head over to this tutorial 🚀
5. Create Primary and Foreign Keys¶
One thing that's lacking in the spreadsheet example is a column called id (although when you look at your spreadsheet in MS Excel or Google Sheets or whatever, the line numbers kind of look like ids). For reasons discussed here, we'll want to add that column, and make sure that each row gets a unique number.
This column will hold the primary key by which we can identify any row in the table. And we can use this value to relate this table to another table in our database, by adding a column for a foreign key in that table.
The good news is that the database application will create our primary keys automatically when we pull our data from the spreadsheet into the tables. It can also help us make sure we add only valid foreign keys as we go.
6. Meet Structured Query Language (SQL)¶
As we do this work, we'll be using Structured Query Language (abbreviated SQL, pronounced 'ess-kyu-el'. When we're talking about a relational database, we usually pronounce the same abbreviation 'sequel'. Weird but true.). We'll tell you the commands you need, but if you'd like to read more, the link above will help.
7. Miscellaneous Terms and Concepts¶
We may discuss these topics as we go, and/or you may want to refer to them outside of class.
7.1. Data Normalization¶
Data normalization refers to the process we all have to go through when we move information from some external source into a database table.
As you try to add your spreadsheet information to your database tables you might run into problems because the form or order of your spreadsheet information is inconsistent. In order for the database table to allow you to enter your information, the database application is going to enforce the rules you gave it ruthlessly.
The data normalization step is actually one of the great advantages of building your own database! It will ensure that your information is clean and consistent, and that will also make it easy to use for lots of other purposes (including building more complicated language technologs!)
7.2. Encodings¶
Info
Computers have to convert letters (and spaces, and punctuation marks, and line beginnings and endings) to numbers before knowing how to display them to humans. There are lots of different ways to do this.
Here's more information about encodings that we might discuss today.
7.3 Collation Sequences, Character Sets, and Cases¶
Info
The marks used in different writing system might also come in a particular order. For English, we all learned 'alphabetical order', but that isn't the correct order for everyone. Furthermore, the ordering of upper and lower case letters with respect to each other; the order of non-letters (numerals, punctuation marks, etc) with respect to the other characters may need to be specified.
Here's more information about character sets, collation sequences and cases that we might discuss today.
Back to Day 1¶
If you'd like you can head back to Day 1. 🚀