Note: this should have been done by participants before the start of the workshop.
There are a number of different database management systems for working with relational data. We're going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g., MySQL, PostgreSQL, MS Access, Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and the details of data types.
Let's all open the database we downloaded in SQLite Manager by clicking on the open file icon.
You can see the tables in the database by looking at the left hand side of the screen under Tables.
To see the contents of a table, click on that table and then click on the Browse and search tab in the right hand section of the screen.
If we want to write a query, we click on the Execute SQL tab.
The dataset for this lesson is cleaned Gapminder data from Jennifer Bryan https://github.com/jennybc/gapminder
country
) and INTEGER for fields with non-decimal numbers (year
,
pop
) and FLOAT for fields with decimal numbers ('lifeExp', 'gdpPercap')Challenge
Import the country table - countries.csv
In this lesson, we have two tables rather than the one combined table. Rather than repeating the continent information along with the country in the surveys table, we have a separate table that contains that information. Then we can put that information back together whenever we need to by matching country to continent. We'll go over how to join that information at the end. This is a very valuable feature of SQL.
You can also use this same approach to append new data to an existing table.
Let's start by using the surveys table. Here we have data on all the countries and years
Let’s write an SQL query that selects only the year column from the surveys table.
Go to the 'Execute SQL' tab. In the box type:
SELECT year FROM surveys;
We have capitalized the words SELECT and FROM because they are SQL keywords. SQL is case insensitive, but it helps for readability – good style.
If we want more information, we can just add a new column to the list of fields, right after SELECT:
SELECT year, pop FROM surveys;
Or we can select all of the columns in a table using the wildcard *
SELECT * FROM surveys;
If we want only the unique values so that we can quickly see what countries have
been surveyed we use DISTINCT
SELECT DISTINCT country FROM surveys;
If we select more than one column, then the distinct pairs of values are returned
SELECT DISTINCT country, year FROM surveys;
We can also do calculations with the values in a query. For example, if we wanted to look at the population in thousands of people, to make it easier to look at, we could divide population by 1000
SELECT country, year, pop/1000.0 from surveys;
When we run the query, the expression pop / 1000.0
is evaluated for each row
and appended to that row, in a new column. Expressions can use any fields, any
arithmetic operators (+ - * /) and a variety of built-in functions (). For
example, we could round the values of gpdPercap to two decimal values to make them easier to read.
SELECT country, year, ROUND(gpdPercap, 2) FROM surveys;
We can also rename the column using AS
SELECT country, year, ROUND(gdpPercap, 2) AS gdp FROM surveys;
Challenge
Write a query that returns the country, year, population in millions of people, rounded to two decimal places and life expectancy rounded to one decimal place. As a bonus rename the columns.
Databases can also filter data – selecting only the data meeting certain criteria. For example, let’s say we only want data for the country Iceland. We need to add a WHERE clause to our query:
SELECT * FROM surveys WHERE country="Iceland";
We can do the same thing with numbers. Here, we only want the data since 2000:
SELECT * FROM surveys WHERE year >= 2000;
We can use more sophisticated conditions by combining tests with AND and OR. For example, suppose we want the data on Iceland starting in the year 2000:
SELECT * FROM surveys WHERE (year >= 2000) AND (country = "Iceland");
Note that the parentheses aren’t needed, but again, they help with readability. They also ensure that the computer combines AND and OR in the way that we intend.
If we wanted to get data for three countries, Iceland, and two other island countries Madagascar and Dominican Republic we could combine the tests using OR:
SELECT * FROM surveys WHERE (country = "Iceland") OR (country = "Madagascar") OR (country = "Dominican Republic");
Challenge
Write a query that returns the country, year, life expectancy and population in thousands of people for any field with a life expectancy greater than 70.
How many records are there?
How many records are there if you change lifeExp to 75?Write a query that returns the country, year, life expectancy and population in thousands of people for any field with a life expectancy greater than 70 and before 1990.
How many records are there?
How many records are there if you just look at the year 1952?
How many records are there if you just look at the year 2007?
Now, lets combine the above queries to get data for the 3 countries from
the year 2000 on. This time, let’s use IN as one way to make the query easier
to understand. It is equivalent to saying WHERE (country = "Iceland") OR (country
= "Madagascar") OR (country = "Dominican Republic")
, but reads more neatly:
SELECT * FROM surveys WHERE (year >= 2000) AND (country IN ("Iceland", "Madagascar", "Dominican Republic"));
SELECT *
FROM surveys
WHERE (year >= 2000) AND (country IN ("Iceland", "Madagascar", "Dominican Republic"));
We started with something simple, then added more clauses one by one, testing their effects as we went along. For complex queries, this is a good strategy, to make sure you are getting what you want. Sometimes it might help to take a subset of the data that you can easily see in a temporary database to practice your queries on before working on a larger or more complicated database.
We can also sort the results of our queries by using ORDER BY. Let's now use our countries table and put them in order. Here instead of saying 'FROM surveys' we say 'FROM countries' because we're referring to that table instead.
SELECT * FROM countries ORDER BY country ASC;
The keyword ASC tells us to order it in Ascending order. We could alternately use DESC to get descending order.
SELECT * FROM countries ORDER BY country DESC;
ASC is the default.
We can also sort on several fields at once. To truly be alphabetical within each continent, we can order by continent and then country.
SELECT * FROM countries ORDER BY continent ASC, country ASC;
Challenge
Write a query that returns country, year, and population in millions of people from the surveys table, sorted with the largest populations at the top.
Another note for ordering. We don’t actually have to display a column to sort by it. For example, let’s say we want to look at countries with population higher than a million and order the countries by their population, but we only want to see gdpPercap.
SELECT country, gdpPercap FROM surveys WHERE (pop > 1000000) ORDER BY pop ASC;
We can do this because sorting occurs earlier in the computational pipeline than field selection.
The computer is basically doing this:
More on SQL order of execution - SQL Query Order of Operations
The order of the clauses when we write a query is dictated by SQL: SELECT, FROM, WHERE, ORDER BY and we often write each of them on their own line for readability.
Challenge
Let's try to combine what we've learned so far in a single query. Using the surveys table write a query to display country, year lifeExp, and population in millions (rounded to two decimal places), for the year 2007, ordered by life expectancy with highest life expectancy at the top.
BREAK
Aggregation allows us to combine results by grouping records based on value and calculating combined values in groups.
Let’s go to the surveys table and find out how many entries there are. Using the wildcard simply counts the number of records (rows)
SELECT COUNT(*) FROM surveys
We can also find out the overall population (maybe the number of people who have been on earth since 1952?).
SELECT COUNT(*), SUM(pop) FROM surveys;
Let's output this value in millions of people, rounded to 3 decimal places.
SELECT ROUND(SUM(pop)/1000000.0, 3) FROM surveys
There are many other aggregate functions included in SQL
Let's use one query to output the total population, average population, and the min and max population.
SELECT pop, AVG(pop), MIN(pop), MAX(pop) FROM surveys;
Now, let's see how many countries were surveyed in each continent. We do this using a GROUP BY clause
SELECT continent, COUNT(*)
FROM countries
GROUP BY continent
GROUP BY tells SQL what field or fields we want to use to aggregate the data. If we want to group by multiple fields, we give GROUP BY a comma separated list.
Challenge
Let's go back to the surveys table. Write queries that return:
- How many countries were counted in each year.
- Average gdpPercap of each country
We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of countries in each continent, ordered by the count
SELECT continent, COUNT(*)
FROM countries
GROUP BY continent
ORDER BY COUNT(continent)
To combine data from two tables we use the SQL JOIN
command, which comes after
the FROM
command.
We also need to tell the computer which columns provide the link between the two
tables using the word ON
. What we want is to join the data with the same
species codes.
SELECT *
FROM surveys
JOIN countries ON surveys.country = countries.country
ON is like WHERE
, it filters things out according to a test condition. We use
the table.colname
format to tell the manager what column in which table we are
referring to.
We often won't want all of the fields from both tables, so anywhere we would
have used a field name in a non-join query, we can use table.colname
.
For example, what if we wanted information on the population of each country with the continent information.
SELECT surveys.country, countries.continent, surveys.year, surveys.pop
FROM surveys
JOIN countries ON surveys.country = countries.country
Challenge:
Write a query that returns the country, continent, and the life expectancy for each record
Joins can be combined with sorting, filtering, and aggregation. So, let's combine things as a final challenge.
Challenge
Write a query that returns the continent, the average life expectancy and the maximum population for each continent.
Bonus: Do this just for the year 1952 then for the year 2007
As queries get more complex names can get long and unwieldy. To help make things clearer we can use aliases to assign new names to things in the query.
We can alias both table names:
SELECT surv.year, surv.country
FROM surveys AS surv
JOIN countries AS co ON surv.country = co.country
And column names:
SELECT surv.year AS yr, surv.country AS con
FROM surveys AS surv
JOIN countries AS co ON surv.country = co.country
The AS
isn't technically required, so you could do
SELECT surv.year yr
FROM surveys surv
but using AS
is much clearer so it's good style to include it.
Data type | Description |
---|---|
CHARACTER(n) | Character string. Fixed-length n |
VARCHAR(n) or CHARACTER VARYING(n) | Character string. Variable length. Maximum length n |
BINARY(n) | Binary string. Fixed-length n |
BOOLEAN | Stores TRUE or FALSE values |
VARBINARY(n) or BINARY VARYING(n) | Binary string. Variable length. Maximum length n |
INTEGER(p) | Integer numerical (no decimal). |
SMALLINT | Integer numerical (no decimal). |
INTEGER | Integer numerical (no decimal). |
BIGINT | Integer numerical (no decimal). |
DECIMAL(p,s) | Exact numerical, precision p, scale s. |
NUMERIC(p,s) | Exact numerical, precision p, scale s. (Same as DECIMAL) |
FLOAT(p) | Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation. |
REAL | Approximate numerical |
FLOAT | Approximate numerical |
DOUBLE PRECISION | Approximate numerical |
DATE | Stores year, month, and day values |
TIME | Stores hour, minute, and second values |
TIMESTAMP | Stores year, month, day, hour, minute, and second values |
INTERVAL | Composed of a number of integer fields, representing a period of time, depending on the type of interval |
ARRAY | A set-length and ordered collection of elements |
MULTISET | A variable-length and unordered collection of elements |
XML | Stores XML data |
Different databases offer different choices for the data type definition.
The following table shows some of the common names of data types between the various database platforms:
Data type | Access | SQLServer | Oracle | MySQL | PostgreSQL |
---|---|---|---|---|---|
boolean | Yes/No | Bit | Byte | N/A | Boolean |
integer | Number (integer) | Int | Number | Int / Integer | Int / Integer |
float | Number (single) | Float / Real | Number | Float | Numeric |
currency | Currency | Money | N/A | N/A | Money |
string (fixed) | N/A | Char | Char | Char | Char |
string (variable) | Text (<256) / Memo (65k+) | Varchar | Varchar / Varchar2 | Varchar | Varchar |
binary object OLE Object Memo Binary (fixed up to 8K) | Varbinary (<8K) | Image (<2GB) Long | Raw Blob | Text Binary | Varbinary |