When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. For example, origin is part of the weather primary key, and is also a foreign key for the airports table. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.Ī primary key uniquely identifies an observation in its own table.įor example, planes$tailnum is a primary key because it uniquely identifiesĪ foreign key uniquely identifies an observation in another table.įor example, flights$tailnum is a foreign key because it appears in theįlights table where it matches each flight to a unique plane.Ī variable can be both a primary key and a foreign key. In other cases, multiple variables may be needed. For example, each plane is uniquely identified by its tailnum.
In simple cases, a single variable is sufficient to identify an observation. A key is a variable (or set of variables) that uniquely identifies an observation.
The variables used to connect each pair of tables are called keys. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. If you’ve used a database before, you’ve almost certainly used SQL. The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. Set operations, which treat observations as if they were set elements. Whether or not they match an observation in the other table. Mutating joins, which add new variables to one data frame from matchingįiltering joins, which filter observations from one data frame based on There are three families of verbs designed to work with relational data: To work with relational data you need verbs that work with pairs of tables.
Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.
All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Relations are always defined between a pair of tables. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. It’s rare that a data analysis involves only a single table of data.