Treat Date as Continuous Variable R
Working with dates
Working with dates in R requires more attention than working with other object classes. Below, we offer some tools and example to make this process less painful. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages such as lubridate.
Upon import of raw data, R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many ways a date can be formatted and you must help R know which part of a date represents what (month, day, hour, etc.).
Dates in R are their own class of object - the Date
class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as POSIXt
, POSIXct
, and/or POSIXlt
classes (the difference isn't important). These objects are informally referred to as datetime classes.
- It is important to make R recognize when a column contains dates.
- Dates are an object class and can be tricky to work with.
- Here we present several ways to convert date columns to Date class.
Preparation
Load packages
This code chunk shows the loading of packages required for this page. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
# Checks if package is installed, installs if necessary, and loads package for current session pacman :: p_load ( lubridate, # general package for handling and converting dates linelist, # has function to "guess" messy dates aweek, # another option for converting dates to weeks, and weeks to dates zoo, # additional date/time functions tidyverse, # data management and visualization rio ) # data import/export
Import data
We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow along step-by-step, see instruction in the Download handbook and data page. We assume the file is in the working directory so no sub-folders are specified in this file path.
linelist <- import ( "linelist_cleaned.xlsx" )
Current date
You can get the current "system" date or system datetime of your computer by doing the following with base R.
# get the system date - this is a DATE class Sys.Date ( )
## [1] "2021-12-15"
# get the system time - this is a DATETIME class Sys.time ( )
## [1] "2021-12-15 20:25:52 EST"
With the lubridate package these can also be returned with today()
and now()
, respectively. date()
returns the current date and time with weekday and month names.
Convert to Date
After importing a dataset into R, date column values may look like "1989/12/30", "05/06/2014", or "13 Jan 2020". In these cases, R is likely still treating these values as Character values. R must be told that these values are dates… and what the format of the date is (which part is Day, which is Month, which is Year, etc).
Once told, R converts these values to class Date. In the background, R will store the dates as numbers (the number of days from its "origin" date 1 Jan 1970). You will not interface with the date number often, but this allows for R to treat dates as continuous variables and to allow special operations such as calculating the distance between dates.
By default, values of class Date in R are displayed as YYYY-MM-DD. Later in this section we will discuss how to change the display of date values.
Below we present two approaches to converting a column from character values to class Date.
TIP: You can check the current class of a column with base R function class()
, like class(linelist$date_onset)
.
base R
as.Date()
is the standard, base R function to convert an object or column to class Date (note capitalization of "D").
Use of as.Date()
requires that:
- You specify the existing format of the raw character date or the origin date if supplying dates as numbers (see section on Excel dates)
- If used on a character column, all date values must have the same exact format (if this is not the case, try
guess_dates()
from the linelist package)
First, check the class of your column with class()
from base R. If you are unsure or confused about the class of your data (e.g. you see "POSIXct", etc.) it can be easiest to first convert the column to class Character with as.character()
, and then convert it to class Date.
Second, within the as.Date()
function, use the format =
argument to tell R the current format of the character date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R's standard date formats ("YYYY-MM-DD" or "YYYY/MM/DD") the format =
argument is not necessary.
To format =
, provide a character string (in quotes) that represents the current date format using the special "strptime" abbreviations below. For example, if your character dates are currently in the format "DD/MM/YYYY", like "24/04/1968", then you would use format = "%d/%m/%Y"
to convert the values into dates. Putting the format in quotation marks is necessary. And don't forget any slashes or dashes!
# Convert to class date linelist <- linelist %>% mutate (date_onset = as.Date ( date_of_onset, format = "%d/%m/%Y" ) )
Most of the strptime abbreviations are listed below. You can see the complete list by running ?strptime
.
%d = Day number of month (5, 17, 28, etc.)
%j = Day number of the year (Julian day 001-366)
%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
%A = Full weekday (Monday, Tuesday, etc.) %w = Weekday number (0-6, Sunday is 0)
%u = Weekday number (1-7, Monday is 1)
%W = Week number (00-53, Monday is week start)
%U = Week number (01-53, Sunday is week start)
%m = Month number (e.g. 01, 02, 03, 04)
%b = Abbreviated month (Jan, Feb, etc.)
%B = Full month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)
TIP: The format =
argument of as.Date()
is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.
TIP: Be sure that in the format =
argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.
Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
lubridate
Converting character objects to dates can be made easier by using the lubridate package. This is a tidyverse package designed to make working with dates and times more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.
The lubridate package provides several different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date()
. These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.
# install/load lubridate pacman :: p_load ( lubridate )
The ymd()
function flexibly converts date values supplied as year, then month, then day.
# read date in year-month-day format ymd ( "2020-10-11" )
## [1] "2020-10-11"
## [1] "2020-10-11"
The mdy()
function flexibly converts date values supplied as month, then day, then year.
# read date in month-day-year format mdy ( "10/11/2020" )
## [1] "2020-10-11"
## [1] "2020-10-11"
The dmy()
function flexibly converts date values supplied as day, then month, then year.
# read date in day-month-year format dmy ( "11 10 2020" )
## [1] "2020-10-11"
## [1] "2020-10-11"
If using piping, the conversion of a character column to dates with lubridate might look like this:
linelist <- linelist %>% mutate (date_onset = lubridate :: dmy ( date_onset ) )
Once complete, you can run class()
to verify the class of the column
# Check the class of the column class ( linelist $ date_onset )
Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
Note that the above functions work best with 4-digit years. 2-digit years can produce unexpected results, as lubridate attempts to guess the century.
To convert a 2-digit year into a 4-digit year (all in the same century) you can convert to class character and then combine the existing digits with a pre-fix using str_glue()
from the stringr package (see Characters and strings). Then convert to date.
two_digit_years <- c ( "15", "15", "16", "17" ) str_glue ( "20{two_digit_years}" )
## 2015 ## 2015 ## 2016 ## 2017
Combine columns
You can use the lubridate functions make_date()
and make_datetime()
to combine multiple numeric columns into one date column. For example if you have numeric columns onset_day
, onset_month
, and onset_year
in the data frame linelist
:
linelist <- linelist %>% mutate (onset_date = make_date (year = onset_year, month = onset_month, day = onset_day ) )
Excel dates
In the background, most software store dates as numbers. R stores dates from an origin of 1st January, 1970. Thus, if you run as.numeric(as.Date("1970-01-01))
you will get 0
.
Microsoft Excel stores dates with an origin of either December 30, 1899 (Windows) or January 1, 1904 (Mac), depending on your operating system. See this Microsoft guidance for more information.
Excel dates often import into R as these numeric values instead of as characters. If the dataset you imported from Excel shows dates as numbers or characters like "41369"… use as.Date()
(or lubridate's as_date()
function) to convert, but instead of supplying a "format" as above, supply the Excel origin date to the argument origin =
.
This will not work if the Excel date is stored in R as a character type, so be sure to ensure the number is class Numeric!
NOTE: You should provide the origin date in R's default date format ("YYYY-MM-DD").
# An example of providing the Excel 'origin date' when converting Excel number dates data_cleaned <- data %>% mutate (date_onset = as.numeric ( date_onset ) ) %>% # ensure class is numeric mutate (date_onset = as.Date ( date_onset, origin = "1899-12-30" ) ) # convert to date using Excel origin
Messy dates
The function guess_dates()
from the linelist package attempts to read a "messy" date column containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates()
. If guess_dates()
is not yet available on CRAN for R 4.0.2, try install via pacman::p_load_gh("reconhub/linelist")
.
For example guess_dates
would see a vector of the following character dates "03 Jan 2018", "07/03/1982", and "08/20/85" and convert them to class Date as: 2018-01-03
, 1982-03-07
, and 1985-08-20
.
linelist :: guess_dates ( c ( "03 Jan 2018", "07/03/1982", "08/20/85" ) )
## [1] "2018-01-03" "1982-03-07" "1985-08-20"
Some optional arguments for guess_dates()
that you might include are:
-
error_tolerance
- The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%) -
last_date
- the last valid date (defaults to current date)
-
first_date
- the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the column dater_onset linelist <- linelist %>% # the dataset is called linelist mutate( date_onset = linelist:: guess_dates( # the guess_dates() from package "linelist" date_onset, error_tolerance = 0.1, first_date = "2016-01-01" )
Working with date-time class
As previously mentioned, R also supports a datetime
class - a column that contains date and time information. As with the Date
class, these often need to be converted from character
objects to datetime
objects.
Convert dates with times
A standard datetime
object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied.
Luckily, lubridate helper functions also exist to help convert these strings to datetime
objects. These functions are extensions of the date helper functions, with _h
(only hours supplied), _hm
(hours and minutes supplied), or _hms
(hours, minutes, and seconds supplied) appended to the end (e.g.dmy_hms()
). These can be used as shown:
Convert datetime with only hours to datetime object
ymd_h ( "2020-01-01 16hrs" )
## [1] "2020-01-01 16:00:00 UTC"
## [1] "2020-01-01 16:00:00 UTC"
Convert datetime with hours and minutes to datetime object
dmy_hm ( "01 January 2020 16:20" )
## [1] "2020-01-01 16:20:00 UTC"
Convert datetime with hours, minutes, and seconds to datetime object
mdy_hms ( "01 January 2020, 16:20:40" )
## [1] "2020-01-20 16:20:40 UTC"
You can supply time zone but it is ignored. See section later in this page on time zones.
mdy_hms ( "01 January 2020, 16:20:40 PST" )
## [1] "2020-01-20 16:20:40 UTC"
When working with a data frame, time and date columns can be combined to create a datetime column using str_glue()
from stringr package and an appropriate lubridate function. See the page on Characters and strings for details on stringr.
In this example, the linelist
data frame has a column in format "hours:minutes". To convert this to a datetime we follow a few steps:
- Create a "clean" time of admission column with missing values filled-in with the column median. We do this because lubridate won't operate on missing values. Combine it with the column
date_hospitalisation
, and then use the functionymd_hm()
to convert.
# packages pacman:: p_load(tidyverse, lubridate, stringr) # time_admission is a column in hours:minutes linelist <- linelist %>% # when time of admission is not given, assign the median admission time mutate( time_admission_clean = ifelse( is.na(time_admission), # if time is missing median(time_admission), # assign the median time_admission # if not missing keep as is ) %>% # use str_glue() to combine date and time columns to create one character column # and then use ymd_hm() to convert it to datetime mutate( date_time_of_admission = str_glue("{date_hospitalisation} {time_admission_clean}") %>% ymd_hm() )
Convert times alone
If your data contain only a character time (hours and minutes), you can convert and manipulate them as times using strptime()
from base R. For example, to get the difference between two of these times:
# raw character times time1 <- "13:45" time2 <- "15:20" # Times converted to a datetime class time1_clean <- strptime ( time1, format = "%H:%M" ) time2_clean <- strptime ( time2, format = "%H:%M" ) # Difference is of class "difftime" by default, here converted to numeric hours as.numeric ( time2_clean - time1_clean ) # difference in hours
## [1] 1.583333
Note however that without a date value provided, it assumes the date is today. To combine a string date and a string time together see how to use stringr in the section just above. Read more about strptime()
here.
To convert single-digit numbers to double-digits (e.g. to "pad" hours or minutes with leading zeros to achieve 2 digits), see this "Pad length" section of the Characters and strings page.
Working with dates
lubridate
can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals
Here we define a date to use for the examples:
# create object of class Date example_date <- ymd ( "2020-03-01" )
Date math
You can add certain numbers of days or weeks using their respective function from lubridate.
# add 3 days to this date example_date + days ( 3 )
## [1] "2020-03-04"
# add 7 weeks and subtract two days from this date example_date + weeks ( 7 ) - days ( 2 )
## [1] "2020-04-17"
Date intervals
The difference between dates can be calculated by:
- Ensure both dates are of class date
- Use subtraction to return the "difftime" difference between the two dates
- If necessary, convert the result to numeric class to perform subsequent mathematical calculations
Below the interval between two dates is calculated and displayed. You can find intervals by using the subtraction "minus" symbol on values that are class Date. Note, however that the class of the returned value is "difftime" as displayed below, and must be converted to numeric.
# find the interval between this date and Feb 20 2020 output <- example_date - ymd ( "2020-02-20" ) output # print
## Time difference of 10 days
## [1] "difftime"
To do subsequent operations on a "difftime", convert it to numeric with as.numeric()
.
This can all be brought together to work with data - for example:
pacman :: p_load ( lubridate, tidyverse ) # load packages linelist <- linelist %>% # convert date of onset from character to date objects by specifying dmy format mutate (date_onset = dmy ( date_onset ), date_hospitalisation = dmy ( date_hospitalisation ) ) %>% # filter out all cases without onset in march filter ( month ( date_onset ) == 3 ) %>% # find the difference in days between onset and hospitalisation mutate (days_onset_to_hosp = date_hospitalisation - date_of_onset )
In a data frame context, if either of the above dates is missing, the operation will fail for that row. This will result in an NA
instead of a numeric value. When using this column for calculations, be sure to set the na.rm =
argument to TRUE
. For example:
# calculate the median number of days to hospitalisation for all cases where data are available median ( linelist_delay $ days_onset_to_hosp, na.rm = T )
Date display
Once dates are the correct class, you often want them to display differently, for example to display as "Monday 05 January" instead of "2018-01-05". You may also want to adjust the display in order to then group rows by the date elements displayed - for example to group by month-year.
format()
Adjust date display with the base R function format()
. This function accepts a character string (in quotes) specifying the desired output format in the "%" strptime abbreviations (the same syntax as used in as.Date()
). Below are most of the common abbreviations.
Note: using format()
will convert the values to class Character, so this is generally used towards the end of an analysis or for display purposes only! You can see the complete list by running ?strptime
.
%d = Day number of month (5, 17, 28, etc.)
%j = Day number of the year (Julian day 001-366)
%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
%A = Full weekday (Monday, Tuesday, etc.)
%w = Weekday number (0-6, Sunday is 0)
%u = Weekday number (1-7, Monday is 1)
%W = Week number (00-53, Monday is week start)
%U = Week number (01-53, Sunday is week start)
%m = Month number (e.g. 01, 02, 03, 04)
%b = Abbreviated month (Jan, Feb, etc.)
%B = Full month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds
%z = offset from GMT
%Z = Time zone (character)
An example of formatting today's date:
# today's date, with formatting format ( Sys.Date ( ), format = "%d %B %Y" )
## [1] "15 December 2021"
# easy way to get full date and time (default formatting) date ( )
## [1] "Wed Dec 15 20:25:53 2021"
# formatted combined date, time, and time zone using str_glue() function str_glue ( "{format(Sys.Date(), format = '%A, %B %d %Y, %z %Z, ')}{format(Sys.time(), format = '%H:%M:%S')}" )
## Wednesday, December 15 2021, +0000 UTC, 20:25:53
## [1] "2021 Week 50"
Note that if using str_glue()
, be aware of that within the expected double quotes " you should only use single quotes (as above).
Month-Year
To convert a Date column to Month-year format, we suggest you use the function as.yearmon()
from the zoo package. This converts the date to class "yearmon" and retains the proper ordering. In contrast, using format(column, "%Y %B")
will convert to class Character and will order the values alphabetically (incorrectly).
Below, a new column yearmonth
is created from the column date_onset
, using the as.yearmon()
function. The default (correct) ordering of the resulting values are shown in the table.
# create new column test_zoo <- linelist %>% mutate (yearmonth = zoo :: as.yearmon ( date_onset ) ) # print table table ( test_zoo $ yearmon )
## ## Apr 2014 May 2014 Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015 ## 7 64 100 226 528 1070 1112 763 562 431 306 277 186
In contrast, you can see how only using format()
does achieve the desired display format, but not the correct ordering.
# create new column test_format <- linelist %>% mutate (yearmonth = format ( date_onset, "%b %Y" ) ) # print table table ( test_format $ yearmon )
## ## Apr 2014 Apr 2015 Aug 2014 Dec 2014 Feb 2015 Jan 2015 Jul 2014 Jun 2014 Mar 2015 May 2014 Nov 2014 Oct 2014 Sep 2014 ## 7 186 528 562 306 431 226 100 277 64 763 1112 1070
Note: if you are working within a ggplot()
and want to adjust how dates are displayed only, it may be sufficient to provide a strptime format to the date_labels =
argument in scale_x_date()
- you can use "%b %Y"
or "%Y %b"
. See the ggplot tips page.
zoo also offers the function as.yearqtr()
, and you can use scale_x_yearmon()
when using ggplot()
.
Epidemiological weeks
lubridate
See the page on Grouping data for more extensive examples of grouping data by date. Below we briefly describe grouping data by weeks.
We generally recommend using the floor_date()
function from lubridate, with the argument unit = "week"
. This rounds the date down to the "start" of the week, as defined by the argument week_start =
. The default week start is 1 (for Mondays) but you can specify any day of the week as the start (e.g. 7 for Sundays). floor_date()
is versitile and can be used to round down to other time units by setting unit =
to "second", "minute", "hour", "day", "month", or "year".
The returned value is the start date of the week, in Date class. Date class is useful when plotting the data, as it will be easily recognized and ordered correctly by ggplot()
.
If you are only interested in adjusting dates to display by week in a plot, see the section in this page on Date display. For example when plotting an epicurve you can format the date display by providing the desired strptime "%" nomenclature. For example, use "%Y-%W" or "%Y-%U" to return the year and week number (given Monday or Sunday week start, respectively).
Weekly counts
See the page on Grouping data for a thorough explanation of grouping data with count()
, group_by()
, and summarise()
. A brief example is below.
- Create a new 'week' column with
mutate()
, usingfloor_date()
withunit = "week"
- Get counts of rows (cases) per week with
count()
; filter out any cases with missing date
- Finish with
complete()
from tidyr to ensure that all weeks appear in the data - even those with no rows/cases. By default the count values for any "new" rows are NA, but you can make them 0 with thefill =
argument, which expects a named list (below,n
is the name of the counts column).
# Make aggregated dataset of weekly case counts weekly_counts <- linelist %>% drop_na ( date_onset ) %>% # remove cases missing onset date mutate (weekly_cases = floor_date ( # make new column, week of onset date_onset, unit = "week" ) ) %>% count ( weekly_cases ) %>% # group data by week and count rows per group (creates column 'n') tidyr :: complete ( # ensure all weeks are present, even those with no cases reported weekly_cases = seq.Date ( # re-define the "weekly_cases" column as a complete sequence, from = min ( weekly_cases ), # from the minimum date to = max ( weekly_cases ), # to the maxiumum date by = "week" ), # by weeks fill = list (n = 0 ) ) # fill-in NAs in the n counts column with 0
Here are the first rows of the resulting data frame:
Epiweek alternatives
Note that lubridate also has functions week()
, epiweek()
, and isoweek()
, each of which has slightly different start dates and other nuances. Generally speaking though, floor_date()
should be all that you need. Read the details for these functions by entering ?week
into the console or reading the documentation here.
You might consider using the package aweek to set epidemiological weeks. You can read more about it on the RECON website. It has the functions date2week()
and week2date()
in which you can set the week start day with week_start = "Monday"
. This package is easiest if you want "week"-style outputs (e.g. "2020-W12"). Another advantage of aweek is that when date2week()
is applied to a date column, the returned column (week format) is automatically of class Factor and includes levels for all weeks in the time span (this avoids the extra step of complete()
described above). However, aweek does not have the functionality to round dates to other time units such as months, years, etc.
Another alternative for time series which also works well to show a a "week" format ("2020 W12") is yearweek()
from the package tsibble, as demonstrated in the page on Time series and outbreak detection.
Converting dates/time zones
When data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.
In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date column represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.
To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
# assign the current time to a column time_now <- Sys.time ( ) time_now
## [1] "2021-12-15 20:25:53 EST"
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time time_london_real <- with_tz ( time_now, "Europe/London" ) # use force_tz() to assign a new timezone to the column, while KEEPING the clock time time_london_local <- force_tz ( time_now, "Europe/London" ) # note that as long as the computer that was used to run this code is NOT set to London time, # there will be a difference in the times # (the number of hours difference from the computers time zone to london) time_london_real - time_london_local
## Time difference of 5 hours
This may seem largely abstract, and is often not needed if the user isn't working across time zones.
Lagging and leading calculations
lead()
and lag()
are functions from the dplyr package which help find previous (lagged) or subsequent (leading) values in a vector - typically a numeric or date vector. This is useful when doing calculations of change/difference between time units.
Let's say you want to calculate the difference in cases between a current week and the previous one. The data are initially provided in weekly counts as shown below.
When using lag()
or lead()
the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending
First, create a new column containing the value of the previous (lagged) week.
- Control the number of units back/forward with
n =
(must be a non-negative integer)
- Use
default =
to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this isNA
.
- Use
order_by = TRUE
if your the rows are not ordered by your reference column
counts <- counts %>% mutate (cases_prev_wk = lag ( cases_wk, n = 1 ) )
Next, create a new column which is the difference between the two cases columns:
counts <- counts %>% mutate (cases_prev_wk = lag ( cases_wk, n = 1 ), case_diff = cases_wk - cases_prev_wk )
You can read more about lead()
and lag()
in the documentation here or by entering ?lag
in your console.
Source: https://epirhandbook.com/en/working-with-dates.html
0 Response to "Treat Date as Continuous Variable R"
Post a Comment