Last updated: 2019-03-03

Checks: 2 0

Knit directory: ncdc_storm_events/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
Ignored:    .Rproj.user/

Untracked files:
Untracked:  docs/

Unstaged changes:
Modified:   .gitignore



Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd da6a233 Tim Trice 2018-11-25 Complete documentation
Rmd 5fbe2e1 Tim Trice 2018-11-14 Initiate workflowr project

ncdc_storm_events is a project that downloads the NCDC Storm Events database from the National Oceanic and Atmospheric Administration.

The NCDC Storm Events Database is a collection of observations for significant weather events. The “database” contains information on property damage, loss of life, intensity of systems and more.

The dataset is updated somewhat regularly, but is not real-time. As of this writing, the dataset covers the time period January, 1950, to Aug, 2018.

There are three tables within the database:

• details

• fatalities

• locations

The details dataset is the heaviest raw dataset, weighing over 1.1G when combined and saved as CSV. locations is much smaller; only 75M (CSV) with fatalities a featherweight at 1.2K.

details also happens to be the dirtiest. With 51 columns, at least a dozen of these are unnecessary - mostly related to date or time observations (there are 13 variables total, all redundant, for date/time info).

fatalities has the same issue with dates and times as details, but not nearly on the same scale.

locations, though treated as its own dataset, is very comparable to the location data within details and much of this information is redundant.

The entire database, imo, is a great project for exploring, tidying and cleaning somewhat large datasets. The challenge comes in ensuring data you may remove (what seems to be redundant) should in fact be removed.

## Data Source

The datasets are broken down by table, then further broken down by year of the observations. They are stored in csv.gz format on the NCDC NOAA FTP server.

Each file is named like,

StormEvents_{TABLE}-ftp_v1.0_d{YEAR}_c{LAST_MODIFIED}.csv.gz

where TABLE is one of details, locations, or fatalities, YEAR is the year of the observations, and LAST_MODIFIED is the last datetime modification of the archive file.

Files are downloaded with ./code/01_get_data.R. All csv.gz datasets are bound together into one dataframe for each of the three tables. These raw data files are saved in the data directory.

## Tidying Data

All three datasets can be tidied to some extent using ./code/02_tidy_data.R. The details dataset is the worse with over a dozen date and/or time variables. These variables are dropped and BEGIN_DATE_TIME and END_DATE_TIME are reformatted to YYYY-MM-DD HH:MM:SS format as a character string. Timezone information is not saved. Though it is included in the dataset, it is near-completely incorrect.

Additionally, date and time variables in fatalities and locations are also modified to remove redundancy or, in the case of locations which matches the date/time values in details, have been completely removed.

The damage variables in details have also been modified. The raw data uses alphanumeric characters; for example, “2k” or “2K” for $2,000 and “10B” or “10b” for$10,000,000,000. These have been cleaned to integer values.

Lastly, the narrative variables (EPISODE_NARRATIVE and EVENT_NARRATIVE) are split out to their own respective dataset to avoid redundancy and reduce the size of the other datasets.

Where the raw data is well over 1.2G, the entire dataset, after tidying, sits at 539M.

All tidied datasets are located in the output directory.

## Removing data file history from git

When updating data, this repo will use BGP Repo-Cleaner to remove the history of data files from the repository. When this is done, the repository will need to be removed from production environments in favor of a fresh clone