‘archetyper’ initializes data mining and data science projects by generating common workflow components as well as peripheral files needed to support technical best practices:
The lifecyle of a data mining project generally includes the following components:
Integration
Exploration
Enrichment
Modeling
Evaluation
Presentation
Deployment
Additionally, a well-formed data mining project will include:
Centralized code for common libraries, functions, and constants
Version control (e.g. git)
Unit testing
A readme file
Adherence to syntax and style (linting)
Externalized properties for secret information (when necessary)
Logging
Use of relative directories
generate will create a new project with the files and directories to support the data mining and data science workflow.
generate("majestic_12")
list.files("majestic_12")
[1] "data_input" "data_output" "data_working" "docs" "majestic_12.Rproj"
[6] "models" "R" "readme.md" ".gitignore"
The R code for the data workflow will be in the R/ directory.
list.files("majestic_12/R/")
[1] "0_test.R" "1_integrate.R" "2_enrich.R" "3_model.R"
[5] "4_evaluate.R" "5_present.Rmd" "api.R" "common.R"
[9] "explore.R" "lint.R" "mediator.R" "utilities.R"
The base work-flow files include integrate.R, enrich.R, model.R, evaluate.R, present.Rmd, and api.R.
1_integrate.R
is responsible for acquiring and
integrating data across data sources as well as standardizing data types
and naming conventions of columnar data. It is recommended that the
output of the integration step is prepared as a two-by-two tibble, with
rows as observations and columns as raw candidate features. The output
should be saved in version-controlled feather format to the
data_working/
directory for downstream use.2_enrich.R
is responsible for reading integrated data
from the data_working/
directory and enriching the raw
features into more informative features specific to the model being
applied. Tasks in this stage might include feature engineering, outlier
removal, imputation, feature selection, and the assignment of testing
partition labels. As with the integration step, the output of the
integration step should be a two-by-two tibble object saved as a
version-controlled feather file in the data_working/
directory for downstream use.3_model.R
should read the training partitions from the
enriched dataset stored in the data_working/
directory and
train the desired model(s). Model objects generated within the model.R
file should be stored in the models/
directory using
consistent version control and naming conventions.4_evaluate.R
should read the testing partitions from
the enriched dataset stored in the data_working/
directory
and apply the trained model from the model/
directory.
Model results are appended to the testing dataset and are persisted as a
.csv file in the data_output/
directory. A feather file is
not generated in this step as this data may be shared with non-technical
stakeholders directly. Visualizations related to performance on the
testing dataset, such as ROC Curves and box-plots, and model metadata
(e.g. coefficients) may be generated at this phase and persisted in the
data_working/
directory.5_present.Rmd
is an RMarkdown document template that
demonstrates the assembly of data and charts stored within the
data_output/
directory.explore.R
exists outside of the sequential work-flow
and is used primarily for transient data analysis needed for data
integration and feature preparation.api.R
is a RESTful api template generated by
‘archetyper’. The API can be tested locally or deployed to an internal
server. Note that the 'archetyper'
API template does not
enforce authentication and authorization and therefore developers should
consult with their security team prior to deploying an API within their
organization. The RESTful service in the api.R
file is
implemented using the ‘plumber’ package.Additional files are created to serve supporting functions:
test.R
stores unit and integration tests with an
example unit test (using the ‘testthat’ package).mediator.R
is responsible for the contiguous execution
of each component, with conditions to stop execution upon failure at any
step. The mediator.R
file is named after the Gang of Four
mediator pattern as it orchestrates calls to each component and enforces
isolation between those components. The mediator.R file further executes
unit tests prior to the execution of the work-flow. In the final step,
the present.Rmd file will be executed with its output (a PDF document
for example) stored into the docs/
directory.common.R
is responsible for identifying libraries
common across components (e.g. ‘dplyr’, ‘stringr’, ‘magittr’, etc.),
storing project constants, inheriting utilities, and initializing a
centralized logger. Each component in the work-flow sources the common.R
file.utilities.R
is designed to store functions that are
necessary across components. The common.R
file sources the
utilities.R file, and in turn, each file in the work-flow sourcing
common.R also has access to the common utility methods..gitignore
ignores files that should not be stored in
version control, including config.yml
. The base .gitignore
file included in the ‘archetyper’ template was generated from the
gitignore r package.lint.R
includes linting commands for each file in the
work-flow to enforce proper style and syntax conventions.readme.md
is designed to provide project documentation
and details on how the user will interact with the code..Rproj
is a file indicating that the directory includes
an R project and includes metadata related to the project itself. This
file further serves as the marker of the root directory so that relative
directories can be used in the project’s .R files. Relative directories
are supported by the ‘here’ package.A directory structure designed to logically separate the data artifacts produced throughout the work-flow is also generated by the ‘archetyper’ package. These directories include:
data_input/
is used to store unprocessed data read in
through the integration.R
filedata_working/
stores working data throughout the data
mining workflow. This data includes versioned snapshots of both
integrated and enriched data as well as other files and objects that
might be necessary to present findings from the model.models/
stores version-controlled model objects.data_output/
stores version-controlled plain-text files
with the appended model results.docs/
stores the output of R Markdown files (and other
documents resulting from the analysis)For traceability, files and objects (e.g. models) throughout the project are named according to a standard naming convention.
[ project_name ]_[ file_name ]_[ YYYY_MM_DD_HH:MM ].[ file_extension ]
This structure, in conjunction with the persistent state of each component, allows each component script to be run independently without sourcing all the preceding components.
A database connection type of “odbc” or “jdbc” can be passed in as a function argument to generate scaffolding helpful for database connections.
ODBC
A db_connection argument of ‘odbc’ will generate a connection code snippet in the integrate.R file.
library(odbc)
con <- dbConnect(odbc::odbc(), "dev_database")
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)
A file to store the database DML and DDL (for data preparation occurring in the database prior to the integration step) is additionally generated when using ‘odbc’:
dml_ddl.sql
stores SQL statements and other database
scripts used to prepare and extract data from the source systems.Note that when using odbc, the user must update the appropriate odbc configuration files (e.g. odbcinst.ini, odbc.ini)
JDBC
The following files and directories will be created with a ‘jdbc’ argument:
dml_ddl.sql
stores SQL statements and other database
scripts used to prepare and extract data from the source systems.config.yml
stores secret information, such as user
database credentials.drivers/
is directory for storing database driver jars.
Note that the user must provide the jars and classpath names to the jars
specific to the database type being accessed.A db_connection argument of ‘jdbc’ will additionally generate a connection code snippet in the integrate.R file. Note that the credentials are sourced from the config.yml file so that they are not exposed in the source code. The config.yml file is ignored in the .gitignore file so that it is not committed to source.
library(RJDBC)
db_credentials <- config::get("dev_database")
drv <- RJDBC::JDBC(driverClass = db_credentials$driver_class, classPath = Sys.glob("drivers/*"))
con <- dbConnect(drv,db_credentials$connection_string, db_credentials$username, db_credentials$password)
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)
When using a JDBC connection, the user must provide appropriate driver JARs in the drivers/ directory, as well as user credentials, class path, and connection string in the config.yml file.
The exclude
argument prevents specified files from being
generated.
generate(project_name = project_name, path = project_directory, exclude = c("api.R", "utilities.R", "readme.md", "lint.R", ".gitignore"))
list.files(project_path)
[1] "data_input" "data_output" "data_working" "docs" "majestic_12.Rproj"
[6] "models" "R"
list.files(project_path_r)
[1] "0_test.R" "1_integrate.R" "2_enrich.R" "3_model.R" "4_evaluate.R" "5_present.Rmd"
[7] "common.R" "explore.R" "mediator.R"
The ‘archetyper’ project is pre-packaged with a working demo project that predicts hospital readmission rates based on publicly-available structural characteristics and complication rates.
The demo project can be generated by running the generate_demo() function.
archetyper::generate_demo()
list.files("hospital_readmissions_demo/")
[1] "data_input" "data_output" "data_working" "docs" "hospital_readmissions_demo.Rproj"
[6] "models" "R" "readme.md" ".gitignore"
Once the demo project has been created, the project should be opened in RStudio.
The the full data-mining/data-science life-cycle can be triggered by
executing the mediator.R
file. The contents of the
mediator.R
file is below:
cat(readChar("hospital_readmissions_demo/R/mediator.R"), 1e5))
##--------------------------------------------------------------------------
## The mediator file will execute the linear data processing work-flow. -
##--------------------------------------------------------------------------
source("R/common.R")
tryCatch({
info(logger, "running tests...")
source("R/0_test.R")
info(logger, "gathering and integrating data...")
source("R/1_integrate.R")
info(logger, "enriching base data...")
source("R/2_enrich.R")
info(logger, "building model(s)...")
source("R/3_model.R")
info(logger, "applying model(s) to test partitions...")
source("R/4_evaluate.R")
info(logger, "building presentation materials...")
rmarkdown::render("R/5_present.Rmd", "pdf_document", output_dir = "docs")
info(logger, "workflow is complete.")
},
error = function(cond) {
log4r::error(logger, str_c("Script error: ", cond))
}
)
Note that the file includes a centralized logger to distinguish levels of severity (using the ‘log4r’ package) as well as relative directory references (using the ‘here’ package). Comments in all files were created using the ‘bannerCommenter’ package.
A set of publicly-available data files will be loaded by the integration.R file. The integration step joins and transforms the source files according to Tidy Data principles, and persists the integrated data into the data_working/ directory.
The enrichment step creates features better suited for the modeling process by applying feature engineering methods (such as scaling and centering the numeric features), outlier removal (using Cook’s Distance), imputation (using predictive mean matching PMM), and feature selection (by removing highly-correlated features). Additionally, training labels are assigned through stratified random sampling. The enriched results are stored in feather format in the data_working/ directory.
In the modeling step, a stepwise linear regression is applied to the training partition of the enriched dataset. The model coefficients, performance statistics, and the model itself is stored in the data_output/ and models/ directories (respectively).
In the evaluation step, the testing partitions of the enriched dataset are applied to the trained model from the models/ directory. The testing data with the appended predictions, along with performance statistics from the testing dataset, are stored in the data_output/ directory as a .csv file.
Finally, an R Markdown report is produced, using files sourced from the data_output/ directory.
The api.R file generates a sample RESTful api that uses the trained model from the models/ directory. The demo api can be called with the below sample request body:
{
"dt": {"state": "AL",
"hospital_type": "acute_care_hospitals",
"hospital_ownership": "government_hospital_district_or_authority",
"emergency_services": "Yes",
"ehr_interop": "Y",
"denominator": 1.4791,
"PSI_10": -1.3895,
"PSI_11": 0.597,
"PSI_12": -0.9487,
"PSI_13": -0.102,
"PSI_14": -1.1131,
"PSI_15": -0.9439,
"PSI_3": 0.029,
"PSI_6": -0.5752,
"PSI_8": -1.4081,
"PSI_9": -0.3028,
"denominator_ln": 1.3489}
}
Note that the demo project was designed simply to illustrate the the functionality of the ‘archetyper’ project. It was not designed to be a production or publish worthy model.