U.S. Environmental Protection Agency (EPA)’s Toxicity Forecaster (ToxCastTM) program makes in vitro medium- and high-throughput screening assay data publicly available for prioritization and hazard characterization of thousands of chemicals. Invitrodb version 4.2 represents the second full release of the ToxCast database in a new schema designed to accommodate the R package dependency, tcplFit2, which is used in curve-fitting and hit-calling of multi-concentration response data. This invitrodb schema was beta-tested with our research release of invitrodb v4.0 that accompanied the publication “The ToxCast Pipeline: Updates to Curve-Fitting Approaches and Database Structure” (Feshuk et al., 2023) and release of tcpl v3.0.1.
The ToxCast Data Analysis Pipeline (tcpl) is an R package that manages, curve-fits, plots, and stores ToxCast data to populate its linked MySQL database, invitrodb. Tcpl is a flexible analysis pipeline and is capable of efficiently processing and storing large volumes of data. The diverse data, received in heterogeneous formats from numerous vendors, are transformed to a standard computable format and loaded into the tcpl database by vendor-specific R scripts. Once data is loaded into the database, ToxCast utilizes generalized processing functions provided in this package to process, normalize, model, qualify, and visualize the data. Check out tcpl’s vignette for more comprehensive documentation describing ToxCast data processing, retrieval, and interpretation.
Invitrodb v4.2 includes the following notable updates:
Data Updates
Never-before released assay endpoints and data include:
New data for existing endpoints include:
New gene mappings as well as references to publications describing assays or results have been linked.
GD211 assay description documents are available. Information is stored in an UDPATED assay_descriptions table.
Analytical QC data has been added. Information is stored in the NEW chemical_analytical_qc table.
Administered Equivalent Doses has been stored in the UPDATED mc7 table. These are pre-calculated estimates of in vivo AEDs from in vitro bioactive concentrations (AC50, ACC, BMD) using a subset of models from the [High-throughput Toxicokinetics R package (httk)]
Update to Assay Lists
Schema changes
tcplfit2
-dependent curve-fitting. The “logc” column in mc3
(log10-micromolar) has been changed to “conc” (micromolar), so
processing now avoids having to switch between logged and unlogged
concentration values between levels.Curve-fitting
tcplfit2
has been updated to
address concerns with the biphasic polynomial 2 (poly2) model, benchmark
dose (BMD) bounds if 10-fold lower or 10-fold higher than the lowest and
highest tested concentration tested, respectively. For tcpl
implementation, “dt4” (original 4 degrees of freedom t-distribution) is
the default error distribution for all model fits.Directionality of Response
tcplfit2
continuous hit
call (hitc) calculation is the product of 3 proportional weights. It is
evident by its non-normal distribution of all hitc in the database that
hitc can be easily skewed to either 0 or 1. Borderline efficacy may be
observed with high hitc, underscoring the point that hitc should not be
interpreted as a likelihood of a positive hit. Rather, hitc as currently
estimated should likely be binarized into active and inactive. Future
work will investigate whether additional proportional weights should be
added to the continuous hitc calculation, such as to consider noise,
relative efficacy (i.e. scaled top) within endpoint, and/or degree of
equivocality (relativity of top_over_cutoff).Invitrodb v4.1 was released in September 2023 and provided bioactivity data for the CompTox Chemicals Dashboard v2.3 release from December 2023 to present. The database package can be accessed here: https://doi.org/10.23645/epacomptox.6062623.v11. For this release note analysis, the following data will be reviewed
Endpoint-level annotation information combined with invitrodb.gene and invitrodb.intended_target tables to provide information about technological design and biological assay coverage.
Chemical level annotations as well as estimates of cytotoxicity, using invitrodb.cytotox table. This table stores the output of running tcplCytoPt() with default configuration and provides the lower bound estimate of the median, and the median, cytotoxicity threshold concentration estimate for the CompTox Chemicals Dashboard.
Multi-concentration screening data at levels 5 and 6 (mc5 and mc6), which contain the summary of curve-fitting and hit-calling from use of tcplFit2. Data includes the winning model parameters and cautionary flags.
Single-concentration data at level 2 (sc2). The sc2 table contains the summary of the cutoff (threshold for positive), estimate of baseline (baseline median absolute deviation, or bmad), the maximum median effect for the single concentration screened (max_med), and ultimately the hitc (requires that max_med > coff).
Invitrodb v4.2 has been released in FALL 2024. Updates to the bioactivity data available on the CompTox Chemicals Dashboard and CTX APIs is anticipated in 2025. The database package can be accessed here: https://doi.org/10.23645/epacomptox.6062623.v12. For this release note, data as described in ‘Load invitrodb v4.1’ section is output for v4.2.
This section examines changes in processed data between versions at the aggregate database-level.
71 new endpoints and 55 chemicals were added in invitrodb v4.2 when compared to invitrodb v4.1. Additional metrics are provided in the table below.
#create database comparison table
Output <- c("Assay Sources",
"Assays",
"Assay Components",
"Assay Component Endpoints",
"MC Samples",
"SC Samples",
"Chemicals",
"MC Endpoint-Samples (M4IDs)",
"SC Endpoint-Samples (S2IDs)",
"Genes")
v4.1 <- c(length(unique(ace.old$asid)),
length(unique(ace.old$aid)),
length(unique(ace.old$acid)),
length(unique(ace.old$aeid)),
length(unique(old.mc5$spid)),
length(unique(old.sc2$spid)),
length(unique(chmn.old$chnm)),
length(unique(old.mc5$m4id)),
length(unique(old.sc2$s2id)),
length(unique(gene.old$gene_id)))
v4.2 <- c(length(unique(ace.new$asid)),
length(unique(ace.new$aid)),
length(unique(ace.new$acid)),
length(unique(ace.new$aeid)),
length(unique(new.mc5$spid)),
length(unique(new.sc2$spid)),
length(unique(chmn.new$chnm)),
length(unique(new.mc5$m4id)),
length(unique(new.sc2$s2id)),
length(unique(gene.new$gene_id)))
Change <- v4.2-v4.1
Table <- data.frame(Output, v4.1, v4.2, Change)
Since its inception in 2013, ToxCast is continually growing as a data resource as reflected by generally increasing DTXSID, endpoint, and gene coverage.
In v4.1 and v4.2, hit calls (hitc) are continuous as the product of three proportional weights: 1) probability the median response exceeds the cutoff, 2) and probability the top of model exceeds the cutoff, 3) and probability winning model’s AIC is less than that of the constant model.
For current ToxCast work, a \(hitc\) greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. This threshold of 0.90 was based on other \(tcplfit2\) implementations with in vitro screening data (Nyffeler et al., 2023) and reflects the apparent bimodal nature of the \(hitc\) distribution, where a preponderance of the \(hitc\) fall between 0 and 0.1 and 0.9 and 1.0. Users may interpret the continuous \(hitc\) into active or inactive designations based on different thresholds. Further testing through implementation of this new functionality may reveal appropriate thresholds for different applications or assays. The “unable to fit” series appear as model “none” with a hitc of 0, therefore inactive. In the tcpl v3.2 update, the mc4 bidirectional false method was replaced with new mc5 overwrite bidirectional methods. Instead of forcing data to fit in the positive analysis direction with the bidirectional_false method, all data now is fit bidirectionally, then the overwrite methods will multiply the continuous hit call value by -1 if top is not in intended direction. Previously “inactive” endpoint samples may appear with negative hitc if response was observed in unintended direction. Negative hitc can interpreted as ‘not applicable’ (NA) and may indicate assay interference.
Comparing the relative proportions of activity calls, v4.1 included 90% inactive and 10% active where v4.2 includes 70.9% inactive, 9.8% active, and 19.3% not applicable.
#binary continuous hitcalls (hitc) into activity calls (actc)
new.mc5[hitc >= 0.9, actc:= 'active']
new.mc5[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
new.mc5[hitc < 0, actc:= 'NA']
old.mc5[hitc >= 0.9, actc:= 'active']
old.mc5[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
old.mc5[hitc < 0, actc:= 'NA']
#subset to only endpoint/samples in v4.1
new.mc5$endpoint_sample <- paste0(new.mc5$aeid, "_", new.mc5$spid)
old.mc5$endpoint_sample <- paste0(old.mc5$aeid, "_", old.mc5$spid)
crossover.mc5 <- new.mc5[endpoint_sample %in% old.mc5$endpoint_sample,]
agg_hitc_old <- old.mc5 %>%
group_by(actc) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
agg_hitc_old$version <- "v4.1"
setnames(agg_hitc_old, "actc", "group")
agg_hitc_new <- new.mc5 %>%
group_by(actc) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
agg_hitc_new$version <- "v4.2"
setnames(agg_hitc_new, "actc", "group")
agg_hitc<- rbind(agg_hitc_old, agg_hitc_new)
agg_hitc$group <- factor(agg_hitc$group, levels = c("inactive", "active", "NA"))
ggplot(agg_hitc, aes(x = version, y= n, fill = group)) +
geom_bar(colour="black", stat = "identity", position = position_stack()) +
theme_minimal() +
scale_fill_manual(values=c("#FDE333", "#3AC96D", "#007896")) +
guides(fill = guide_legend(title = "Activity call")) +
labs(y = "Endpoint-sample Proportion", x= "Version") +
theme(axis.text = element_text(size=14),
axis.title = element_text(size=16))+
geom_text(aes(label = paste0(n, " (", labels, ")")), position = position_stack(vjust=0.5), size=5)
When examined, the distribution of the v4.2 continuous hitc is not normally distributed, and values skew towards 0 and 1. All data is bidirectionally fit regardless of the intended direction of biological effect. If a particular direction is considered not biologically interpretable, a negative hit call value will indicate this.
The hitc itself is not a probability of a hit. It is a continuous value, but essentially one that should be interpreted in a binary way. Given the distribution of values is not a product of probabilities, so it is appropriate to binarize these values into active and inactive designations. Future work is being done to update hit calling logic so hitc more accurately reflect borderline or noisy cases.
ggplot(data=new.mc5) +
geom_histogram(aes(x=hitc), bins=20)+
xlab('Hitc')+
ylab('Frequency')+
scale_x_continuous(breaks=seq(-1,1,0.2))+
theme_bw()+
theme(axis.title.x = element_text(face='bold', size=14),
axis.title.y = element_text(face='bold', size=14),
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12)) + theme_minimal()
Winning model selection between versions can be visualized. This does not distinguish between positive and negative hit calls. In v3.5, a hit call of -1 correspond to a “NA” model fitting. These are concentration-response series that tcpl was unable to fit, usually due to limited number of concentrations tested. In v3.5, there were 428 “-1” hit calls. In v4.0, these series may appear fit in model “none” with a hit call of 0, therefore inactive. In the new tcpl paradigm, constant model can never win.
Models included are: “hill”, “gnls”(gain-loss), “cnst”(constant), “pow”(power), “poly1”(polynomial-1: linear), “poly2”(polynomial-2: quadratic), “exp2”(exponential-2), “exp3”(exponential-3), “exp4”(exponential-4), “exp5”(exponential-5), NA (unable to fit in v3.5), and “none”(unable to fit in v4.0).
modl_agg_old <- old.mc5 %>%
group_by(old_modl) %>%
filter(!is.na(old_modl)) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
modl_agg_old$version <- "v4.1"
setnames(modl_agg_old, "old_modl", "modl")
modl_agg_new <- crossover.mc5 %>%
group_by(new_modl) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
modl_agg_new$version <- "v4.2"
setnames(modl_agg_new, "new_modl", "modl")
modl_agg_comp<- rbind(modl_agg_old, modl_agg_new)
modl_agg_comp$modl<-factor(modl_agg_comp$modl, levels = c("hill", "gnls", "cnst", "pow", "poly1", "poly2", "exp2", "exp3", "exp4", "exp5", "none"))
ggplot(modl_agg_comp, aes(x = version, y= n, fill = modl)) + geom_bar(colour="black", stat = "identity", position = position_fill(reverse = TRUE)) +
scale_fill_manual(values=c("#FDE333", "#C8DF32", "#AF8B52", "#3AC96D", "#00B983", "#00A691", "#009097", "#007896", "#005F8E", "#30437F","#45256B", "#4B0055")) +
guides(fill = guide_legend(title = "Winning Model")) +
theme(axis.text.x = element_text(angle = 65,
vjust = 1,
hjust = 1), text=element_text(size=20),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.key.size = unit(1, 'cm'), #change legend key size
legend.key.height = unit(1, 'cm'), #change legend key height
legend.key.width = unit(1, 'cm'), #change legend key width
legend.title = element_text(size=20), #change legend title font size
legend.text = element_text(size=20)) + #change legend text font size
labs(y = "Endpoint-sample Proportion", x="Version") + theme_minimal()
This data was filtered to remove endpoint-samples that were inactive between both versions. Models included: “hill”, “gnls”(Gain-Loss), “pow”(power), “poly1”(polynomial-1: linear), “poly2”(polynomial-2: quadratic), “exp2”(exponential-2), “exp3”(exponential-3), “exp4”(exponential-4), “exp5”(exponential-5).
The number of actives with poly2 winning model selected increased from 0.6% in v4.1 to 18.2358% in v4.2 as evident in the figure below.
modl_agg_old <- old.mc5 %>%
filter(old_actc %in% 'active') %>%
group_by(old_modl) %>%
filter(!is.na(old_modl)) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
modl_agg_old$version <- "v4.1"
setnames(modl_agg_old, "old_modl", "modl")
modl_agg_new <- crossover.mc5 %>%
filter(new_actc %in% 'active') %>%
group_by(new_modl) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
modl_agg_new$version <- "v4.2"
setnames(modl_agg_new, "new_modl", "modl")
modl_agg_comp<- rbind(modl_agg_old, modl_agg_new)
modl_agg_comp$modl<-factor(modl_agg_comp$modl, levels = c("hill", "gnls", "cnst", "pow", "poly1", "poly2", "exp2", "exp3", "exp4", "exp5", "none"))
ggplot(modl_agg_comp, aes(x = version, y= n, fill = modl)) + geom_bar(colour="black", stat = "identity", position = position_fill(reverse = TRUE)) +
scale_fill_manual(values=c("#FDE333", "#C8DF32", "#AF8B52", "#3AC96D", "#00B983", "#00A691", "#009097", "#007896", "#005F8E", "#30437F","#45256B", "#4B0055")) +
guides(fill = guide_legend(title = "Winning Model")) +
theme(axis.text.x = element_text(angle = 65,
vjust = 1,
hjust = 1), text=element_text(size=20),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.key.size = unit(1, 'cm'), #change legend key size
legend.key.height = unit(1, 'cm'), #change legend key height
legend.key.width = unit(1, 'cm'), #change legend key width
legend.title = element_text(size=20), #change legend title font size
legend.text = element_text(size=20)) + #change legend text font size
labs(y = "Endpoint-sample Proportion", x="Version") + theme_minimal()
datatable(modl_agg_comp,
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
Cautionary flags include:
We can examine the number of flags for active endpoint-samples.
new.mc5[is.na(flag.length), flag.length := 0]
new.mc5.pos <- new.mc5[actc == 'active',]
ggplot(data=new.mc5.pos)+
geom_bar(aes(x=flag.length), stat='count')+
theme_bw()+
ylab('Endpoint-sample Count')+
ggtitle('Flag Count for Active Endpoint-Samples')+
theme(plot.title= element_text(hjust=0.5))+
scale_x_continuous(name='Flag Count',breaks=seq(0,7,1))
Most active endpoint-samples have 0, 1, or 2 flags.
Below is a table of the counts of m4ids (n) that have a particular fit category (fitc) and flag combination, showing only those m4ids with 3 or more flags and hitc > 0.9 (only positives).
new.mc5.pos.ct <- new.mc5[,list(
behavior.ct = .N
), by=list(flag.length, fitc, actc)] %>% data.table()
new.mc5.pos.ct <- new.mc5.pos.ct[!is.na(fitc)]
new.mc5.pos.ct <- new.mc5.pos.ct[order(-behavior.ct)]
all <- ggplot(data=new.mc5.pos.ct, aes(x=flag.length, y = behavior.ct, fill=as.factor(fitc)))+
geom_bar(position='dodge', stat='identity')+
theme_bw()+
facet_grid(rows = 'actc')+
scale_fill_viridis(name = 'Fit category',discrete=TRUE)+
ylab('Count of curve fitc and flag count')+
xlab('Flag count')+
scale_x_continuous(name='Flag Count',breaks=seq(0,7,1))+
coord_cartesian(xlim=c(0,7.5))
gt3flags <- ggplot(data=new.mc5.pos.ct[flag.length>2 & fitc %in% c(36,40)], aes(x=flag.length, y = behavior.ct, fill=as.factor(fitc)))+
geom_bar(position='dodge', stat='identity')+
theme_bw()+
facet_grid(rows = 'actc')+
scale_fill_viridis(name = 'Fit category',option="G", discrete=TRUE)+
ylab('Count of fitc and flag count')+
xlab('Flag count > 2 and Fitc in 36 or 40')+
scale_x_continuous(name='Flag Count',breaks=seq(3,8,1))+
coord_cartesian(xlim=c(3.5,8.5))
plot_grid(all, gt3flags, ncol=2, labels=c('A','B'), label_size = 14)
Point of departure potency estimates were based on modeled active concentration series, including ACC (activity concentration at cutoff), AC10 (activity concentration at 10% of maximal response), and AC50 (activity concentration at 50% of maximal response). tcplFit2 modeling has incorporated new potency and uncertainty estimates, based on the BMDExpress transcriptomics dose-response modeling software, which includes benchmark dose (BMD) estimation (that is the dose at which a defined Benchmark Response (BMR) level occurs). The BMR is defined to be one standard deviation shift from the baseline response for the endpoint(s) of interest. As a new update, the baseline response is estimated using the two lowest concentrations across all tested chemicals or neutral controls, depending on what the endpoint used.
To examine the potency distributions, data was first filtered to remove endpoint-samples that were inactives and only endpoint-samples in both versions. The following analysis can he potency analysis examines \(log_{10}\) potency values, or the log-transformed experimental concentrations in uM. For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1.
Compared to \(log_{10}\) potency distributions, v4.2 distributions show BMDs have fewer extreme values given BMD bounds that were implemented in tcplfit2. Other potency estimates remain consistent between versions, with slightly variations likely attributed to minor shifts in baseline median absolute deviations (BMADs).
colnames(crossover.mc5) <- str_replace(colnames(crossover.mc5), "new_", "")
colnames(old.mc5) <- str_replace(colnames(old.mc5), "old_", "")
potency <-rbind(crossover.mc5, old.mc5, fill=TRUE)
potency[hitc >= 0.9, actc:= 'active']
potency[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
potency[hitc < 0, actc:= 'NA']
potency_active <- potency[actc=='active',]
potency_agg_comp_long <- as.data.table(melt(potency_active,
# ID variables - all the variables to keep but not split apart on
id.vars=c("spid" ,"aeid", "version", "actc"),
# The source columns
measure.vars=c( "bmd","ac50", "acc", "ac10"),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="potency",
value.name="value"))
#filter to remove instances if NA in either one version (limit to values used in stat test comparison)
potency_agg_comp_long <- potency_agg_comp_long %>%
group_by(spid, aeid, potency) %>%
filter(!any(is.na(value)))
ggplot(data = potency_agg_comp_long, aes(x = value, y = potency, fill=version, na.rm=TRUE)) +
geom_boxplot(outlier.alpha = 0.2) + theme_minimal() +
theme(strip.background = element_blank(),
strip.text.y = element_text(angle=360)) +
scale_x_log10()+
scale_fill_manual(values = c("gray", "white")) +
labs(y= NULL, x = "Potency (uM)") +
guides(fill = guide_legend(title = "Version"))
Activity call in relation to each of these potency values can be considered.
potency_agg_comp_long_all <- as.data.table(melt(potency,
# ID variables - all the variables to keep but not split apart on
id.vars=c("spid" ,"aeid", "version", "actc"),
# The source columns
measure.vars=c( "bmd","ac50", "acc", "ac10"),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="potency",
value.name="value"))
ggplot(data = potency_agg_comp_long_all, aes(x = value, y = potency, color=actc, fill=version, na.rm=TRUE)) +
geom_boxplot(outlier.alpha = 0.2) + theme_minimal() +
theme(strip.background = element_blank(),
strip.text.y = element_text(angle=360)) +
scale_x_log10()+
scale_fill_manual(values = c("gray", "white")) +
scale_color_manual(values=c("#3AC96D", "#FDE333", "#007896")) +
labs(y= NULL, x = "Potency (uM)") +
guides(color = guide_legend(title = "Activity Call"), fill = guide_legend(title = "Version"))
#binarize hitc into activity calls (actc)
new.sc2[hitc == -1, actc:= 'NA']
new.sc2[hitc == 0, actc:= 'inactive']
new.sc2[hitc == 1, actc:= 'active']
old.sc2[hitc == -1, actc:= 'NA']
old.sc2[hitc == 0, actc:= 'inactive']
old.sc2[hitc == 1, actc:= 'active']
#subset to only endpoint/samples in v4.1
new.sc2$endpoint_sample <- paste0(new.sc2$aeid, "_", new.sc2$spid)
old.sc2$endpoint_sample <- paste0(old.sc2$aeid, "_", old.sc2$spid)
crossover.sc2 <- new.sc2[endpoint_sample %in% old.sc2$endpoint_sample,]
sc_agg_hitc_old <- old.sc2 %>%
group_by(actc) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
sc_agg_hitc_old$version <- "v4.1"
setnames(sc_agg_hitc_old, "actc", "group")
sc_agg_hitc_new <- new.sc2 %>%
group_by(actc) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
sc_agg_hitc_new$version <- "v4.2"
setnames(sc_agg_hitc_new, "actc", "group")
sc_agg_hitc_cross <- crossover.sc2 %>%
group_by(actc) %>%
count() %>%
ungroup() %>%
mutate(perc = `n` / sum(`n`)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
sc_agg_hitc_cross$version <- "v4.2"
setnames(sc_agg_hitc_cross, "actc", "group")
sc_agg_hitc<- rbind(sc_agg_hitc_old, sc_agg_hitc_new)
sc_agg_hitc$group <- factor(sc_agg_hitc$group, levels = c("inactive", "active", "NA"))
sc_agg_hitc_cross<- rbind(sc_agg_hitc_old, sc_agg_hitc_cross)
sc_agg_hitc_cross$group <- factor(sc_agg_hitc_cross$group, levels = c("inactive", "active", "NA"))
For a more equitable comparison, this analysis will only consider the v4.2 single-conc data with endpoint samples included in v4.1. As with multi-conc data, negative hit calls were introduced after bidirectional fitting to describe responses in the non-biologically intrepretable direction.
ggplot(sc_agg_hitc_cross, aes(x = version, y= n, fill = group)) +
geom_bar(colour="black", stat = "identity", position = position_stack()) +
theme_minimal() +
scale_fill_manual(values=c("#FDE333", "#3AC96D", "#007896")) +
guides(fill = guide_legend(title = "Activity Call")) +
labs(y = "Endpoint-sample Proportion", x= "Version") +
theme(axis.text = element_text(size=14),
axis.title = element_text(size=16))+
geom_text(aes(label = paste0(n, " (", labels, ")")), position = position_stack(vjust=0.5), size = 4)
23 new gene mappings were added. These new genes associated include:
new.genes <- setdiff(gene.new$gene_id, gene.old$gene_id)
datatable(gene.new[gene_id %in% new.genes, c('entrez_gene_id','gene_name', 'gene_symbol')],
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
This section examines changes in processed data between versions at the assay source-level.
Herein we explore the relative changes in activity calls (actc) by assay source between invitrodb v4.1 and this release, invitrodb v4.2. For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1. As in Database-level Review, MC Hit Calls, a hitc greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. A negative hitc for responses observed in unintended direction can interpreted as ‘not applicable’ (NA).
tcplConf(user='_dataminer', pass='pass', db='prod_internal_invitrodb_v4_2', drvr='MySQL', host='ccte-mysql-res.epa.gov')
ids <- tcplLoadAeid(fld="asid",val=1:40, add.fld = c("asid", "asnm", "aid", "anm", "acid", "acnm"))
potency_as <- merge.data.table(potency,
ids[,c('aeid','asnm')],
by = 'aeid',
all.x=TRUE)
as_review <- potency_as %>%
group_by(asnm, version, actc) %>%
count() %>%
group_by(asnm) %>%
mutate(perc = `n` / sum(`n`)*100)
ggplot(as_review, aes(x = "", y = perc, fill = actc)) +
geom_bar(stat="identity", color="black") + facet_grid(rows=vars(as_review$asnm), cols=vars(as_review$version)) +
scale_fill_manual(values=c("#3AC96D","#FDE333", "#007896")) +
guides(fill = guide_legend(title = "Flip Direction")) + coord_flip() + theme_void()
datatable(as_review,
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
This section examines changes in processed data between versions at the individual endpoint-level.
The 67 new endpoints released in v4.2 with multi-conc data include:
new.mc.aeids <- setdiff(new.mc5$aeid, old.mc5$aeid)
datatable(ace.new[aeid %in% new.mc.aeids, c('aeid','assay_component_endpoint_name')],
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
111710 more endpoint-samples (m4ids) were added in v4.2. Excluding new endpoints, new data (9535 m4ids) was added to 169 existing endpoints including:
new.data.in.mc.aeids <- new.mc5[!endpoint_sample %in% old.mc5$endpoint_sample,]
new.data.in.mc.aeids <- new.data.in.mc.aeids[!aeid %in% new.mc.aeids,]
datatable(distinct(new.data.in.mc.aeids[!aeid %in% new.mc.aeids, c('aeid','aenm')]),
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
In addition to more screening, legacy well type (wllt) filters have been removed, so additional control data (wllts other than t, p, or n) have been fully processed. Note: these extraneous wllts do not count towards chemical counts if they do not have EPA registered sample ids.
For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1. As in Database-level Review, MC Hit Calls, a hitc greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. A negative hitc for responses observed in unintended direction can interpreted as ‘not applicable’ (NA).
colnames(crossover.mc5) <- str_replace(colnames(crossover.mc5), "new_", "")
colnames(old.mc5) <- str_replace(colnames(old.mc5), "old_", "")
potency <-rbind(crossover.mc5, old.mc5, fill=TRUE)
endpoint_review <- potency %>%
group_by(aeid, aenm, version, actc) %>%
count() %>%
group_by(aeid) %>%
mutate(perc = (`n` / sum(`n`) *100)) %>%
mutate(actc_endpoint = paste0(actc, ": aeid ", aeid, " (", aenm, ")")) %>%
ungroup()
endpoint_review_wide <-pivot_wider(endpoint_review, id_cols="actc_endpoint", names_from ="version", values_from ="perc")
setDT(endpoint_review_wide)
endpoint_review_wide[,perc.change := (v4.2 - v4.1)]
datatable(endpoint_review_wide,
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
In addition to updating existing assay lists in invitrodb, new assay lists have been added in invitrodb v4.2. The table below summarizes the available assay lists and endpoint counts per list. Note that endpoints can be associated with multiple assay lists.
assay_lists <- tcplQuery("SELECT name, description, count(aeid) FROM assay_list
INNER JOIN assay_list_aeid ON assay_list.assay_list_id=assay_list_aeid.assay_list_id
group by assay_list.assay_list_id;")
datatable(assay_lists,
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
The 8 new endpoints released in v4.2 with single-conc data include:
new.sc.aeids <- setdiff(new.sc2$aeid, old.sc2$aeid)
datatable(ace.new[aeid %in% new.sc.aeids, c('aeid','assay_component_endpoint_name')],
filter='top',
options=list(pagelength=25, autoWidth=FALSE, scrollX=TRUE, initComplete = JS(
"function(settings, json) {",
"$('body').css({'font-family': 'Calibri'});",
"}"
)))
invitrodb v4.2 was processed using ToxCast Data Analysis Pipeline (tcpl) R Package v3.2 with the tcplfit2 Concentration-Response Modeling Utility R Package v0.1.7.
In this version update, tcpl has been integrated with the Computational Toxicology and Exposure (CTX) APIs) via (ctxR). See tcpl’s vignette for more guidance on data retrieval and plotting capabilities with tcpl.
Note: As of Sept 2024, invitrodb v4.1 data is available on the CTX Bioactivity API. Check here for current version information.
tcplPlot is tcpl’s single
flexible plotting function, allowing for interactive yet consistent
visualization of concentration-response curves via customizable
parameters. The standalone plotting utility is built with the R
libraries plotly and
ggplot2 to display the additional
curve-fitting models. The tcplPlot function
requires the selection of a type (type
), field
(fld
), and value (val
) to load the necessary
data and display the associated plots. Notable Plotting updates
include:
lvl
selection is no longer required and is
replaced by type
. type = "mc"
plots available
multi-concentration-response series fit by all models and highlights the
winning model with activity hit call presented. type = "sc"
plots available single-concentration data including response values,
maximum median, and cutoff with activity hit call presented.flag = FALSE
. If type = "sc"
, setting
flag = TRUE
will result in warning, since invitrodb does
not store flags for single-concentration data.yuniform
parameter is used for toggling automatic uniform y-axis scaling whereas
yrange
parameter is used for toggling user-specified
uniform y-axis scaling.val
and compare.val
.
tcplPlot also now supports advanced
comparison plotting across data connections. If working on a local
invitrodb instance, additional dose-response data may be available, data
reprocessed, methods adjusted, etc. Users may wish to compare data
released in different versions, such as comparing CTX Bioactivity API
data or versioned database (invitrodb v4.1 and later) to one’s local
invitrodb database. Using the utility function
tcplPlotLoadData while
connected to one data source, users can switch to a new data
configuration and pass along data via
tcplPlot’s dat
parameter and use
compare.val
for their new connection’s data.See tcpl’s Data Retrieval with invitrodb vignette for more guidance on new plotting features.
The highest level assumption in the in vitro to in vivo extrapolation (IVIVE) approach employed here is that the in vitro bioactive concentration in a ToxCast assay endpoint is roughly equivalent to a human plasma concentration in vivo. For a review of IVIVE and httk models for it, please see: Breen et al, 2021
For invitrodb v4.2 onward, a new multi-concentration level 7 (MC7) table contains pre-generated AED values using several potency metrics from invitrodb and a subset of models from the High-throughput Toxicokinetics R package httk . As implemented, this MC7 table provides users with pre-calculated estimates of the in vivo human administered dose (mg/kg/day) based on the in vitro bioactive concentrations as seen in ToxCast screening data. AEDs can also be calculated ad hoc for ToxCast potency estimates using the High-throughput Toxicokinetics R package (httk). See tcpl’s vignette under Data Interpretation>Administered Equivalent Doses for more information.
Estimates of chemical concentrations that elicit cytotoxicity and/or cell stress have been informative for contextualizing bioactivity screening data in ToxCast by providing information on the likelihood that these data may be confounded by assay interference resulting from cytotoxicity and/or cell stress, particularly when a parallel or in-well estimate of cell viability is unavailable. As such, general estimates of the median and lower bound concentrations that might elicit cytotoxicity and/or cell stress in vitro have previously been calculated using the tcplCytoPt function, which considers activity across a suite of cell-based assays based on updates to previous work (Judson et al., 2016).
These derived concentration threshold values have been released in the “cytotox” table of invitrodb and are also provided on the [CompTox Chemicals Dashboard] (https://comptox.epa.gov/dashboard/batch-search) Bioactivity Summary Plot, as shown in the BPA example below. In some use cases, this cytotoxicity-associated “burst” threshold may be used to infer activity that may represent a false positive response ascribed to assay interference, with the stringency of this interpretation subject to the use case. See tcpl’s vignette, under Data Interpretation>Cytotoxicity Burst Distribution, for more information.
Notable updates include:
In v4.2, a slightly higher global MAD was calculated at 0.1703165 compared to 0.159707 in v4.1. The figures below depict impacts to distributions of the median and lower bounds of the cytotoxicity burst in v4.2 compared to v4.1. In the table, a median and median absolute deviation of these values is calculated per version (with default median and lower bounds are excluded).
Given ToxCast includes a heterogeneous set of assays across a diverse biological space, annotations in the database help flexibly aggregate and differentiate processed data whereas assay documentation aligned with international standardization efforts can make ToxCast data more useful and interpretable for use in decision-making. Assay Element and Auxiliary Annotation structure is described in tcpl’s Introduction vignette, whereas how to access databased annotation information is described in the Data Retrieval vignettes.
The OECD Guidance Document 211 (GD211) is a standard for comprehensive assay documentation describing non-guideline in vitro test methods and their interpretation. This template is intended to harmonize non-guideline, in vitro method descriptions to allow assessment of the relevance of the test method for biological responses of interest and the quality of the data produced. Unlike the assay element annotations which are often short in a standardized format or use a controlled term list, the assay_descriptions fields have no character limit for text. Information has been released in the updated “assay_descriptions” table of invitrodb and are also provided within a compiled report of these assay description documents, available on the ToxCast Downloadable Data page.
There is high value in understanding the outcomes of solubilization and chemical stability in the vehicle chosen to solubilize the chemical, i.e. a chemical’s applicability domain for in vitro screening. This informs what chemicals and samples should screened in future experiments. It also helps inform future structural models to understand which chemicals will be stable and detectable in solubilization, and further provide insight into possible degradation products that could be synthesized or purchased. Most critically, this information promotes understanding of uncertainty in estimates of initial experimental concentration of chemicals.
To establish a resource of applicability domain information at the substance and sample level, a retrospective analysis of the analytical QC data for the ToxCast/Tox21 chemical library was conducted. This involved reviewing legacy reports from gas chromatography-mass spectrometry (GCMS), liquid chromatography-mass spectrometry (LCMS), and Nuclear Magnetic Resonance (NMR) experiments. Additional Analytical QC, such as for the PFAS chemical library, and integration efforts are ongoing. The new “chemical_analytical_qc” table of invitrodb stores this information, which will eventually be surfaced within the [CompTox Chemicals Dashboard] (https://comptox.epa.gov/dashboard/batch-search) ToxCast Bioactivity module.
v4.2 include 7878 substance-level and 24159 sample-level calls. 6231 (79.0936786%) substance-level calls pass, whereas 19246 (79.6638934%) of sample-level calls pass.
Consult full table for additional details, including cautionary flags and annotations, as well as tcpl’s vignette, under Data Interpretation>Analytical QC & Applicability Domain, for more information.
ToxCast provides a standard for consistent and reproducible data processing for diverse, targeted bioactivity assay data with readily available documentation and a unified FAIR software approach. As part of EPA’s commitment to gather and share its chemical data in open and transparent ways, all ToxCast chemical data is publicly available. For more information, visit the ToxCast Home Page
Here is information about R session used to generate this release note:
print(sessionInfo())
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 8.10 (Ootpa)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.15.so; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.1.3 scales_1.3.0 stringr_1.5.1 viridis_0.6.5
## [5] viridisLite_0.4.2 tcpl_3.1.0 tidyr_1.3.1 plotly_4.10.4
## [9] openxlsx_4.2.5.2 jtools_2.2.2 kableExtra_1.4.0 ggthemes_5.1.0
## [13] ggrepel_0.9.5 gridExtra_2.3 ggplot2_3.5.1 DT_0.33
## [17] dplyr_1.1.4 data.table_1.15.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.1 blob_1.2.4
## [4] fastmap_1.1.1 lazyeval_0.2.2 digest_0.6.35
## [7] timechange_0.3.0 lifecycle_1.0.4 RSQLite_2.3.6
## [10] magrittr_2.0.3 compiler_4.4.0 rlang_1.1.3
## [13] sass_0.4.9 tools_4.4.0 utf8_1.2.4
## [16] yaml_2.3.8 knitr_1.46 ctxR_1.0.0
## [19] labeling_0.4.3 htmlwidgets_1.6.4 bit_4.0.5
## [22] xml2_1.3.6 RColorBrewer_1.1-3 withr_3.0.0
## [25] purrr_1.0.2 numDeriv_2016.8-1.1 grid_4.4.0
## [28] fansi_1.0.6 colorspace_2.1-0 cli_3.6.2
## [31] rmarkdown_2.26 crayon_1.5.2 chron_2.3-61
## [34] generics_0.1.3 rstudioapi_0.15.0 httr_1.4.7
## [37] DBI_1.2.2 cachem_1.0.8 pander_0.6.5
## [40] parallel_4.4.0 vctrs_0.6.5 jsonlite_1.8.8
## [43] hms_1.1.3 bit64_4.0.5 tcplfit2_0.1.6
## [46] systemfonts_1.0.5 crosstalk_1.2.1 jquerylib_0.1.4
## [49] proto_1.0.0 glue_1.7.0 lubridate_1.9.3
## [52] stringi_1.8.4 gtable_0.3.5 munsell_0.5.1
## [55] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
## [58] R6_2.5.1 evaluate_0.23 gsubfn_0.7
## [61] highr_0.10 RMariaDB_1.3.1 memoise_2.0.1
## [64] bslib_0.7.0 Rcpp_1.0.12 zip_2.3.0
## [67] svglite_2.1.3 sqldf_0.4-11 xfun_0.43
## [70] pkgconfig_2.0.3