1 Introduction

U.S. Environmental Protection Agency (EPA)’s Toxicity Forecaster (ToxCastTM) program makes in vitro medium- and high-throughput screening assay data publicly available for prioritization and hazard characterization of thousands of chemicals. Invitrodb version 4.2 represents the second full release of the ToxCast database in a new schema designed to accommodate the R package dependency, tcplFit2, which is used in curve-fitting and hit-calling of multi-concentration response data. This invitrodb schema was beta-tested with our research release of invitrodb v4.0 that accompanied the publication “The ToxCast Pipeline: Updates to Curve-Fitting Approaches and Database Structure” (Feshuk et al., 2023) and release of tcpl v3.0.1.

Conceptual overview of the ToxCast Pipeline
Conceptual overview of the ToxCast Pipeline

The ToxCast Data Analysis Pipeline (tcpl) is an R package that manages, curve-fits, plots, and stores ToxCast data to populate its linked MySQL database, invitrodb. Tcpl is a flexible analysis pipeline and is capable of efficiently processing and storing large volumes of data. The diverse data, received in heterogeneous formats from numerous vendors, are transformed to a standard computable format and loaded into the tcpl database by vendor-specific R scripts. Once data is loaded into the database, ToxCast utilizes generalized processing functions provided in this package to process, normalize, model, qualify, and visualize the data. Check out tcpl’s vignette for more comprehensive documentation describing ToxCast data processing, retrieval, and interpretation.

2 Highlights

Invitrodb v4.2 includes the following notable updates:

  • Data Updates

    • Never-before released assay endpoints and data include:

    • New data for existing endpoints include:

      • New data for 169 existing endpoints, including 9535 new endpoint-samples (m4ids). Legacy well type (wllt) filters have been removed, so additional control data (wllts other than t, p, or n) have been fully processed.
    • New gene mappings as well as references to publications describing assays or results have been linked.

    • GD211 assay description documents are available. Information is stored in an UDPATED assay_descriptions table.

    • Analytical QC data has been added. Information is stored in the NEW chemical_analytical_qc table.

    • Administered Equivalent Doses has been stored in the UPDATED mc7 table. These are pre-calculated estimates of in vivo AEDs from in vitro bioactive concentrations (AC50, ACC, BMD) using a subset of models from the [High-throughput Toxicokinetics R package (httk)]

  • Update to Assay Lists

    • Assay lists, anchored by use in CCTE publications and/or projects, were updated or added in invitrodb v4.2. These assay lists will be available as a feature in the CompTox Chemicals Dashboard ToxCast bioactivity grid for filtering.
  • Schema changes

    • Concentration values have been simplified to accommodate tcplfit2-dependent curve-fitting. The “logc” column in mc3 (log10-micromolar) has been changed to “conc” (micromolar), so processing now avoids having to switch between logged and unlogged concentration values between levels.
    • Deprecated tables have been removed: chemical_library, pfas, chemical_assay_count
    • Existing tables have been updated: mc4, mc7, assay_descriptions, and sample
      • Sample table was updated to include additional chemical provenance information per OHT201 requirements.
    • New tables added: chemical_lists and chemical_analytical_qc
      • These chemical_lists contains lists curated by the US EPA in the Distributed Structure-Searchable Toxicity (DSSTox) database. The tcplLoadChemList function allows the user to subdivide the chemical IDs based on presence in different chemical lists.
  • Curve-fitting

    • Fitting functions within tcplfit2 has been updated to address concerns with the biphasic polynomial 2 (poly2) model, benchmark dose (BMD) bounds if 10-fold lower or 10-fold higher than the lowest and highest tested concentration tested, respectively. For tcpl implementation, “dt4” (original 4 degrees of freedom t-distribution) is the default error distribution for all model fits.
    • The mc5_fit_categories table has been updated with the current fit categories (fitc).
    • Model_type has been updated to reflect nature of fitting done (i.e. bidirectional, gain, or loss) as reflected by directionality of response.
  • Directionality of Response

    • Single conc data processing changes to account for directionality. Instead of forcing data to fit in the positive analysis direction with the bidirectional_false method, all data now is fit bidirectionally, then the overwrite methods will multiply the continuous hit call value (hitc) by -1 if top is not in intended direction.
    • The mc4 bidirectional_false method has been replaced with new ow_bidirectional_gain and ow_bidirectional_loss. Instead of forcing data to fit in the positive analysis direction with the bidirectional_false method, all data now is fit bidirectionally, then the overwrite methods will multiply the continuous hitc value by -1 if top is not in intended direction.
    • Summary statistics in the mc4 table appropriately consider the directionality of the response. Specifically, resp_max should be based on the maximum response in the positive or negative direction if bidirectional=TRUE and only the positive direction if bidirectional=FALSE. This known issue previously had not affected the hitc or the potency estimates. Flagging logic has been corrected to incorporate directionality.

3 Known Issues in invitrodb v4.2

  • The probability of top being above cutoff assumes top occurs at max concentration, which is not always the case for a subset of poly2 fits. Re-parameterization of the toplikelihood calculation may be needed in the future.
  • Current implementation of the tcplfit2 continuous hit call (hitc) calculation is the product of 3 proportional weights. It is evident by its non-normal distribution of all hitc in the database that hitc can be easily skewed to either 0 or 1. Borderline efficacy may be observed with high hitc, underscoring the point that hitc should not be interpreted as a likelihood of a positive hit. Rather, hitc as currently estimated should likely be binarized into active and inactive. Future work will investigate whether additional proportional weights should be added to the continuous hitc calculation, such as to consider noise, relative efficacy (i.e. scaled top) within endpoint, and/or degree of equivocality (relativity of top_over_cutoff).

4 Loading data

4.1 Set tcplConf and connections

  • tcplConf() from the library(tcpl) is used to connect to the MySQL database, invitrodb v4.1. Data is also accessible via CCD or the CCTE Bioactivity APIs.
  • library(RMySQL) is no longer supported with tcpl and will automatically be detached prior to use.

4.2 Load invitrodb v4.1

Invitrodb v4.1 was released in September 2023 and provided bioactivity data for the CompTox Chemicals Dashboard v2.3 release from December 2023 to present. The database package can be accessed here: https://doi.org/10.23645/epacomptox.6062623.v11. For this release note analysis, the following data will be reviewed

  • Endpoint-level annotation information combined with invitrodb.gene and invitrodb.intended_target tables to provide information about technological design and biological assay coverage.

  • Chemical level annotations as well as estimates of cytotoxicity, using invitrodb.cytotox table. This table stores the output of running tcplCytoPt() with default configuration and provides the lower bound estimate of the median, and the median, cytotoxicity threshold concentration estimate for the CompTox Chemicals Dashboard.

  • Multi-concentration screening data at levels 5 and 6 (mc5 and mc6), which contain the summary of curve-fitting and hit-calling from use of tcplFit2. Data includes the winning model parameters and cautionary flags.

  • Single-concentration data at level 2 (sc2). The sc2 table contains the summary of the cutoff (threshold for positive), estimate of baseline (baseline median absolute deviation, or bmad), the maximum median effect for the single concentration screened (max_med), and ultimately the hitc (requires that max_med > coff).

4.3 Load invitrodb v4.2

Invitrodb v4.2 has been released in FALL 2024. Updates to the bioactivity data available on the CompTox Chemicals Dashboard and CTX APIs is anticipated in 2025. The database package can be accessed here: https://doi.org/10.23645/epacomptox.6062623.v12. For this release note, data as described in ‘Load invitrodb v4.1’ section is output for v4.2.

5 Database-level Review

This section examines changes in processed data between versions at the aggregate database-level.

5.1 Version Differences

71 new endpoints and 55 chemicals were added in invitrodb v4.2 when compared to invitrodb v4.1. Additional metrics are provided in the table below.

#create database comparison table
Output <- c("Assay Sources", 
            "Assays",
            "Assay Components", 
            "Assay Component Endpoints", 
            "MC Samples",
            "SC Samples",
            "Chemicals",
            "MC Endpoint-Samples (M4IDs)",
            "SC Endpoint-Samples (S2IDs)",
            "Genes")
v4.1 <- c(length(unique(ace.old$asid)),
          length(unique(ace.old$aid)), 
          length(unique(ace.old$acid)),
          length(unique(ace.old$aeid)), 
          length(unique(old.mc5$spid)),
          length(unique(old.sc2$spid)),
          length(unique(chmn.old$chnm)),
          length(unique(old.mc5$m4id)),
          length(unique(old.sc2$s2id)),
          length(unique(gene.old$gene_id)))

v4.2 <- c(length(unique(ace.new$asid)),
          length(unique(ace.new$aid)), 
          length(unique(ace.new$acid)),
          length(unique(ace.new$aeid)), 
          length(unique(new.mc5$spid)),
          length(unique(new.sc2$spid)),
          length(unique(chmn.new$chnm)),
          length(unique(new.mc5$m4id)),
          length(unique(new.sc2$s2id)),
          length(unique(gene.new$gene_id)))

Change <- v4.2-v4.1
Table <- data.frame(Output, v4.1, v4.2, Change)

5.2 Database Change Over Time

Since its inception in 2013, ToxCast is continually growing as a data resource as reflected by generally increasing DTXSID, endpoint, and gene coverage.

5.3 MC Hit Calls

In v4.1 and v4.2, hit calls (hitc) are continuous as the product of three proportional weights: 1) probability the median response exceeds the cutoff, 2) and probability the top of model exceeds the cutoff, 3) and probability winning model’s AIC is less than that of the constant model.

For current ToxCast work, a \(hitc\) greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. This threshold of 0.90 was based on other \(tcplfit2\) implementations with in vitro screening data (Nyffeler et al., 2023) and reflects the apparent bimodal nature of the \(hitc\) distribution, where a preponderance of the \(hitc\) fall between 0 and 0.1 and 0.9 and 1.0. Users may interpret the continuous \(hitc\) into active or inactive designations based on different thresholds. Further testing through implementation of this new functionality may reveal appropriate thresholds for different applications or assays. The “unable to fit” series appear as model “none” with a hitc of 0, therefore inactive. In the tcpl v3.2 update, the mc4 bidirectional false method was replaced with new mc5 overwrite bidirectional methods. Instead of forcing data to fit in the positive analysis direction with the bidirectional_false method, all data now is fit bidirectionally, then the overwrite methods will multiply the continuous hit call value by -1 if top is not in intended direction. Previously “inactive” endpoint samples may appear with negative hitc if response was observed in unintended direction. Negative hitc can interpreted as ‘not applicable’ (NA) and may indicate assay interference.

Comparing the relative proportions of activity calls, v4.1 included 90% inactive and 10% active where v4.2 includes 70.9% inactive, 9.8% active, and 19.3% not applicable.

#binary continuous hitcalls (hitc) into activity calls (actc)
new.mc5[hitc >= 0.9, actc:= 'active']
new.mc5[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
new.mc5[hitc < 0, actc:= 'NA']

old.mc5[hitc >= 0.9, actc:= 'active']
old.mc5[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
old.mc5[hitc < 0, actc:= 'NA']

#subset to only endpoint/samples in v4.1
new.mc5$endpoint_sample <- paste0(new.mc5$aeid, "_", new.mc5$spid)
old.mc5$endpoint_sample <- paste0(old.mc5$aeid, "_", old.mc5$spid)
crossover.mc5 <- new.mc5[endpoint_sample %in% old.mc5$endpoint_sample,]

agg_hitc_old <- old.mc5 %>% 
  group_by(actc) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
agg_hitc_old$version <- "v4.1"
setnames(agg_hitc_old, "actc", "group")

agg_hitc_new <- new.mc5 %>% 
  group_by(actc) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
agg_hitc_new$version <- "v4.2"
setnames(agg_hitc_new, "actc", "group")

agg_hitc<- rbind(agg_hitc_old, agg_hitc_new)
agg_hitc$group <- factor(agg_hitc$group, levels = c("inactive", "active", "NA"))
ggplot(agg_hitc, aes(x = version, y= n, fill = group)) + 
  geom_bar(colour="black", stat = "identity", position = position_stack()) + 
  theme_minimal() +
  scale_fill_manual(values=c("#FDE333", "#3AC96D", "#007896")) +
  guides(fill = guide_legend(title = "Activity call")) +
  labs(y = "Endpoint-sample Proportion", x= "Version") +
  theme(axis.text = element_text(size=14),
        axis.title = element_text(size=16))+
  geom_text(aes(label = paste0(n, " (", labels, ")")), position = position_stack(vjust=0.5), size=5) 

When examined, the distribution of the v4.2 continuous hitc is not normally distributed, and values skew towards 0 and 1. All data is bidirectionally fit regardless of the intended direction of biological effect. If a particular direction is considered not biologically interpretable, a negative hit call value will indicate this.

The hitc itself is not a probability of a hit. It is a continuous value, but essentially one that should be interpreted in a binary way. Given the distribution of values is not a product of probabilities, so it is appropriate to binarize these values into active and inactive designations. Future work is being done to update hit calling logic so hitc more accurately reflect borderline or noisy cases.

ggplot(data=new.mc5) +
  geom_histogram(aes(x=hitc), bins=20)+ 
  xlab('Hitc')+
  ylab('Frequency')+
  scale_x_continuous(breaks=seq(-1,1,0.2))+
  theme_bw()+
  theme(axis.title.x = element_text(face='bold', size=14),
        axis.title.y = element_text(face='bold', size=14),
        axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) + theme_minimal()

5.4 Impacts of Poly2 Update on Winning Model Selection

5.4.1 Winning Model, Aggregate

Winning model selection between versions can be visualized. This does not distinguish between positive and negative hit calls. In v3.5, a hit call of -1 correspond to a “NA” model fitting. These are concentration-response series that tcpl was unable to fit, usually due to limited number of concentrations tested. In v3.5, there were 428 “-1” hit calls. In v4.0, these series may appear fit in model “none” with a hit call of 0, therefore inactive. In the new tcpl paradigm, constant model can never win.

Models included are: “hill”, “gnls”(gain-loss), “cnst”(constant), “pow”(power), “poly1”(polynomial-1: linear), “poly2”(polynomial-2: quadratic), “exp2”(exponential-2), “exp3”(exponential-3), “exp4”(exponential-4), “exp5”(exponential-5), NA (unable to fit in v3.5), and “none”(unable to fit in v4.0).

modl_agg_old <- old.mc5 %>% 
  group_by(old_modl) %>% 
  filter(!is.na(old_modl)) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
modl_agg_old$version <- "v4.1"
setnames(modl_agg_old, "old_modl", "modl")

modl_agg_new <- crossover.mc5 %>% 
  group_by(new_modl) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
modl_agg_new$version <- "v4.2"
setnames(modl_agg_new, "new_modl", "modl")

modl_agg_comp<- rbind(modl_agg_old, modl_agg_new)

modl_agg_comp$modl<-factor(modl_agg_comp$modl, levels = c("hill", "gnls", "cnst", "pow", "poly1", "poly2", "exp2", "exp3", "exp4", "exp5", "none")) 

ggplot(modl_agg_comp, aes(x = version, y= n, fill = modl)) + geom_bar(colour="black", stat = "identity", position = position_fill(reverse = TRUE)) + 
  scale_fill_manual(values=c("#FDE333", "#C8DF32", "#AF8B52", "#3AC96D", "#00B983", "#00A691", "#009097", "#007896", "#005F8E", "#30437F","#45256B", "#4B0055")) +
  guides(fill = guide_legend(title = "Winning Model")) +
  theme(axis.text.x = element_text(angle = 65, 
                                   vjust = 1, 
                                   hjust = 1), text=element_text(size=20),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.key.size = unit(1, 'cm'), #change legend key size
        legend.key.height = unit(1, 'cm'), #change legend key height
        legend.key.width = unit(1, 'cm'), #change legend key width
        legend.title = element_text(size=20), #change legend title font size
        legend.text = element_text(size=20)) + #change legend text font size
  labs(y = "Endpoint-sample Proportion", x="Version") + theme_minimal()

5.4.2 Winning Model, Actives Only

This data was filtered to remove endpoint-samples that were inactive between both versions. Models included: “hill”, “gnls”(Gain-Loss), “pow”(power), “poly1”(polynomial-1: linear), “poly2”(polynomial-2: quadratic), “exp2”(exponential-2), “exp3”(exponential-3), “exp4”(exponential-4), “exp5”(exponential-5).

The number of actives with poly2 winning model selected increased from 0.6% in v4.1 to 18.2358% in v4.2 as evident in the figure below.

modl_agg_old <- old.mc5 %>% 
  filter(old_actc %in% 'active') %>% 
  group_by(old_modl) %>% 
  filter(!is.na(old_modl)) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
modl_agg_old$version <- "v4.1"
setnames(modl_agg_old, "old_modl", "modl")

modl_agg_new <- crossover.mc5 %>% 
  filter(new_actc %in% 'active') %>% 
  group_by(new_modl) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
modl_agg_new$version <- "v4.2"
setnames(modl_agg_new, "new_modl", "modl")

modl_agg_comp<- rbind(modl_agg_old, modl_agg_new)

modl_agg_comp$modl<-factor(modl_agg_comp$modl, levels = c("hill", "gnls", "cnst", "pow", "poly1", "poly2", "exp2", "exp3", "exp4", "exp5", "none")) 

ggplot(modl_agg_comp, aes(x = version, y= n, fill = modl)) + geom_bar(colour="black", stat = "identity", position = position_fill(reverse = TRUE)) + 
  scale_fill_manual(values=c("#FDE333", "#C8DF32", "#AF8B52", "#3AC96D", "#00B983", "#00A691", "#009097", "#007896", "#005F8E", "#30437F","#45256B", "#4B0055")) +
  guides(fill = guide_legend(title = "Winning Model")) +
  theme(axis.text.x = element_text(angle = 65, 
                                   vjust = 1, 
                                   hjust = 1), text=element_text(size=20),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        legend.key.size = unit(1, 'cm'), #change legend key size
        legend.key.height = unit(1, 'cm'), #change legend key height
        legend.key.width = unit(1, 'cm'), #change legend key width
        legend.title = element_text(size=20), #change legend title font size
        legend.text = element_text(size=20)) + #change legend text font size
  labs(y = "Endpoint-sample Proportion", x="Version") + theme_minimal()

datatable(modl_agg_comp,
          filter='top', 
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

5.5 Caution Flag Review

Cautionary flags include:

We can examine the number of flags for active endpoint-samples.

new.mc5[is.na(flag.length), flag.length := 0]
new.mc5.pos <- new.mc5[actc == 'active',]

ggplot(data=new.mc5.pos)+
  geom_bar(aes(x=flag.length), stat='count')+
  theme_bw()+
  ylab('Endpoint-sample Count')+
  ggtitle('Flag Count for Active Endpoint-Samples')+
  theme(plot.title= element_text(hjust=0.5))+
  scale_x_continuous(name='Flag Count',breaks=seq(0,7,1))

Most active endpoint-samples have 0, 1, or 2 flags.

Below is a table of the counts of m4ids (n) that have a particular fit category (fitc) and flag combination, showing only those m4ids with 3 or more flags and hitc > 0.9 (only positives).

  • Ongoing work will be needed to understand if flags correspond to curve-fits of lower confidence.
  • In general, 1-3 flags may not signify poor fits; indeed some most common flags is simply fewer than 2 replicates per concentration or less than 50% efficacy (which may be commonly observed for assays that are not focused on pharmacological receptor interaction). Flags may encompass assay design.
  • Combining flags and fit category may provide appropriate measures to clean data sets for further interpretation for particular use cases. See the Data Interpretation sections in the tcpl package.
new.mc5.pos.ct <- new.mc5[,list(
  behavior.ct = .N
), by=list(flag.length, fitc, actc)] %>% data.table()

new.mc5.pos.ct <- new.mc5.pos.ct[!is.na(fitc)]
new.mc5.pos.ct <- new.mc5.pos.ct[order(-behavior.ct)]

all <- ggplot(data=new.mc5.pos.ct, aes(x=flag.length, y = behavior.ct, fill=as.factor(fitc)))+
       geom_bar(position='dodge', stat='identity')+
  theme_bw()+
  facet_grid(rows = 'actc')+
  scale_fill_viridis(name = 'Fit category',discrete=TRUE)+
  ylab('Count of curve fitc and flag count')+
  xlab('Flag count')+
scale_x_continuous(name='Flag Count',breaks=seq(0,7,1))+
  coord_cartesian(xlim=c(0,7.5))

gt3flags <- ggplot(data=new.mc5.pos.ct[flag.length>2 & fitc %in% c(36,40)], aes(x=flag.length, y = behavior.ct, fill=as.factor(fitc)))+
       geom_bar(position='dodge', stat='identity')+
  theme_bw()+
  facet_grid(rows = 'actc')+
  scale_fill_viridis(name = 'Fit category',option="G", discrete=TRUE)+
  ylab('Count of fitc and flag count')+
  xlab('Flag count > 2 and Fitc in 36 or 40')+
scale_x_continuous(name='Flag Count',breaks=seq(3,8,1))+
  coord_cartesian(xlim=c(3.5,8.5))

plot_grid(all, gt3flags, ncol=2, labels=c('A','B'), label_size = 14)

5.6 Impacts to Potency Distributions

Point of departure potency estimates were based on modeled active concentration series, including ACC (activity concentration at cutoff), AC10 (activity concentration at 10% of maximal response), and AC50 (activity concentration at 50% of maximal response). tcplFit2 modeling has incorporated new potency and uncertainty estimates, based on the BMDExpress transcriptomics dose-response modeling software, which includes benchmark dose (BMD) estimation (that is the dose at which a defined Benchmark Response (BMR) level occurs). The BMR is defined to be one standard deviation shift from the baseline response for the endpoint(s) of interest. As a new update, the baseline response is estimated using the two lowest concentrations across all tested chemicals or neutral controls, depending on what the endpoint used.

To examine the potency distributions, data was first filtered to remove endpoint-samples that were inactives and only endpoint-samples in both versions. The following analysis can he potency analysis examines \(log_{10}\) potency values, or the log-transformed experimental concentrations in uM. For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1.

Compared to \(log_{10}\) potency distributions, v4.2 distributions show BMDs have fewer extreme values given BMD bounds that were implemented in tcplfit2. Other potency estimates remain consistent between versions, with slightly variations likely attributed to minor shifts in baseline median absolute deviations (BMADs).

colnames(crossover.mc5) <- str_replace(colnames(crossover.mc5), "new_", "")
colnames(old.mc5) <- str_replace(colnames(old.mc5), "old_", "")
potency <-rbind(crossover.mc5, old.mc5, fill=TRUE)

potency[hitc >= 0.9, actc:= 'active']
potency[hitc >= 0 & hitc < 0.9, actc:= 'inactive']
potency[hitc < 0, actc:= 'NA']
potency_active <- potency[actc=='active',]

potency_agg_comp_long <- as.data.table(melt(potency_active,
                                  # ID variables - all the variables to keep but not split apart on
                                  id.vars=c("spid" ,"aeid", "version", "actc"),
                                  # The source columns
                                  measure.vars=c( "bmd","ac50", "acc", "ac10"),
                                  # Name of the destination column that will identify the original
                                  # column that the measurement came from
                                  variable.name="potency",
                                  value.name="value"))

#filter to remove instances if NA in either one version (limit to values used in stat test comparison)
potency_agg_comp_long <- potency_agg_comp_long %>% 
  group_by(spid, aeid, potency) %>% 
    filter(!any(is.na(value))) 

ggplot(data = potency_agg_comp_long, aes(x = value, y = potency, fill=version, na.rm=TRUE)) +
  geom_boxplot(outlier.alpha = 0.2) + theme_minimal() +
  theme(strip.background = element_blank(),
        strip.text.y = element_text(angle=360)) +
  scale_x_log10()+
  scale_fill_manual(values = c("gray", "white")) +
  labs(y= NULL, x = "Potency (uM)") +
  guides(fill = guide_legend(title = "Version")) 

Activity call in relation to each of these potency values can be considered.

potency_agg_comp_long_all <- as.data.table(melt(potency,
                                  # ID variables - all the variables to keep but not split apart on
                                  id.vars=c("spid" ,"aeid", "version", "actc"),
                                  # The source columns
                                  measure.vars=c( "bmd","ac50", "acc", "ac10"),
                                  # Name of the destination column that will identify the original
                                  # column that the measurement came from
                                  variable.name="potency",
                                  value.name="value"))

ggplot(data = potency_agg_comp_long_all, aes(x = value, y = potency, color=actc, fill=version, na.rm=TRUE)) +
  geom_boxplot(outlier.alpha = 0.2) + theme_minimal() +
  theme(strip.background = element_blank(),
        strip.text.y = element_text(angle=360)) +
  scale_x_log10()+
  scale_fill_manual(values = c("gray", "white")) +
   scale_color_manual(values=c("#3AC96D", "#FDE333", "#007896")) +
  labs(y= NULL, x = "Potency (uM)") +
  guides(color = guide_legend(title = "Activity Call"), fill = guide_legend(title = "Version"))

5.7 SC Hit Calls

#binarize hitc into activity calls (actc)
new.sc2[hitc == -1, actc:= 'NA']
new.sc2[hitc == 0, actc:= 'inactive']
new.sc2[hitc == 1,  actc:= 'active']

old.sc2[hitc == -1, actc:= 'NA']
old.sc2[hitc == 0, actc:= 'inactive']
old.sc2[hitc == 1, actc:= 'active']

#subset to only endpoint/samples in v4.1
new.sc2$endpoint_sample <- paste0(new.sc2$aeid, "_", new.sc2$spid)
old.sc2$endpoint_sample <- paste0(old.sc2$aeid, "_", old.sc2$spid)
crossover.sc2 <- new.sc2[endpoint_sample %in% old.sc2$endpoint_sample,]

sc_agg_hitc_old <- old.sc2 %>% 
  group_by(actc) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
sc_agg_hitc_old$version <- "v4.1"
setnames(sc_agg_hitc_old, "actc", "group")

sc_agg_hitc_new <- new.sc2 %>% 
  group_by(actc) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
sc_agg_hitc_new$version <- "v4.2"
setnames(sc_agg_hitc_new, "actc", "group")

sc_agg_hitc_cross <- crossover.sc2 %>% 
  group_by(actc) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(perc = `n` / sum(`n`)) %>% 
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))
sc_agg_hitc_cross$version <- "v4.2"
setnames(sc_agg_hitc_cross, "actc", "group")

sc_agg_hitc<- rbind(sc_agg_hitc_old, sc_agg_hitc_new)
sc_agg_hitc$group <- factor(sc_agg_hitc$group, levels = c("inactive", "active", "NA"))

sc_agg_hitc_cross<- rbind(sc_agg_hitc_old, sc_agg_hitc_cross)
sc_agg_hitc_cross$group <- factor(sc_agg_hitc_cross$group, levels = c("inactive", "active", "NA"))

For a more equitable comparison, this analysis will only consider the v4.2 single-conc data with endpoint samples included in v4.1. As with multi-conc data, negative hit calls were introduced after bidirectional fitting to describe responses in the non-biologically intrepretable direction.

ggplot(sc_agg_hitc_cross, aes(x = version, y= n, fill = group)) + 
  geom_bar(colour="black", stat = "identity", position = position_stack()) + 
  theme_minimal() +
  scale_fill_manual(values=c("#FDE333", "#3AC96D", "#007896")) +
  guides(fill = guide_legend(title = "Activity Call")) +
  labs(y = "Endpoint-sample Proportion", x= "Version") +
  theme(axis.text = element_text(size=14),
        axis.title = element_text(size=16))+
  geom_text(aes(label = paste0(n, " (", labels, ")")), position = position_stack(vjust=0.5), size = 4)

5.8 Gene Coverage

23 new gene mappings were added. These new genes associated include:

new.genes <- setdiff(gene.new$gene_id, gene.old$gene_id)

datatable(gene.new[gene_id %in% new.genes, c('entrez_gene_id','gene_name', 'gene_symbol')],
          filter='top',
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

6 Assay Source-level Review

This section examines changes in processed data between versions at the assay source-level.

6.1 MC Assay Sources

Herein we explore the relative changes in activity calls (actc) by assay source between invitrodb v4.1 and this release, invitrodb v4.2. For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1. As in Database-level Review, MC Hit Calls, a hitc greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. A negative hitc for responses observed in unintended direction can interpreted as ‘not applicable’ (NA).

tcplConf(user='_dataminer', pass='pass', db='prod_internal_invitrodb_v4_2', drvr='MySQL', host='ccte-mysql-res.epa.gov')
ids <- tcplLoadAeid(fld="asid",val=1:40, add.fld = c("asid", "asnm", "aid", "anm", "acid", "acnm"))

potency_as <- merge.data.table(potency,
                        ids[,c('aeid','asnm')],
                        by = 'aeid',
                        all.x=TRUE)

as_review <- potency_as %>% 
  group_by(asnm, version, actc)  %>%
  count() %>% 
  group_by(asnm) %>%
  mutate(perc = `n` / sum(`n`)*100) 

ggplot(as_review, aes(x = "", y = perc, fill = actc)) +
  geom_bar(stat="identity", color="black") + facet_grid(rows=vars(as_review$asnm), cols=vars(as_review$version)) +
  scale_fill_manual(values=c("#3AC96D","#FDE333", "#007896")) +
  guides(fill = guide_legend(title = "Flip Direction")) + coord_flip() + theme_void()

datatable(as_review,
          filter='top', 
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

7 Endpoint-level Review

This section examines changes in processed data between versions at the individual endpoint-level.

7.1 New MC Endpoints

The 67 new endpoints released in v4.2 with multi-conc data include:

new.mc.aeids <- setdiff(new.mc5$aeid, old.mc5$aeid)

datatable(ace.new[aeid %in% new.mc.aeids, c('aeid','assay_component_endpoint_name')],
          filter='top',
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

111710 more endpoint-samples (m4ids) were added in v4.2. Excluding new endpoints, new data (9535 m4ids) was added to 169 existing endpoints including:

new.data.in.mc.aeids <- new.mc5[!endpoint_sample %in% old.mc5$endpoint_sample,]
new.data.in.mc.aeids <- new.data.in.mc.aeids[!aeid %in% new.mc.aeids,] 

datatable(distinct(new.data.in.mc.aeids[!aeid %in% new.mc.aeids, c('aeid','aenm')]),
          filter='top',
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

In addition to more screening, legacy well type (wllt) filters have been removed, so additional control data (wllts other than t, p, or n) have been fully processed. Note: these extraneous wllts do not count towards chemical counts if they do not have EPA registered sample ids.

7.2 MC Activity Review

For a more equitable comparison, this analysis will only consider the v4.2 mc5 with endpoint samples included in v4.1. As in Database-level Review, MC Hit Calls, a hitc greater than or equal to 0.90 is labeled active, whereas anything less was considered inactive. A negative hitc for responses observed in unintended direction can interpreted as ‘not applicable’ (NA).

colnames(crossover.mc5) <- str_replace(colnames(crossover.mc5), "new_", "")
colnames(old.mc5) <- str_replace(colnames(old.mc5), "old_", "")

potency <-rbind(crossover.mc5, old.mc5, fill=TRUE)

endpoint_review <- potency %>% 
  group_by(aeid, aenm, version, actc)  %>%
  count() %>% 
  group_by(aeid) %>%
  mutate(perc = (`n` / sum(`n`) *100)) %>%
  mutate(actc_endpoint = paste0(actc, ": aeid ", aeid, " (", aenm, ")")) %>%
  ungroup()

endpoint_review_wide <-pivot_wider(endpoint_review, id_cols="actc_endpoint", names_from ="version", values_from ="perc")
setDT(endpoint_review_wide)
endpoint_review_wide[,perc.change := (v4.2 - v4.1)]

datatable(endpoint_review_wide,
          filter='top', 
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

7.3 Assay Lists

In addition to updating existing assay lists in invitrodb, new assay lists have been added in invitrodb v4.2. The table below summarizes the available assay lists and endpoint counts per list. Note that endpoints can be associated with multiple assay lists.

assay_lists <- tcplQuery("SELECT name, description, count(aeid) FROM assay_list
                         INNER JOIN assay_list_aeid ON assay_list.assay_list_id=assay_list_aeid.assay_list_id 
                         group by assay_list.assay_list_id;")

datatable(assay_lists,
          filter='top', 
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

7.4 SC New Endpoints

The 8 new endpoints released in v4.2 with single-conc data include:

new.sc.aeids <- setdiff(new.sc2$aeid, old.sc2$aeid)

datatable(ace.new[aeid %in% new.sc.aeids, c('aeid','assay_component_endpoint_name')],
          filter='top',
          options=list(pagelength=25, autoWidth=FALSE,  scrollX=TRUE, initComplete = JS(
    "function(settings, json) {",
    "$('body').css({'font-family': 'Calibri'});",
    "}"
  )))

8 Notable tcpl Updates

invitrodb v4.2 was processed using ToxCast Data Analysis Pipeline (tcpl) R Package v3.2 with the tcplfit2 Concentration-Response Modeling Utility R Package v0.1.7.

8.1 API integration

In this version update, tcpl has been integrated with the Computational Toxicology and Exposure (CTX) APIs) via (ctxR). See tcpl’s vignette for more guidance on data retrieval and plotting capabilities with tcpl.

Note: As of Sept 2024, invitrodb v4.1 data is available on the CTX Bioactivity API. Check here for current version information.

8.2 Plotting

tcplPlot is tcpl’s single flexible plotting function, allowing for interactive yet consistent visualization of concentration-response curves via customizable parameters. The standalone plotting utility is built with the R libraries plotly and ggplot2 to display the additional curve-fitting models. The tcplPlot function requires the selection of a type (type), field (fld), and value (val) to load the necessary data and display the associated plots. Notable Plotting updates include:

  • Level lvl selection is no longer required and is replaced by type. type = "mc" plots available multi-concentration-response series fit by all models and highlights the winning model with activity hit call presented. type = "sc" plots available single-concentration data including response values, maximum median, and cutoff with activity hit call presented.
  • Flags can be output on plots. The default option is flag = FALSE. If type = "sc", setting flag = TRUE will result in warning, since invitrodb does not store flags for single-concentration data.
  • Custom y-axis scaling is now possible. The yuniform parameter is used for toggling automatic uniform y-axis scaling whereas yrange parameter is used for toggling user-specified uniform y-axis scaling.
  • Comparison plotting allows users to plot 2 concentration-response curves on the same axis when they supply a list or vector of the same length to val and compare.val. tcplPlot also now supports advanced comparison plotting across data connections. If working on a local invitrodb instance, additional dose-response data may be available, data reprocessed, methods adjusted, etc. Users may wish to compare data released in different versions, such as comparing CTX Bioactivity API data or versioned database (invitrodb v4.1 and later) to one’s local invitrodb database. Using the utility function tcplPlotLoadData while connected to one data source, users can switch to a new data configuration and pass along data via tcplPlot’s dat parameter and use compare.val for their new connection’s data.

See tcpl’s Data Retrieval with invitrodb vignette for more guidance on new plotting features.

9 Administered Equivalent Doses

The highest level assumption in the in vitro to in vivo extrapolation (IVIVE) approach employed here is that the in vitro bioactive concentration in a ToxCast assay endpoint is roughly equivalent to a human plasma concentration in vivo. For a review of IVIVE and httk models for it, please see: Breen et al, 2021

For invitrodb v4.2 onward, a new multi-concentration level 7 (MC7) table contains pre-generated AED values using several potency metrics from invitrodb and a subset of models from the High-throughput Toxicokinetics R package httk . As implemented, this MC7 table provides users with pre-calculated estimates of the in vivo human administered dose (mg/kg/day) based on the in vitro bioactive concentrations as seen in ToxCast screening data. AEDs can also be calculated ad hoc for ToxCast potency estimates using the High-throughput Toxicokinetics R package (httk). See tcpl’s vignette under Data Interpretation>Administered Equivalent Doses for more information.

10 Cytotoxicity Burst Distrubution

Estimates of chemical concentrations that elicit cytotoxicity and/or cell stress have been informative for contextualizing bioactivity screening data in ToxCast by providing information on the likelihood that these data may be confounded by assay interference resulting from cytotoxicity and/or cell stress, particularly when a parallel or in-well estimate of cell viability is unavailable. As such, general estimates of the median and lower bound concentrations that might elicit cytotoxicity and/or cell stress in vitro have previously been calculated using the tcplCytoPt function, which considers activity across a suite of cell-based assays based on updates to previous work (Judson et al., 2016).

These derived concentration threshold values have been released in the “cytotox” table of invitrodb and are also provided on the [CompTox Chemicals Dashboard] (https://comptox.epa.gov/dashboard/batch-search) Bioactivity Summary Plot, as shown in the BPA example below. In some use cases, this cytotoxicity-associated “burst” threshold may be used to infer activity that may represent a false positive response ascribed to assay interference, with the stringency of this interpretation subject to the use case. See tcpl’s vignette, under Data Interpretation>Cytotoxicity Burst Distribution, for more information.

Notable updates include:

  • Endpoints annotated as “burst assays” therefore used for cytotoxicity burst calculation have changed. v3.5 had 91 burst endpoints, whereas v4.0 currently had 73 given excluded bidirectional BSK and APR endpoints. For v4.1, logic in the tcplCytoPt function has been updated so losses in cell viability for these bidirectional endpoints can be included with any cell proliferation responses excluded. Total number of endpoints included in the burst calculation is now 90.
  • If a chemical is a hit in fewer than 5% of burst assay endpoints screened or is a hit in only 1 assay endpoint, the default cytotox_median (3 on the log10 scale or 1000 micromolar on the arithmetic scale) will be assigned because we lack enough data to compute a median and assume some estimate of variance in cell stress/cytotoxicity data.
  • Chemicals in invitrodb that have not been tested in any “burst assays” are assigned this default value of \(1\) millimolar as well. This was chosen as an arbitrary high concentration given it’s outside the typical testing range for ToxCast data and based on the principle that all compounds are toxic if given in high enough concentration.

In v4.2, a slightly higher global MAD was calculated at 0.1703165 compared to 0.159707 in v4.1. The figures below depict impacts to distributions of the median and lower bounds of the cytotoxicity burst in v4.2 compared to v4.1. In the table, a median and median absolute deviation of these values is calculated per version (with default median and lower bounds are excluded).

11 OECD GD211 Assay Description Documents

Given ToxCast includes a heterogeneous set of assays across a diverse biological space, annotations in the database help flexibly aggregate and differentiate processed data whereas assay documentation aligned with international standardization efforts can make ToxCast data more useful and interpretable for use in decision-making. Assay Element and Auxiliary Annotation structure is described in tcpl’s Introduction vignette, whereas how to access databased annotation information is described in the Data Retrieval vignettes.

The OECD Guidance Document 211 (GD211) is a standard for comprehensive assay documentation describing non-guideline in vitro test methods and their interpretation. This template is intended to harmonize non-guideline, in vitro method descriptions to allow assessment of the relevance of the test method for biological responses of interest and the quality of the data produced. Unlike the assay element annotations which are often short in a standardized format or use a controlled term list, the assay_descriptions fields have no character limit for text. Information has been released in the updated “assay_descriptions” table of invitrodb and are also provided within a compiled report of these assay description documents, available on the ToxCast Downloadable Data page.

12 Analytical QC and Applicability Domain

There is high value in understanding the outcomes of solubilization and chemical stability in the vehicle chosen to solubilize the chemical, i.e. a chemical’s applicability domain for in vitro screening. This informs what chemicals and samples should screened in future experiments. It also helps inform future structural models to understand which chemicals will be stable and detectable in solubilization, and further provide insight into possible degradation products that could be synthesized or purchased. Most critically, this information promotes understanding of uncertainty in estimates of initial experimental concentration of chemicals.

To establish a resource of applicability domain information at the substance and sample level, a retrospective analysis of the analytical QC data for the ToxCast/Tox21 chemical library was conducted. This involved reviewing legacy reports from gas chromatography-mass spectrometry (GCMS), liquid chromatography-mass spectrometry (LCMS), and Nuclear Magnetic Resonance (NMR) experiments. Additional Analytical QC, such as for the PFAS chemical library, and integration efforts are ongoing. The new “chemical_analytical_qc” table of invitrodb stores this information, which will eventually be surfaced within the [CompTox Chemicals Dashboard] (https://comptox.epa.gov/dashboard/batch-search) ToxCast Bioactivity module.

v4.2 include 7878 substance-level and 24159 sample-level calls. 6231 (79.0936786%) substance-level calls pass, whereas 19246 (79.6638934%) of sample-level calls pass.

Consult full table for additional details, including cautionary flags and annotations, as well as tcpl’s vignette, under Data Interpretation>Analytical QC & Applicability Domain, for more information.

13 Reproducibility

ToxCast provides a standard for consistent and reproducible data processing for diverse, targeted bioactivity assay data with readily available documentation and a unified FAIR software approach. As part of EPA’s commitment to gather and share its chemical data in open and transparent ways, all ToxCast chemical data is publicly available. For more information, visit the ToxCast Home Page

Here is information about R session used to generate this release note:

print(sessionInfo())
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 8.10 (Ootpa)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.15.so;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.3     scales_1.3.0      stringr_1.5.1     viridis_0.6.5    
##  [5] viridisLite_0.4.2 tcpl_3.1.0        tidyr_1.3.1       plotly_4.10.4    
##  [9] openxlsx_4.2.5.2  jtools_2.2.2      kableExtra_1.4.0  ggthemes_5.1.0   
## [13] ggrepel_0.9.5     gridExtra_2.3     ggplot2_3.5.1     DT_0.33          
## [17] dplyr_1.1.4       data.table_1.15.4
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    farver_2.1.1        blob_1.2.4         
##  [4] fastmap_1.1.1       lazyeval_0.2.2      digest_0.6.35      
##  [7] timechange_0.3.0    lifecycle_1.0.4     RSQLite_2.3.6      
## [10] magrittr_2.0.3      compiler_4.4.0      rlang_1.1.3        
## [13] sass_0.4.9          tools_4.4.0         utf8_1.2.4         
## [16] yaml_2.3.8          knitr_1.46          ctxR_1.0.0         
## [19] labeling_0.4.3      htmlwidgets_1.6.4   bit_4.0.5          
## [22] xml2_1.3.6          RColorBrewer_1.1-3  withr_3.0.0        
## [25] purrr_1.0.2         numDeriv_2016.8-1.1 grid_4.4.0         
## [28] fansi_1.0.6         colorspace_2.1-0    cli_3.6.2          
## [31] rmarkdown_2.26      crayon_1.5.2        chron_2.3-61       
## [34] generics_0.1.3      rstudioapi_0.15.0   httr_1.4.7         
## [37] DBI_1.2.2           cachem_1.0.8        pander_0.6.5       
## [40] parallel_4.4.0      vctrs_0.6.5         jsonlite_1.8.8     
## [43] hms_1.1.3           bit64_4.0.5         tcplfit2_0.1.6     
## [46] systemfonts_1.0.5   crosstalk_1.2.1     jquerylib_0.1.4    
## [49] proto_1.0.0         glue_1.7.0          lubridate_1.9.3    
## [52] stringi_1.8.4       gtable_0.3.5        munsell_0.5.1      
## [55] tibble_3.2.1        pillar_1.9.0        htmltools_0.5.8.1  
## [58] R6_2.5.1            evaluate_0.23       gsubfn_0.7         
## [61] highr_0.10          RMariaDB_1.3.1      memoise_2.0.1      
## [64] bslib_0.7.0         Rcpp_1.0.12         zip_2.3.0          
## [67] svglite_2.1.3       sqldf_0.4-11        xfun_0.43          
## [70] pkgconfig_2.0.3