The U.S. Environmental Protection Agency's ToxCast research program uses high throughput screening (HTS) for profiling bioactivity and predicting the toxicity of large numbers of chemicals. ToxCast Phase I tested 309 well-characterized chemicals in more than 500 assays for a wide range of molecular targets and cellular responses. Of the 309 environmental chemicals in Phase I, 256 were linked to high-quality rat multigeneration reproductive toxicity studies in the relational Toxicity Reference Database. Reproductive toxicants were defined here as having achieved a reproductive lowest-observed-adverse-effect level of less than 500 mg kg−1 day−1. Eight-six chemicals were identified as reproductive toxicants in the rat, and 68 of those had sufficient in vitro bioactivity to model. Each assay was assessed for univariate association with the identified reproductive toxicants. Significantly associated assays were linked to gene sets and used for the subsequent predictive modeling. Using linear discriminant analysis and fivefold cross-validation, a robust and stable predictive model was produced capable of identifying rodent reproductive toxicants with 77% ± 2% and 74% ± 5% (mean ± SEM) training and test cross-validation balanced accuracies, respectively. With a 21-chemical external validation set, the model was 76% accurate, further indicating the model's potential for prioritizing the many thousands of environmental chemicals with little to no hazard information. The biological features of the model include steroidal and nonsteroidal nuclear receptors, cytochrome P450 enzyme inhibition, G protein-coupled receptors, and cell signaling pathway readouts—mechanistic information suggesting additional targeted, integrated testing strategies and potential applications of in vitro HTS to risk assessment.
Current chemical evaluations in the United States range from those providing either little to no evidence of safety for most industrial chemicals to an expensive battery of animal tests for food-use pesticides that offers little mechanistic insights. No in vivo toxicology test uses more animals than the rat multigeneration reproductive test. It has been estimated that 70% of the total cost and 90% of the animal use for compliance with Registration, Evaluation, Authorization, and Restriction of Chemicals (REACH) will be due to reproductive toxicity testing . Addressing the existing chemical evaluation bottleneck can only be achieved through progressive changes to the current animal testing paradigm. A promising resource for addressing this bottleneck is computational toxicology, a field that integrates tools from computer science, bio- and chemi-informatics, molecular biology, and high throughput screening (HTS). Currently prescribed in vivo tests for chemical toxicity are resource-intensive, particularly for multigeneration reproductive and prenatal developmental assessment. Policy directives such as the Cosmetics Directive of the European Union (EU) call for the elimination of animals for evaluating reproductive toxicity by 2013 for cosmetic products and development of alternative methods for safety evaluation. In the past, significantly less attention has been spent modeling or predicting chemical-induced reproductive toxicity relative to efforts spent modeling cancer and other endpoints. Reasons for the meager effort in this area include a lack of reference animal toxicity data to model as well as the molecular and physiological complexity of maternal-fetal interactions, life stages, and generational sensitivities . Recent efforts capturing in vivo reproductive toxicity studies into databases and in vitro bioactivity profiling have enabled the development of predictive, mechanistic, and pathway-based models for these complex reproductive outcomes.
The Toxicity Reference Database (ToxRefDB) has been the primary tool for storing and accessing high-quality toxicology studies and is available online for searching and download . ToxRefDB has characterized thousands of studies using a standardized vocabulary, a uniform structure across study types, and a high level of internal and external quality control (QC) for the extraction of endpoints useful in developing predictive models . The primary study for assessing reproductive effects of chemicals is the multigeneration reproductive test (Office of Prevention, Pesticides, and Toxic Substances 870.3800 and Organization for Economic Cooperation and Development [OECD] 416), which is typically conducted under continuous exposure to male and female rats from 10 wk premating through lactation in the second generation. From multigeneration reproductive studies in ToxRefDB, we have the capacity to identify individual or aggregated endpoints for predictive modeling across hundreds of chemicals and have made comparisons across generations to identify adverse impacts on developmentally sensitive reproductive endpoints based on the prevalence of specific endpoints at later generations compared to the first generation . Generational comparisons using ToxRefDB have also been part of the OECD evaluation of the proposed Extended One-Generation Reproductive Toxicity Study (EOGRTS). ToxRefDB was the primary database used in the large-scale retrospective analysis aimed at evaluating the impact of the second generation on risk assessments and classification and labeling (C&L) in Europe . However, the acceptance of the EOGRTS in lieu of the existing two-generation test will not alleviate the chemical testing bottleneck for the many thousands of chemicals in commerce. One set of solutions to this testing bottleneck are alternative methods for chemical prioritization and intelligent, targeted testing decisions.
The use of alternative methods as part of an integrated reproductive and developmental toxicity testing strategy is currently being developed as a battery of in silico, in vitro, and in vivo tests [7, 8]. One component of this toolbox is the large-scale bioactivity profiling of chemicals in HTS and high-content assays. The ToxCast research project of the U.S. Environmental Protection Agency (EPA) has produced a substantial amount of HTS data on environmental chemicals for developing predictive models of toxicity . Phase I of ToxCast profiled 309 toxicologically well-characterized chemicals in more than 500 assays using nine technologies, including cell-free HTS assays and cell-based assays. ToxCast HTS data and multigeneration reproductive toxicity data from ToxRefDB provide an effective dataset for developing predictive toxicology models. In the present study, we present a robust and stable predictive model of chemically induced reproductive toxicity that demonstrates external predictivity useful for targeting testing prioritizations and significantly advancing predictive and computational toxicology.
MATERIALS AND METHODS
Phase I of the U.S. EPA's ToxCast program employed a chemical library containing 320 samples consisting of 309 unique structures, with five duplicates that were differently sourced and three triplicates as technical repeats for internal QC. The rationale for chemical selection was based on several criteria: extensive chronic, cancer, multigeneration reproductive, and developmental assay data available (95% of compounds met this criteria); soluble in dimethyl sulfoxide (DMSO; −1 < log P < 6 [i.e., log of the octanol/water partition coefficient]; 97.5% met this criteria); molecular weight range of 250-1000 (90% met this criteria); and commercially available with purity of greater than 90% (98% met this criteria). These criteria were largely satisfied with a diverse set of pesticide active ingredients that have had guideline in vivo toxicology studies conducted as part of their registration process with the U.S. EPA. Several other miscellaneous chemicals of environmental concern meeting these criteria were also included in the library. Despite its large representation of pesticidal actives, the Phase I chemical library spans a wide range of property values and is quite structurally diverse, representing more than 40 chemical functional classes (e.g., pyrazole, sulfonamide, organochlorine, and pyrethroid) and more than 24 known pesticidal mode-of-action classes (e.g., phenylurea herbicides, organophosphate insecticides, and dinitroaniline herbicides). A complete listing of the quality-reviewed and structure-annotated chemical library is available for download as a Structure Data Format (SDF) file at the DSSTox website .
Chemicals comprising the ToxCast Phase I library were commercially procured and plated by BioFocus DPI. Supplier-provided certificates of analysis indicated a purity of greater than 97% for the majority of chemicals (87%) and of greater than 90% for all but a few instances of technical grade or known mixtures. Follow-up analysis of an original solution plate by BioFocus DPI using liquid chromatography/mass spectrometry subsequent to assay screening has confirmed mass identification, stability, and purity for more than 83% of the chemical library. For the majority of the remaining chemicals, currently employed methods of analysis are known or suspected to be inadequate for confirming sample purity, and for the remaining 8% of the chemicals, follow-up studies have provided some evidence of sample decomposition in DMSO over time. A QC summary result mapped to chemical solution sample is provided on the ToxCast website in association with assay results . All chemicals were included in the analysis regardless of analytic results but were accounted for throughout the analysis process.
In Vivo (Class Data)
Multigeneration reproductive toxicity testing study design and treatment group information along with all treatment-related effects were manually collected into the U.S. EPA's ToxRefDB. The database structure, standardized vocabulary and ontology, and QC procedures have been described previously . To date, ToxRefDB has captured 393 acceptable reproductive studies across 353 chemicals, equating to 14, 347, and 32 one-, two-, and three- generation studies, respectively. An acceptable study can be defined as any study that adequately followed the multigeneration testing guideline, primarily determined by regulatory toxicologists from the U.S. EPA's Office of Pesticide Programs and for which the review of the study contains sufficient information for complete entry into ToxRefDB. Of the 309 ToxCast chemicals, 256 chemicals have been linked to an acceptable reproductive study entered in ToxRefDB, with 242 exact structural matches, 4 close structural matches presumed to be toxicological equivalents (e.g., parent-to-salt, salt-to-parent, or different isomeric forms) not already linked to a ToxCast chemical, 4 close structural matches already linked to a ToxCast chemical (e.g., fluazifop-butyl and fluazifop-p-butyl), and 6 parent-to-metabolite pairs (e.g., diethylhexyl phthalate, phthalic acid, and mono-2-ethylhexyl ester). An additional 39 chemicals have unacceptable reproductive studies, whereas 14 chemicals have no data available in ToxRefDB.
In ToxRefDB, 650 unique effects were observed across the entire multigeneration reproductive toxicity study dataset, ranging from body weight decreases to organ weight changes to litter survival to fertility decrements. Each unique effect was mapped to one of three multigeneration study categories: parental (e.g., body weight, liver weight, and other systemic toxicities), reproductive (e.g., primarily fertility and early offspring survival), and offspring (e.g., offspring weight, longer-term offspring survival, and other systemic offspring toxicities during their juvenile period). Specifically, 120 effects were directly related to reproductive outcomes, and another 175 effects indicated adverse offspring outcomes, with the remainder being systemic parental effects . Based on the review of each study, primarily by regulatory toxicologists from the U.S. EPA's Office of Pesticide Programs, lowest-observed-adverse-effect levels (LOAELs) were established for the parental, offspring, and reproductive categories based on the weight of evidence and expert judgment of the reviewer. The reproductive LOAEL (rLOAEL) was used to delineate a positive and negative set for reproductive toxicity based on a 500 mg kg−1 day−1. This cutoff value approximates the testing limit of 1000 mg kg−1 day−1 in the reproductive test guideline and accounts for the large uncertainty around the dose intake measurements and standard conversions used in capturing the dosing information across hundreds of chemicals and more than 30 yr of toxicity testing. Any chemical with an rLOAEL of less than or equal to the cutoff was considered to be positive for reproductive toxicity, and any chemical with an rLOAEL of greater than the cutoff or that was not assigned an rLOAEL by the study reviewer was considered to be negative for reproductive toxicity. Specific effects within this endpoint category include reproductive performance measures (e.g., fertility, mating, and gestational interval), male and female reproductive tract effects (e.g., testis, epididymis, ovary, and uterus pathology and weight, along with sperm measures and morphology), and sexual developmental landmarks (e.g., preputial separation, vaginal opening, and anogenital distance). Teratogenic endpoints from prenatal toxicity testing were not included as part of the definition of a reproductive toxicant for the purposes of this modeling effort. Additional information regarding the treatment groups, including the life stage and generation of the animals and the administered dose, were captured in ToxRefDB to provide additional context for each chemical's reproductive toxicity potential.
In Vitro (Features)
As part of the ToxCast research program, the chemical library was tested in more than 500 assays across nine technologies, including cell-free HTS assays and cell-based assays in multiple human and rodent primary and derived cell lines. A complete overview of the assays, assay selection, analysis methods, quality measures, and assay annotation have been previously published . In general, concentration at half-maximal efficacy (AC50) values or lowest-effective concentrations (LECs) were derived for each assay and time point, where applicable. The complete dataset, including AC50/LEC values and corresponding concentration response data for all chemical-assay measurement pairs, is available from the U.S. EPA's ToxCast website . For the purpose of predictive modeling, assays form the input features and can be thought of as the right side of the equation, where some linear combination of these assays or sets of assays is equal to the class data (the reference in vivo endpoint).
The AC50/LEC values were −log3 transformed (−log3[AC50/1000]), and a value of zero was given to all negative assay results. A log3 transformation and setting negatives to 1000 μM was used over a log10 transformation and setting negatives to 1 mol/L, as has been done in previous publications of ToxCast results , to enhance the scoring range between high- and low-potency active chemicals and to decrease the distance between active (i.e., achieving an AC50 and defined as a hit in the assay) and inactive chemicals. Therefore, the “assay score” where the AC50 was 100 μM would have a value of roughly two, whereas one where the AC50 was 100 nM would have a value of roughly eight. A “gene score” or “gene-set score” was derived based on the average assay score across a set of closely related assays (e.g., assays mapped to a single gene or gene family). Any chemical active in 10 or fewer assays (≤2% aggregate active) was removed from the initial model development due to the lack of information provided by the chemical's bioactivity fingerprint to discern active and inactive for any toxicity. The rationale for excluding the chemicals with little or no in vitro activity is based on the following logic: Specific chemicals may lack activity in in vitro assays for a number of reasons, including chemical degradation, aqueous insolubility, lack of metabolic activation, or volatility. Such chemicals would be characterized by little to no activity across a broad range of in vitro assays. Because this behavior is, at least to some extent, relevant only to the in vitro systems, these chemicals are not good candidates for including in a model predicting in vivo activity. They were thus excluded from the training set. However, their exclusion is making no statement of a chemical's true reproductive toxicity potential.
Model (Class and Features)
The first step in the development of a predictive model was univariate feature selection. Each assay was compared to the training set of chemicals, both positive and negative for reproductive toxicity, using continuous and dichotomous statistical methodology, including linear (Pearson) correlation test, chi-square test, and t-test, with the level of significance returned as P-values. Each assay with a P-value of less than 0.1 from any method passed the initial feature selection filter. The resulting assays were then grouped by gene or assay family, as described above, to form the input features for subsequent modeling. In some instances, assays that were not statistically significantly associated but that provided orthogonal or complimentary readouts for the same target were included in model development. This was performed for various nuclear receptor targets in which cell-based transcription factor assays were significantly associated with reproductive toxicity, whereas the more specific cell-free binding assays were not because of the low number of active chemicals. The highly specific assays provide increased evidence that a chemical interacts with a particular target. Significantly associated assays that were part of a large assay family or that were highly correlated to other higher prioritized assays, based on relative P-value and correlation, were excluded to minimize the total number of assays moving into the model development phase. For example, as part of ToxCast, 54 G protein-coupled receptor (GPCR)-binding assays were evaluated, with 18 being significantly associated with reproductive toxicity. Of those 18 GPCR assays, five were selected based on having the greatest correlation collectively; adding further GPCR assays only lowered the overall association to reproductive toxicity.
Based on the selection of a small and balanced feature set, the prediction of reproductive toxicity potential was performed using linear discriminant analysis (LDA). Fivefold cross-validation was used to explore the stability of the resultant model, a process of developing the model using 80% of the chemical set and testing the model accuracy with the remaining 20% and repeating five times until all data have been used as both training and test datasets. The resulting cross-validation statistics are presented as the average and standard deviation of the training- and test-set balanced accuracies across the five runs. Additionally, a subset of chemicals with positive findings in unacceptable studies within ToxRefDB and chemicals with clear literature evidence of reproductive toxicity or no reproductive toxicity were used to assess the forward predictivity of the resultant model and to serve as an initial external validation set.
The quality and forward predictivity of any model is limited by the quality of the feature and class data being used in the model development process. Therefore, strict and transparent methods were used for identifying the training set used in the initial modeling effort from both in vivo (i.e., class) data and in vitro (i.e., feature) data perspectives. Of the 256 chemicals linked to an acceptable reproductive study, 86 reported an rLOAEL of less than 500 mg kg−1 day−1 (Table 1). The additional 12 chemicals that reported an rLOAEL from an unacceptable reproductive study were not incorporated into the initial model development process but were used for model assessment and external validation of the model. Six chemicals had an rLOAEL above the 500 mg kg−1 day−1 cutoff and were considered to be negative for modeling purposes; these chemicals included fluoxastrobin (862 mg kg−1 day−1), trifloxysulfuron sodium (631 mg kg−1 day−1), propoxycarbazone sodium (1314 mg kg−1 day−1), oxasulfuron (1115 mg kg−1 day−1), isoxaben (1000 mg kg−1 day−1), and propamocarb hydrochloride (1000 mg kg−1 day−1). The toxicity profile for these chemicals primarily consisted of high-dose systemic parental and offspring toxicities leading to confounding sexual developmental landmark findings and early offspring survival decrements. Of the 98 chemicals identified as reproductive toxicants (i.e., 86 from acceptable and 12 from unacceptable studies), 49 chemicals had treatment-related effects on the male and/or female reproductive tract, 51 caused decrements in reproductive performance, 67 affected early offspring survival, and 18 altered sexual development. A combined model of reproductive toxicity is presented, as opposed to individual models of each endpoint class, because of the large overlap in chemicals across these endpoint classes, the lack of gender-specific phenotypes, and the lack of mechanistic information in the guideline multigeneration reproductive studies.
Ninety-eight chemicals, linked to 86 acceptable and 12 unacceptable studies in ToxRefDB, achieved a reproductive LOAEL (rLOAEL ≤ 500 mg/kg per day) and used as the positive class set for the training and testing of the predictive model.
A significant number of ToxCast chemicals had little to no in vitro activity across hundreds of assays. Aggregate activity for each chemical was calculated as the number of actives divided by the total number of assays used in this analysis (n = 512). A 2% activity cutoff was established based on the minimal impact of aggregate in vitro activity on the sensitivity and, to a limited degree, specificity of resulting models. In total, 62 chemicals were identified as falling below the 2% cutoff and were not used in the initial model development process. Table 2 summarizes the chemical counts for each chemical group based on in vivo reproductive study acceptability/availability and aggregate in vitro activity. The entire chemical library was split into these groups to identify a chemical set with the capacity to develop a stable and robust model without the negative impacts of low in vivo multigeneration study quality or potential limited amenability to in vitro screening. Thus, chemical group A was selected for the initial development of the predictive reproductive toxicity model, including internal cross-validation. Groups B, C, and D were used to evaluate the stability and identify the current weaknesses, limitations, and gaps of the model. Groups E and F have also provided insight regarding the forward predictivity of the model based on available open-literature reproductive toxicity studies. In conjunction with Table 2, a schematic of the full decision process, including chemical groupings, class definitions (i.e., positive or negative for reproductive toxicity), and final summary model statistics, is provided as an overview and guidepost to the remaining, more detailed results (Fig. 1).
Chemical groupings based on aggregate in vitro activity across the over 500 ToxCast assays and in vivo reproductive study acceptability/availability within ToxRefDB.*
Of the 206 chemicals used in the initial development of the predictive reproductive toxicity model (i.e., chemical group A), 68 were identified as reproductive toxicants—roughly one third of the total. In relating the in vitro bioactivity to these reproductive toxicants, a set of assays and genes were identified as significant indicators of reproductive toxicity based on their univariate association. In total, 36 of more than 500 assays were selected for model development and subsequently mapped to genes or gene sets (Table 3). The primary genes identified were nuclear receptors, both steroidal and nonsteroidal, and included the androgen receptor (AR), estrogen receptor alpha (ERα; ESR1), and peroxisome proliferator-activated receptors alpha and gamma (PPARA and PPARG, respectively). These molecular targets have extensive literature detailing their role in normal reproductive function as well as reproductive and endocrine toxicity. A number of cytochrome P450 enzyme (CYP) inhibition assays, including aromatase (CYP19A1), were also significantly associated with the reproductive toxicants. Interestingly, besides the human aromatase assay, rat CYP assays had increased association to the endpoint as compared to the human CYP assays. For the purposes of the model and based on the increase in overall statistical correlation, all associated rodent CYP assay scores, as well as aromatase, were averaged and used as a single feature, called CYP. In addition to these genes and assay sets, individual assays representing cell-based markers of growth factor stimulation and cell signaling, including epidermal growth factor 1 (EGFR1), transforming growth factor beta 1 (TGFB1), vesicular monoamine transporter 2 (VMAT2), and nuclear factor kappa light-chain enhancer of activated B cells (NFKB), were other positive indicators of reproductive toxicity potential. These assays were also averaged together as a miscellaneous set of assays and called OTHER. As part of ToxCast, 54 GPCR-binding assays were evaluated, with 18 being significantly associated with reproductive toxicity. Of those 18 GPCR assays, five were selected based on having the greatest correlation collectively; adding further GPCR assays only lowered the overall association to reproductive toxicity. Assays targeting the pregnane X receptor (PXR, NR1L2) were negatively correlated with reproductive toxicity potential and used in the model development process with the expectation of providing some indication of the metabolic clearance of the chemical or representing general nuclear receptor promiscuity.
Feature selection statistics based on univariate correlations and associations between individual assays or genes/gene-sets and reproductive toxicants in chemical group A, dichotomously represented (i.e., 1 for positive and 0 for negative).
Using the combination of the selected gene/gene-set scores, a multivariate linear classifier was developed using LDA and fivefold cross validation. The feature set included PPARA (average −log3[AC50/1000] across three assays), AR (average −log3[AC50/1000] across three assays), ESR1 (average −log3[AC50/1000] across seven assays), PPARG (average −log3[AC50/1000] across four assays), CYP (average −log3[AC50/1000] across seven assays), GPCR (average −log3[AC50/1000] across five assays), OTHER (average −log3[AC50/1000] across four assays), and NR1L2 (average −log3[AC50/1000] across three assays) for a total of eight features. Figure 2 demonstrates the relative impact on classification rates between individual assays, genes/gene sets, and the final model. In general, we find that aggregating multiple related assays into a single feature increased the classification rate and yielded a more balanced and stable model. Grouping the assays by gene and gene sets also allows assays with low hit prevalence that would otherwise not be included in the model to contribute to the overall assessment of whether a chemical interacts with a specific molecular target.
Using the eight gene and gene-set features, a robust (i.e., high predictivity with high balanced accuracy; >70%) and stable (i.e., high test cross-validation and external validation accuracies; >70%) classifier or predictive model was generated as shown by the resulting model statistics (Table 4). The cross-validation balanced accuracy (equal to the average of sensitivity and specificity) for the training and test sets, averaged across all five runs, was 77% and 74% for the training set and the test set, respectively, with a standard deviation of 2% and 5%, respectively. Conversely, using the single most significantly associated assay per gene or gene set resulted in training and test balanced accuracies of 71% and 64%, respectively, illustrating the loss in predictivity and model stability when relying on a single assay to represent a molecular target or pathway. After demonstrating stability across the cross-validation runs, a model generated using all 206 group A chemicals was optimized, resulting in a balanced accuracy of 80% (P = 4.2E-17), indicating a highly predictive model for reproductive hazard.
Performance metrics for the predictive model of reproductive toxicity, including cross-validation and optimized model statistics and weighting of model input features.
Chemical group B was not included in the initial model development because of the lack of in vitro bioactivity across hundreds of assays. Interestingly, a comparable prevalence of reproductive toxicants was observed in chemical group B, with 18 of the 50 chemicals characterized as active (36% active vs. 64% inactive). Only 20 chemicals in group B were active across any of the 33 assays or seven input features that positively indicated reproductive toxicity. If the model is applied to chemical group B only, the balanced accuracy is 54%, with a very low sensitivity of 11%. If the model is applied to chemical groups A and B, balanced accuracy and sensitivity drop to 75% and 66%, respectively. The diminished model performance, especially in terms of sensitivity when including low in vitro activity chemicals, provides justification for considering these chemicals as outside the domain of in vitro biological applicability, akin to the domain analysis performed in structure-activity studies, and provides no evidence as to the safety or toxicity of the chemical. In real-world applications of this reproductive toxicity model, chemicals could be identified for follow-up analysis ranging from traditional animal toxicity testing to additional in vitro screening attempting to address confounding issues such as chemical decomposition, aqueous insolubility, or volatility to the application of purely in silico models.
Chemical groups C and D are comprised of 39 chemicals that have been tested in guideline reproductive studies that were deemed to be unacceptable for a variety of reasons, including quality of the review, dose selection, and guideline adherence. It would not be expected that these studies were deemed to be unacceptable because of false-positive findings; therefore, the 12 chemicals designated as reproductive toxicants were used to demonstrate external predictivity of the model. Examples of such chemicals include the putative antiandrogen prochloraz  and the possible endocrine-disrupting chemical fenitrothion , both of which were predicted to be positive for reproductive toxicity. In total, 7 of the 12 reproductive toxicants in chemicals groups C or D were predicted to be positive. The same presumption for the positive findings cannot be extended to the negative findings across studies flagged as unacceptable. For example, the male reproductive toxicant boric acid  caused only limited reproductive effects in the unacceptable guideline multigeneration reproductive study and showed little in vitro activity (chemical group D), possibly as a result of limited amenability to HTS.
Chemical groups E and F have no guideline-based multigeneration reproductive toxicity study entered into ToxRefDB and, in most cases, have never had such a study performed. However, of the 14 chemicals in groups E and F, nine were linked to reproductive toxicity tests available in the open literature. Varying sources and degrees of evidence can be found for reproductive toxicity: methoxychlor and its metabolite 2,2-bis-(p-hydroxyphenyl)-1,1,1-trichloroethane (HPTE) based on positive findings in numerous pubertal and other in vivo assays [16–19], bromoxynil based on EU labeling as a reproductive toxicant (R62), methyl cellosolve (2-methoxyethanol) based on reproductive findings in multiple systemic repeat-dose and multigenerational studies , and monocrotophos based on male and female reproductive toxicity across multiple studies [21, 22]. Equivocal evidence of reproductive toxicity could be found for alachlor  based on non-dose-dependent effects on ovarian weight and pregnancy index effects, which did not result in an rLOAEL being determined. Dimethyl phthalate and its metabolite methyl hydrogen phthalate [24, 25] as well as butralin  were considered to be negative for reproductive toxicity based on the available studies. The model correctly divided this subset of chemicals as reproductive toxicants or not with the exception of monocrotophos, which was in the low in vitro activity group (chemical group F). Interestingly, alachlor, which showed limited evidence of reproductive toxicity, was predicted to be positive and was just above the cutoff or model intercept, which could readily be interpreted as an equivocal prediction. In summary, five of six chemicals with literature evidence of reproductive or endocrine toxicity were accurately predicted, whereas all three negative chemicals were accurately predicted.
The remaining five chemicals had no reproductive toxicity information available in the literature and were candidates for forward predictions. Based on the model, symclosene and phenoxyethanol were predicted to be negative, but it should be noted that the chemicals had low confidence in their purity from the analytical QC and/or low in vitro activity. Three chemicals with no reproductive toxicity data were predicted to be positives, including diniconazole, niclosamide, and clorophene. Diniconazole, similar to many of the other conazoles, demonstrated CYP inhibition, which was highly associated with decrements in early offspring survival. Niclosamide displayed fairly potent PPARG agonist activity in multiple assays (top 5 of 309 chemicals for aggregate PPARG activity), which was associated most with male and female reproductive tract effects. AR binding was observed for clorophene at potencies similar to the CYP inhibition findings, which were both associated with delays in sexual development and decrements in reproductive performance. These results provide examples of how in vitro screening leading to targeted testing could be used to identify chemicals as potential reproductive toxicants based on model predictions. Additionally, the components of the predictive model have increased associations with specific endpoints and can help make recommendations about study design, including incorporating more sensitive or mechanistic endpoints into the study. A summary of the external validation (i.e., chemicals that are not used in training or testing the model and that have sufficient ToxRefDB or literature data to confidently classify the chemical as a reproductive toxicant or not) and forward validation (i.e., chemicals for which a prediction has been made but that have no available evidence of whether the chemical is a reproductive toxicant) chemical sets demonstrates the forward predictivity of the model and provides examples of predictions made on chemicals with no reproductive toxicity information available (Table 5). Of the 21 external validation chemicals, 12 were accurately predicted as reproductive toxicants, 5 were incorrectly predicted as negative, and 4 were accurately predicted to be negative, resulting in an external validation accuracy of 76% and a balanced accuracy of 85%.
External validation chemical set used to test the forward predictivity of the model.
In practice, the use of a predictive reproductive toxicity model can assist in prioritizing further targeted testing. Using chemical group A, we demonstrate the utility of this model in decision making and how it could assist in alleviating the current chemical testing bottleneck. Depending on prioritization goals, increasing or decreasing the optimal balanced cutoff would alter the specificity, sensitivity, and predictivity of the applied model (Fig. 3). Using a high cutoff, testing the top 30 scoring chemicals would yield 26 reproductive toxicants. On the other hand, to identify the vast majority of reproductive toxicants (57 of the 66 total reproductive toxicants), one would have to test the top 136 of 206 scoring chemicals. If the prioritization task was to follow up with an expensive and time-consuming multigeneration reproductive study in a short period of time, then a more specific approach (i.e., higher cutoff) might be more appropriate. If the prioritization task was to follow up with a medium throughput assay capable of testing many chemicals, then a more sensitive approach (i.e., lower cutoff) could be used, ensuring the testing strategy catches as many potential reproductive toxicants as possible. A maximum sensitivity of 86% and a maximum specificity of 97% are achieved dependent on the cutoff, which can be adjusted to the prioritization task.
Beyond the accurate prediction of reproductive toxicants identified solely from animal studies, we have compiled the available EU C&L for reproductive toxicity (R60&62 for fertility and R61&63 for developmental toxicity) in Table 6. Of the 206 group A chemicals, 19 have been reviewed for EU classification, and of these 19 chemicals, 7 have been classified for fertility (R60&62), 8 for developmental toxicity (R61&63), and 4 for neither. In all, 14 of the 15 R60&63 classified chemicals were predicted by the current model to be positive. Only the metabolically activated benomyl was a false negative using the predictive model . All four nonclassified chemicals were predicted to be negative, but it should be noted that these chemicals could have been unclassified as a result of insufficient data to assess C&L. As opposed to the risk assessment process, in which quantitative dose-response information is needed, the C&L process evaluates the intrinsic hazard of a substance. The output of the predictive reproductive toxicity model appears to be well suited to C&L.
Comparison of predictive model results to classification and labeling for reproductive toxicity.
The results of the present analysis demonstrate that in vitro HTS data can be used to predict developmentally sensitive reproductive toxicity in the rat. The capacity to use ToxCast HTS data, costing roughly $20,000–$30,000 per chemical for more than 500 assays, in predicting the reproductive toxicity of hundreds to thousands of chemicals could transform the way in which chemicals are prioritized and selected for targeted reproductive toxicity testing. Reproductive toxicity testing is animal intensive, time-consuming, and costly. Current testing requirements are expanding internationally beyond conventional pesticides to industrial chemicals and other chemical domains. Past, present, and future multigeneration reproductive studies characterize reproductive toxicity through the integrated assessment of more than 100 potential endpoints across varying life stages and generations. Even with these large numbers of measured endpoints, the imprecise nature of many of the endpoints limits the ability to identify gender and life-stage specificity, let alone mechanisms of action. The complexity of the biology, physiology, and study design are primary reasons for using molecular and cellular markers to model reproductive toxicity, but these complexities are also the reasons why previous modeling efforts have not shown dramatic success. Therefore, we have focused not only on the model development but also on the detailed capture and uniform assessment of the reference in vivo reproductive toxicity information leading to a predictive and biologically relevant model that can be applied not just to testing prioritization but also to refinement or even replacement.
The overall accuracy and predictivity of the current model based on the cross-validation statistics and examples of forward predictivity demonstrate its potential for use in an integrated evaluation strategy for environmental chemicals. Additionally, the model is specific to reproductive toxicity and is not modeling general systemic toxicity, as evident by the lack of concordance with the systemic parental and offspring LOAELs. It should be noted that the ToxCast assay data are being used concurrently to develop independent predictive models of cancer as well as systemic and developmental toxicities. Once further model performance assessment has been performed on models developed using ToxCast data, the models could be combined into an integrated testing strategy. As a starting point in this process, the current reproductive toxicity model underwent performance-based assessment demonstrating its strengths and limitations. For example, chemicals that require metabolic activation, such as benomyl or molinate, will not be predicted as a reproductive toxicant by this model, at least not until HTS data using metabolically competent systems are available [27, 28]. Additionally, chemicals such as boric acid, which likely causes its male reproductive toxicity through nonmolecular interactions, demonstrate limitations of the current model [15, 28] and point to the larger issue of chemical domain of applicability. The ToxCast Phase I chemical set contains a large number of conventional pesticides. The ToxCast Phase II chemical library contains approximately 700 chemicals with more diverse structural and use characteristics, including on-the-market and failed pharmaceuticals, food additives, antimicrobials, and other industrial chemicals. ToxCast Phase II will provide a robust external validation set testing the forward predictivity of the current model and evaluating the model's chemical domain of applicability. An advantage of developing predictive models using quantitative HTS data linked to genes, proteins, and pathways is the ability to identify gaps in the mechanisms covered by the model. Additionally, chemicals predicted to be reproductive toxicants that caused minimal reproductive toxicity in the multigenerational study have at times instead been shown to cause reproductive-related effects in either chronic, developmental, or other types of studies. Examples of reproductive related effects for triclosan and bensulide [29, 30] from other study types demonstrate the difficulty in definitively calling training-set chemicals positive or negative for reproductive toxicity.
Among the 21 chemicals selected for external validation, the model provided accurate predictions for 16 of the chemicals. The five chemicals with inaccurate predictions provide valuable insight regarding potential limitations or gaps of the model. Interestingly, the five chemicals had a common phenotypic profile with respect to reproductive toxicity. Tribufos, spiroxamine, tefluthris, disulfoton, and esfenvalerate all caused reduced early offspring survival, particularly litter size decrease, with little to no accompanying effects on reproductive performance or reproductive tract pathology. The rLOAELs for all five chemicals were set at the high dose tested based on the early offspring survival effects, and the parental and offspring LOAELs were set at the lower dose levels. Based on the inclusive definition used for defining a positive for reproductive toxicity for model development, all five were considered to be positive, but all five lack evidence of specific fertility-related or developmentally sensitive reproductive outcomes. Nonetheless, a gap in model predictivity has been identified and could potentially be filled using additional assay technologies, physical-chemical properties and structural descriptors, or acute or short-term in vivo studies.
The model development process identified biologically plausible features and pathways from more than 500 assays mapped to hundreds of genes and spanning many reproductively relevant modes of action. PPARα activity was clearly associated with reproductive toxicity, with all 10 PPARα agonists in the training set (chemical group A) causing reproductive toxicity. Putative PPARα agonists (lactofen , imazalil , diclofop-methyl , DEHP , MEHP , and PFOA ) and environmental chemicals identified as potential PPARα agonists through the ToxCast research program (fluazinam, emamectin benzoate, vinclozolin, and fenthion) span many chemical classes yet share a relatively common reproductive toxicity profile—namely, a decrease in reproductive performance (i.e., decreased fertility) in 8 of the 10 chemicals. Although a mechanistic link between PPAR activity and fertility or other reproductive impairments remains unclear , the role of PPAR in steroid metabolism and its activity in reproductive tissues allows us to infer that it is a plausible target for disruption of endocrine signaling and altered gametogenesis.
The AR and ESR1 activity was also associated with reproductive toxicity. The ToxCast receptor profiling identified most, if not all, of the known antiandrogenic and estrogenic chemicals in the current dataset, including well-studied chemicals such as vinclozolin, bisphenol A, methoxychlor, HPTE, and clorophene. The role of potency in determining a chemical's relative reproductive toxicity potential needs to be explored further, considering that five of the top seven scoring ESR1 activators (i.e., active across multiple ESR1 assays and at relatively low concentrations) did not cause substantial reproductive toxicity in vivo, including flumetralin, fenhexamid, fludioxonil, pyridaben, and endosulfan. Additionally, the impact of weak or partial nuclear receptor agonists and antagonists on reproductive toxicity potential and other toxicities needs to be explored further.
Inhibition of the CYP enzyme, as compared to gene induction, was significantly more associated with reproductive toxicity. Alterations in steroid metabolism through CYP induction have been previously associated with reproductive impairment ; however, the nonspecific inhibition of CYPs may be a surrogate for a chemical's capacity to disturb steroid metabolism, including inhibition of key CYPs involved in steroidogenesis (e.g., CYP19 and CYP17). Related to CYP activity, NR1L2 interestingly displayed a negative correlation/association with reproductive toxicity. In general, NR1L2 lowered the false-positive rate of the model by lowering the model score of chemicals with nonspecific and low-potency nuclear receptor activity. Robust NR1L2 activity is an indication of potent xenosensing and potentially rapid metabolism.
The pyrethroid class of pesticides has shown limited reproductive toxicity in guideline toxicity studies, although limited evidence does link pyrethroid exposure to decreased human sperm quality . Of the 10 pyrethroids in chemical group A, only resmethrin was considered to be a reproductive toxicant based on the criteria described in this manuscript. All 10 pyrethroids displayed low-potency activity across one or more of the selected features, including AR, ESR1, and PPARG, but not CYP. Without the down-weighting based on each of their NR1L2 activities, the pyrethroids would have all been predicted to be reproductive toxicants.
A major component of the model not directly related to nuclear receptor biology and xenobiotic/steroid metabolism was GPCR binding. Numerous GPCR-binding assays were significantly associated with reproductive toxicity. Those chosen to represent the GPCR family were selected for statistical, not biological, reasons, because the literature contains limited information on their role in reproduction, in contrast to their well-characterized role in nervous system function.
Platforms measuring EGFR, TGFB1, and NFKB activity were also associated with reproductive toxicity and make up the OTHER feature. All three gene products have been shown to modulate the relative sensitivity of developmental toxicants, especially aryl-hydrocarbon receptor signaling [37, 38], and may be indicative of altered xenobiotic metabolism, cellular proliferation, cell-cell signaling, or potential epigenetic effects [39, 40].
Overall, the key targets in the model identify plausible modes of action leading to reproductive toxicity covering antiandrogenic, estrogenic, cholesterol/steroid metabolism, limited coverage of disruption of steroidogenesis, and altered xenobiotic metabolism modes of action.
Limited efforts have been made toward the development of models predictive of reproductive toxicity, due in part to the lack of reference data with which to model. One resource for predictive models have come from structure-based methods (i.e., quantitative structure-activity relationship [QSAR] models), but the accuracy and predictivity of the resultant models have been limited. A comprehensive effort toward the prediction of reproductive and developmental toxicity was undertaken by the U.S. Food Drug Administration . The resultant QSAR models were developed for endpoints such as sperm effects, female reproductive toxicity, and male reproductive toxicity and generally were highly specific, with an average specificity across all generated models being 88%. However, the average balanced accuracy across all models was 58%, with the maximum balanced accuracy for any single model being 68% for predictive trans-species female reproductive toxicity. It is difficult to assess the true accuracy and forward predictivity of the models based solely on the summary statistics, but the balanced accuracy values provide the most direct and unbiased comparison to the current model. Most likely, the limitation lies in the physiological complexity of reproductive toxicity and structural diversity of reproductive toxicants. The current predictive model has improved accuracy compared to any published QSAR model of reproductive toxicity and provides additional mechanistic information and indications of specific reproductive effects. The model also can be extended to include new data either covering the gene targets in the current model or new gene targets of other potential reproductive toxicity modes of action. Additional international efforts are underway with the goal of using alternative testing approaches in the detection of reproductive toxicants and have shown promise on limited chemical sets . However, the current ToxCast-based approach utilizes hundreds of diverse biological-chemical activities associated with many potential modes of action leading to reproductive toxicity. The output of the current model provides a binary classification. Applications beyond hazard identification and testing prioritization may require dose-response and even mechanistic information. To accomplish this, research is underway incorporating toxicokinetic information into the modeling process using primary rodent and human hepatocytes, plasma protein binding, and pharmacokinetic modeling intended to reverse engineer the expected oral dose required to achieve a particular in vitro bioactivity level . Experimentally and computationally deriving dosimetry relevant to in vivo exposures has the potential to provide quantitative dose-response information that can be incorporated into the modeling process. For example, the in vitro constitutive androstane receptor (CAR, NR1L3) and NR1L2 activity on a set of conazole fungicides in ToxCast Phase I demonstrated the dose-response relationship between the equivalent in vivo levels required to observe the NR1L3/NR1L2 activity and the known dose levels causing rodent liver toxicity . Examples such as this provide a path toward incorporating in vitro assay data into the risk assessment process, but they also demonstrate the amount of prior knowledge currently required to perform such an analysis. The vast majority of environmental chemicals have little to no prior toxicity data, and those that do commonly lack information regarding potential modes of action or human relevance. The reality is that among the thousands of environmental chemicals, few will ever have a multigeneration reproductive study run. Over the past 30 years, only 500 chemicals have been run in multigeneration reproductive studies because of the high animal and financial burdens for such large-scale animal testing [1, 6]. A practical solution and pressing need, especially with regards to reproductive toxicity testing, is for prioritization tools, such as the current model, to make more informed reproductive toxicity testing decisions.
Cross-validation and external validation sets used to develop and assess the quality of the reproductive toxicity model helped identify strengths and weaknesses of the present model and will help focus future research. Using HTS assays as the input into the model provided mechanistic insights and helped further characterize the predicted chemicals beyond negative and positive prediction outcomes. However, a subset of chemicals were deemed to be outside the domain of applicability based on low in vitro activity as a result of physical-chemical characteristics, biological gaps, chemical decomposition, or volatility. Further research is needed to better characterize this chemical subset to expand the current model's domain of applicability. A limited number of chemicals selected from the ToxCast Phase I chemical set were used for external validation and provided supporting evidence of the model quality. A large set of chemicals from ToxCast Phase II will have a full complement of in vitro bioactivity data, and rodent reproductive toxicity studies will be used to further evaluate, validate, and expand the predictive model. Additionally, ToxCast Phase II contains a library of failed pharmaceuticals with preclinical and clinical toxicity outcomes as well as reference chemicals with known mechanisms of reproductive toxicity. In addition to diversifying the current chemical library, these chemicals will aid in the expansion of predictive reproductive toxicity model development toward mechanistic and human reproductive toxicity models useful in risk assessment applications.
The ability of this predictive reproductive toxicity model to externally predict numerous chemicals with biological and structural diversity demonstrates suitability for chemical testing prioritization. Although the model does not provide quantitative dose-response information, it does provide accurate predictions of a chemical's reproductive toxicity potential. Because the model is based on HTS data, it is amenable to screening and prioritizing thousands of chemicals. Additionally, the biological features of the model provide mechanistic insights regarding modes of action useful in developing an integrated testing strategy for reproductive toxicity.
The authors thank Dr. Andrew Hotchkiss (U.S. EPA/Office of Research and Development/National Center for Environmental Assessment) for his review of this manuscript. The authors also thank the Office of Pesticide Programs for continuing support of ToxRefDB's development and expansion and the government contract support, government contractors, and collaborators of ToxCast. Additionally, the authors thank our Tox21 partners at the National Institutes of Health Chemical Genomics Center, especially Drs. Menghang Xia and Ruili Huang, for generation of chemical-nuclear receptor interaction data.