Columbia Cycle Day Prediction
Summary of the Request
As much data as possible to assist with answering the below research question(s). The data needed would be from non-anovulatory women not taking oral contraceptives who could be considered “healthy.” The ideal variables (at least initially) include progesterone metabolites, estradiol, estrone, and estrogen metabolites (vs. only estrone metabolites listed in the data request document?) and any other covariates available . . Below I attempted to create research questions from the recording of the 8/2 meeting. Please comment and correct anything I may have misinterpreted.
Important to note here that the assumption of a “healthy” status is a delicate assumption given that, in general, laboratory testing is ordered when a clinician is attempting to diagnose an issue. With the current exclusion criteria, this shouldn’t be a big deal and including non-healthy (but still cycling) women should, in theory, increase the generalizability of the model. A naive assumption is that a large portion of the women completing DUTCH cycle mapping are experiencing infertility or some potentially cyclical symptomatology.
Research Question 1 (Most Pressing)
How accurately does the model created using hormone concentrations from 18 women predict cycle day in a new cohort of women?
I see this as essentially looking for a test set for the model previously trained on 18 women. Because this test set will be much larger, I wonder if it would be prudent to retrain the model based on a larger number of training observations and then use a smaller test set (like the standard 80/20 or 75/25 train-test split)?
Research Question 2
Can we create a reasonably accurate model for predicting cycle day from a larger number of predictors? Or, alternatively, is there a more accurate model that can be created using a larger number of predictors?
A key question here might be how important is the interpretability of the model? Or how is feature/predictor importance quantified in the current approach? Would alternatively approaches result in substantially different rankings for relative feature/predictor importance?
Potential Research Question 3
How accurately can the model predict cycle phase (luteal vs. follicular)?
This is a somewhat implied research question (and was hinted at by Dr. Herbstman in the meeting), but it could be particularly interesting and useful in some settings. Additionally this would present an opportunity to use non-cycle map data (which would greatly increase the number of observations available) to test hypotheses related to models created to predict cycle day, with the assumption being that if a model can accurately predict cycle day within 2-3 days, then it should be extremely accurate at predicting cycle phase (with an interesting potential hypothesis of some bias to the errors). This could potentially represent a broader application of the resulting model.
In reference to the above note, in reading through the manuscript, the way in which the error is assigned (for theta = 28, theta hats of 26 and 2 are both assigned an error of 2, with sign indicating the direction of the error…) does not optimize for cycle phase determination.
Options
Option 1
Use data collected from 12/2015 - 1/2020 under previous IRB. This includes data from 2435 women who completed DUTCH Cycle Mapping. A brief EDA using this/these data is below.
Considerations
Inclusion/exclusion criteria will likely remove a moderate number of observations given the goal of the research (see below for details in context like age, cycle length, etc.)
This approach would only require a data transfer agreement (vs. a data use agreement + IRB approval since it is completely de-identified human subject data).
However, this would still likely require a modification to the IRB in place
NoteThis is an important trend related to effort and return on investment for expended effort regarding IRB modification/submission (i.e., does it make sense to put in a little more effort regarding the IRB to set up a longer term and more useful data mining resource?)
Option 2
Use all available cycle map data (from 12/2015 to present). I’m still working on a SQL query to get an accurate number of how many additional women this would add, however, a reasonable estimate would be 1.5 - 2X the number in option 1.
Could also consider adding non-cycle map data here to explore the accuracy of cycle phase prediction…
Considerations
This increased number is important to consider here given the number of observations that will likely be lost to inclusion/exclusion criteria (again…age, cycle length, etc.)
This would require a new IRB application, or, more accurately, an IRB exemption since all data would be de-identified (or potentially a modification?).
- Note
This brings up an important point, which is the mechanism for sharing data. This seems like a good project for REDCap. We have a REDCap instance as I’m sure you all do as well and this may facilitate more streamlined and compliant data sharing.
The time commitment here for pulling and cleaning the data is more than the near zero time for option 1 (however, this is a super cool project and I’d be willing to dedicate a solid amount of non-paid/non-work/free time to helping out because this kind of stuff is super fun for me!).
“How much more time?” you might ask. Not quite sure until I get started and see what shape the data is currently in, but at least a couple of weeks. Another potential approach is to use the option 1 data as training data and use the data from 1/2020 - present as test data.
Option 3
Use all available cycle map data from (12/2015 to present) + non-cycle map data (for cycle phase prediction) + collect cycle day information from incoming non-cycle map (so cross-sectional samples…would be self-reported, but if the question is asked prospectively this could increase accuracy or at least allow for the comparison of self-reported data and model-generated data, especially if the model is performing well).
Considerations
The cycle phase prediction piece is “easy” data to get (relatively)
Collecting cycle day information from incoming non-cycle map data would add some additional effort, but could greatly increase the importance of the results and the perceived validity
- The much better alternative here would be a prospective study where cycle day is more rigorously collected (perhaps via a cycle monitoring app), but that isn’t currently feasible (but maybe there’s a grant application in the future??)
This would require a new IRB application/exemption/modification
Option 2 could be used prior to this option or while data collection is ongoing and this could then be a follow up study. There is also an opportunity here to put a model into “production” for broader use and then collect exponentially more data.
Questions
How does this model perform compared to other types of supervised learning models (gradient boosted decision trees, random forest, multi-class SVM, etc.)
- This might be more relevant for the potential research question 3 which reduces this to a binary classification problem, but a multi-class SVM approach is an interesting consideration.
How is variation in cycle length accounted for or is having a 28-day cycle required for inclusion (if so, this would eliminate a substantial number of observations)?
- Note
It is important to note here that, in general, cycle length will vary quite a bit; so if the existing model is inflexible to cycle lengths that deviate from 28 days, the clinical utility may be limited
There is dependence among samples from the same patient. How is this handled?
Brief EDA
Below are the first 100 entries so that you can get an idea of the shape of the data.
To Do
Join other data with age + …
Variable Exploration
Age

There are 9 rows with NA values for age…this means age is missing for 1 person
A maximum age would need to be set as exclusion criteria to exclude women who are likely to be peri or postmenopausal.
# A tibble: 9 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 321471 321471 611866 NA 20.4 34
2 321471 321471 611866 NA 20.4 22
3 321471 321471 611866 NA 20.4 29
4 321471 321471 611866 NA 20.4 25
5 321471 321471 611866 NA 20.4 11
6 321471 321471 611866 NA 20.4 15
7 321471 321471 611866 NA 20.4 4
8 321471 321471 611866 NA 20.4 20
9 321471 321471 611866 NA 20.4 17
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
# A tibble: 2,435 × 2
patient_id age_at_collection
<chr> <dbl>
1 020303 44
2 020313 45
3 020323 30
4 020333 16
5 020357 44
6 020367 41
7 020372 42
8 020382 40
9 021104 24
10 021114 38
# ℹ 2,425 more rows
BMI

Might consider excluding those with extreme BMIs as this may either be inaccurate (self-reported) or influence cycle characteristics.
The extremely low BMIs seem particularly concerning…it is likely that these are inaccurate
Day of Cycle

The above demonstrates the issue with variable cycle lengths and with potential self-reporting errors. Exclusion criteria will need to be established here.
Creatinine

The large peak at 1 suggests that this was potentially an “imputed” value (at some point in the lab methodology). This likely means that this variable might not be very useful as a predictor unless all observations with a value of 1 are excluded.
Alpha-Pregnanediol

From the warning message on the histogram, it appears that there are some missing values
# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Decision would need to be made here around how to handle missing values.
Beta-Pregnanediol

Missing values here as well…would like to see if these are the same as the missing values for alpha pregnanediol
# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Yes. These are the same patients/accessions with missing alpha and beta pregnanediol values.
Estrone

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
Estradiol

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
Estriol

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
2-OHE1

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
4-OHE1

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
16-OHE1

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
2-Methoxy-E1

# A tibble: 5 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
Again, same 5 patients + accessions with missing values
2-OHE2

# A tibble: 7 × 18
accession_id accession_id1 patient_id age_at_collection bmi day_of_cycle
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 048414 048414 130321 44 18.9 4
2 048434 048434 030698 52 19.8 4
3 048474 048474 130351 42 20.2 4
4 048484 048484 517933 39 24.8 4
5 048493 048493 130311 21 22.1 4
6 732860 732860 928021 33 23.6 4
7 732880 732880 928031 32 19.9 4
# ℹ 12 more variables: creatinine <dbl>, a_pregnanediol <dbl>,
# b_pregnanediol <dbl>, estrone <dbl>, estradiol <dbl>, estriol <dbl>,
# x2_ohe1 <dbl>, x4_ohe1 <dbl>, x16_ohe1 <dbl>, x2_meth_e1 <dbl>,
# x2_ohe2 <dbl>, single_or_quad <dbl>
There are 7 patients missing values for 2OHE2; 5 of them are the same patients that have been missing other values. The other 2 observations with missing values are 2 new patients + accessions.