Welcome to Phase 2 of the capstone project. This section will be the first of two parts that concerns the model training process of the model development cycle. You continue to play the role of a bioinformatics professor. The questions will relate to the various challenges faced by the teams working on the two projects introduced in the first section.
Your two research teams have begun working on the projects, and have some preliminary results. Both teams have e-mailed you summaries of progress thus far, which are shown below.
Hi,We are super excited to get this project kicked off! We have implemented the data pipeline and trained a few preliminary models, but there is still lots of room for improvement. Here is what we’ve done so far:We split the data randomly into a training and test set. We are placing 90% of the data into the training set and 10% of the data into the test set. Additionally, the images were initially massive, on the order of 3000 by 3000 pixels. So, we re-sized the images to 224 by 224 pixels. We are using the ResNet-50 CNN architecture. During training, we are applying data augmentation. Concretely, on a given image, with 50% probability, we are zooming in on a small, randomly selected region before feeding it to the model. Here is an example:
So far, we have seen the following training curves from our model. The loss for neither the training set nor the test set goes down very much.
As you can see, there is plenty of room for improvement. We’ll keep working on it, but let us know if you have any suggestions. Thanks.
Hello,We are in the process of cleaning up the COVID EHR data, and expect to get a model training soon. We attempted to train a set of preliminary models, but ran into some data issues. We were wondering if you could take a look at some of the problems we’ve found in the data and let us know what you think.First, we noticed that the EHR data is actually quite sparse relative to what we thought we have. We only have about 3,000 EHR records– not 30,000, as we originally thought. This leaves us with about 300 COVID-positive and 2,700 COVID-negative exams. We might not be able to train a model on this data alone.We are noticing some very strange patterns in the data, particularly in the lab values. For example, see the following histogram of D-DIMER lab values found for each exam across the entire dataset. The x-axis is the D-DIMER lab values, and the y-axis is the number of exams with that count. We use a log-scale on the y-axis improve readability.
We saw this in several CSV columns, including Ferritin, and Procalcitonin lab values. **We suspect that there is some underlying phenomena affecting all three lab values.**Another issue we were running into were missing column values. We can’t create a feature vector for Logistic Regression if we are missing some values. How do you suggest we proceed regarding both the large outlier values and the missing values? Below is an example of the data once again, this time a sample of 30 exams (with the observed symptoms excluded). Note that NaN in the CSV means that the value is missing. Please take a look and let us know if you see something that we might have missed.
In [1]:
pd.read_csv('COVID_19_sample_data.csv')[ ['pat_deid', 'intubation_date', 'IP_admission_date', 'IP_discharge_date', 'clinic', 'birth_date', 'death_date', 'gender', 'ethnicity', 'race_new', 'LYMAB', 'CK', 'CR', 'LDH', 'TNI', 'DDIMER', 'FERRITIN', 'PROCTL', 'PT', 'BUN', 'CRP', 'SPO2', 'FIO2', 'NA']].iloc[5:35]
Out[1]:
pat_deid | intubation_date | IP_admission_date | IP_discharge_date | clinic | birth_date | death_date | gender | ethnicity | race_new | LYMAB | CK | CR | LDH | TNI | DDIMER | FERRITIN | PROCTL | PT | BUN | CRP | SPO2 | FIO2 | NA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 8f9539f2-e6ad-4e00-ad45-4abc2bff2214 | NaN | 2020-03-04 | 2020-03-14 | Clinic B | 1965-11-11 | NaN | F | nonhispanic | white | 1.0 | 51.6 | NaN | 232.4 | 0.0020 | 400000.0 | 503900.0 | 0.0 | NaN | 16.4 | 0.88 | 88.1 | NaN | 136.9 |
6 | d5dd13c4-c31e-419c-8c02-47e4ca1ac5e2 | NaN | 2020-03-02 | 2020-03-20 | Clinic B | 2018-08-16 | NaN | F | nonhispanic | white | 1.1 | 52.9 | NaN | 300.7 | 0.0030 | 700000.0 | 579600.0 | 100.0 | 10.3 | 16.2 | 0.83 | 80.6 | 32.8 | 140.0 |
7 | 91369e11-b944-4132-be0f-af46e880936b | NaN | 2020-03-02 | 2020-03-21 | Clinic C | 1972-09-22 | NaN | M | nonhispanic | white | 1.2 | 45.0 | NaN | NaN | 0.0047 | 600.0 | 562.3 | 0.2 | 12.8 | 10.7 | 0.85 | 84.4 | NaN | 141.4 |
8 | c70992c9-ff13-467b-9032-1901506edeef | NaN | 2020-02-29 | 2020-03-05 | Clinic C | 1959-06-17 | 2020-03-11 | M | nonhispanic | white | 0.8 | 84.2 | NaN | 321.8 | 0.0092 | 1100.0 | 877.6 | 0.3 | 13.5 | 7.6 | NaN | 88.6 | 88.3 | 141.0 |
9 | c70992c9-ff13-467b-9032-1901506edeef | 2020-03-05 | 2020-03-05 | 2020-03-12 | Clinic B | 1959-06-17 | 2020-03-11 | M | nonhispanic | white | 0.5 | 29.7 | NaN | 391.9 | 0.0553 | 17900000.0 | 1786400.0 | 200.0 | 12.7 | 11.5 | 0.94 | 75.6 | 88.3 | 143.2 |
10 | 9ec7d743-96e7-47c8-b2ee-6336633beb39 | NaN | 2020-03-10 | 2020-03-23 | Clinic B | 1969-03-22 | NaN | M | nonhispanic | white | 1.1 | 35.7 | NaN | 312.5 | 0.0045 | 600000.0 | 615900.0 | 100.0 | 11.1 | 15.6 | 0.98 | 83.7 | NaN | 143.8 |
11 | 9ec7d743-96e7-47c8-b2ee-6336633beb39 | NaN | 2020-03-23 | 2020-03-25 | Clinic C | 1969-03-22 | NaN | M | nonhispanic | white | 1.2 | 24.4 | NaN | 254.3 | 0.0021 | 600.0 | 406.2 | 0.1 | 11.8 | 15.6 | 0.80 | NaN | NaN | 143.8 |
12 | a527bcf0-3746-476c-90f2-dbab8868385e | NaN | 2020-03-03 | 2020-03-17 | Clinic C | 1978-11-27 | NaN | F | nonhispanic | white | 1.1 | 25.7 | NaN | 228.1 | 0.0038 | 500.0 | 459.5 | 0.0 | 12.8 | 13.5 | 1.14 | 77.7 | NaN | 138.4 |
13 | 7f4ef129-1511-47a9-a9b7-8b0b2d02ad50 | NaN | 2020-03-12 | 2020-03-25 | Clinic C | 1952-05-06 | NaN | F | nonhispanic | white | 1.0 | 29.7 | NaN | 321.5 | 0.0038 | 700.0 | 450.1 | 0.2 | 12.1 | 15.0 | 1.20 | 75.8 | NaN | 139.6 |
14 | 7078ae9a-4c79-4b30-b127-f76aabb6763e | NaN | 2020-02-17 | 2020-03-07 | Clinic B | 1968-04-26 | NaN | F | hispanic | white | 0.9 | 48.7 | NaN | 306.5 | NaN | 600000.0 | 509000.0 | 100.0 | 12.0 | 18.4 | 1.14 | 77.3 | NaN | 139.6 |
15 | a5c39700-6bf3-4984-af46-31344695e21b | NaN | 2020-03-05 | 2020-03-13 | Clinic A | 1940-01-09 | 2020-03-15 | M | nonhispanic | white | 0.7 | 85.5 | NaN | 312.9 | 0.0257 | NaN | 1552.8 | NaN | NaN | 14.9 | 1.23 | 79.2 | 64.5 | 143.8 |
16 | a5c39700-6bf3-4984-af46-31344695e21b | 2020-03-12 | 2020-03-12 | 2020-03-16 | Clinic C | 1940-01-09 | 2020-03-15 | M | nonhispanic | white | 0.5 | 170.6 | NaN | 390.5 | 0.0238 | 15000.0 | 1755.9 | 0.2 | 14.0 | 14.9 | 1.15 | 75.6 | 40.9 | 143.8 |
17 | ddb2d5e2-643e-4374-ac19-f6ca3c0d16f5 | NaN | 2020-02-25 | 2020-03-09 | Clinic C | 1967-12-24 | NaN | M | nonhispanic | white | NaN | 51.5 | NaN | 234.4 | 0.0029 | 500.0 | 527.4 | 0.1 | 11.5 | 12.0 | 0.90 | 88.9 | 42.0 | 138.8 |
18 | 21505aac-f219-43a8-ab3c-f57c6d8f1d1f | NaN | 2020-03-08 | 2020-03-21 | Clinic B | 1940-05-03 | NaN | F | nonhispanic | white | 1.1 | 38.7 | NaN | 229.3 | 0.0029 | 600000.0 | 536500.0 | NaN | 12.6 | 8.4 | 1.07 | 77.0 | NaN | 137.4 |
19 | 7992bf94-feee-4728-9187-2c911df2819b | NaN | 2020-03-03 | 2020-03-17 | Clinic C | 2004-07-04 | NaN | F | nonhispanic | white | 1.0 | 27.3 | NaN | 238.9 | 0.0022 | 600.0 | NaN | 0.1 | 11.0 | 10.9 | 0.98 | NaN | NaN | 138.4 |
20 | d2f6d528-39db-4b7e-8389-abd27af9a710 | NaN | 2020-02-29 | 2020-03-12 | Clinic B | 1996-06-26 | NaN | F | nonhispanic | white | 1.1 | 31.5 | NaN | 254.3 | 0.0034 | 700000.0 | 455300.0 | 0.0 | 10.3 | 14.6 | 1.08 | 76.6 | NaN | 138.4 |
21 | fa0b58e6-6817-4d49-8211-1dd34abf0c15 | NaN | 2020-03-11 | 2020-03-28 | Clinic C | 2008-11-21 | NaN | M | nonhispanic | white | 0.9 | 23.0 | NaN | 232.0 | 0.0030 | 600.0 | 542.0 | 0.1 | 10.8 | 10.6 | 0.88 | 86.3 | NaN | 141.6 |
22 | b83237f3-9ff5-491e-aab4-d63ccff85f85 | NaN | 2020-03-13 | 2020-03-30 | Clinic C | 2012-11-17 | NaN | M | nonhispanic | white | 1.1 | 47.2 | NaN | 329.8 | 0.0038 | 700.0 | 535.0 | 0.0 | 12.4 | 10.1 | 1.13 | 75.3 | NaN | 137.3 |
23 | 46988a9c-9c86-429a-bc4a-b3d14ff321b0 | NaN | 2020-03-11 | 2020-03-21 | Clinic B | 1957-03-13 | NaN | M | nonhispanic | asian | NaN | 37.0 | NaN | 235.7 | 0.0030 | NaN | 535000.0 | 100.0 | 10.2 | 19.1 | 0.86 | 79.1 | NaN | 137.5 |
24 | 46988a9c-9c86-429a-bc4a-b3d14ff321b0 | 2020-03-20 | 2020-03-21 | 2020-03-24 | Clinic B | 1957-03-13 | NaN | M | nonhispanic | asian | 1.2 | 29.6 | NaN | 304.2 | 0.0044 | 500000.0 | 553400.0 | 0.0 | 11.8 | 9.2 | 1.13 | 98.7 | 86.8 | 143.5 |
25 | 785b484d-7060-4d17-bf18-ef8bbafc6f04 | NaN | 2020-02-28 | 2020-03-10 | Clinic B | 1942-08-24 | NaN | F | nonhispanic | white | 0.8 | 21.3 | NaN | 226.3 | 0.0023 | 500000.0 | 468600.0 | NaN | 12.6 | 17.2 | 0.91 | 81.7 | NaN | 138.2 |
26 | edad31f3-5a08-4678-8d31-271a41a2aad5 | NaN | 2020-03-05 | 2020-03-13 | Clinic C | 1940-01-09 | 2020-03-19 | M | nonhispanic | white | 0.6 | 78.6 | NaN | 306.4 | 0.0256 | 2600.0 | 1764.2 | 0.2 | 13.8 | 18.0 | 1.14 | 83.0 | 62.0 | 141.2 |
27 | edad31f3-5a08-4678-8d31-271a41a2aad5 | 2020-03-12 | 2020-03-12 | 2020-03-20 | Clinic C | 1940-01-09 | 2020-03-19 | M | nonhispanic | white | 0.3 | 184.4 | NaN | 370.1 | 0.0639 | 20600.0 | 1804.0 | 0.3 | 11.5 | 7.7 | 1.23 | 84.4 | NaN | 142.2 |
28 | 4607a669-4a97-4f0a-9661-856569905047 | NaN | 2020-03-09 | 2020-03-21 | Clinic C | 1993-11-26 | NaN | F | nonhispanic | white | 1.1 | 48.7 | NaN | NaN | 0.0037 | 600.0 | 590.9 | 0.1 | 12.5 | 13.7 | 0.97 | 81.1 | NaN | 142.4 |
29 | c1800ba1-7cba-45d7-bdc4-0e0b583932e4 | NaN | 2020-02-23 | 2020-03-08 | Clinic A | 2018-01-20 | NaN | M | hispanic | white | NaN | 25.7 | NaN | 306.4 | 0.0030 | 600.0 | 464.0 | 0.0 | 10.9 | 7.6 | NaN | 77.8 | NaN | 141.7 |
30 | d2718050-2e9c-4d5b-842e-52d910c1563f | NaN | 2020-03-04 | 2020-03-17 | Clinic C | 1997-06-01 | NaN | M | nonhispanic | white | 1.2 | 50.6 | NaN | 339.8 | 0.0047 | 600.0 | 634.5 | 0.2 | 10.2 | 13.6 | 1.08 | 81.5 | NaN | 137.9 |
31 | d2718050-2e9c-4d5b-842e-52d910c1563f | NaN | 2020-03-17 | 2020-03-22 | Clinic A | 1997-06-01 | NaN | M | nonhispanic | white | 1.3 | 40.5 | NaN | 186.6 | 0.0038 | 500.0 | 322.6 | 0.1 | 10.8 | 16.5 | 0.85 | 99.6 | NaN | 139.6 |
32 | 818566cb-c89b-42d8-a6af-1a1ef13ed7cf | NaN | 2020-03-08 | 2020-03-20 | Clinic C | 1984-10-11 | NaN | F | nonhispanic | white | 0.9 | 27.0 | NaN | 239.3 | 0.0023 | 500.0 | 474.7 | 0.1 | 11.8 | 9.8 | 0.94 | 85.7 | NaN | 137.9 |
33 | 000e7adf-cbaa-4fad-ab2f-658c32f7d4d3 | NaN | 2020-03-12 | 2020-03-16 | Clinic B | 1959-01-03 | 2020-03-15 | M | nonhispanic | white | 0.6 | 175.1 | NaN | 326.5 | 0.0063 | 1600000.0 | 1010100.0 | 400.0 | 12.5 | 16.4 | 1.37 | 81.4 | 79.6 | 143.0 |
34 | 5a2f02ce-0286-45ae-b992-05331cb88379 | NaN | 2020-03-11 | 2020-03-29 | Clinic C | 1973-06-30 | NaN | F | nonhispanic | white | 1.0 | 27.7 | NaN | 349.5 | NaN | 600.0 | NaN | 0.2 | 10.7 | 14.8 | 0.86 | 87.6 | NaN | 143.7 |
In the following quiz, you will answer questions examining the issues of Team 1 and Team 2.
In [ ]:
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more
Recent Comments