AIinhealthcarewithPhase2dataset.html

Phase 2: Model Training, Part 1

Welcome to Phase 2 of the capstone project. This section will be the first of two parts that concerns the model training process of the model development cycle. You continue to play the role of a bioinformatics professor. The questions will relate to the various challenges faced by the teams working on the two projects introduced in the first section.

Your two research teams have begun working on the projects, and have some preliminary results. Both teams have e-mailed you summaries of progress thus far, which are shown below.

Project 1: CXR-based COVID-19 Detector

Hi,We are super excited to get this project kicked off! We have implemented the data pipeline and trained a few preliminary models, but there is still lots of room for improvement. Here is what we’ve done so far:We split the data randomly into a training and test set. We are placing 90% of the data into the training set and 10% of the data into the test set. Additionally, the images were initially massive, on the order of 3000 by 3000 pixels. So, we re-sized the images to 224 by 224 pixels. We are using the ResNet-50 CNN architecture. During training, we are applying data augmentation. Concretely, on a given image, with 50% probability, we are zooming in on a small, randomly selected region before feeding it to the model. Here is an example:

So far, we have seen the following training curves from our model. The loss for neither the training set nor the test set goes down very much.

As you can see, there is plenty of room for improvement. We’ll keep working on it, but let us know if you have any suggestions. Thanks.

Project 2: EHR-based Intubation Predictor

Hello,We are in the process of cleaning up the COVID EHR data, and expect to get a model training soon. We attempted to train a set of preliminary models, but ran into some data issues. We were wondering if you could take a look at some of the problems we’ve found in the data and let us know what you think.First, we noticed that the EHR data is actually quite sparse relative to what we thought we have. We only have about 3,000 EHR records– not 30,000, as we originally thought. This leaves us with about 300 COVID-positive and 2,700 COVID-negative exams. We might not be able to train a model on this data alone.We are noticing some very strange patterns in the data, particularly in the lab values. For example, see the following histogram of D-DIMER lab values found for each exam across the entire dataset. The x-axis is the D-DIMER lab values, and the y-axis is the number of exams with that count. We use a log-scale on the y-axis improve readability.

We saw this in several CSV columns, including Ferritin, and Procalcitonin lab values. **We suspect that there is some underlying phenomena affecting all three lab values.**Another issue we were running into were missing column values. We can’t create a feature vector for Logistic Regression if we are missing some values. How do you suggest we proceed regarding both the large outlier values and the missing values? Below is an example of the data once again, this time a sample of 30 exams (with the observed symptoms excluded). Note that NaN in the CSV means that the value is missing. Please take a look and let us know if you see something that we might have missed.

In [1]:

pd.read_csv('COVID_19_sample_data.csv')[    ['pat_deid', 'intubation_date', 'IP_admission_date', 'IP_discharge_date', 'clinic',     'birth_date', 'death_date', 'gender', 'ethnicity', 'race_new', 'LYMAB', 'CK', 'CR',     'LDH', 'TNI', 'DDIMER', 'FERRITIN', 'PROCTL', 'PT', 'BUN', 'CRP',     'SPO2', 'FIO2', 'NA']].iloc[5:35]

Out[1]:

pat_deid intubation_date IP_admission_date IP_discharge_date clinic birth_date death_date gender ethnicity race_new LYMAB CK CR LDH TNI DDIMER FERRITIN PROCTL PT BUN CRP SPO2 FIO2 NA
5 8f9539f2-e6ad-4e00-ad45-4abc2bff2214 NaN 2020-03-04 2020-03-14 Clinic B 1965-11-11 NaN F nonhispanic white 1.0 51.6 NaN 232.4 0.0020 400000.0 503900.0 0.0 NaN 16.4 0.88 88.1 NaN 136.9
6 d5dd13c4-c31e-419c-8c02-47e4ca1ac5e2 NaN 2020-03-02 2020-03-20 Clinic B 2018-08-16 NaN F nonhispanic white 1.1 52.9 NaN 300.7 0.0030 700000.0 579600.0 100.0 10.3 16.2 0.83 80.6 32.8 140.0
7 91369e11-b944-4132-be0f-af46e880936b NaN 2020-03-02 2020-03-21 Clinic C 1972-09-22 NaN M nonhispanic white 1.2 45.0 NaN NaN 0.0047 600.0 562.3 0.2 12.8 10.7 0.85 84.4 NaN 141.4
8 c70992c9-ff13-467b-9032-1901506edeef NaN 2020-02-29 2020-03-05 Clinic C 1959-06-17 2020-03-11 M nonhispanic white 0.8 84.2 NaN 321.8 0.0092 1100.0 877.6 0.3 13.5 7.6 NaN 88.6 88.3 141.0
9 c70992c9-ff13-467b-9032-1901506edeef 2020-03-05 2020-03-05 2020-03-12 Clinic B 1959-06-17 2020-03-11 M nonhispanic white 0.5 29.7 NaN 391.9 0.0553 17900000.0 1786400.0 200.0 12.7 11.5 0.94 75.6 88.3 143.2
10 9ec7d743-96e7-47c8-b2ee-6336633beb39 NaN 2020-03-10 2020-03-23 Clinic B 1969-03-22 NaN M nonhispanic white 1.1 35.7 NaN 312.5 0.0045 600000.0 615900.0 100.0 11.1 15.6 0.98 83.7 NaN 143.8
11 9ec7d743-96e7-47c8-b2ee-6336633beb39 NaN 2020-03-23 2020-03-25 Clinic C 1969-03-22 NaN M nonhispanic white 1.2 24.4 NaN 254.3 0.0021 600.0 406.2 0.1 11.8 15.6 0.80 NaN NaN 143.8
12 a527bcf0-3746-476c-90f2-dbab8868385e NaN 2020-03-03 2020-03-17 Clinic C 1978-11-27 NaN F nonhispanic white 1.1 25.7 NaN 228.1 0.0038 500.0 459.5 0.0 12.8 13.5 1.14 77.7 NaN 138.4
13 7f4ef129-1511-47a9-a9b7-8b0b2d02ad50 NaN 2020-03-12 2020-03-25 Clinic C 1952-05-06 NaN F nonhispanic white 1.0 29.7 NaN 321.5 0.0038 700.0 450.1 0.2 12.1 15.0 1.20 75.8 NaN 139.6
14 7078ae9a-4c79-4b30-b127-f76aabb6763e NaN 2020-02-17 2020-03-07 Clinic B 1968-04-26 NaN F hispanic white 0.9 48.7 NaN 306.5 NaN 600000.0 509000.0 100.0 12.0 18.4 1.14 77.3 NaN 139.6
15 a5c39700-6bf3-4984-af46-31344695e21b NaN 2020-03-05 2020-03-13 Clinic A 1940-01-09 2020-03-15 M nonhispanic white 0.7 85.5 NaN 312.9 0.0257 NaN 1552.8 NaN NaN 14.9 1.23 79.2 64.5 143.8
16 a5c39700-6bf3-4984-af46-31344695e21b 2020-03-12 2020-03-12 2020-03-16 Clinic C 1940-01-09 2020-03-15 M nonhispanic white 0.5 170.6 NaN 390.5 0.0238 15000.0 1755.9 0.2 14.0 14.9 1.15 75.6 40.9 143.8
17 ddb2d5e2-643e-4374-ac19-f6ca3c0d16f5 NaN 2020-02-25 2020-03-09 Clinic C 1967-12-24 NaN M nonhispanic white NaN 51.5 NaN 234.4 0.0029 500.0 527.4 0.1 11.5 12.0 0.90 88.9 42.0 138.8
18 21505aac-f219-43a8-ab3c-f57c6d8f1d1f NaN 2020-03-08 2020-03-21 Clinic B 1940-05-03 NaN F nonhispanic white 1.1 38.7 NaN 229.3 0.0029 600000.0 536500.0 NaN 12.6 8.4 1.07 77.0 NaN 137.4
19 7992bf94-feee-4728-9187-2c911df2819b NaN 2020-03-03 2020-03-17 Clinic C 2004-07-04 NaN F nonhispanic white 1.0 27.3 NaN 238.9 0.0022 600.0 NaN 0.1 11.0 10.9 0.98 NaN NaN 138.4
20 d2f6d528-39db-4b7e-8389-abd27af9a710 NaN 2020-02-29 2020-03-12 Clinic B 1996-06-26 NaN F nonhispanic white 1.1 31.5 NaN 254.3 0.0034 700000.0 455300.0 0.0 10.3 14.6 1.08 76.6 NaN 138.4
21 fa0b58e6-6817-4d49-8211-1dd34abf0c15 NaN 2020-03-11 2020-03-28 Clinic C 2008-11-21 NaN M nonhispanic white 0.9 23.0 NaN 232.0 0.0030 600.0 542.0 0.1 10.8 10.6 0.88 86.3 NaN 141.6
22 b83237f3-9ff5-491e-aab4-d63ccff85f85 NaN 2020-03-13 2020-03-30 Clinic C 2012-11-17 NaN M nonhispanic white 1.1 47.2 NaN 329.8 0.0038 700.0 535.0 0.0 12.4 10.1 1.13 75.3 NaN 137.3
23 46988a9c-9c86-429a-bc4a-b3d14ff321b0 NaN 2020-03-11 2020-03-21 Clinic B 1957-03-13 NaN M nonhispanic asian NaN 37.0 NaN 235.7 0.0030 NaN 535000.0 100.0 10.2 19.1 0.86 79.1 NaN 137.5
24 46988a9c-9c86-429a-bc4a-b3d14ff321b0 2020-03-20 2020-03-21 2020-03-24 Clinic B 1957-03-13 NaN M nonhispanic asian 1.2 29.6 NaN 304.2 0.0044 500000.0 553400.0 0.0 11.8 9.2 1.13 98.7 86.8 143.5
25 785b484d-7060-4d17-bf18-ef8bbafc6f04 NaN 2020-02-28 2020-03-10 Clinic B 1942-08-24 NaN F nonhispanic white 0.8 21.3 NaN 226.3 0.0023 500000.0 468600.0 NaN 12.6 17.2 0.91 81.7 NaN 138.2
26 edad31f3-5a08-4678-8d31-271a41a2aad5 NaN 2020-03-05 2020-03-13 Clinic C 1940-01-09 2020-03-19 M nonhispanic white 0.6 78.6 NaN 306.4 0.0256 2600.0 1764.2 0.2 13.8 18.0 1.14 83.0 62.0 141.2
27 edad31f3-5a08-4678-8d31-271a41a2aad5 2020-03-12 2020-03-12 2020-03-20 Clinic C 1940-01-09 2020-03-19 M nonhispanic white 0.3 184.4 NaN 370.1 0.0639 20600.0 1804.0 0.3 11.5 7.7 1.23 84.4 NaN 142.2
28 4607a669-4a97-4f0a-9661-856569905047 NaN 2020-03-09 2020-03-21 Clinic C 1993-11-26 NaN F nonhispanic white 1.1 48.7 NaN NaN 0.0037 600.0 590.9 0.1 12.5 13.7 0.97 81.1 NaN 142.4
29 c1800ba1-7cba-45d7-bdc4-0e0b583932e4 NaN 2020-02-23 2020-03-08 Clinic A 2018-01-20 NaN M hispanic white NaN 25.7 NaN 306.4 0.0030 600.0 464.0 0.0 10.9 7.6 NaN 77.8 NaN 141.7
30 d2718050-2e9c-4d5b-842e-52d910c1563f NaN 2020-03-04 2020-03-17 Clinic C 1997-06-01 NaN M nonhispanic white 1.2 50.6 NaN 339.8 0.0047 600.0 634.5 0.2 10.2 13.6 1.08 81.5 NaN 137.9
31 d2718050-2e9c-4d5b-842e-52d910c1563f NaN 2020-03-17 2020-03-22 Clinic A 1997-06-01 NaN M nonhispanic white 1.3 40.5 NaN 186.6 0.0038 500.0 322.6 0.1 10.8 16.5 0.85 99.6 NaN 139.6
32 818566cb-c89b-42d8-a6af-1a1ef13ed7cf NaN 2020-03-08 2020-03-20 Clinic C 1984-10-11 NaN F nonhispanic white 0.9 27.0 NaN 239.3 0.0023 500.0 474.7 0.1 11.8 9.8 0.94 85.7 NaN 137.9
33 000e7adf-cbaa-4fad-ab2f-658c32f7d4d3 NaN 2020-03-12 2020-03-16 Clinic B 1959-01-03 2020-03-15 M nonhispanic white 0.6 175.1 NaN 326.5 0.0063 1600000.0 1010100.0 400.0 12.5 16.4 1.37 81.4 79.6 143.0
34 5a2f02ce-0286-45ae-b992-05331cb88379 NaN 2020-03-11 2020-03-29 Clinic C 1973-06-30 NaN F nonhispanic white 1.0 27.7 NaN 349.5 NaN 600.0 NaN 0.2 10.7 14.8 0.86 87.6 NaN 143.7

In the following quiz, you will answer questions examining the issues of Team 1 and Team 2.

In [ ]:

 

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more