A note for professor and TAs: refer to your email for a notebook containing more information about the data set.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Read in data set from excel file:
df = pd.read_excel('dataset_proj.xlsx')
The goal of this project is to determine what factors influence booking rates at Medmo Inc., a healthcare in tech company. Medmo facilitates radiology appointments for their clients, both B2B and B2C. Every scan request that comes in through their online portal contains a multitude of information that could be helpful in drawing correlations and creating prediction models. I, Zoe Birnbaum, will be analyzing their data set of all appointment requests to discover correlations between variables to identify what influences the likelihood that a requested appointment is booked, and if we can predict that based on such variables.
This data includes all requested appointments from the company's portal from April 2018 through April 2024. Attached to each request is information regarding geographic location of the patient, type of scan/appointment, B2B vs. B2C, stage of request (scheduled, completed, closed, etc.), date and time of request (and booking, if applicable), and more. I chose to work with this data set because of my experience working with the company and personal interest in the healthcare in tech field. The set provides a sufficient amount of data, information, and variables to allow me to follow my goal for this project.
Research Questions
Questions I hope to answer with this data:
What factors are correlated with successful bookings? Based on those factors, can we predict whether a request will be booked?:
Is the complexity of a request correlated with the time taken to book appointment after request?
RQ1 will look at all requests, and RQ2 will look at only requests that are scheduled and beyond (only those requests have additional data on booking, such as time taken). In the section for cleaning up data, I will create a second DataFrame for these RQs (df1 and df2 respectively).
Why this data set is correct for answering the above
The data set contains variables for all factors included in my branched-off research questions. I have over 80k scan requests to work with, which will yield valid statistics.
Hypotheses:
The following factors can predict a successful appointment request:
The complexity of a request and the time taken to book the appointment -- from the time the appointment was requested -- are positively correlated.
Since each observation is identified by the request ID, I'll make TestRequestID
the index:
df.set_index("TestRequestID", inplace = True)
I'll drop the Latitude
, Longitude
, and ZipCode
columns, and just use the data set's column State
for location purposes.
df = df.drop(columns=['Latitude', 'Longitude', 'ZipCode'])
The column HasReportFile
is currently represented as Boolean (1 and 0), but would be more readable with values "Yes" and "No" instead.
df['HasReportFile'] = df['HasReportFile'].map({
0: "No",
1: "Yes"
})
Since I'll be looking at the time between DateRequested
and DateBooked
, I'm going to create a new column in the DataFrame called BookingTime
that represents this difference:
df["BookingTime"] = df["DateBooked"] - df["DateRequested"]
An important metric in analyzing this data is the complexity of a scan request. Additional tasks and comments on a scan request, such as needing to ask a physician for a script, increase the complexity and I hypothesize that this a) reduces the likelihood that a request will become scheduled, and b) increases BookingTime. Let's create an algorithm to calculate complexity of a scan request:
def calculate_request_complexity(dfRow):
score = 0
# The weight of complexity for each task/comment was provided to me by the company:
score += (dfRow['ExternalTasks'] * 10)
score += (dfRow['InternalTasks'] * 5)
score += (dfRow['B2BComments'] * 2)
score += (dfRow['InternalComments'] * 2)
score += (dfRow['ICComments'] * 1)
return score
I'll add a Complexity
column to the DataFrame so I can use this measure in the future:
df['Complexity']=df.apply(calculate_request_complexity, axis=1)
Dropping StatusID
as it's redundant to HasReportFile
, and I will stick with the boolean value in this data set. Also, dropping other "Booking" variables, as the new column BookingTime
is what I'll be using:
df = df.drop(columns=['StatusID', 'FirstBooking', 'LastBooking', 'DateRequested', 'DateBooked'])
Time to handle the NaNs/missing data. These are the columns that contain such values and how many:
print(df.isnull().any())
StageID False StageName False Modality False ChannelSource False AppointmentDate True State True Cancelations False InternalTasks False ExternalTasks False PriorAuthStatus False TimeToOrderEntry True TimeToA1Completion True InternalComments False B2BComments False ICComments False HasReportFile False BookingTime True Complexity False dtype: bool
print("AppointmentDate: ", len(df[df['AppointmentDate'].isnull() == True]))
print("State: ", len(df[df['State'].isnull() == True]))
print("TimeToOrderEntry: ", len(df[df['TimeToOrderEntry'].isnull() == True]))
print("TimeToA1Completion: ", len(df[df['TimeToA1Completion'].isnull() == True]))
print("BookingTime: ", len(df[df['BookingTime'].isnull() == True]))
AppointmentDate: 35840 State: 595 TimeToOrderEntry: 27951 TimeToA1Completion: 70925 BookingTime: 35838
BookingTime
and AppointmentDate
only contain data for requests that are scheduled and beyond, which is why there are so many null values. We are not looking at either variable for RQ1, so I will not include those columns in df1, and double check that when df2 is created, this problem is solved.
TimeToA1Completion
and TimeToOrderEntry
contain too many NaNs, and I am not looking at either variable in either RQ, so I will drop those columns as well.
State
, however, is very relevant to RQ1 so I will drop the rows that are missing that data. The amount of observations dropped is negligible in relation to the total amount of observations (80k+), unlike the other 4 NaN-heavy variables, which is why I am taking this approach with State
.
df = df.drop(columns=['TimeToA1Completion', 'TimeToOrderEntry'])
df = df.dropna(subset=['State'])
Time to create df1 and df2. Recall, df1 is for RQ1, which is looking at all requests (many of which have not been booked), and df2 is for RQ2, which is looking at only requests that have been scheduled and/or completed. In other words, df2 is a subset of df1, with the addition of "booking" variables that would have been null in df1.
df1 = df.drop(columns=['AppointmentDate', 'BookingTime'])
df2 = df[df['StageID'] >= 4]
# StageID of 4 = scheduled, StageID of 5 = completed
Let's double check both DataFrames for null values:
print(df1.isnull().any())
print(df2.isnull().any())
StageID False StageName False Modality False ChannelSource False State False Cancelations False InternalTasks False ExternalTasks False PriorAuthStatus False InternalComments False B2BComments False ICComments False HasReportFile False Complexity False dtype: bool StageID False StageName False Modality False ChannelSource False AppointmentDate True State False Cancelations False InternalTasks False ExternalTasks False PriorAuthStatus False InternalComments False B2BComments False ICComments False HasReportFile False BookingTime True Complexity False dtype: bool
print("AppointmentDate: ", len(df2[df2['AppointmentDate'].isnull() == True]))
print("BookingTime: ", len(df2[df2['BookingTime'].isnull() == True]))
AppointmentDate: 3 BookingTime: 1
Again, out of 80,000+ observations, these amounts are negligible so they can be dropped.
df2 = df2.dropna(subset=['BookingTime', 'AppointmentDate'])
print(df2.isnull().any())
StageID False StageName False Modality False ChannelSource False AppointmentDate False State False Cancelations False InternalTasks False ExternalTasks False PriorAuthStatus False InternalComments False B2BComments False ICComments False HasReportFile False BookingTime False Complexity False dtype: bool
Perfect! The data is ready to work with.
State
variable:¶Since I am interested in the relationship between successful bookings and geographic location, I'll start by taking a look at data related to requests by state:
df1[["State"]].describe()
State | |
---|---|
count | 85326 |
unique | 50 |
top | NY |
freq | 53913 |
New York leads the states in the total amount of requests at 53,913, which is 63.18% of all requests:
percent = (df1['State'].value_counts()["NY"]) / (len(df1))
print(f"{percent:.2%}")
63.18%
Next, I'll generate a graph showing the percentage distribution of requests by state:
state_counts = df1['State'].value_counts()
state_percentages = state_counts / (len(df1)) * 100
state_percentages.plot.bar(figsize=(20, 10), ylabel='Percentage of test requests', xlabel='State')
<Axes: xlabel='State', ylabel='Percentage of test requests'>
The leading states are: New York, Florida, and New Jersey. The following table shows the relationship between StageID and State where State = (NJ | NY | FL), via a joint & marginal distribution table:
top_states = df[(df['State'] == 'NJ') | (df['State'] == 'NY') | (df['State'] == 'FL')]
pd.crosstab(df.StageID, top_states.State, normalize=True, margins=True)
State | FL | NJ | NY | All |
---|---|---|---|---|
StageID | ||||
0 | 0.114878 | 0.019356 | 0.268086 | 0.402320 |
4 | 0.003031 | 0.000076 | 0.025163 | 0.028271 |
5 | 0.162505 | 0.013600 | 0.393304 | 0.569409 |
All | 0.280413 | 0.033033 | 0.686554 | 1.000000 |
Throughout this report I will refer to requests as "successful", so let's define and find out some information about that. A successful request is a scan that is at the scheduling stage or beyond: it is scheduled or completed. What's the percentage of scan requests that fall into this category?
df_scheduled_completed = df1[(df1["StageName"] == "Scheduled") | (df1["StageName"] == "Completed")]
print(f"{(len(df_scheduled_completed) / len(df1)):.2%}")
58.59%
The "best-case scenario" for a scan request is one that is not only scheduled/completed, but has a report file attached. Let's see what percent of requests are at the best-case level:
best_case = df_scheduled_completed[df_scheduled_completed["HasReportFile"] == "Yes"]
print(f"{(len(best_case) / len(df1)):.2%}")
35.63%
Modality
variable:¶Next, let's look at the concentrations of scan types. Amount of successful scan requests per type of scan (modality
variable):
counts = df_scheduled_completed['Modality'].value_counts()
counts.plot.bar(figsize=(20, 10), ylabel='Number of Scan Requests', xlabel='Type of Scan (Modality)')
<Axes: xlabel='Type of Scan (Modality)', ylabel='Number of Scan Requests'>
From this bar graph, we can see that Ultrasound is by far the most popular scan out of all those scheduled and completed, followed by MRI and Mammogram. Let's see what percentage this is vs. the percentage of Ultrasounds in all scans:
percent_ultrasound_success = len(df_scheduled_completed[df_scheduled_completed["Modality"] == "Ultrasound"])/len(df_scheduled_completed)
print(f"{percent_ultrasound_success:.2%}")
39.44%
percent_all_ultrasounds = len(df1[df1["Modality"] == "Ultrasound"])/len(df1)
print(f"{percent_all_ultrasounds:.2%}")
35.47%
Ultrasounds make up a bit more than a third of all scan requests. There isn't much difference in the percent of Ultrasounds between the "all scans" group and the "successful" group.
Another variable of interest in this project is Cancelations
, which represents the amount of times the order was canceled (if any). I hypothesized that zero previous cancelations influence successful bookings, but let's see how many cancelations are at each level of StageID
:
dfx = df[['StageID', 'Cancelations']]
dfx.groupby('StageID').sum()
Cancelations | |
---|---|
StageID | |
0 | 16995 |
4 | 636 |
5 | 8008 |
The amount of cancelations in successful bookings is lower than un-successful.
As mentioned before, I added Complexity
to the DataFrame(s). Let's take a closer look into that variable, and how it is related to other variables of interest:
df['Complexity'].describe()
count 85326.000000 mean 32.751471 std 34.935679 min 2.000000 25% 12.000000 50% 22.000000 75% 40.000000 max 596.000000 Name: Complexity, dtype: float64
The scatter plot below visualizes the relationship between Complexity, Cancelations, and StageID, where all variables are standardized to be on the same scale:
df_std = df[['Complexity', 'Cancelations', 'StageID']]
df_std = ((df_std - df_std.mean())/df_std.std())
df_std.plot.scatter(x="Complexity", y="Cancelations", c="StageID",
cmap="plasma", alpha=.5);
We can see that the amount of successful requests are highly concentrated in the lower complexity and zero cancelations region. This is promising in proving my first hypothesis, however model creation and evaluation is still necessary.
To compute more-than-simple mathematic calculations on BookingTime
, I will convert these values from TimeDelta objects into floats, where the unit is days:
df2["BookingTimeInts"] = (df2["BookingTime"].dt.days +
(df2["BookingTime"].dt.components["hours"]/24) + # 24 hours per day
(df2["BookingTime"].dt.components["minutes"]/1440) + # 1440 minutes per day
(df2["BookingTime"].dt.components["seconds"])) # 86400 seconds per day
This next scatter plot shows the relationship between complexity and time taken to book, also standardized to be on the same scale:
df2_std = df2[['Complexity', 'BookingTimeInts']]
df2_std = ((df2_std - df2_std.mean())/df2_std.std())
plt.scatter(df2_std["Complexity"], df2_std["BookingTimeInts"])
plt.xlabel("Complexity of Request")
plt.ylabel("Time Taken to Book (from Time Requested) (In Days)")
plt.show()
what can we see? lead into models:
I will use the state, modality, cancellations, and complexity score (IVs) to predict whether an appointment will be booked (DV). I will use a K-Nearest-Neighbors Classifier, as this DV is categorical.
I will use the complexity score to predict the time between when the request is made and appointment is booked. I will use a K-Nearest-Neighbors Regressor, as this DV is continuous.
As mentioned above, I am going to implement a simple 5-Nearest-Neighbors classifier to predict the StageID
of a scan request:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import DictVectorizer
# Define the training data and represent features as a list of dictionaries
features = ["State", "Modality", "Cancelations", "Complexity"]
x_train_dict = df1[features].to_dict(orient="records")
y_train = df1["StageID"]
# dummy encoding:
vec = DictVectorizer(sparse=False)
vec.fit(x_train_dict)
x_train = vec.transform(x_train_dict)
# scale the features:
scaler = StandardScaler()
scaler.fit(x_train)
x_train_std = scaler.transform(x_train)
# Fit the 5-nearest neighbors model:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(x_train_std, y_train)
y_train_pred = model.predict(x_train_std)
y_train_pred
array([0, 0, 0, ..., 0, 5, 5])
What will the model predict for a request that models the first hypothesis?:
x_new_dict = [{
"State": "NY",
"Modality": "Ultrasound",
"Cancelations": 0,
"Complexity": 5 # recall 50th percentile of Complexity is ~22
}]
x_new = vec.transform(x_new_dict)
x_new_std = scaler.transform(x_new)
print(model.predict(x_new_std))
model.predict_proba(x_new_std)
[0]
array([[0.6, 0. , 0.4]])
With a 5-nearest-neighbors model, the prediction is that such a request is of StageID == 0
(which corresponds to StageName == Closed
) with a 60% probability. I will evaluate accuracy at each k and choose the k with the highest accuracy to use for the model:
def get_accuracy(k):
model = KNeighborsClassifier(n_neighbors=k)
pipeline = Pipeline([
("scaler", scaler),
("model", model)
])
accuracy= cross_val_score(pipeline, x_train, y_train,cv=10, scoring="accuracy").mean()
return accuracy
ks = pd.Series(range(1, 31))
ks.index = range(1, 31)
accuracies = ks.apply(get_accuracy)
accuracies.plot.line()
accuracies.sort_values()
1 0.551650 2 0.551755 4 0.596981 3 0.599513 5 0.618253 10 0.619143 11 0.620632 6 0.621300 7 0.622882 8 0.624136 12 0.624488 13 0.628168 15 0.628261 9 0.628706 23 0.629621 21 0.629773 17 0.630019 27 0.630523 22 0.630547 14 0.630605 25 0.630640 19 0.631004 26 0.631050 16 0.631590 29 0.631683 24 0.631976 20 0.632187 30 0.632222 28 0.632773 18 0.633395 dtype: float64
k=18 optimizes accuracy:
# Fit the 18-nearest neighbors model:
model = KNeighborsClassifier(n_neighbors=18)
model.fit(x_train_std, y_train)
# Use the model to predict on the same test point
print(model.predict(x_new_std))
model.predict_proba(x_new_std)
[0]
array([[0.83333333, 0. , 0.16666667]])
With the value of k to optimize precision, the prediction is the same, but with a higher probability. I subjectively chose the value of 5 for "low complexity" on the new test point, so I will now use the model to predict if a request with Complexity
== (0, 12, and 20) will yield the same result. I chose these numbers because they are other ways to define "low" based on the values of Complexity
:
x_new_dict_0 = [{
"State": "NY",
"Modality": "Ultrasound",
"Cancelations": 0,
"Complexity": 0
}]
x_new_0 = vec.transform(x_new_dict_0)
x_new_0_std = scaler.transform(x_new_0)
print(model.predict(x_new_0_std))
model.predict_proba(x_new_0_std)
[0]
array([[0.61111111, 0. , 0.38888889]])
x_new_dict_12 = [{
"State": "NY",
"Modality": "Ultrasound",
"Cancelations": 0,
"Complexity": 12
}]
x_new_12 = vec.transform(x_new_dict_12)
x_new_12_std = scaler.transform(x_new_12)
print(model.predict(x_new_12_std))
model.predict_proba(x_new_12_std)
[5]
array([[0.38888889, 0. , 0.61111111]])
x_new_dict_20 = [{
"State": "NY",
"Modality": "Ultrasound",
"Cancelations": 0,
"Complexity": 20
}]
x_new_20 = vec.transform(x_new_dict_20)
x_new_20_std = scaler.transform(x_new_20)
print(model.predict(x_new_20_std))
model.predict_proba(x_new_20_std)
[5]
array([[0., 0., 1.]])
If we combine and average the probabilities of these low complexity requests being at stage Closed
or Completed
...
prob_0 = (0.833 + 0.611 + 0.389 + 0) / 4
prob_5 = (0.167 + 0.389 + 0.611 + 1) / 4
print(prob_0, prob_5)
0.45825 0.54175
We can see that it is more likely that a low complexity will be Completed
rather than Closed
. Below, I will compute the both the precision and recall of this model, where precision is the proportion of successful scan requests that are true positives, whereas recall is the proportion of successful scan requests that are positives, as opposed to negatives:
pipeline = Pipeline([
("scaler", scaler),
("model", model)
])
# Precision:
cross_val_score(pipeline, x_train, (y_train >=4),
cv=10, scoring="precision").mean()
0.7173601228151975
# Recall:
cross_val_score(pipeline, x_train, (y_train >=4),
cv=10, scoring="recall").mean()
0.6962038847769554
In conclusion, scan requests in which the location of the requested appointment is New York, the type of scan is an Ultrasound, there haven't been any previous cancelations, and complexity is low (below 50th percentile of all Complexity
scores), are likely to become successful with a ~54% probability. This is supported by the model's not-the-highest (but balanced) precision and recall.
As mentioned above, I am going to implement a K-Nearest-Neighbors Regressor to predict BookingTime
based on Complexity
def get_kNN_pred(x_new, n):
"""Given new observation, returns n nearest neighbors prediction
"""
dists = np.sqrt(((x_train - x_new)**2).sum(axis=1))
inds_sorted = dists.sort_values().index[:(n)]
return y_train.loc[inds_sorted].mean()
x_train = df2[["Complexity"]]
y_train = df2["BookingTimeInts"]
x_new = pd.DataFrame()
x_new["Complexity"] = np.arange(0, 600, 5)
The graph below shows the scatterplot of Complexity and BookingTime from the DataFrame, overlayed with the Regression model line for k=100:
y_pred_100 = x_new.apply(get_kNN_pred, axis=1, args=(100,))
# Predictive model for k=30 nearest neighbors:
y_pred_100.index = x_new["Complexity"]
df2.plot.scatter(x="Complexity", y="BookingTimeInts", color="black", alpha=.2)
y_pred_100.plot.line(color="blue")
<Axes: xlabel='Complexity', ylabel='BookingTimeInts'>
Complexity
and BookingTime
are positively correlated, supporting my second hypothesis. Next, I will evaluate the strength of this correlation:
index_array = np.array(y_pred_100.index)
values_array = np.array(y_pred_100.values)
np.corrcoef(index_array, values_array)[0, 1]
0.9343630142678436
Based off of this regression model, BookingTime
and Complexity
's correlation is positive and very strong! To evaluate the performance of this model, I'll compute the cross-validation error at different levels of k:
from sklearn.neighbors import KNeighborsRegressor
x_dict = df2[["Complexity"]].to_dict(orient="records")
y = df2["BookingTimeInts"]
def get_cv_error(k):
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mae = np.mean(-cross_val_score(
pipeline, x_dict, y,
cv=10, scoring="neg_mean_absolute_error"
))
return mae
ks = pd.Series(range(1, 101))
ks.index = range(1, 101)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
test_errs.sort_values()
100 20.999402 99 21.022531 98 21.037882 97 21.054166 96 21.069855 ... 5 33.259823 4 34.549659 3 35.233311 2 36.569867 1 38.069530 Length: 100, dtype: float64
From this graph, we can see that the (cross-validation) test error decreases as k increases, and starts to plateau around k=100, where error = ~21. I predict it will continue to plateau as k increases, but let's double check with a very large value of k:
k = pd.Series(range(30000, 30001))
k.index = range(30000, 30001)
test_err = k.apply(get_cv_error)
print(test_err)
30000 19.721953 dtype: float64
The larger a value of k, the lower the test error will be, until k reaches about 100 and the test error will plateau.
Findings
The bulk of all scan requests are from New York and are for Ultrasounds.
Cancelations are more present in un-successful scan requests than successful ones.
The following values were able to predict a successful scan request
However, the precision (0.717) and recall (0.696) of this classification model, along with the probability of the above statement (54%), indicate that future work may need to be completed to create a better model for this prediction.
The complexity of a scan request and the time taken to book the appointment are strongly and positively correlated (r=0.934)
Limitations
As displayed in some of the graphs and tables in this report, there was no representation for StageID
values 1-3 in this data set (respectively: Requested, Qualified, Imaging Center Assigned). Even though the StageID
variables were primarily registered in this report as successful or un-successful, the lack of representation of these values leads to a non-representative sample.