Influences on Medmo Booking Rates¶

Name: Zoe Birnbaum¶

Link to Github Pages: https://zbirnbaum.github.io/¶

A note for professor and TAs: refer to your email for a notebook containing more information about the data set.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read in data set from excel file:

In [2]:
df = pd.read_excel('dataset_proj.xlsx')

Project Goals¶

The goal of this project is to determine what factors influence booking rates at Medmo Inc., a healthcare in tech company. Medmo facilitates radiology appointments for their clients, both B2B and B2C. Every scan request that comes in through their online portal contains a multitude of information that could be helpful in drawing correlations and creating prediction models. I, Zoe Birnbaum, will be analyzing their data set of all appointment requests to discover correlations between variables to identify what influences the likelihood that a requested appointment is booked, and if we can predict that based on such variables.

The Data Set¶

This data includes all requested appointments from the company's portal from April 2018 through April 2024. Attached to each request is information regarding geographic location of the patient, type of scan/appointment, B2B vs. B2C, stage of request (scheduled, completed, closed, etc.), date and time of request (and booking, if applicable), and more. I chose to work with this data set because of my experience working with the company and personal interest in the healthcare in tech field. The set provides a sufficient amount of data, information, and variables to allow me to follow my goal for this project.

Research Questions

Questions I hope to answer with this data:

  1. What factors are correlated with successful bookings? Based on those factors, can we predict whether a request will be booked?:

    • Geographic location of the patient/appointment request
    • Type of scan
    • Number of previous cancellations (if any)
    • Complexity of the request (additional tasks, etc.)
  2. Is the complexity of a request correlated with the time taken to book appointment after request?

RQ1 will look at all requests, and RQ2 will look at only requests that are scheduled and beyond (only those requests have additional data on booking, such as time taken). In the section for cleaning up data, I will create a second DataFrame for these RQs (df1 and df2 respectively).

Why this data set is correct for answering the above

The data set contains variables for all factors included in my branched-off research questions. I have over 80k scan requests to work with, which will yield valid statistics.

Hypotheses:

  1. The following factors can predict a successful appointment request:

    • The location for the requested appointment is New York
    • The type of scan is an Ultrasound
    • Request is of low complexity (minimal additional tasks, etc. -- see 'Request Complexity' section)
    • 0 previous appointment cancelations
  2. The complexity of a request and the time taken to book the appointment -- from the time the appointment was requested -- are positively correlated.

Cleaning up the data¶

Since each observation is identified by the request ID, I'll make TestRequestID the index:

In [8]:
df.set_index("TestRequestID", inplace = True)

I'll drop the Latitude, Longitude, and ZipCode columns, and just use the data set's column State for location purposes.

In [9]:
df = df.drop(columns=['Latitude', 'Longitude', 'ZipCode'])

The column HasReportFile is currently represented as Boolean (1 and 0), but would be more readable with values "Yes" and "No" instead.

In [10]:
df['HasReportFile'] = df['HasReportFile'].map({
    0: "No",
    1: "Yes"
})

Since I'll be looking at the time between DateRequested and DateBooked, I'm going to create a new column in the DataFrame called BookingTime that represents this difference:

In [11]:
df["BookingTime"] = df["DateBooked"] - df["DateRequested"]

An important metric in analyzing this data is the complexity of a scan request. Additional tasks and comments on a scan request, such as needing to ask a physician for a script, increase the complexity and I hypothesize that this a) reduces the likelihood that a request will become scheduled, and b) increases BookingTime. Let's create an algorithm to calculate complexity of a scan request:

In [12]:
def calculate_request_complexity(dfRow):
  score = 0
  # The weight of complexity for each task/comment was provided to me by the company:
  score += (dfRow['ExternalTasks'] * 10)
  score += (dfRow['InternalTasks'] * 5)
  score += (dfRow['B2BComments'] * 2)
  score += (dfRow['InternalComments'] * 2)
  score += (dfRow['ICComments'] * 1)
  return score

I'll add a Complexity column to the DataFrame so I can use this measure in the future:

In [13]:
df['Complexity']=df.apply(calculate_request_complexity, axis=1)

Dropping StatusID as it's redundant to HasReportFile, and I will stick with the boolean value in this data set. Also, dropping other "Booking" variables, as the new column BookingTime is what I'll be using:

In [14]:
df = df.drop(columns=['StatusID', 'FirstBooking', 'LastBooking', 'DateRequested', 'DateBooked'])

Time to handle the NaNs/missing data. These are the columns that contain such values and how many:

In [15]:
print(df.isnull().any())
StageID               False
StageName             False
Modality              False
ChannelSource         False
AppointmentDate        True
State                  True
Cancelations          False
InternalTasks         False
ExternalTasks         False
PriorAuthStatus       False
TimeToOrderEntry       True
TimeToA1Completion     True
InternalComments      False
B2BComments           False
ICComments            False
HasReportFile         False
BookingTime            True
Complexity            False
dtype: bool
In [16]:
print("AppointmentDate: ", len(df[df['AppointmentDate'].isnull() == True]))
print("State: ", len(df[df['State'].isnull() == True]))
print("TimeToOrderEntry: ", len(df[df['TimeToOrderEntry'].isnull() == True]))
print("TimeToA1Completion: ", len(df[df['TimeToA1Completion'].isnull() == True]))
print("BookingTime: ", len(df[df['BookingTime'].isnull() == True]))
AppointmentDate:  35840
State:  595
TimeToOrderEntry:  27951
TimeToA1Completion:  70925
BookingTime:  35838

BookingTime and AppointmentDate only contain data for requests that are scheduled and beyond, which is why there are so many null values. We are not looking at either variable for RQ1, so I will not include those columns in df1, and double check that when df2 is created, this problem is solved.

TimeToA1Completion and TimeToOrderEntry contain too many NaNs, and I am not looking at either variable in either RQ, so I will drop those columns as well.

State, however, is very relevant to RQ1 so I will drop the rows that are missing that data. The amount of observations dropped is negligible in relation to the total amount of observations (80k+), unlike the other 4 NaN-heavy variables, which is why I am taking this approach with State.

In [17]:
df = df.drop(columns=['TimeToA1Completion', 'TimeToOrderEntry'])
In [18]:
df = df.dropna(subset=['State'])

Time to create df1 and df2. Recall, df1 is for RQ1, which is looking at all requests (many of which have not been booked), and df2 is for RQ2, which is looking at only requests that have been scheduled and/or completed. In other words, df2 is a subset of df1, with the addition of "booking" variables that would have been null in df1.

In [19]:
df1 = df.drop(columns=['AppointmentDate', 'BookingTime'])
In [20]:
df2 = df[df['StageID'] >= 4]
# StageID of 4 = scheduled, StageID of 5 = completed

Let's double check both DataFrames for null values:

In [21]:
print(df1.isnull().any())
print(df2.isnull().any())
StageID             False
StageName           False
Modality            False
ChannelSource       False
State               False
Cancelations        False
InternalTasks       False
ExternalTasks       False
PriorAuthStatus     False
InternalComments    False
B2BComments         False
ICComments          False
HasReportFile       False
Complexity          False
dtype: bool
StageID             False
StageName           False
Modality            False
ChannelSource       False
AppointmentDate      True
State               False
Cancelations        False
InternalTasks       False
ExternalTasks       False
PriorAuthStatus     False
InternalComments    False
B2BComments         False
ICComments          False
HasReportFile       False
BookingTime          True
Complexity          False
dtype: bool
In [22]:
print("AppointmentDate: ", len(df2[df2['AppointmentDate'].isnull() == True]))
print("BookingTime: ", len(df2[df2['BookingTime'].isnull() == True]))
AppointmentDate:  3
BookingTime:  1

Again, out of 80,000+ observations, these amounts are negligible so they can be dropped.

In [23]:
df2 = df2.dropna(subset=['BookingTime', 'AppointmentDate'])
In [24]:
print(df2.isnull().any())
StageID             False
StageName           False
Modality            False
ChannelSource       False
AppointmentDate     False
State               False
Cancelations        False
InternalTasks       False
ExternalTasks       False
PriorAuthStatus     False
InternalComments    False
B2BComments         False
ICComments          False
HasReportFile       False
BookingTime         False
Complexity          False
dtype: bool

Perfect! The data is ready to work with.

Preliminary Data Workings¶

The State variable:¶

Since I am interested in the relationship between successful bookings and geographic location, I'll start by taking a look at data related to requests by state:

In [25]:
df1[["State"]].describe()
Out[25]:
State
count 85326
unique 50
top NY
freq 53913

New York leads the states in the total amount of requests at 53,913, which is 63.18% of all requests:

In [26]:
percent = (df1['State'].value_counts()["NY"]) / (len(df1))
print(f"{percent:.2%}")
63.18%

Next, I'll generate a graph showing the percentage distribution of requests by state:

In [27]:
state_counts = df1['State'].value_counts()
state_percentages = state_counts / (len(df1)) * 100
In [28]:
state_percentages.plot.bar(figsize=(20, 10), ylabel='Percentage of test requests', xlabel='State')
Out[28]:
<Axes: xlabel='State', ylabel='Percentage of test requests'>

The leading states are: New York, Florida, and New Jersey. The following table shows the relationship between StageID and State where State = (NJ | NY | FL), via a joint & marginal distribution table:

In [350]:
top_states = df[(df['State'] == 'NJ') | (df['State'] == 'NY') | (df['State'] == 'FL')]
pd.crosstab(df.StageID, top_states.State, normalize=True, margins=True)
Out[350]:
State FL NJ NY All
StageID
0 0.114878 0.019356 0.268086 0.402320
4 0.003031 0.000076 0.025163 0.028271
5 0.162505 0.013600 0.393304 0.569409
All 0.280413 0.033033 0.686554 1.000000

"Successful" Requests¶

Throughout this report I will refer to requests as "successful", so let's define and find out some information about that. A successful request is a scan that is at the scheduling stage or beyond: it is scheduled or completed. What's the percentage of scan requests that fall into this category?

In [30]:
df_scheduled_completed = df1[(df1["StageName"] == "Scheduled") | (df1["StageName"] == "Completed")]
print(f"{(len(df_scheduled_completed) / len(df1)):.2%}")
58.59%

The "best-case scenario" for a scan request is one that is not only scheduled/completed, but has a report file attached. Let's see what percent of requests are at the best-case level:

In [31]:
best_case = df_scheduled_completed[df_scheduled_completed["HasReportFile"] == "Yes"]
print(f"{(len(best_case) / len(df1)):.2%}")
35.63%

The Modality variable:¶

Next, let's look at the concentrations of scan types. Amount of successful scan requests per type of scan (modality variable):

In [32]:
counts = df_scheduled_completed['Modality'].value_counts()
counts.plot.bar(figsize=(20, 10), ylabel='Number of Scan Requests', xlabel='Type of Scan (Modality)')
Out[32]:
<Axes: xlabel='Type of Scan (Modality)', ylabel='Number of Scan Requests'>

From this bar graph, we can see that Ultrasound is by far the most popular scan out of all those scheduled and completed, followed by MRI and Mammogram. Let's see what percentage this is vs. the percentage of Ultrasounds in all scans:

In [33]:
percent_ultrasound_success = len(df_scheduled_completed[df_scheduled_completed["Modality"] == "Ultrasound"])/len(df_scheduled_completed)
print(f"{percent_ultrasound_success:.2%}")
39.44%
In [34]:
percent_all_ultrasounds = len(df1[df1["Modality"] == "Ultrasound"])/len(df1)
print(f"{percent_all_ultrasounds:.2%}")
35.47%

Ultrasounds make up a bit more than a third of all scan requests. There isn't much difference in the percent of Ultrasounds between the "all scans" group and the "successful" group.

Cancelations¶

Another variable of interest in this project is Cancelations, which represents the amount of times the order was canceled (if any). I hypothesized that zero previous cancelations influence successful bookings, but let's see how many cancelations are at each level of StageID:

In [359]:
dfx = df[['StageID', 'Cancelations']]
dfx.groupby('StageID').sum()
Out[359]:
Cancelations
StageID
0 16995
4 636
5 8008

The amount of cancelations in successful bookings is lower than un-successful.

Request Complexity¶

As mentioned before, I added Complexity to the DataFrame(s). Let's take a closer look into that variable, and how it is related to other variables of interest:

In [37]:
df['Complexity'].describe()
Out[37]:
count    85326.000000
mean        32.751471
std         34.935679
min          2.000000
25%         12.000000
50%         22.000000
75%         40.000000
max        596.000000
Name: Complexity, dtype: float64

The scatter plot below visualizes the relationship between Complexity, Cancelations, and StageID, where all variables are standardized to be on the same scale:

In [53]:
df_std = df[['Complexity', 'Cancelations', 'StageID']]
df_std = ((df_std - df_std.mean())/df_std.std())
In [54]:
df_std.plot.scatter(x="Complexity", y="Cancelations", c="StageID",
                     cmap="plasma", alpha=.5);

We can see that the amount of successful requests are highly concentrated in the lower complexity and zero cancelations region. This is promising in proving my first hypothesis, however model creation and evaluation is still necessary.

To compute more-than-simple mathematic calculations on BookingTime, I will convert these values from TimeDelta objects into floats, where the unit is days:

In [302]:
df2["BookingTimeInts"] = (df2["BookingTime"].dt.days +
 (df2["BookingTime"].dt.components["hours"]/24) +   # 24 hours per day
 (df2["BookingTime"].dt.components["minutes"]/1440) +   # 1440 minutes per day
 (df2["BookingTime"].dt.components["seconds"]))   # 86400 seconds per day

This next scatter plot shows the relationship between complexity and time taken to book, also standardized to be on the same scale:

In [340]:
df2_std = df2[['Complexity', 'BookingTimeInts']]
df2_std = ((df2_std - df2_std.mean())/df2_std.std())
In [341]:
plt.scatter(df2_std["Complexity"], df2_std["BookingTimeInts"])

plt.xlabel("Complexity of Request")
plt.ylabel("Time Taken to Book (from Time Requested) (In Days)")

plt.show()

what can we see? lead into models:

Models Overview:¶

  1. I will use the state, modality, cancellations, and complexity score (IVs) to predict whether an appointment will be booked (DV). I will use a K-Nearest-Neighbors Classifier, as this DV is categorical.

  2. I will use the complexity score to predict the time between when the request is made and appointment is booked. I will use a K-Nearest-Neighbors Regressor, as this DV is continuous.

RQ1: will a request become successful?¶

As mentioned above, I am going to implement a simple 5-Nearest-Neighbors classifier to predict the StageID of a scan request:

In [86]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import DictVectorizer
In [153]:
# Define the training data and represent features as a list of dictionaries
features = ["State", "Modality", "Cancelations", "Complexity"]
x_train_dict = df1[features].to_dict(orient="records")
y_train = df1["StageID"]
In [155]:
# dummy encoding:
vec = DictVectorizer(sparse=False)
vec.fit(x_train_dict)
x_train = vec.transform(x_train_dict)

# scale the features:
scaler = StandardScaler()
scaler.fit(x_train)
x_train_std = scaler.transform(x_train)
In [166]:
# Fit the 5-nearest neighbors model:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(x_train_std, y_train)
y_train_pred = model.predict(x_train_std)
y_train_pred
Out[166]:
array([0, 0, 0, ..., 0, 5, 5])

What will the model predict for a request that models the first hypothesis?:

In [167]:
x_new_dict = [{
    "State": "NY",
    "Modality": "Ultrasound",
    "Cancelations": 0,
    "Complexity": 5   # recall 50th percentile of Complexity is ~22
}]
x_new = vec.transform(x_new_dict)
x_new_std = scaler.transform(x_new)

print(model.predict(x_new_std))
model.predict_proba(x_new_std)
[0]
Out[167]:
array([[0.6, 0. , 0.4]])

With a 5-nearest-neighbors model, the prediction is that such a request is of StageID == 0 (which corresponds to StageName == Closed) with a 60% probability. I will evaluate accuracy at each k and choose the k with the highest accuracy to use for the model:

In [113]:
def get_accuracy(k):
    model = KNeighborsClassifier(n_neighbors=k)
    pipeline = Pipeline([
    ("scaler", scaler),
    ("model", model)
])
    accuracy= cross_val_score(pipeline, x_train, y_train,cv=10, scoring="accuracy").mean()
    return accuracy
In [114]:
ks = pd.Series(range(1, 31))
ks.index = range(1, 31)
accuracies = ks.apply(get_accuracy)

accuracies.plot.line()
accuracies.sort_values()
Out[114]:
1     0.551650
2     0.551755
4     0.596981
3     0.599513
5     0.618253
10    0.619143
11    0.620632
6     0.621300
7     0.622882
8     0.624136
12    0.624488
13    0.628168
15    0.628261
9     0.628706
23    0.629621
21    0.629773
17    0.630019
27    0.630523
22    0.630547
14    0.630605
25    0.630640
19    0.631004
26    0.631050
16    0.631590
29    0.631683
24    0.631976
20    0.632187
30    0.632222
28    0.632773
18    0.633395
dtype: float64

k=18 optimizes accuracy:

In [168]:
# Fit the 18-nearest neighbors model:
model = KNeighborsClassifier(n_neighbors=18)
model.fit(x_train_std, y_train)

# Use the model to predict on the same test point
print(model.predict(x_new_std))
model.predict_proba(x_new_std)
[0]
Out[168]:
array([[0.83333333, 0.        , 0.16666667]])

With the value of k to optimize precision, the prediction is the same, but with a higher probability. I subjectively chose the value of 5 for "low complexity" on the new test point, so I will now use the model to predict if a request with Complexity == (0, 12, and 20) will yield the same result. I chose these numbers because they are other ways to define "low" based on the values of Complexity:

In [169]:
x_new_dict_0 = [{
    "State": "NY",
    "Modality": "Ultrasound",
    "Cancelations": 0,
    "Complexity": 0
}]
x_new_0 = vec.transform(x_new_dict_0)
x_new_0_std = scaler.transform(x_new_0)

print(model.predict(x_new_0_std))
model.predict_proba(x_new_0_std)
[0]
Out[169]:
array([[0.61111111, 0.        , 0.38888889]])
In [172]:
x_new_dict_12 = [{
    "State": "NY",
    "Modality": "Ultrasound",
    "Cancelations": 0,
    "Complexity": 12
}]
x_new_12 = vec.transform(x_new_dict_12)
x_new_12_std = scaler.transform(x_new_12)

print(model.predict(x_new_12_std))
model.predict_proba(x_new_12_std)
[5]
Out[172]:
array([[0.38888889, 0.        , 0.61111111]])
In [173]:
x_new_dict_20 = [{
    "State": "NY",
    "Modality": "Ultrasound",
    "Cancelations": 0,
    "Complexity": 20
}]
x_new_20 = vec.transform(x_new_dict_20)
x_new_20_std = scaler.transform(x_new_20)

print(model.predict(x_new_20_std))
model.predict_proba(x_new_20_std)
[5]
Out[173]:
array([[0., 0., 1.]])

If we combine and average the probabilities of these low complexity requests being at stage Closed or Completed...

In [174]:
prob_0 = (0.833 + 0.611 + 0.389 + 0) / 4
prob_5 = (0.167 + 0.389 + 0.611 + 1) / 4

print(prob_0, prob_5)
0.45825 0.54175

We can see that it is more likely that a low complexity will be Completed rather than Closed. Below, I will compute the both the precision and recall of this model, where precision is the proportion of successful scan requests that are true positives, whereas recall is the proportion of successful scan requests that are positives, as opposed to negatives:

In [188]:
pipeline = Pipeline([
    ("scaler", scaler),
    ("model", model)
])
In [190]:
# Precision:
cross_val_score(pipeline, x_train, (y_train >=4),
                cv=10, scoring="precision").mean()
Out[190]:
0.7173601228151975
In [191]:
# Recall:
cross_val_score(pipeline, x_train, (y_train >=4),
                cv=10, scoring="recall").mean()
Out[191]:
0.6962038847769554

In conclusion, scan requests in which the location of the requested appointment is New York, the type of scan is an Ultrasound, there haven't been any previous cancelations, and complexity is low (below 50th percentile of all Complexity scores), are likely to become successful with a ~54% probability. This is supported by the model's not-the-highest (but balanced) precision and recall.

RQ2: is there a correlation between complexity and time taken to book?¶

As mentioned above, I am going to implement a K-Nearest-Neighbors Regressor to predict BookingTime based on Complexity

In [303]:
def get_kNN_pred(x_new, n):
    """Given new observation, returns n nearest neighbors prediction
    """
    dists = np.sqrt(((x_train - x_new)**2).sum(axis=1))
    inds_sorted = dists.sort_values().index[:(n)]
    return y_train.loc[inds_sorted].mean()

x_train = df2[["Complexity"]]
y_train = df2["BookingTimeInts"]

x_new = pd.DataFrame()
x_new["Complexity"] = np.arange(0, 600, 5)

The graph below shows the scatterplot of Complexity and BookingTime from the DataFrame, overlayed with the Regression model line for k=100:

In [304]:
y_pred_100 = x_new.apply(get_kNN_pred, axis=1, args=(100,))

# Predictive model for k=30 nearest neighbors:
y_pred_100.index = x_new["Complexity"]
df2.plot.scatter(x="Complexity", y="BookingTimeInts", color="black", alpha=.2)
y_pred_100.plot.line(color="blue")
Out[304]:
<Axes: xlabel='Complexity', ylabel='BookingTimeInts'>

Complexity and BookingTime are positively correlated, supporting my second hypothesis. Next, I will evaluate the strength of this correlation:

In [316]:
index_array = np.array(y_pred_100.index)
values_array = np.array(y_pred_100.values)
In [319]:
np.corrcoef(index_array, values_array)[0, 1]
Out[319]:
0.9343630142678436

Based off of this regression model, BookingTime and Complexity's correlation is positive and very strong! To evaluate the performance of this model, I'll compute the cross-validation error at different levels of k:

In [336]:
from sklearn.neighbors import KNeighborsRegressor

x_dict = df2[["Complexity"]].to_dict(orient="records")
y = df2["BookingTimeInts"]

def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mae = np.mean(-cross_val_score(
        pipeline, x_dict, y,
        cv=10, scoring="neg_mean_absolute_error"
    ))
    return mae

ks = pd.Series(range(1, 101))
ks.index = range(1, 101)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()
Out[336]:
100    20.999402
99     21.022531
98     21.037882
97     21.054166
96     21.069855
         ...    
5      33.259823
4      34.549659
3      35.233311
2      36.569867
1      38.069530
Length: 100, dtype: float64

From this graph, we can see that the (cross-validation) test error decreases as k increases, and starts to plateau around k=100, where error = ~21. I predict it will continue to plateau as k increases, but let's double check with a very large value of k:

In [337]:
k = pd.Series(range(30000, 30001))
k.index = range(30000, 30001)
test_err = k.apply(get_cv_error)
In [338]:
print(test_err)
30000    19.721953
dtype: float64

The larger a value of k, the lower the test error will be, until k reaches about 100 and the test error will plateau.

Conclusion¶

Findings

  1. The bulk of all scan requests are from New York and are for Ultrasounds.

  2. Cancelations are more present in un-successful scan requests than successful ones.

  3. The following values were able to predict a successful scan request

    • The appointment request's location is New York
    • The type of scan is an Ultrasound
    • 0 previous cancelations of this order
    • Low (score <= 20) complexity

    However, the precision (0.717) and recall (0.696) of this classification model, along with the probability of the above statement (54%), indicate that future work may need to be completed to create a better model for this prediction.

  4. The complexity of a scan request and the time taken to book the appointment are strongly and positively correlated (r=0.934)

Limitations

As displayed in some of the graphs and tables in this report, there was no representation for StageID values 1-3 in this data set (respectively: Requested, Qualified, Imaging Center Assigned). Even though the StageID variables were primarily registered in this report as successful or un-successful, the lack of representation of these values leads to a non-representative sample.