Vivek Farias, Associate Professor of Operations Management, MIT Sloan School of Management
We fall back on our experience to make decisions – but how much of the experiential data is codified? It is messy experiential data that is difficult to organise.
If you want to build a sustainable business that leverages data, how to you dynamically utilise incoming data to guide decision making. The questions we need to be asking with regards to models and making data driven decisions is – what happens in a world with poorly organised data, where you don’t have time to go back into the process of model building and decision making.
Vivek will discuss two disparate problem areas – one in the consumer internet and one in liquid biopsy.
Problems motivated by the consumer internet
At Amazon, Walmart, Spotify, they are looking to build increasingly granular models of customer preference – eg tracking individual customer actions.
There are two ingredients for this: 1) there is an arms race for increasingly granular data and 2) creating effective algorithms to leverage the data. The difficulty is folding in all these different sources of data (eg pinterest, in-store data) to something useful. How do you build software infrastructure that can immediately utilise a new source of data, without revisiting the model, the decision making etc.
Vivek’s described how this problem is around 70 years old and can be summed up in the term “tensor recovery”. By solving this problem, Vivek has opened the doors to a useful solution for scalability.
Vivek showed a matrix which was a holy grail for Amazon, Walmart etc – where customers are on one axis, products are on another axis – and the result is the probability that someone will buy a product. You have a second similar matrix on observed transactions. If you have data on observed transactions, you can create purchase probabilities – but the problem is that the matrix on observed transactions has very sparse data.
The amount of data that you have fundamentally limits the complexity of the model you can estimate. So if the data you have is sparse, you have to settle with a much simpler model.
So if you are Spotify, with 10^7 users and 10^10 transactions, you have 1000 transactions per user and are “rank bound” to 1000 (a model with 1000 factors). This is a complex model. Spotify have actually published open-source their models, and the rank bound is indeed between 40-1000.
In Amazon there are 10^8 active users with 10^8 transactions and there is therefore 1 transaction per user and they are therefore rank bound to 1 – a 1 factor model. This isn’t good enough for Amazon to do advanced predictions – the model is too simple to be predictive. They want granular prediction. Vivek asks – can we beat this with “side information”? Eg Amazon has data on 1) email usage, 2) activities on the website – this is side information. Vivek wants to assume that if you want to build a predictive model, you will use side information that gives you signal on the behaviour of your customer.
These sources of side information are likely to be dynamic – eg what happens if twitter “dies” and a new source of side information appears? You want to create a model that can easily adapt to these changes.
To solve this problem of tensor recovery, Vivek’s team created a new algorithm called “Slice Learning”, and it scales to TB sized datasets.
Xiami.com music streaming software – a warm-up application
Vivek’s team looked whether they could predict which songs a user would “love” (a website action similar to a “like”) – where the data was very sparse at around 1 “love” / month / user.
Recovering the “love” slice – the naive ML approach has a 79% accuracy as measured by area under the curve (AUC); the state of the art prior approach had an 86% accuracy; the slice learning approach had a 94% accuracy.
Problems of building a liquid biopsy
Liquid biopsies take a blood sample and look for circulating cancer cells / proteins / markers, prior to the standard markers being present.
The problem here is that you are trying to build a model with tiny amounts of data – in the hundreds and thousands of “blood draws”.
The approach is to use proteomic liquid looking for certain biomarkers – eg PSA (prostate), CA-125 (ovarian), CEA (colorectal), CA 19-9 (pancreatic). There is too much variation in any single protein across patients making it hard to build a predictive test.
So Vivek’s approach is such – rather than look at a single protein, they will do matrix-valued observations with multiplex observations across thousands of proteins and across may diverse reagents. The tensor recovery with slice learning becomes useful because the protein matrix is sparse with 10% proteins only found in a single patient.
In their experiment, they used 3 reagents with 900 proteins detected in two patient groups – 1) a training group with 45 patients with a variety of different cancers and 2) a test group of 15 patients. The results showed incredible specificity and sensitivities in both the training and test group. This is a regime of very small patients where you cannot do deep learning – but slice learning is perfect.
- You need to think how you can organise your data into a tensor
- (customer x product) x interactions
- (reagent x protein) x patients
- (disease x patient) x time
- Then fill in the missing data, at scale.