by Maciek Wasiak
“I am not a Data Scientist and I just built a predictive model!”
Chances are, that if you are reading this post, you have seen at least one sales pitch/conference presentation of one of the mainstream (or less mainstream) Predictive Modelling software vendors. The pitch typically follows the same enticing pattern: load the table, click a few buttons and here is your Model. It’s that simple. It’s meant to lure you into thinking – “if you buy our software, you can get your own models in just few clicks.” I witnessed some presenters saying – “look, I am not a Data Scientist and I just built a churn model”. One might wonder why companies are complaining about the shortage of Data Science skills when so many sales reps can build Predictive Models in minutes . . . It is a well-oiled sales machine, polished and perfected over the last 15 years. Eventually one of the examined tools wins the procurement process and licences are bought.
What happens next? More often than not – nothing.
The models that were supposed to rain down upon us just don’t materialize. I have watched two telecoms buying SAS/SPSS software that ended up lying idle for the first nine months. A utilities provider bought a number of SPSS seats and within 12 months they had only produced 1 model. A bank who bought a comprehensive SAS suite and also invested in an extremely powerful server – after one model delivered by external consultants not a single model was built for the following three years. I could go on. These cases are not unusual – they are actually the norm.
So a good question to pose would be: why do companies burn hundreds of thousands of dollars on the stuff they don’t use? The answer is: because during the sales pitch they participated in, someone failed to explain how to obtain that little table – let’s call it a Modelling Table, which was used in the demo.
“Typically it takes months of manual coding to transform the source data into a structured Modelling Table” You see – Modelling Tables don’t just sit in your database. As I wrote here – building Modelling Tables is NOT a part of most of Data Science curriculums, so not that many analysts truly know how to do it. Sometimes the key stakeholders in the process aren’t even aware that it is needed at all.
The truth is – typically the data in your database is nowhere near the format required by Predictive Modelling algorithms. If you just load your tables from the database to the Predictive Modelling tool, it simply won’t work:
All Predictive Analytics algorithms require a rigidly structured table consisting of aggregates – otherwise called Features and the process of converting the source data to the Modelling Table is called Feature Engineering.
And here it gets a bit techie – the Features describe customers’ behaviours prior to certain events – usually differently timed for each customer. This is NOT how the data is stored in the databases. Typically, it takes weeks or even months of a painful manual work to transform the source data into a structured Modelling Table. However, once you have it – (and this is where the sales pitch starts) it is rather straightforward to build a predictive model. The pitch is not quite a lie – but it’s a long road from honesty all the same if someone wants to sell you a Predictive Analytics tool.
Here is how the process looks like on the real-life Predictive Analytics project:
So back to the sales pitch – a more honest one would look something like this:
“We will start with loading the Modelling Table.
There is something you need to know – to get that table we had a team of Data Scientists plus ETL developers and it took us 4 months to get it right. If I talk more about this, it would ruin the pitch completely, so let’s just watch the last 5 minutes of this multi-month project.
Here it is, I’ll do a few clicks and we get the Model, ta da!”
Skipping this middle part is what makes such pitch deceptive, misleading and harmful to the clients.
Dirty Little Secret
“They know about it
but you are not supposed to”
The enormity of the effort burnt in the pre-modelling stage is what Tom Davenport called the “dirty little secret” of the industry.
One of the main reasons why it is a “secret” are the sales pitches that consistently avoid this topic. They know about it but you are not supposed to, as it would effectively kill or, at best, postpone the sale.
Not blaming only one side of the table – there are other reasons too, e.g. Data Scientists being truly embarrassed of how much time they spend on manual data prep. If you asked us what we do – it’s “Machine Learning, Predictive Modelling and other AI stuff”, while in fact for the last 6 months we were trying to join the damn tables and create aggregates in SQL.
So much for the sexiest job of the 21st century.
Getting ready for the next sales pitch
The world of data analytics is changing – and not necessarily for the better from a Data Scientist’s point of view.
There are tens of modelling tools available, including free ones and these, coupled with current computing power, enable us to build Predictive Models really fast. Running Machine Learning algorithms has never been easier.
However, with the explosion of new data sources, it is more and more challenging to keep the pace when aggregating the source data into the format digestible by the Predictive Modelling algorithms. Getting the data ready for Modelling, it would seem – has never been harder.
And that – not the Modelling, is the real challenge for Data Scientists.
So next time when you are watching a Predictive Modelling sales pitch, perhaps you could consider grilling the M.C. a little bit on how exactly they obtained their Modelling Table and how you can get it within your organisation.
Because once you have it.. you can really be just a few clicks away from the model.
COMMENT by George Mathew, President & COO @ Alteryx. Rebooted entrepreneur @ the intersection of Analytics, Big Data, & Cloud
Hi Maciek, You should have done a bit more homework before lumping Alteryx in w/ the rest of the folks in the market. First and foremost, our strength is Self-Service Data Prep/Blending. We solve for the creation of the perfect ‘Analytic Dataset’ (what you called the modelling table), before tacking the downstream model activities. BTW, we are doing that for ~2000 customers, tens of thousands of users with an NPS score of ~50. Anyway, good luck w/ your business, but please get your facts right before writing misleads posts about ours.
RESPONSE by Maciek Wasiak, Automated Predictive Analytics
Firstly – I think that all analytical toolsets are great – just because they allow to do cool stuff and by competing with each other they get the job progressively easier. And Alteryx is doing a good thing by making the data prep easier. But can you point out what facts I am getting wrong? While ‘doing the homework’ I met your staff for a demo of ‘data blending’ 3 days ago, just to make sure that I am not missing any tech you might have developed recently. Your man actually volunteered an opinion that ‘data blending’ in Alteryx is a point-and-click way of doing stuff the users do in Excel. Which oddly (for sales meetings) seems very true. When I described the Feature Engineering task, he admitted this is beyond his knowledge (fair enough) and promised to get back to me with his engineer by the end of the day. Haven’t heard anything from him since.
Regarding the Predictive Analytics sales pitches – how Alteryx are doing it is for everyone to see, please type in: Alteryx Predictive Analytics in youtube and here are:
the link number 1: https://www.youtube.com/watch?v=wgVzN_G7vm4
the link number 2: https://www.youtube.com/watch?v=UvoaWwPlHew
the link number 3: https://www.youtube.com/watch?v=CP6Q1i6ZYvE&list=PLfSLx4WE4q52uMPPHw4i25C44U8gYcUv-
all starting with the flat Modelling Table (Analytical Dataset – whatever) without a word mentioned how that table was obtained… those are ‘facts’ created by your company, not me. – Respectfully, Maciek