Competitive Analytics | Dirty Secrets - Competitive Analytics

Dirty Secrets

July 24, 2016

by Maciek Wasiak

“I am not a Data Scientist and I just built a predictive model!”

Click here to read the entire post with comments

Chances are, that if you are reading this post, you have seen at least one sales pitch/conference presentation of one of the mainstream (or less mainstream) Predictive Modelling software vendors. The pitch typically follows the same enticing pattern: load the table, click a few buttons and here is your Model. It’s that simple. It’s meant to lure you into thinking – “if you buy our software, you can get your own models in just few clicks.” I witnessed some presenters saying – “look, I am not a Data Scientist and I just built a churn model”. One might wonder why companies are complaining about the shortage of Data Science skills when so many sales reps can build Predictive Models in minutes . . . It is a well-oiled sales machine, polished and perfected over the last 15 years. Eventually one of the examined tools wins the procurement process and licences are bought.

What happens next? More often than not – nothing.

The Reality

The models that were supposed to rain down upon us just don’t materialize. I have watched two telecoms buying SAS/SPSS software that ended up lying idle for the first nine months. A utilities provider bought a number of SPSS seats and within 12 months they had only produced 1 model. A bank who bought a comprehensive SAS suite and also invested in an extremely powerful server – after one model delivered by external consultants not a single model was built for the following three years. I could go on. These cases are not unusual – they are actually the norm.

So a good question to pose would be: why do companies burn hundreds of thousands of dollars on the stuff they don’t use? The answer is: because during the sales pitch they participated in, someone failed to explain how to obtain that little table – let’s call it a Modelling Table, which was used in the demo.

“Typically it takes months of manual coding to transform the source data into a structured Modelling Table” You see – Modelling Tables don’t just sit in your database. As I wrote here – building Modelling Tables is NOT a part of most of Data Science curriculums, so not that many analysts truly know how to do it. Sometimes the key stakeholders in the process aren’t even aware that it is needed at all.

The truth is – typically the data in your database is nowhere near the format required by Predictive Modelling algorithms. If you just load your tables from the database to the Predictive Modelling tool, it simply won’t work:

AAEAAQAAAAAAAAeDAAAAJDA3YzVkMzQyLTBlY2YtNGQwYy1hZDJjLTYyNjM3Mjc2NjZkYw

All Predictive Analytics algorithms require a rigidly structured table consisting of aggregates – otherwise called Features and the process of converting the source data to the Modelling Table is called Feature Engineering.

And here it gets a bit techie – the Features describe customers’ behaviours prior to certain events – usually differently timed for each customer. This is NOT how the data is stored in the databases. Typically, it takes weeks or even months of a painful manual work to transform the source data into a structured Modelling Table. However, once you have it – (and this is where the sales pitch starts) it is rather straightforward to build a predictive model. The pitch is not quite a lie – but it’s a long road from honesty all the same if someone wants to sell you a Predictive Analytics tool.

Here is how the process looks like on the real-life Predictive Analytics project:

AAEAAQAAAAAAAAjZAAAAJDE2OTgzZWIyLTBhM2YtNDFjNi04ZDgwLWEyNDc3N2FiMzEwYw

So back to the sales pitch – a more honest one would look something like this:

“We will start with loading the Modelling Table.
There is something you need to know – to get that table we had a team of Data Scientists plus ETL developers and it took us 4 months to get it right. If I talk more about this, it would ruin the pitch completely, so let’s just watch the last 5 minutes of this multi-month project.
Here it is, I’ll do a few clicks and we get the Model, ta da!”

Skipping this middle part is what makes such pitch deceptive, misleading and harmful to the clients.

Dirty Little Secret

“They know about it
but you are not supposed to”
The enormity of the effort burnt in the pre-modelling stage is what Tom Davenport called the “dirty little secret” of the industry.

One of the main reasons why it is a “secret” are the sales pitches that consistently avoid this topic. They know about it but you are not supposed to, as it would effectively kill or, at best, postpone the sale.

Not blaming only one side of the table – there are other reasons too, e.g. Data Scientists being truly embarrassed of how much time they spend on manual data prep. If you asked us what we do – it’s “Machine Learning, Predictive Modelling and other AI stuff”, while in fact for the last 6 months we were trying to join the damn tables and create aggregates in SQL.

So much for the sexiest job of the 21st century.

Getting ready for the next sales pitch

The world of data analytics is changing – and not necessarily for the better from a Data Scientist’s point of view.

There are tens of modelling tools available, including free ones and these, coupled with current computing power, enable us to build Predictive Models really fast. Running Machine Learning algorithms has never been easier.

However, with the explosion of new data sources, it is more and more challenging to keep the pace when aggregating the source data into the format digestible by the Predictive Modelling algorithms. Getting the data ready for Modelling, it would seem – has never been harder.
And that – not the Modelling, is the real challenge for Data Scientists.

So next time when you are watching a Predictive Modelling sales pitch, perhaps you could consider grilling the M.C. a little bit on how exactly they obtained their Modelling Table and how you can get it within your organisation.
Because once you have it.. you can really be just a few clicks away from the model.

Cheers

Maciek

COMMENT by George Mathew, President & COO @ Alteryx. Rebooted entrepreneur @ the intersection of Analytics, Big Data, & Cloud

Hi Maciek, You should have done a bit more homework before lumping Alteryx in w/ the rest of the folks in the market. First and foremost, our strength is Self-Service Data Prep/Blending. We solve for the creation of the perfect ‘Analytic Dataset’ (what you called the modelling table), before tacking the downstream model activities. BTW, we are doing that for ~2000 customers, tens of thousands of users with an NPS score of ~50. Anyway, good luck w/ your business, but please get your facts right before writing misleads posts about ours.

RESPONSE by Maciek Wasiak, Automated Predictive Analytics

Firstly – I think that all analytical toolsets are great – just because they allow to do cool stuff and by competing with each other they get the job progressively easier. And Alteryx is doing a good thing by making the data prep easier. But can you point out what facts I am getting wrong? While ‘doing the homework’ I met your staff for a demo of ‘data blending’ 3 days ago, just to make sure that I am not missing any tech you might have developed recently. Your man actually volunteered an opinion that ‘data blending’ in Alteryx is a point-and-click way of doing stuff the users do in Excel. Which oddly (for sales meetings) seems very true. When I described the Feature Engineering task, he admitted this is beyond his knowledge (fair enough) and promised to get back to me with his engineer by the end of the day. Haven’t heard anything from him since.

Regarding the Predictive Analytics sales pitches – how Alteryx are doing it is for everyone to see, please type in: Alteryx Predictive Analytics in youtube and here are:

the link number 1: https://www.youtube.com/watch?v=wgVzN_G7vm4

the link number 2: https://www.youtube.com/watch?v=UvoaWwPlHew

the link number 3: https://www.youtube.com/watch?v=CP6Q1i6ZYvE&list=PLfSLx4WE4q52uMPPHw4i25C44U8gYcUv-

all starting with the flat Modelling Table (Analytical Dataset – whatever) without a word mentioned how that table was obtained… those are ‘facts’ created by your company, not me. – Respectfully, Maciek

ALL COMMENTS

George Mathew

President & COO @ Alteryx. Rebooted entrepreneur @ the intersection of Analytics, Big Data, &…

Hi Maciek, You should have done a bit more homework before lumping Alteryx in w/ the rest of the folks in the market. First and foremost, our strength is Self-Service Data Prep/Blending. We solve for the creation of the perfect ‘Analytic Dataset’ (what you called the modelling table), before tacking the downstream model activities. BTW, we are doing that for ~2000 customers, tens of thousands of users with an NPS score of ~50. Anyway, good luck w/ your business, but please get your facts right before writing misleads posts about ours. George

1mo

David Bloch

Innovation, Analytics and Commercial Strategist

As one of those ~2000 customers, I can safely say that if people view Alteryx solely as a “Hey, this does predictive models” tool, they don’t understand what the tool set out to achieve. 60% of the job in Analytics/Data Science/whatever you want to call it is collecting data, cleansing it, structuring it and getting it ready for models to be run over the top of it. 20% of the job is understanding the business well enough to be able to build your hypotheses and business challenge statements in a way that means the data you do collect is worth modelling, 15% of it is how you sell the results of your models to people who can then act on the insight, about 5% of it comes down to the models/algorithms you use. Just because you “code” in R or Python when doing your random forest or poisson variance models doesn’t mean you’re not just calling a pre-existing function.

1mo

David Pope

Various roles/leadership positions at SAS

Maciek – I work for SAS (however this response is my own) and noticed that other analytic vendors have already responded much the same way I planned too that is to say I agree with your content, but don’t understand how you can lump SAS into the same bucket. I can point to years of whitepapers/documentation/user(customer)-written papers (or in data scientist terms “data“) on this exact topic published by SAS, which is the analytic data prep is the most challenging part. This is why SAS always tries to broaden “bake-offs” to include the entire analytics lifecycle because we tend to win if done fairly. Here are just one or two historical references. A SAS user/customer wrote the first edition of this book: “Data Preparation for Analytics Using SAS” back in 2006. The SAS language for that matter has always been made up of a “data step” and then “procedures (in SAS-speak PROCS) and that’s got a 40 year track record of being a leader in this space. SAS has been talking about the ABT (analytic base table) I believe before the term was even widely used/accepted as well as identifying the need of an IT focused “analytical data steward” to assist the “data scientist”. Finally our most recent messaging on this exact topic discusses the important of DATA, DISCOVERY, and DEPLOYMENT, just google Analytics in Action and you will see SAS content on this type of content.

1mo

Maciek Wasiak

Automated Predictive Analytics

Hi David Thanks for kind words but I cannot agree with what you are saying. There are ample of demonstrations of SAS Enterprise Miner (youtube, slideshare, etc) but after reviewing ca 20 of them, none of them gives as much as a whiff of where the Modelling Table came from. That includes the most watched piece by SAS [https://youtu.be/Nj4L5RFvkMg] , 60k views which starts with loading (ta-da!) a ready-to-go Modelling Table. Since I am on it – that demo, despite what it tries to sell, shows not a Predictive Modelling problem but the classification problem without the time element – which, for us, is a critical distinction when building the Modelling Table, causing headaches and contributing to the ETL efforts. Yet another nuance meant to alter the perception of reality. It may be that from time to time someone from SAS writes something about the data prep for Predictive Analytics. But the decision makers (rightly or wrongly) are not the audience of this and if anything – this is totally lost in the sea of “this is so quick and easy” communication. Regards

1mo

Dr. Olav Laudy

Chief Data Scientist, IBM Analytics, Asia-Pacific

Rather than a non-informative sales pitch and bashing other companies, share your knowledge on HOW to do it: https://www.linkedin.com/pulse/data-science-logic-dr-olav-laudy

1mo

Julien Nel

Data Scientist at Idiro Analytics

or when the salesman is selling SPSS and promises amazing models using modeler.. but then if the client wants to productionise those models.. did they mention the additional license for C&DS??

1mo

Mark Rabkin

Director of Business Development at Zementis

Best way to handle moving models to production is to export them to the open industry standard Predictive Model Markup Language (PMML) and use Zementis to execute and scale the models for any information technology environment.

1mo

Jarlath Quinn

Pre-Sales Director at Smart Vision Europe Ltd

It is utter nonsense to suggest that companies routinely spend hundreds of thousands of dollars on predictive analytics software because of a cute demo by a sales person. Nor is it fair or true to suggest that companies such as SAS or IBM (who own SPSS) are particularly likely to try to pull the wool over prospect’s eyes when demonstrating their respective technologies. Inover 20 years of working in a customer-facing technical role for SPSS Inc, IBM and SAS I have never, ever known a customer to invest in technology like this without a full due diligence process. It is FAR more common that prior to investment the customer insists on a proof of concept project (often for free!). Moreover the CRISP-DM and SEMMA methodologies are pretty much always placed front and centre in any presentation. Why do customers sometimes not fully exploit ANY technology after they buy it? There’s no dirty little secret and you don’t need to be a data scientist or pundit to answer that question. The answer is for the usual reasons – people leave posts, other projects take priority, internal politics resist change and sometimes it’s simply because no one gets fired if it doesn’t happen. In fact that would make a more interesting and useful article.

1mo

Maciek Wasiak

Automated Predictive Analytics

Jarlath, you are one of those who actually sell SAS and SPSS solutions. So you must be speaking true.

1mo

Joe Cunningham

Data Science & Analytics Executive

Data Preparation is like Voldemort – suppliers do not speak of it by name. Dare you remove the invisibility cloak from these proceedings? Well done Mr. Potter.

1mo

David Rimmer

Manager – Customer Data & Insights at Air New Zealand

This is a bit of scaremongering. I built an automated modelling dataset from the many star schemas in our data warehouse, denormalised and fully cleansed, including about a hundred imputed values. It took about 2 days to write and about the same to test and validate. The problem is that businesses involve the wrong people in the sales process. If more technical staff were involved in RFP’s it would keep software vendors on their toes.

1mo

Ciarán Cody-Kenny

Data Scientist at Schibsted Media Group

Completely agree. This article is experience talking, thanks for sharing.

1mo

Maciek Wasiak

Automated Predictive Analytics

Heh Ciaran, I feel so old 😉

1mo

Roger Fried

Senior Data Scientist

I’ll take no position on the blog itself or its contents, but this feels a bit like the Greek myth where a golden apple was tossed into a room with a message designed to ensure a storm of conflict. The comments are more fun than the blog. 😉

1mo

Anthony O’Farrell

Data Mining Specialist at CSRA

An entertaining read again sir! The more I hear someone talk about how easy data analytics is, the more I’m sure that person is either (a) not a data practitioner or (b) deceitful.

1mo

1mo

Sarat Patro

GM/Senior Managing Consultant, IBM Cogntive & Digital Analytics

well the something happens with lot of other IT investments. ERP, EDW, Digital and the list goes on and on.. analytic is a very niche area and doesn’t take in lot of investments if one really cares for numbers. why blame SAS, SPSS if you don’t have folks who can do feature engineering then why blame the tool . Lot of EDW investments ride on the Analytics promise as the pot of gold at the end of the journey so analytics and the tools it can be SAS, SPSS, you name it being the fall guy for it..imho

1mo

Cesar Patino

Analytics and Information Management Executive & Advisor, helping companies to improve business…

Maciek Wasiak, I totally agree that Data Preparation is the hard work of creating predictive models. I use to say that good Data Prep is at least 50% of project success and it is very clear at Crisp-DM and SEMMA methodologies (used by SPSS and SAS respectivelly). Although, I’ve been working with SAS and SPSS since 2003, in different roles of business (consulting, presales, sales, post-sales, Alliances) and I never told customers and partners that Predictive Modelling starts after Data Preparation. Actually, besides warning about Data Preparation importance (including proper sourcing and Data Quality), I adviced them about the following issues: 1) Understand and Define properly the Business Problem 2) hire a Statistician (today aka Data Scientist) to use the software 3) integrate the predictive model in the business process (otherwise you’ll have a nice model without ROI). I really believe a good long term relationship is based on true and confidence. I understand you have created and now is trying to sell your own software, but not all sales guys are liers.

1mo

Maciek Wasiak

Automated Predictive Analytics

Cesar I do not accuse all sales guys of lying, not at all, where did you get this idea 🙂 (This sounds unintentionally sarcastic while being honest (this sounds even more sarcastic, it’s like a loop, I can’t win)); break;

1mo

Kevin Gray

Marketing Science and Analytics

I’ve had horrible experiences with untutored people using abuser-friendly software such as SPSS. One consequences is that their errors – some quite serious – can become part of the research protocols of the organization or industry. Here are some examples in marketing research: https://www.linkedin.com/pulse/article/analytics-easy-kevin-gray

1mo

Rachel Fairhurst MCICM

Senior Portfolio Risk Manager at Future Williams & Glyn Team – RBS

I think this is also why modelling teams should never sit in a bubble – to understand whether the features you have created will work in an real life operational environment (and aren’t against regulations, outcome variables, all sorts of pitfalls) are also a really important set of experiences that need to be developed. Nice article Maciek

1mo

Maciek Wasiak

Automated Predictive Analytics

Thanks Rachel! The whole regulatory constraints are a big pain for analysts when modelling risk, ain’t they? There is way more freedom when working in the marketing part of the house.

1mo

Maciek Wasiak

Automated Predictive Analytics

@George Mathew Firstly – I think that all analytical toolsets are great – just because they allow to do cool stuff and by competing with each other they get the job progressively easier. And Alteryx is doing a good thing by making the data prep easier

1mo

Pradip Mohan

Business Intelligence SME at Theta (NZ)

Thanks for your article. Tools cannot be a solution. It is the knowledge of the data, as well as business problem which is very important. Most of the tools have similar features and can be applied after the initial data preparations are completed.

1mo

Maciek Wasiak

Automated Predictive Analytics

Hi All, I expressed an opinion based on my experience and (a lot) of talks with my colleagues but I would be very interested in hearing from the wider audience if this issue bothers anyone. (Thanks Julien Renault!)

1mo

Cliff Ashford

Software Development Director

lol – couldn’t agree more, but of course it’s not just machine learning – this happens across the board when non-specialist staff are selecting the tools to buy

1mo

1mo

Ajay Raina

Director & Chief Architect at Wipro – Healthcare and Lifesciences portfolio

If tool companies can tone down their message to make it easier to learn the data , connect the dots , draw inferences and correlations before the modeling , I would think this will address the technology problem. Other part is this process needs to be better governed with right data SMEs who can drive more focussed efforts test driving the data for more nuggets of insight. Those nuggets of insights need to be shared with business and accordingly prodictionalized via right tools. Tools are just enablers here we have to look at it from people, process and technology standpoint.

1mo

Niall Walsh

Automating Predictive Analytics

What Maciek has held off saying, is that Xpanse have developed a solution that automatically transforms the source data into a structured Modelling Table … and Yes, this is going to fundamentally change the way predictive analytics projects get delivered !

1mo

Julien Rooke

Business Intelligence & Data Warehousing Professional | Qlik Certified | SAP Certified | Data…

I have seen ‘shelfware’ for years – it is not just in predictive modelling that things are simplified to make things look attractive and then when bought are unusable. I also think a number of BI companies have taken a similar approach – showing beautiful charts and analysis without explaining how much hard work goes into getting the data ready.

1mo

DAIHO ALBERT

DIRECTOR, PRODUCT STRATEGY AND MANAGEMENT

I think it is a combination of Sales Pitch, Institutions’ insistence for reference and background while choosing a system and yes of course those Applications which have been trying to please all clients by building in adhoc requirements without actually looking into the future. In some region, reference and background checking does more harm than good. The sales persons uses the reference whereas those who actually knows the issues, do not admit it for obvious reasons.

1mo

Robbie Cook

Information Management Consultant

Great post that brings attention where it is needed. I would suggest that the difficulty of creating a data set suitable for modelling is a bit overstated here but it is non-trivial and critical and should not be overlooked by any organisation. Oddly to succeed in this space is often more dependent on basic data management than technology but again we see IT vendors proclaiming that their tools can magically transform poorly managed data into meaningful and consistent data assets at the mere flick of a switch. Improving the value of your data is not rocket science, it just takes a commitment to the basics of data management and once this is achieved the value of your information assets can be unlocked via any one of the contemporary tool kits on offer.

1mo

Milad Falahi

Senior Data Scientist, Any platform you want!!

I feel your pain, the truth is “All models are wrong but some are useful”- Gorge Box…. yesterday I had to bring down all the nuts and bolts of Theano to adapt it to my problem… I ended up with red eyes and a model that was finally working. Reality is it doesnt matter what model or platform you use, you will need to make the model yours!!

1mo

Shrish Kaoley

Independent Consultant at Self Employed

Hello Maciek, I truly appreciate your attempt to highlight a very important and critical step before proceeding to modelling (that gets ‘muted’ during the sales process)! I have been in DWHing and worked extensively on ETL. I know and understand the tedium involved in ‘preparing and massaging’ the data. I have also worked on analytics software and modelling and hence understand the criticality of a well-prepared data!! The point that comes to my mind is having a ‘relevant’ pitch for the ‘relevant’ audience. Whilst analytics capabilities are the need for the business-user, the preparation is mostly a task undertaken under the IT function. Not having one (relevant pitch), only aggravates the problem damaging the vendor perception followed by the of waste of time and resources for the users! Thanks for placing this upfront …

1mo

1mo

Sarat Patro

GM/Senior Managing Consultant, IBM Cogntive & Digital Analytics

Guys .. this is an open forum for tech discussion.. at least I thought so.. so please stop plugging your products in these discussions … two cents 🙁

1mo

Tim Woodruff

Business Analyst

Great few points about the influence of Sales departments on so-called predictive analytics. So in summary, analytic data prep or “feature engineering” is a huge part of what data scientists do, yet is not part of a typical DS curriculum. Dirty little secret, indeed #hugedata #scienceforall #experts #fastbuck

12h

Ernests Stals

CEO and co-founder at Dripit.io

“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.” – @BigDataBorat p.s. Good read!

12h

1mo

Berni Simmons

Sales & Marketing Director – Predictive Analytics, Mobile Technology, Marketing Services & Sales…

I’m in fierce agreement with Jarlath Quinn – there’s no sleight of hand and to assume otherwise underestimates the integrity and intellect of both buyer and seller. I’ve spent 20 years ‘selling’ advanced analytics. It was ‘statistics’ back then. Organisations that make advanced analytics and modelling work are those that prove the value first and then incrementally invest based on proven return. Making advanced analytics work requires significant effort and focus ( there are no shortage of technology options). For clarity Jarlath Quinn and I share a good deal of professional history.

1mo

Phillip McBride

Business Technology Executive

You point out the difference between a ‘Pitcher’ and a ‘Partner’. Discovery approach defines who you get.

1mo

Milad Falahi

Senior Data Scientist, Any platform you want!!

Philip: its not a matter of partnership. 90% of D.S. problems dont have an out-of-the-box solution so the model always need adapting and 100% of problems have dirty data so data always needs tidying (and adapting to the chosen model). I dont know why people get their hopes high with a generic demo on a non-real super-tidy tiny dataset by any vendor…

1mo

Milad Falahi

Senior Data Scientist, Any platform you want!!

Dr. Olav Laudy I wasn’t trying to convince anybody and all vendors already know that “a brand is no longer what we tell the consumer it is – it is what consumers tell each other it is”…. Unfortunately there are still customers who go to a vendor for a solution without making a simple research on Kaggle or DataScienceCentral on the best solution for the problem… its like shopping in store without checking amazon reviews; hit and miss!

1mo

Dr. Olav Laudy

Chief Data Scientist, IBM Analytics, Asia-Pacific

Milad Falahi You seem to do exactly what you accuse the large vendors of: make excessive claims, without a single argument to support it. Your opinion: irrelevant and discarded

8h

Christophe Cop

Datascientist at Cronos Groep

Great read. And I can agree that preparing the data is the ugly tedious part of data science.

3h

John O. Jones

Bayesian Network Analyst and Gardener

Use the data you have. Even simple models can help guide policies and decisions in the business. Data scientists should spend 80% of their time explaining how decisions affect business performance. Real value is produced. They can have a looping process of adding features and information. If they are helpful they will help make the best decisions based on current information. Maybe we cover our butts by saying “I need more information before I can help.”