The data warehouse is central to the modern Unified Data Infrastructure stack. Selecting a cloud-based data warehouse can therefore be one of the most consequential investments your Data Science team can make.

Modern Unified Data Architecture. Image taken from article by a16z

The wrong solution can stymie productivity while burning through your team's yearly budget. The right solution can multiply your team's productivity k-fold! (pun totally intended)

In the Spring of 2022, my team was in the market for a cloud-based data warehousing solution. The process we followed in evaluating candidates for our data warehouse allowed us to scale our data operations to match our company's growth, while saving us up to 40% of ongoing operational cost.

While I created this process to evaluate 3rd party cloud based data warehouses, the steps followed are generalizable broadly to evaluate any 3rd party data/SaaS tools. When I shared it internally, every other team in my company found this process very useful and has adopted it for their own 3rd party software search.

I am sharing it broadly as a guide in the hope that it benefits others as well.

Here are the key steps we followed.

Step 1 - Start with your 'why'?

(With apologies to Simon Sinek)

Why do you need a data warehouse/3rd party SaaS?
Having a good, evidence-based answer to this question is perhaps the single most important step in the process.

In our case, our platform was not originally designed with a data science team's needs in mind. Querying our production database directly created performance issues for customers.  We had to create custom API endpoints each time we needed access to specific parts of our database.

As a team that needed instant access to data for analytics and ML, this procedure was unscalable. It resulted in lost man-hours creating and maintaining REST endpoints. Data access through those endpoints was also > an order of magnitude slower than if we had access to a fast, query able warehouse. That answered our 'Why'?

Step 2 - List down all of your requirements

What are the key requirements that you have from this solution?
List out all of your requirements and rank them based on how crucial they are to meet your business objectives (Must-have, should-have, nice-to-have).

Step 3 -  Do your homework / market research

List out key vendors of this software and their capabilities from publicly available information. Focus on your requirements and leave out cost for the moment (I'll tell you why shortly)
Don't spend too much time doing this. The idea is to get a rough feel for who the key players are and their relative capabilities. Doing this will quickly help you weed-out solutions that do not meet your requirements.

Below is a snapshot of our key requirements and a pre-analysis of 3 popular data warehousing solutions - AWS Redshift, Databricks and Snowflake.

Key requirements that we had versus 3 popular data warehousing solutions


It was clear from this process that Redshift wasn't going to make the cut for us, so we decided to focus on Snowflake v Databricks. At this stage, Databricks ticked more boxes for us than Snowflake.

The one concern we had at this stage was that Databricks is thought of more as a cloud-based ML platform. But it is building its warehousing capability rapidly and has a lot of ML features natively. So we decided to proceed ahead.

Step 4 - Make contact with vendors

I used to (and still do) have data SaaS vendors contacting me all the time based on my LinkedIn info. I had half heartedly agreed to engage with a vendor in Fall 2021 knowing that we'd be in the market in Spring '22.

Resist the temptation to engage unless you've completed all of the first three steps.
You'll be on a weaker footing if you haven't, and won't be able to effectively cut through the firehose of sales bullshit that's about to come your way.

Step 5 - Begin Proof-of-Concept (PoC) with vendors.

Before you begin, clearly mention that this is a competitive evaluation with the vendors you've shortlisted. Most vendors are happy to provide you with credits for one month trials.

Knowing that they're in a competitive eval brings out a different, competitive side to vendor salespersons that you don't see if they're the only ones in the trial.

Step 6 - Create your PoC plan

Work backwards, from requirements to broad goals (KPIs) to measurable outcomes (metrics) to data needed to compute metrics.
Try to hit every must-have requirement outlined above in your PoC.

Step 7 - Create your PoC to reflect actual, operational conditions.

This means that you'll have to likely create dummy data in your native storage layer at the scale of your existing data.  It also means that you'll have to measure your performance metrics (query performance, inference time, cost/query) using realistic data loads.

Step 8 - Stress-test your PoC

Repeat your PoC (if possible) assuming projected Compounded Annual Growth Rate (CAGR) of data growth in your organization 'n' years into the future.

Cost scaling curves for vendor 1 versus vendor 2 at 100% CAGR data growth

Costs scale with the size of data. You want to have a clear picture of which vendor solution optimizes the (cost, performance) combo the best into the future, not just the present. It is entirely possible that vendors that appear cheap at your present data size end up scaling costs super-linearly versus vendors that have a superior product that doesn't blow-up costs.

This is also why you shouldn't focus on "published" cost in the homework phase.
Try and evaluate "hidden" costs during your PoC. These can include the cost of security compliance, the cost of support and the cost of other third party tools that may be needed (or not) to make the warehouse work. The cost of features you might need in the near future should also be accounted for.

Step 9 - Ignore the Account Executives, focus on the Sales engineers.

Account Execs have one mantra - Always Be Closing.
They will try all sorts of tricks to make the sale: underselling the competitor, sharing research on why their solution is better, contacting you on LinkedIn, email, phone, text message.

Your sole focus at this stage should be on evaluating the software against your requirements. During the PoC, the Sales engineer is your best friend. Get a real feel for using the product by engaging deeply with the product and the Sales engineer.
Be polite with the account executive by all means, but largely ignore them at this stage.

Step 10 - Present your findings to all internal stakeholders (especially finance)

Make a detailed presentation of all of your findings within the company after evaluating all vendors. Be objective, but also make your preference clearly heard.

If you've been methodical and objective throughout the evaluation process, this will show when you present internally. Coupled with a genuine need for this piece of software (the "Why?"), this will go a long way in allaying concerns of your higher-ups about the purchase.

Step 11 -  Present your findings to vendors and negotiate hard.

You don't need to, but its a good idea to present all of your findings to all vendors that you have engaged with during your PoC. Its common courtesy to tell them what you found out about their product. But it also makes a lot of sense to tell them about what you found lacking.

Account execs have motivation to negotiate with you because you've told them that this is a competitive evaluation. Depending on how this is set up in your company, the finance team might be involved in negotiations, instead of you.
This is a very fun stage. Get involved if possible as you learn a lot during negotiations.

Step 12 - Make the purchase!

This is the final stage. You need to move swiftly once everything is decided. Don't tarry. Close very quickly with your internal finance team once you have the go-ahead. Budgets are always more fungible than you think!

Aside: Yes, I am aware that this is a 12 step process. No, I didn't intend for it to be a 12 step process. But, just like the original 12 steps, the process works best when all steps are followed, preferably in order.

Recap


Here's a TL;DR summary of the process. I'd like emphasize that while some details of your own evaluation process might vary, following this process will help you make an objective, fully informed decision on the right vendor for your requirements.


I hope you find this process useful in your own 3rd party data science / SaaS tool search. I'm on Twitter (@antaraxia_kk) where I write about the intersection of data science, AI and Web3 / Crypto.

If like what you read, please subscribe to this (completely free) blog and you'll receive articles written by me directly in your email inbox. Thank you!