William: Welcome to Data Decoded, an IBM podcast series dedicated to demystifying the world of data. From data lakes to master data management to data science and big data and everything in between, this is the podcast for the data professional and all those who understand they are in the business of data. If you don’t think you’re in the business of data as we are recording this, the business the data is all over the news right now and so I think people are really becoming aware that their assets are way more than what they can touch and feel – those tangible – it’s their data that is becoming a real prime asset.
I’m your host William McKnight, President of McKnight Consulting. I sat down with our guest today at the IBM Think conference and I found it so fascinating that I wanted to talk with him again, but also I wanted to share it with you. And we’re going to focus today on the announcements that were made at Think – at least some of them, there were so many. But my guest today is Seth Dobrin, Vice President and Chief Data Officer for IBM Analytics. Welcome Seth.
Seth: Thanks William, for having me. I appreciate the opportunity.
William: You got it. Well, first, I think the audience might be surprised or at least intrigued by the fact that IBM has a Chief Data Officer. I’ve actually learned there are three of them but can you tell us about your role?
Seth: Yeah, absolutely. Yes. IBM has three Chief Data Officers. So we have a Global Chief Data Officer who is responsible for our enterprise data and data science and AI transformation – that’s the cultural aspects.
He’s also responsible for building out an enterprise data platform, leveraging IBM technologies. His name’s Inderpal Bhandari and once he came on, he quickly realized that IBM is a massive company. In fact, it’s probably five companies if you look at the different divisions – any one of which could be a Fortune 500 company.
So to help drive the transformation within IBM, he brought on a couple of other Chief Data Officers in some of the larger business units. So I have a colleague of mine, Eugene Kolker, who is the Chief Data Officer for our Global Technology Services business and then myself, who is the Chief Data Officer for our Analytics software business unit.
I was really hired to lead the internal transformation for the Analytics business unit, but I’ve also spent a lot of time out talking to clients – sharing my experiences, having taken a Fortune 500 company down this journey in the past. I have spent a lot of time with the Offering Management teams to help them think about what are the amazing products that I would want to buy if I was the Chief Data Officer so how could we simplify our portfolio and our offerings to be something that people would be really, really excited by.
William: Well that last role really speaks to why you would have so much insight into some of these data related announcements at Think and the first one I want to tackle here, is the IBM Cloud Private for Data.
But before we get into that, you have some unique ideas there Seth, about the future of an enterprise as it relates to Cloud. Can you take us down that road a little bit and make clear that awareness of what enterprises are going to be doing with Cloud in the near future?
Seth:Yeah, absolutely. If you look at enterprises that are very mature in this space of Cloud and Analytics and just data in general. The vast majority of them and I might even venture to say, all of them, are not on a single cloud. The vast majority of them are on multiple clouds – they’re on Amazon, they’re on Azure, they’re on Google. They’re on IBM Cloud, they probably have a private cloud as well. And they probably use Salesforce or Workday or something to that extent to manage their talent resources and their sales resources.
And so you know IBM has, over the last year, realized that the future is not IBM’s Cloud. We go to our clients and we say you need to move everything to IBM’s Cloud, we’re being a little bit delusional. And so we settled on the fact that we need to help our clients operate seamlessly across multiple cloud environments and that’s exactly what IBM Cloud Private and IBM Cloud Private for Data are designed to do. They’re containerized dockerized platforms and Docker containers.
Most of your audience probably knows, but Docker containers need to be managed and orchestrated. And to do that, the management orchestration of those containers, we’re leveraging our platform called Kubernetes and Kubernetes is an open source technology that allows us to orchestrate all of these containers in a single cloud and in fact, across multiple clouds.
And so with IBM Cloud Private for Data, we were enabling our clients to have the ability to create a private cloud on any cloud, where they can manage resources available. So for instance, William, if you’re on my team and I’m the administrator of the IBM Cloud Private for Data environment, I would say okay William you get a hundred cores, you get a terabyte of RAM and you get a petabyte of storage. You can go into that environment, you can spin up a Hortonworks cluster and set aside a certain amount of resources.
You can also go and spin up a data science experience cluster and assign that a certain amount of resources and you can do this in minutes, instead of days or hours or weeks or even months in some companies. And then the next day, you can go and you can reassign these resources as long as you stay within the bounds of what I’ve assigned to you, you essentially can do as you need to, to do your job, which really helps when you’re in a private cloud environment.
And thinking about going across all these other clouds, these public clouds, each one of them has a unique skill set that’s required to operate on them, am I right?And in today’s world, if I have people that I want to work across all these environments not only do I need to have skilled data scientists or skilled developers, but they need to be able to have the additional skillset of operating in all of these different cloud environments, which makes that already difficult talent even more difficult.
So what IBM Cloud Private for Data does it essentially abstracts away the complexity of having to have your entire talent pool understand all these different cloud, reduces it to a small percentage if you’re going to be the administrator that are administrating all of these requirements. But because it’s containerized, it’s API-based and micro-service based, you can still have access to the reasons you go to these various clouds. You can still call API’s from Amazon Cloud or might be able to integrate into our IBM Cloud Private for Data.
William: Well I can see how that will really facilitate innovation. I love how you can quick-start while your ideas are hot and get after it with some real resources there. And I understand you have some governance over the top, which I think would be pretty important.
Seth: Yeah, so inside of IBM Cloud Private for Data is basically our entire software portfolio that you can use on-demand and you only pay for what you use and part of that is our Unified Governance and Data Integration portfolio, which enables you to organize your data into a source of truth, into delivering the data in a way that is available for everyone so you can easily create a data catalog.
So our Unified Governance platform leverages an open source metadata framework called Apache Atlas and Apache Atlas has the ability to reach into any data repository and get visibility into the meta data that’s in there. One of the big challenges with creating a data catalog or shop for data experience and in any enterprise, is that each data repository has its own metadata, proprietary metadata framework.
They all want to be the master so none of them can be the slave and so it’s really difficult to get a good understanding of what data you have if you don’t know what metadata you have. And so Apache Atlas essentially creates a metadata virtualization framework that’s built on open source and so connectors are there and if there’s not a connector there, the client can easily build one to connect into Apache Atlas.
And Atlas ties into our Unified Governance & Data Integration framework. And in fact, this open governance consortium that manages Apache Atlas is IBM, it’s Hortonworks. SaaS is involved and ING is involved.
Many of our clients and even some of our competitors are jumping on this project because it’s such an important way to enable enterprises are or frankly anyone who operates across multiple environments to get access to metadata that is needed to survive in this data-driven world. If you don’t know what data you have, you can’t get insights from it.
William: Indeed that makes a lot of sense. Now before we leave this topic, can you tell us about Genesys.
Seth: IBM is in the process of building a new cloud – and this comes out of technology from IBM Research, but having a completely fiber optic based system and having a switch list based system really allows you to completely separate, compute, and store and allows you to accelerate connections between all these different environments.
If you look at the fastest cloud in the world today – I believe that’s Facebook’s Cloud and right now that operates at 30 gigabits. IBM’s new cloud in Genesys is spec’d out to be at ten times that. And so what we’re looking at an order of magnitude to jump over the fastest cloud in the world – oh which by the way Facebook is the only one that can use. When Genesys is launched, it’ll be available to anyone.
William: Well there’s a lot there to unpack in regards to what you guys have been doing with Cloud so really congrats on that. I liked hearing about the separation of compute storage, that’s so important to the bottom line is an organization as they get into the Cloud.
Let’s talk about this other announcement that was made, called Data Science Elite – I’m really curious about that. Can you tell us about Data Science Elite?
Seth: Yeah, so the Data Science Elite Team came out of traveling around the world talking to clients, literally, and the realization that most large enterprises do not know how to effectively implement data science in their enterprise.
Essentially what happens is some senior executive goes out and talks to one of their buddies or goes to a conference and here’s how, you know so and so has a hundred data scientists and so they come back and they say, “Well, I want 200 data scientists.” And their hire data scientists and they say, “Go do data science stuff.”
And they hire people from Kaggle competitions or at universities. Doing data science in a company, in a large company, is fundamentally different than doing data science as part of a Kaggle competition or in a boot camp or university for many reasons. First off is, data never looks like what it looks like in these competitions or at a university – it’s quite messy and you know, maybe not fully understood.
And then there’s this whole concept of operationalizing in data science, if you deliver the output of the machine learning model or a decision optimization model as an excel spreadsheet or CSV file – really, you’re not creating any value. How do you integrate it in the processes? How do you integrated into systems? How do you surface it in a way that an end user, someone whose ultimately making the decision or driving a process, can consume most effectively.
And then one thing that most senior executives don’t realize and even some people that are leading these data science or analytics teams, is that when you use machine learning – you create a self-fulfilling prophecy. When you apply machine learning, you’re going to change the data and when you change the data over time, the model will stop working.
So your data scientists are going to quickly, the more models they build, their time is going to quickly get consumed with monitoring models to make sure they are within specification and then re-training them when they fall out of spec. And so the Data Science Elite Team was designed to sit down with clients, for a short period of time at no cost to them for 30, 60, 90 days, depending on the instance around a specific use case – so we want to sit down with them and solve a real world problem.
We will teach them everything from how do you identify decision and break that decision down into its component parts – most decisions we’re solving will not be a single machine learning model. They’ll be two, three, five, ten or many more. And then how do you tackle that problem in an agile, iterative manner to deliver value as soon as possible.
Then we even walk them through how do you operationalize it, how do you deploy as an API instead of the CSV file, how do you automatically monitor and retrain your models so that your data scientists are only spending time with ones where they need to reengineer what we call “features,” which are the important parts of the data or the way we poll data together in a unique way that drives the machine learning model or the underlying model itself needs to change. That’s the only time the data scientists will be called in to look at it that’s still interesting to them so it’s not going to be something that’s going to unnecessarily consume their time.
William: Well, sounds like a very relevant service. When you said I need 200 data scientists I just shuddered because it’s so hard just to get one sometimes so it’s good to have your pool. Now do your scientists only work on IBM products when they go into these engagements.
Seth: Obviously IBM’s in the business of making money and so on we when we go into these clients we go in with our data science portfolio, which is primarily built on our Data Science Experience Platform and we call it DSX. Data Science Experience is built on open source technologies, backed by Spark. You can work it in either into Jupyter or on notebooks, you can leverage Python or Scholar to write you code. And then we have some value adds on top of it – so from the perspective of IBM technology, yes, but it’s built on open source and you write your code in open source and so its portable. And we teach people how to do data science but we hope that our value adds are so significant and we demonstrate those during the engagement that people are going to want to stay with our products and that’s what we’ve seen. We launched this team unofficially January 1st and officially at Think.
Since January 1st, we’ve gone through several engagements and with every one of these engagements, the client has enjoyed and learned a lot from the relationship but more importantly for IBM, they bought our Data Science Experience platform.
Now with that said, the data needs to come from somewhere and we do not require people to have entirely you know blue portfolio of all IBM products. DSX works with Cloudera, with self-managed distribution, it works with Oracle and Teradata and Db2 and the data system, it will either directly integrate with it or you can create an API or a direct connection to it just like you would anything else.
William: Well this kind of jump start package sounds really relevant to organizations today because data science, that’s really were competitive advantages are being forged today and so companies really need to take a look at this and get going down this path and this can certainly help, I can see that.
Now, great data science as we know it is based on great data and great date over the years has been stored in DB2 in all its various forms. I’ve been tracking DB2 forever. And there were some announcements about DB2, in particular one I want to talk about here is the DB2 Event Store. So how does that fit in with the DB2 family databases.
Seth: Actually, I’d like to start a step back. So most of your listeners probably don’t realize that IBM is the single largest contributor to open source projects if you look across all the different open source frameworks. And what we call DB2 Event Store Is a new database that’s built on open source technologies.
It’s built on an S3 object store. It leverages Apache Parquet. It leverages Apache Spark. And it’s a what we call a hybrid transactional analytical or HDAP database and so it can do millions of transactions per hour with just a three node cluster, but you can also run analytics directly on it. If you look at HBase, which is the backend of Hadoop in most instances.
Each column in HBase is a Parquet file and so we’ve taken that concept and on top of that we put Apache Spark and that creates this highly analytical data structure where you can access the data that’s been indexed leveraging Spark in these parquet files. Parquet files are essentially, each of them is essentially an independent schema.
Today we have DB2 that sits in front of a relational database that acts as a landing zone in the near future and in fact, in IBM Cloud Private for Data is it an in-memory database that the data lands and acts as the landing zone for DB2 Event Store. And so today we’re really positioning it as an IoT type system where you have highly transactional – like The Weather Company type, where you got all these transactions coming from phones, devices from all over the world, many millions of transactions per minute or per hour.
But in the future it’s really the foundation of our databases. We’re going to re-architect – and we’re in the process of doing this – all of our databases leveraging this concept of object store with Parquet with Spark to give me the same type of look and feel you have with DB2 or any other database but much more lighter weight and easier to manage environment.
William: That’s really fascinating and I love how you’re bringing together some really relevant things out there that I’ve been hearing about. I’ve been seeing workloads get beyond this hard line between operational and analytical and I don’t know where to put him sometimes anymore so I think this notion of HDAP is really growing so that’s great that you’re putting in database right there.
You mention Parquet a few times there, that each has an independent schema, it’s immutable, and so on – isn’t Parquet really the idea that you bring in that column store mentality to a distributed file system?
Seth: Yeah, that’s exactly what it is and I guess one of the other advantages of Parquet is that they’re highly compressed. So as we start, we get back to this multi cloud environment and you want to metadata across all these environments, that if the data is not highly compressed…Number one, your latency is going to be poor. Number two, most of the cloud environments they let you put your data in for free but as soon as you want to take out is when they charge you. If you don’t have highly compressed data, the costs of moving data around is going to get out of hand very quickly.
William: Well I think Parquet is going to be that version of the distributed file system. It’s really going to stick around for a while.
Seth: Yeah, yeah, I think so too, I think it’s going to be like Spark. It’s been around for a while and isn’t going anywhere soon unlike some of these other open source technologies that kind of wax and wane.
William: Right, well that’s great to take advantage of that. Also there was a lot of Watson branding, of course, throughout Think and one of them was Watson Studio. Is this a new release of Watson – can you clarify that?
Seth: Yeah so Watson Studio is the melding of several things that have been held separate inside of IBM for a while so if you look at what we came to market with or we had what we called Watson Data Platform, which had our hosted or cloud offspring. We had IBM Cloud, Bluemix, and all these different Watson services.
And so what we’ve done is an internal reorganization as well as the go-to-market reorganization of all of those things I just mentioned into one platform and we call that platform Watson Studio. And so as a data scientist, I can still got to Data Science at IBM.com and get access to what we used to call Data Science Experience but integrated with that now are all the IBM Cloud services that used to be held separately before. Very well-integrated access to all the different Watson services, Watson Knowledge Studio, Watson Digital Recognition. So that can actually in that environment, build truly cognitive applications or AI applications that leverage all of these different services in there.
It ties directly into IBM Cloud Private for Data and that’s how we’re really enabling clients to bridge to IBM’s Cloud because one thing I didn’t mention when I was talking about IBM Cloud Private for Data, Kubernetes, and Dockers is that it’s actually IBM’s architecture. So the entire company of IBM is built on that architecture so when clients use IBM Cloud Private for Data on their private cloud or someone else’s cloud, it will integrate seamlessly with those properties that live on IBM’s Cloud – including all Watson services that are there and including Watson studio. It’s a coming together of all the different technologies that makes it easier to consume.
William: I see, alright. Is there a great use of this Watson Studio that you can share so listeners can maybe customize this into their situation?
Seth: So we just launched it a month ago, so we don’t have any great use cases to share that were built completely on Watson Studio but if you go to Data Science at ibm.com, and you create an account and log into the Data Science component, there are notebook examples in there, of starter kits of how you can get going. They range everything from how do you do visual recognition – can I fly a drone around and fly it in front of William McKnight and know it’s William McKnight. Stuff like that is in there.
There’s some examples of how our global business services, our consulting services has brought all these different pieces together for clients around retail – so how we helped retailers understand how foot traffic is going to change for out of ordinary things. For example, there’s an art festival and it’s going to draw more people downtown or there’s an art festival down the street and the streets are going to be closed in front of your shop so you will lose traffic. And so helping retailers understand things like that in the context of the weather and the context of the supply chain. They use these concepts on Watson Studios to solve these world problems for clients.
William: Well the more factors you are bringing together, the more you are a getting a full picture and I think Watson Studios is really poised to bring their full picture to enterprises today. Thanks a lot for sharing today and thanks a lot listeners for tuning into another episode of Data Decoded. You can find this podcast and more at ibmbigdatahub.com and you can contact me at McKnight Consulting Group at McKnightcg.com. Til next time, thanks for listening to Data Decoded.