Big Data and Analytics Hub spoke with IBM Distinguished Engineer John Thomas (@johnjaithomas) about some of the importance of tuning information architecture to make algorithms meet enterprise needs, as well as how machine learning can most effectively be applied in hybrid scenarios in 2018. The following is part two of a two part interview (read part one).
BDAH: As we’ve discussed, the multicloud is a reality for many of our clients. What would be some of the challenge of doing data science in the multicloud and how do you overcome the challenges?
JT: Whatever is the driving force, the multicloud is happening. Sometimes what gets lost in this discussion is that the multicloud isn’t always public, but a combination of public and private cloud, a combination of on and off premises cloud. Once you have that understanding, then you can begin to think of what kind of data science and machine learning makes sense in a multicloud environment.
BDAH: What has been your experience dealing with unstructured data?
JT: When it comes to speech or vision, deep learning techniques are ideal. You have challenges with the volume of data. This means large training times. Acceleration is important when you get to building and training models that deal with these very large quantities of data. If you’re dealing with neural nets, you need to look at various acceleration mechanisms. For example, the PowerAI machines from IBM use Power processors coupled with GPUs, giving significant improvement in training speed.
Beyond that, there are other challenges. One is the lack of labeled data for the model to train on. At IBM, we have a number of techniques we are working on to address this. Transfer learning is becoming very popular in industry. Another challenge is dealing with very large volumes of text data and performing efficient ML on text data. Going beyond just statistical analysis of your text corpus to adding lexical meaning and bringing semantic understanding of text, being able to annotate and refine understanding of text data through mechanisms which allow a domain expert to interact with models, then being able to consume it in downstream machine learning platforms is important.
BDAH: Data warehouses are aligned to analysis on structured data but a lot of learning is coming from unstructured side, how do you incorporate what you can do with data warehouse with maybe what you need from a data lake? How do you bridge gaps to find more learnings or to find a larger corpus of data?
JT: I think there’s a place for a variety of repositories and a variety of data management systems. I think a fit-for-purpose approach for persistence and management of data is essential. Structured data is best kept in a relational, structured environment, but unstructured data may be best kept in a very different environment. What we really need is a logical data lake approach; a polyglot persistence approach that means if you have a variety of types of data we will support all of them. Warehouses, NoSQL databases, Hadoop environments, object stores — they are all first-class citizens in a polyglot persistence model. What is important is being able to work with all these different data sources seamlessly, being able to have governance across all of them, applying and building your ML models with data that comes from any or all of these sources.
BDAH: I like what you said about how there’s no AI without IA. Am I correct in assuming that it’s all for not if you don’t have a system to connect the raw material?
JT: You can do AI projects without architecture in place, but can you scale? Does it meet enterprise requirements? The sexy stuff these days are the algorithms and the models, but if you don’t have an information architecture in place, they can take you only so far in terms of operationalizing AI for the enterprise.
In 2018, we will see more hybrid ML scenarios. The main point I would like to make about Hybrid ML is this: let data gravity and SLAs dictate where the model building happens and let consumption channels dictate where the model gets deployed. Try not to move tons of data around, inside or outside the enterprise. There’s a reason data gravity is important and you have to conform to that. Otherwise, you end up in costly and complex approaches. In short, let data gravity and consumption channels dictate your hybrid ML models.
Discover more about machine learning and hybrid data management with this interactive view.