Navigating the Complexities of MLOps: A Guide for Businesses
| nmb@konfitech.com
Generative AI is one of the greatest technological tools being developed in the last years. For companies it unlocks great opportunities to automate repetitive tasks and improve processes. Greatly increasing productivity and the ability to focus on more complex tasks that can give competitive advantages. The accessibility of this technology for many small businesses can mean the difference between disappearing or winning in the market. Here in Konfitech we aim to be able to give this opportunity of ease to accessibility to all, therefore we are writing this short guide on MLOps, of which we can provide support to in all steps.
We will therefore be taking you through all the layers required and potential think-abouts to assess AI readiness.
Tooling is complex in the area, and the alignment of business strategies and risk profile to the increased security risks with AI is hard to navigate for most businesses. For this reason, it is natural that many of them seek to source an expert partner for these matters, but they still require a lot of understanding on the topic to manage the processes. All from underlying hardware to type of machine learning tool and interface with users.
Now that AI and ML has reached a plateau of productivity, they are out of computer scientists labs and able to provide commercial value through certain use cases. One of them being the famous Klarna use-case where they replaced a big part of their services and support department, which has a predicted 40 million USD in savings.
As other companies keep increasing the investment in AI there are three main areas of problems that arise – Technology, People, and the intersection between them.
Technology mainly has two high-level components that creates some trouble infrastructure and data. The landscape is ever changing and evolving at a rapid pace to sustain the innovations coming, non-experts risk choosing to adopt a technology that may be expired in a few years.
Data is also a challenge, as companies have been spreading their system risks and data in the past, much of the data out there has been siloed. The implementation of AI and ML brings the need for not only a lot of data, but also high-quality data which can be compiled together in the same format.
For the people side of things there are mainly two areas, talent and operational. The talent needed for AI is not yet ready and has reached a maturity and readiness level. The AI wave requires not only deep technological experts but people with deep understanding of the business and various stakeholders in the business. As ultimately the biggest benefits are reaped from leveraging the AI to increase core business productivity. The operating model is vital here as they need both to scale the talent and their interactions with the business, but also to operate the cloud model itself. Brining scalability issues to the implementation of AI.
Although these problems require investment, it is hard to argue that the ROI and tangible value AI can bring is high with an extremely long lifetime.
When choosing a tech-stack to build your AI application, one generally must select chips to use to optimize your results. Which is why historically it has been considered high cost. You need to buy many GPUs to do this, the most popular one being NVIDIA. Which is why the stock has soared recently. Due to their production of GPUs.
If you do decide to go with buying the chips and running the data locally, you need to consider factors such as:
- Unit interconnection: As projects scale, they require more computing power, this is something that you need to consider from the start of any AI project.
- Supported software: You need to make sure that you have alignment between computing hardware and the software you will be using for your application.
- Licensing: Certain ways of licensing such as CUDA, can make an impact by bringing certain constraints into the picture of building your application.
However, here in Konfitech we generally recommend going with building most of your solution or application in the cloud. As this will allow you to not have to consider certain parts of this points, as scaling will be taken care of by the clouds on-demand scaling model, and using software already on the cloud will ensure that it is supported.
As two big success factors of a ML project is the amount of data and training the model gets. Doing both things are both easier and cheaper to do on the cloud than on-premises on your own chips.
In general, there are many ways to run MLOps on the cloud with the various configurations that a cloud brings. Private, public, hybrid, and multi-cloud approach. Depending on your needs and resources available one solution is generally better than another for you, however, the easiest way is implementing some of the out-of-the-box solutions available to you on the public cloud platforms into your application. They have been around for a long time bringing a huge diversity of instances and ways of doing computing that aligns to your business requirements and prices.
The public clouds also offer tooling across the lifecycle of an MLOps projects as well. From the initial data ingestion, to cleaning, processing, and into training the model. This will ensure ease of management overhead with a continuous workflow for your engineers with great support form the big cloud suppliers.
Public clouds do also not need an initial investment, but as the compute requirements of your project grows so does the cost of the cloud applications. This will increase if you also need to have the compute available in multiple geographical regions as this will add to the cost of data mobility.
Security concerns are also present as you are sending your data away over the internet, which are vulnerable to be attacked.
However, these are points to be weary about and are mitigable and has their benefits as it is pay-to-use. The ease of deployment in the marketplace solutions available on the clouds are incredible. Going to AWS marketplace or Azure Marketplace can allow you to deploy MLOps environments in minutes.
If you happen to go with a hybrid version of the cloud, running some public and some private it makes sense to split the workload to your business needs. You can choose to run the more computational heavy and expensive tasks on-premises where the cost doesn’t scale on compute since you already have the hardware. You can then use the public cloud as a data storage and processing layer, getting data out of the siloed on-premises. Delegating the compute back to on-premises and using the cloud as an interface for interacting. Most drawbacks of a hybrid cloud model come from the difficulty of set up, as it requires a lot of interaction with each other that also needs to be managed. In theory as well you have three new attack surfaces you need to secure, the on-prem environment, the cloud, and their interface. In hybrid cloud cases you need to be extremely skilled with the security.
The last common cloud compute scenario is multi-cloud environments, using two public clouds that have different instances and costs to optimize your needs. Here you have no up-front investment and can ensure to always run on the lowest cost compute. Splitting vendor risk, and not having to be locked in. Giving your business an edge in flexibility. Although this seems like a very beneficial way of running a cloud environment you do need a lot of overhead in API management and integration tasks. More services and domains to be managed in different places.
What are then some important considerations to make when finding the right cloud for your operating model in terms of Machine Learning?
- What is your AI readiness? If you’re fresh to the game public cloud with out-of-the-box solutions are easy to implement, use, and maintain while you experiment. However, if you are a more experienced AI organisation you can leverage some of the hybrid or multi-cloud approaches to enable your business.
- What is your security and compliance profile? Compliance, what type of data you will be handling, and regulatory frameworks can have a great impact on not only your cloud environment but also cloud tools.
- How is your current scenario? If your organization already has great data centre capacity on-premises or in the cloud you should not try to completely reinvent your cloud strategy and operating model. Try to adapt to it and make use of existing resources as you’re building and experimenting. The technology is being developed as we speak so locking in already now might be unwise.
- How is your human capital? Do you have labour capacity available to manage a new environment and user management in addition to developing the AI?
- How much data do you need? The more data the more compute you need. Are your current resources sufficient? Do you need to invest in more compute?
- What is the business strategy? Alignment is crucial in any technology implementation scenario. The business needs to be onboard, are they looking to migrate clouds? Change key vendors? How will that impact the choice of cloud for MLOps?
- Who are the service providers available for you? The market of services that the provider has now and the roadmap of them must be aligned with your requirements for business outcomes. Do the capabilities of the cloud match the workload, standard, and architecture requirements that you would have if you built on that cloud?
- What are some resilience and performance requirements for your use-cases? When you run your application how critical is it that this is up and running? How will it impact the core business if its not? Here it is important to consider contractual obligations such as SLAs and another contracts with the vendor to ensure that the resilience and performance is up to par.
- What will this cost our business and what is the return? After you have considered most of the questions above, you are ready to build out different computational use cases and infrastructure requirements in the different cloud configurations you may have. This will allow you to easier compare costs between the different service providers, as they have very different cost models, its hard to make a comparison until you have clear use cases and how they should be built.
Overall, most cloud providers do offer capabilities for machine learning and artificial intelligence, but the way they are offered vary a lot on quality of implementation to your specific scenario. Which makes it extremely important to consider the points above up front of selecting a vendor.
MLOps itself is then the intersection of developing and operating the machine learning on the platform itself. Like DevOps but focused on DevOpsing a Machine Learning solution. This is a set of practices to automate the and simply the workloads that you need to consider above. Much of the MLOps practices today are being run on technologies such as Kubernetes. Kubernetes among other providers offer end-to-end solutions for running the machine learning on the cloud.
When you do this, you need to build a technology stack on an end-to-end platform or build one from the various components that is out there. Related to Data, Development & Deployment, Testing, and monitoring.
No matter what model you chose for you application building, this stack can be hard to manage. Just for the “DevOps” part you have Integrations, management of the tooling, versioning, dependencies, and many more. Adding on top of that you have the Machine Learning components of data ingestion, cleaning of data, storing of data, training of model, and deployment of data. Furthermore, bringing concerns around security and compliance with the model and data that is being used. Creating extreme complexity of running some of these models.
Meaning that companies sometimes will have to add open-source models to their stack and other components so that they have more control and insights over what is going on inside the model versus some of the public models out there.
Many chose to outsource the management of these models and development to managed service partners such as those Konfitech provide. As the market matures, we will see more production grade and better documented solutions, however the requirement for open-source models, programs, and integrations is undeniable. With the increased management overhead, we expect to see a rise in demand for manged service partners in this aspect. Which is probably alongside making smart choices about type of platform the most economically smart way of running MLOps.
Contact us today to build your MLOps solution: https://www.konfitech.com/contact-us
Stay up to date on our newest insights: https://www.konfitech.com/konfitechs-blog
Or get the insights sent to you via e-mail: https://konfitech.substack.com