The process and best practices of maintenance in Google Cloud
by Livia Cirnu, on Jul 10, 2020 4:04:50 PM
Every development team has a long list of software features coming from the business units. Whether it is to build or improve a product, an application, developers are needed. But as the business grows, IT managers may find the maintenance tasks as a result of the developers’ work taking more and more of their time. As the customer base grows and the business might have seen some first spikes, it is time to start thinking about professionalising maintenance.
Bringing someone in-house might seem like an easy option if your business can find the right talent to manage their operations. Another option is to outsource to a larger remote maintenance support team. In that case, your company has to deal with a ticketing system handled by various engineers at different time zones. Not to forget there will be an in-house person needed to manage things closely. The third option is to find a dedicated consultant that can provide exactly what your organisation needs, communicate closely with the internal team and help exactly when it needs it.
Although working in the cloud allows you to easily deploy applications without having to hire an operations team, there is always some part of maintenance that needs to be done and can’t be neglected to ensure business continuity. The more visibility and monitoring there is in place, the easier it is to grow, innovate and save costs. In this article, we will share what maintenance means in the context of Google Cloud and how we monitor cloud operations for our clients.
The maintenance scope
Maintenance on a daily basis includes monitoring of processes such as data pipelines, data quality and billing. Among these activities, we also include quality assurance. More specifically, we check for data freshness (if it was delivered on time), data completeness (do we receive all the files) and data accuracy (do the files contain the information we expected and in the appropriate format).
Traditionally there used to be a software development team that designed and built the solutions and an operational team that would maintain the environment. Today we rely on DevOps practices. With DevOps, both the Development and the Operations teams are combined to ensure that the software and data development cycles run smoothly.
As part of our maintenance efforts, the first category to monitor is the data pipelines. A data pipeline is an infrastructure which ensures the delivery and necessary transformation of data. Therefore it is important to keep an eye on its health.
The basic health indicators are:
- Status of the job: success or failure?
- Latency: how long does it take for the job to complete?
Let's take a dataflow for example. In order to pass the checks which were specified above, the maintenance team will take these actions:
- Configure alerts on Stackdriver to send emails when a dataflow fails;
- Collect metrics on time and the amount of data that is ingested by the process.
After checking whether the infrastructure works as expected, the next thing we monitor is the data quality. We mentioned earlier that data can be assessed through its freshness, completeness and accuracy (whether it satisfies the business requirements).
The starting point is to check the data the moment it arrives for the following criteria:
- Arrival time
The maintenance team has created scripts that automate these checks. When one of the checks fails, then an alert is triggered.
One of the critical business maintenance operations is to monitor the billing cost and any significant changes in expenses that might have occurred. In case of any suspicious spikes in the billing report, we would investigate:
- which environment incurred the expenses (dev, test or prod)
- which Google Cloud products/resources cause it
- which processes from these products are involved
After that, we would get into contact with developers or product owners in charge of that particular project. Below you can see an example of how the report looks like and which Google Cloud products contributed to the overall costs.
The maintenance process
Reports and automated alerts
There are many visualisation applications that can be used for maintenance and monitoring, such as Google Data Studio, Looker, Tableau. To monitor the Google Cloud environments, we work with Google Data Studio that allows us to create custom dashboards and easily detect any errors or over-consumption. In addition, the reports can be shared with other team members, can be used for forecasting of resource usage and costs and it’s free of charge.
The monitoring process can be automated in Google Cloud by setting up alerts to check data accuracy or when the tables were last updated. We set up the warning system and monitoring alert via Stackdriver, a product of Google Cloud. With Stackdriver we can monitor data pipelines, for instance. When a Dataflow is lagging behind or it has failed we get notified via an automated email.
For each project, we use Confluence to document the processes, what kind of checks we need to perform and how to act in case something fails. The daily tasks include checking the following:
- Cron jobs
- Scheduled queries
- The status of open Jira tickets
- File delivery in certain buckets
The maintenance of the production environment helps to identify performance deviations or any issues related to the availability of the service that might impact using the systems. We work closely with the developers and communicate with them about expected behaviour or bugs that have arisen. While the developers should focus on new releases they are inevitably part of the monitoring team that assures all the other processes run smoothly.
Monthly Service Review
A best practice is to take the maintenance to a more strategic level by reviewing periodically with senior IT management. As a result of this review, structural improvements can be made. These are the elements we usually cover in a service review:
- Action points/minutes from the previous review
- Maintenance activities and tools
- Incident management
- Service requests
- SLA (service level agreement) overview
- Maintenance meeting - Topics & Minutes
- Billing report
- Service Improvement Plan
By applying this kind of review we have a helicopter view on the maintenance and the cooperation of the teams including the technology.
The Service Improvement plan may consist of topics such as shown in the image below
Maintenance is just as crucial for a seamless software experience as the application itself. Organisations can no longer afford to run the risk of system failures that would affect productivity and cost money. We can help you maintain a secure and reliable cloud environment without burdening your budget.
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.