AVROTROS

A best practice for collecting and publishing data in Google Cloud Platform in 3 steps

About AVROTROS

AVROTROS is a media organisation that is part of the Nederlandse Publieke Omroep (NPO, the Dutch public broadcasting organisation). With dozens of television shows, radio shows, online shows, 8 sites, 2 apps and over 100 social media accounts AVROTROS is a key player in the Dutch market.

The program budgets are largely determined on the reach per show (on television and online). Therefore the most important KPI on all their platforms is unique/total reach. AVROTROS have a strong ambition to further expand their customer reach based on data and by using data-driven workflows to deliver insights to the publishers for analysis of customer reach of their shows and other formats and also for growth hacking purposes.

AVROTROS

The Goal

They had a working solution on Google Cloud in place and wanted to bring their architectural setup to the next level. That is why they challenged Crystalloids to design and build a future proof solution. 

They were looking for a solution that is:

  • Easy to use by both developers and analysts
  • Easy to monitor
    • Make sure that it is easy to detect issues, before they are even noticed by the data users.
  • Easy to maintain
    • Whenever new information needs to be added or existing information needs to be changed, while maintaining the availability of the data to its users.
  • Easy to expand
    • Whenever you want to add new data sources, the same solution could be applied for the new data source.
  • Easy to reload
    • Whenever any step has failed, it should be easy to redo that step individually, without having any dependency on the rest of the process.
  • Easy to secure
    • Groups of people should only have access to data that is pre-defined for them to use.

The Result

By using this setup, in each individual step, monitoring can be applied. Operational engineers can act on this to help manage the production lifecycle.

The solution has a pay-per-use pricing model, making sure there are no unnecessary costs. And since this process runs serverless, there is no configuration involved.

 

TJITZE ZIJLSTRA, Growth Hacker AVROTROS

“In just four weeks we have been able to learn a lot. The load-normalise-publish process that Crystalloids has built will be an important foundation of our Datahub. We will bring all remaining data sources to the Datahub in the same way which is a major step in the development of our Datahub and the data driven way of working.”

 

The Process

At Crystalloids we work in a structured way to get maximum results in the shortest time.

Preparation phase:

  1. Information gathering about the existing architecture and high level goals.
  2. Writing backlog together the Product Owner, developers and Commercial Director
  3. Backlog refinement to get all necessary details right such as Definition of Done
  4. Backlog prioritisation by business importance
  5. Estimation of work for the selected user stories

Implementation phase:

  1. Design the architecture
  2. Implement the solution
  3. Demo working software in two sprints of two weeks
  4. Document and handover the code and documentation
  5. Review customer satisfaction

The Solution

Instead of loading data from a source into BigQuery and directly build reports on these BigQuery tables, the best practice we implemented is to divide this process into three steps:

Google Cloud Platform

1. Load

This step only takes data from the source and writes it to BigQuery in its raw form.

No transformations are made on the data, to make sure that you are looking at the same data as was received from the source. It can be useful to not only write it to BigQuery, but also to a file in Google Cloud Storage. For example, when you receive data in JSON format, this can be stored in its purest form.

2. Normalise

This step takes the raw data and makes the transformations that are needed to be able to use the data. Make sure that:

  • all data is deduplicated: no transformed table contains two identical rows. 
  • all data is actual: only one version of the data is present, the most recent one,
  • all data is stored at the lowest granularity (no aggregation of raw data, only storing its lowest level)

3. Publish

Once the data is normalised and all entities are stored in tables, you are ready to start gathering the data that you want to use in your reports or for specific analyses. This should happen in a separate publish step. Here you also have the possibility to combine multiple entities and/or sources.

To orchestrate these steps and make sure that the next step will only be executed when the previous is finished, we used Google Workflows. Within this workflow, each step in the solution is handled by a Cloud function. For this case it was required to load the data from the source on a daily basis. Therefore the first step is triggered by Cloud scheduler, since it allows to trigger the process every day at the same time.

AVROTROS

“In just four weeks we have been able to learn a lot. The load-normalise-publish process that Crystalloids has built will be an important foundation of our Datahub. We will bring all remaining data sources to the Datahub in the same way which is a major step in the development of our Datahub and the data driven way of working.”

TJITZE ZIJLSTRA, Growth Hacker AVROTROS

 

Share this story