How to migrate BigQuery datasets across regions on Google Cloud Platform?

by Robin Laurens, on Mar 9, 2018, 2:49:47 PM

How to migrate BigQuery datasets across regions on Google Cloud Platform?

On May 25th of this year, the new EU data regulation (GDPR) will be enforced which means if you are not in compliance by then, you can risk fines from the European Union up to 4% of your overhead.

One of the critical points in the regulation is the personal data protection of the European citizen. If you use BigQuery for your data queries, this means that the datasets you are using cannot be stored in any other region than the European Union. If they are, it might be a good idea to start migrating.

Unfortunately, migrating BigQuery datasets across regions on Google Cloud Platform is not as easy as it seems. Therefore, our lead developers will provide you with a technical explanation about how to solve the problem, using two different approaches:

How to migrate Big Query datasets across regions on Google Cloud Platform

The Problem

You have a few, or possibly many BigQuery tables or datasets on Google Cloud Platform (hereafter referred to as GCP) in a location/region, which now for whatever reason, need to be moved to another location/region. Unfortunately, there’s no out-of-the-box copy or export functionality at this moment in Google Cloud Platform that makes this job easy for one table, let alone for many tables.  

The Solution

Out of the various approaches that solve this problem, we will talk about two solutions using Google Cloud Dataflow or Google Cloud Datalab:

Cloud Dataflow

Using Cloud Dataflow, it can be solved in the following steps:

  1. Create a dataflow, providing the source & target dataset (target dataset needs to be in place beforehand).
  2. The dataflow iterates over the list of all tables/views using getTablesList(sourceProjectId, sourceDataSet).
  3. It reads from the source dataset  using BigQueryIO.Read.from
  4. For each table/view, copy its schema by TableSchema schema = BigqueryFactory.getBigquery().tables().get(sourceProjectId, sourceDataSet, table).execute().getSchema();
  5. Finally it writes it to the destination dataset using BigQueryIO.Write.to. Don’t forget to include withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED), it will take care of creating the tables.

Cloud Datalab

Using a Jupiter notebook in Cloud Datalab, it is solved using the Python google.datalab.bigquery and google.datalab.storage modules in the following steps:

  1. Get a source_tables_to_copy list from Google BigQuery containing tables that are not present in the target_dataset or are different from the source_dataset.  
  2. For each source_table in the source_tables_to_copy list, do:
    1. Create a new empty target_table in the target_dataset using the source_table.name and source_table.table_schema.
    2. Export the source_table in JSON format to a temporary Google Cloud Storage storage_object.
    3. Import the storage_object to the target_table.
    4. Remove the temporary Google Cloud Storage storage_object.
    5. Remove the source_table if applicable.

If you need to migrate BigQuery tables across locations/regions but need a helping hand, do not hesitate to contact us. 

Authors: S. Khattak & B. Marseille

The Economic Advantage of Google BigQuery

About Crystalloids

Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science and marketing making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and make their job focus more on decision making and less on programming.

For more information, please visit www.crystalloids.com or follow us on LinkedIN

Topics:Google CloudBigQuery

Comments