How to migrate BigQuery datasets across regions on Google Cloud Platform?
by Robin Laurens, on Mar 9, 2018 2:49:47 PM
On May 25th of this year, the new EU data regulation (GDPR) will be enforced which means if you are not in compliance by then, you can risk fines from the European Union up to 4% of your overhead.
One of the critical points in the regulation is the personal data protection of the European citizen. If you use BigQuery for your data queries, this means that the datasets you are using cannot be stored in any other region than the European Union. If they are, it might be a good idea to start migrating.
Unfortunately, migrating BigQuery datasets across regions on Google Cloud Platform is not as easy as it seems. Therefore, our lead developers will provide you with a technical explanation about how to solve the problem, using two different approaches:
How to migrate Big Query datasets across regions on Google Cloud Platform
You have a few, or possibly many BigQuery tables or datasets on Google Cloud Platform (hereafter referred to as GCP) in a location/region, which now for whatever reason, need to be moved to another location/region. Unfortunately, there’s no out-of-the-box copy or export functionality at this moment in Google Cloud Platform that makes this job easy for one table, let alone for many tables.
Out of the various approaches that solve this problem, we will talk about two solutions using Google Cloud Dataflow or Google Cloud Datalab:
Using Cloud Dataflow, it can be solved in the following steps:
- Create a dataflow, providing the source & target dataset (target dataset needs to be in place beforehand).
- The dataflow iterates over the list of all tables/views using getTablesList(sourceProjectId, sourceDataSet).
- It reads from the source dataset using BigQueryIO.Read.from
- For each table/view, copy its schema by TableSchema schema = BigqueryFactory.getBigquery().tables().get(sourceProjectId, sourceDataSet, table).execute().getSchema();
- Finally it writes it to the destination dataset using BigQueryIO.Write.to. Don’t forget to include withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED), it will take care of creating the tables.
- Get a source_tables_to_copy list from Google BigQuery containing tables that are not present in the target_dataset or are different from the source_dataset.
- For each source_table in the source_tables_to_copy list, do:
- Create a new empty target_table in the target_dataset using the source_table.name and source_table.table_schema.
- Export the source_table in JSON format to a temporary Google Cloud Storage storage_object.
- Import the storage_object to the target_table.
- Remove the temporary Google Cloud Storage storage_object.
- Remove the source_table if applicable.
If you need to migrate BigQuery tables across locations/regions but need a helping hand, do not hesitate to contact us.
Authors: S. Khattak & B. Marseille
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science and marketing making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and make their job focus more on decision making and less on programming.