Drupal migrations: Understanding the ETL process

31 days of Drupal migrations Drupal migrations: Understanding the ETL process

The Migrate API is a very flexible and powerful system that allows you to collect data from different locations and store them in Drupal. It is, in fact, a full-blown extract, transform, and load (ETL) framework. For instance, it could produce CSV files. Its primary use is to create Drupal content entities: nodes, users, files, comments, etc. The API is thoroughly documented, and their maintainers are very active in the #migration slack channel for those needing assistance. The use cases for the Migrate API are numerous and vary greatly. Today we are starting a blog post series that will cover different migrate concepts so that you can apply them to your particular project.

Understanding the ETL process

Extract, transform, and load (ETL) is a procedure where data is collected from multiple sources, processed according to business needs, and its result stored for later use. This paradigm is not specific to Drupal. Books and frameworks abound on the topic. Let’s try to understand the general idea by following a real life analogy: baking bread. To make some bread, you need to obtain various ingredients: wheat flour, salt, yeast, etc. (extracting). Then, you need to combine them in a process that involves mixing and baking (transforming). Finally, when the bread is ready, you put it into shelves for display in the bakery (loading). In Drupal, each step is performed by a Migrate plugin:

The extract step is provided by source plugins.
The transform step is provided by process plugins.
The load step is provided by destination plugins.

As it is the case with other systems, Drupal core offers some base functionality which can be extended by contributed modules or custom code. Out of the box, Drupal can connect to SQL databases including previous versions of Drupal. There are contributed modules to read from CSV files, XML documents, JSON and SOAP feeds, WordPress sites, LibreOffice Calc and Microsoft Office Excel files, Google Sheets, and much more.

The list of core process plugins is impressive. You can concatenate strings, explode or implode arrays, format dates, encode URLs, look up already migrated data, among other transform operations. Migrate Plus offers more process plugins for HTML parsing, DOM manipulation, string replacement, transliteration, etc.

Drupal core provides destination plugins for content and configuration entities. Most of the time, targets are content entities like nodes, users, taxonomy terms, comments, files, etc. It is also possible to import configuration entities like field and content type definitions. This is often used when upgrading sites from Drupal 6/7 to Drupal 8/9. Via a combination of source, process, and destination plugins, it is possible to write Commerce Product Variations, Paragraphs, and more.

Technical note: The Migrate API defines another plugin type: id_map. They are used to map source IDs to destination IDs. This allows the system to keep track of records that have been imported and roll them back if needed.

Drupal migrations: a two step process

Performing a Drupal migration is a two step process: writing the migration definitions and executing them. Migration definitions are written in YAML format. These files contain information on how to fetch data from the source, how to process the data, and how to store it in the destination. It is important to note that each migration file can only specify one source and one destination. That is, you cannot read from a CSV file and a JSON feed using the same migration definition file. Similarly, you cannot create nodes and users from the same file. However, you can use as many process plugins as needed to convert your data from the format defined in the source to the format expected in the destination.

A typical migration project consists of several migration definition files. Although not required, it is recommended to write one migration file per entity bundle. If you are migrating nodes, that means writing one migration file per content type. The reason is that different content types will have different field configurations. It is easier to write and manage migrations when the destination is homogeneous. In this case, a single content type will have the same fields for all the elements to process in a particular migration.

Once all the migration definitions have been written, you need to execute the migrations. The most common way to do this is using the Migrate Tools module which provides Drush commands and a user interface (UI) to run migrations. Note that the UI for running migrations only detect those that have been defined as configuration entities using the Migrate Plus module. This is a topic we will cover in the future. For now, we are going to stick to Drupal core’s mechanisms of defining migrations. Contributed modules like Migrate Scheduler, Migrate Manifest, and Migrate Run offer alternatives for executing migrations.

Did you know that Drupal had an ETL framework baked in? Were you aware there are plugins already available to fetch data from so many sources? Can you think of a use case where the destination would not be Drupal’s database? Did you know that performing a migration is a two step process? Please share your answers in the comments. Also, I would be grateful if you shared this blog post with your friends and colleagues. Tomorrow, you will learn to write and execute a basic migration.

Back to Course

Next Lesson