Migrating Microsoft Excel and LibreOffice Calc files into Drupal

Today we will learn how to migrate content from Microsoft Excel and LibreOffice Calc files into Drupal using the Migrate Spreadsheet module. We will give instructions on getting the module and its dependencies. Then, we will present how to configure the module for spreadsheets with or without a header row. There are two example migrations: images and paragraphs. Let’s get started.

Example configuration for Microsoft Excel and LibreOffice Calc migration

Getting the code

You can get the full code example at https://github.com/dinarcon/ud_migrations The module to enable is `UD Google Sheets, Microsoft Excel, and LibreOffice Calc source migration` whose machine name is `ud_migrations_sheets_sources`. It comes with four migrations: `udm_google_sheets_source_node.yml`, `udm_libreoffice_calc_source_paragraph.yml`, `udm_microsoft_excel_source_image.yml`, and `udm_backup_csv_source_node.yml`. The image migration uses a Microsoft Excel file as source. The paragraph migration uses a LibreOffice Calc file as source. The CSV migration is a backup in case the Google Sheet is not available. To execute the last one you would need the Migrate Source CSV module.

You can get the Migrate Google Sheets module using composer: `composer require ‘drupal/migrate_spreadsheet:^1.0’. This module depends on the `PHPOffice/PhpSpreadsheet` library and many PHP extensions including `ext-zip`. Check this page for a full list of dependencies. If any required extension is missing the installation will fail. If your Drupal site is not composer-based, you will not be able to use Migrate Spreadsheet, unless you go around a lot of hoops.

Understanding the example set up

This migration will reuse the same configuration from the introduction to paragraph migrations example. Refer to that article for details on the configuration. The destinations will be the same content type, paragraph type, and fields. The source will be changed in today’s example, as we use it to explain Microsoft Excel and LibreOffice Calc migrations. The end result will again be nodes containing an image and a paragraph with information about someone’s favorite book. The major difference is that we are going to read from different sources.

Note: You can literally swap migration sources without changing any other part of the migration.  This is a powerful feature of ETL frameworks like Drupal’s Migrate API. Although possible, the example includes slight changes to demonstrate various plugin configuration options. Also, some machine names had to be changed to avoid conflicts with other examples in the demo repository.

Understanding the source document and plugin configuration

In any migration project, understanding the source is very important. For Microsoft Excel and LibreOffice Calc migrations, the primary thing to consider is whether or not the file contains a row of headers. Also, a workbook (file) might contain several worksheets (tabs). You can only migrate from one worksheet at a time. The example documents have two worksheets: `UD Example Sheet` and `Do not peek in here`. We are going to be working with the first one.

The `spreadsheet` source plugin exposes seven configuration options. The values to use might change depending on the presence of a header row, but all of them apply for both types of document. Here is a summary of the available configurations:

  • `file` is required. It stores the path to the document to process. You can use a relative path from the Drupal root, an absolute path, or stream wrappers.
  • `worksheet` is required. It contains the name of the one worksheet to process.
  • `header_row` is optional. This number indicates which row contains the headers. Contrary to CSV migrations, the row number is not zero-based. So, set this value to `1` if headers are on the first row, `2` if they are on the second, and so on.
  • `origin` is optional and defaults to `A2`. It indicates which non-header cell contains the first value you want to import. It assumes a grid layout and you only need to indicate the position of the top-left cell value.
  • `columns` is optional. It is the list of columns you want to make available for the migration. In case of files with a header row, use those header values in this list. Otherwise, use the default title for columns: `A`, `B`, `C`, etc. If this setting is missing, the plugin will return all columns. This is not ideal, especially for very large files containing more columns than needed for the migration.
  • `row_index_column` is optional. This is a special column that contains the row number for each record. This can be used as a unique identifier for the records in case your dataset does not provide a suitable value. Exposing this special column in the migration is up to you. If so, you can come up with any name as long as it does not conflict with header row names set in the `columns` configuration. Important: this is an autogenerated column, not any of the columns that comes with your dataset.
  • `keys` is optional and, if not set, it defaults to the value of `row_index_column`. It contains an array of column names that uniquely identify each record. For files with a header row, you can use the values set in the `columns` configuration. Otherwise, use default column titles like `A`, `B`, `C`, etc. In both cases, you can use the `row_index_column` column if it was set. Each value in the array will contain database storage details for the column.

Note that nowhere in the plugin configuration you specify the file type. The same setup applies for both Microsoft Excel and LibreOffice Calc files. The library will take care of detecting and validating the proper type.

Migrating spreadsheet files with a header row

This example is for the paragraph migration and uses a LibreOffice Calc file. The following snippets shows the `UD Example Sheet` worksheet and the configuration of the source plugin:

book_id, book_title, Book author
B10, The definite guide to Drupal 7, Benjamin Melançon et al.
B20, Understanding Drupal Views, Carlos Dinarte
B30, Understanding Drupal Migrations, Mauricio Dinarte
source:
  plugin: spreadsheet
  file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_book_paragraph.ods
  worksheet: 'UD Example Sheet'
  header_row: 1
  origin: A2
  columns:
    - book_id
    - book_title
    - 'Book author'
  row_index_column: 'Document Row Index'
  keys:
    book_id:
      type: string

The name of the plugin is `spreadsheet`. Then you use the `file` configuration to indicate the path to the file. In this case, it is relative to the Drupal root. The `UD Example Sheet` is set as the `worksheet` to process. Because the first row of the file contains the header rows, then `header_row` is set to `1` and `origin` to `A2`.

Then specify which `columns` to make available to the migration. In this case, we listed all of them so this setting could have been left unassigned. It is better to get into the habit of being explicit about what you import. If the file were to change and more columns were added, you would not have to update the file to prevent unneeded data to be fetched. The `row_index_column` is not actually used in the migration, but it is set to show all the configuration options in the example. The values will be `1`, `2`, `3`, etc.  Finally, the `keys` is set the column that serves as unique identifiers for the records.

The rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows the process and destination sections for the LibreOffice Calc paragraph migration.

process:
  field_ud_book_paragraph_title: book_title
  field_ud_book_paragraph_author: 'Book author'
destination:
  plugin: 'entity_reference_revisions:paragraph'
  default_bundle: ud_book_paragraph

Migrating spreadsheet files without a header row

Now let’s consider an example of a spreadsheet file that does not have a header row. This example is for the image migration and uses a Microsoft Excel file. The following snippets shows the `UD Example Sheet` worksheet and the configuration of the source plugin:

P01, https://agaric.coop/sites/default/files/pictures/picture-15-1421176712.jpg
P02, https://agaric.coop/sites/default/files/pictures/picture-3-1421176784.jpg
P03, https://agaric.coop/sites/default/files/pictures/picture-2-1421176752.jpg
source:
  plugin: spreadsheet
  file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_photos.xlsx
  worksheet: 'UD Example Sheet'
  # The file does not have a headers row.
  header_row: null
  origin: A1
  # If no header row is available, you use the spreadsheet's column names: A, B, C, etc.
  # If you do not manually add a list of columns, all columns that contain data in the worksheet would be returned.
  # The same names need to used in the process section.
  columns:
    - A # This column contains the photo ID. Example: 'P01'.
    - B # This column contains the photo URL.
  row_index_column: null
  keys:
    A:
      type: string

The `plugin`, `file`, amd `worksheet` configurations follow the same pattern as the paragraph migration. The difference for files with no header row is reflected in the other parameters. `header_row` is set to `null` to indicate the lack of headers and `origin` is to `A1`. Because there are no column names to use, you have to use the ones provided by the spreadsheet. In this case, we want to use the first two columns: `A` and `B`. Contrary to CSV migrations, the `spreadsheet` plugin does not allow you to define aliases for unnamed columns. That means that you would have to use `A`, `B` in the process section to refer to this columns.

`row_index_column` is set to `null` because it will not be used. And finally, in the `keys` section, we use the `A` column as the primary key. This might seem like an odd choice. Why use that value if you could use the `row_index_column` as the unique identifier for each row? If this were an isolated migration, that would be a valid option. But this migration is referenced from the node migration explained in the previous example. The lookup is made based on the values stored in the `A` column. If we used the index of the row as the unique identifier, we would have to update the other migration or the lookup would fail. In many cases, that is not feasible nor desirable.

Except for the name of the columns, the rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows part of the process and destination section for the Microsoft Excel image migration.

process:
  psf_destination_filename:
    plugin: callback
    callable: basename
    source: B # This is the photo URL column.
destination:
  plugin: 'entity:file'

Refer to this entry to know how to run migrations that depend on others. In this case, you can execute them all by running: `drush migrate:import –tag=’UD Sheets Source’`. And that is how you can use Microsoft Excel and LibreOffice Calc files as the source of your migrations. This example is very interesting because each of the migration uses a different source type. The node migration explained in the previous post uses a Google Sheet. This is a great example of how powerful and flexible the Migrate API is.

What did you learn in today’s blog post? Have you migrated from Microsoft Excel and LibreOffice Calc files before? If so, what challenges have you found? Did you know the source plugin configuration is not dependent on the file type? Share your answers in the comments. Also, I would be grateful if you shared this blog post with others.