Today we will learn how to migrate content from Microsoft Excel and LibreOffice Calc files into Drupal using the Migrate Spreadsheet module. We will give instructions on getting the module and its dependencies. Then, we will present how to configure the module for spreadsheets with or without a header row. There are two example migrations: images and paragraphs. Let’s get started.
You can get the full code example at https://github.com/dinarcon/ud_migrations The module to enable is `UD Google Sheets, Microsoft Excel, and LibreOffice Calc source migration` whose machine name is `ud_migrations_sheets_sources`. It comes with four migrations: `udm_google_sheets_source_node.yml`, `udm_libreoffice_calc_source_paragraph.yml`, `udm_microsoft_excel_source_image.yml`, and `udm_backup_csv_source_node.yml`. The image migration uses a Microsoft Excel file as source. The paragraph migration uses a LibreOffice Calc file as source. The CSV migration is a backup in case the Google Sheet is not available. To execute the last one you would need the Migrate Source CSV module.
You can get the Migrate Google Sheets module using composer: `composer require ‘drupal/migrate_spreadsheet:^1.0’. This module depends on the `PHPOffice/PhpSpreadsheet` library and many PHP extensions including `ext-zip`. Check this page for a full list of dependencies. If any required extension is missing the installation will fail. If your Drupal site is not composer-based, you will not be able to use Migrate Spreadsheet, unless you go around a lot of hoops.
This migration will reuse the same configuration from the introduction to paragraph migrations example. Refer to that article for details on the configuration. The destinations will be the same content type, paragraph type, and fields. The source will be changed in today’s example, as we use it to explain Microsoft Excel and LibreOffice Calc migrations. The end result will again be nodes containing an image and a paragraph with information about someone’s favorite book. The major difference is that we are going to read from different sources.
Note: You can literally swap migration sources without changing any other part of the migration. This is a powerful feature of ETL frameworks like Drupal’s Migrate API. Although possible, the example includes slight changes to demonstrate various plugin configuration options. Also, some machine names had to be changed to avoid conflicts with other examples in the demo repository.
In any migration project, understanding the source is very important. For Microsoft Excel and LibreOffice Calc migrations, the primary thing to consider is whether or not the file contains a row of headers. Also, a workbook (file) might contain several worksheets (tabs). You can only migrate from one worksheet at a time. The example documents have two worksheets: `UD Example Sheet` and `Do not peek in here`. We are going to be working with the first one.
The `spreadsheet` source plugin exposes seven configuration options. The values to use might change depending on the presence of a header row, but all of them apply for both types of document. Here is a summary of the available configurations:
Note that nowhere in the plugin configuration you specify the file type. The same setup applies for both Microsoft Excel and LibreOffice Calc files. The library will take care of detecting and validating the proper type.
This example is for the paragraph migration and uses a LibreOffice Calc file. The following snippets shows the `UD Example Sheet` worksheet and the configuration of the source plugin:
book_id, book_title, Book author B10, The definite guide to Drupal 7, Benjamin Melançon et al. B20, Understanding Drupal Views, Carlos Dinarte B30, Understanding Drupal Migrations, Mauricio Dinarte
source: plugin: spreadsheet file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_book_paragraph.ods worksheet: 'UD Example Sheet' header_row: 1 origin: A2 columns: - book_id - book_title - 'Book author' row_index_column: 'Document Row Index' keys: book_id: type: string
The name of the plugin is `spreadsheet`. Then you use the `file` configuration to indicate the path to the file. In this case, it is relative to the Drupal root. The `UD Example Sheet` is set as the `worksheet` to process. Because the first row of the file contains the header rows, then `header_row` is set to `1` and `origin` to `A2`.
Then specify which `columns` to make available to the migration. In this case, we listed all of them so this setting could have been left unassigned. It is better to get into the habit of being explicit about what you import. If the file were to change and more columns were added, you would not have to update the file to prevent unneeded data to be fetched. The `row_index_column` is not actually used in the migration, but it is set to show all the configuration options in the example. The values will be `1`, `2`, `3`, etc. Finally, the `keys` is set the column that serves as unique identifiers for the records.
The rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows the process and destination sections for the LibreOffice Calc paragraph migration.
process: field_ud_book_paragraph_title: book_title field_ud_book_paragraph_author: 'Book author' destination: plugin: 'entity_reference_revisions:paragraph' default_bundle: ud_book_paragraph
Now let’s consider an example of a spreadsheet file that does not have a header row. This example is for the image migration and uses a Microsoft Excel file. The following snippets shows the `UD Example Sheet` worksheet and the configuration of the source plugin:
P01, https://agaric.coop/sites/default/files/pictures/picture-15-1421176712.jpg P02, https://agaric.coop/sites/default/files/pictures/picture-3-1421176784.jpg P03, https://agaric.coop/sites/default/files/pictures/picture-2-1421176752.jpg
source: plugin: spreadsheet file: modules/custom/ud_migrations/ud_migrations_sheets_sources/sources/udm_photos.xlsx worksheet: 'UD Example Sheet' # The file does not have a headers row. header_row: null origin: A1 # If no header row is available, you use the spreadsheet's column names: A, B, C, etc. # If you do not manually add a list of columns, all columns that contain data in the worksheet would be returned. # The same names need to used in the process section. columns: - A # This column contains the photo ID. Example: 'P01'. - B # This column contains the photo URL. row_index_column: null keys: A: type: string
The `plugin`, `file`, amd `worksheet` configurations follow the same pattern as the paragraph migration. The difference for files with no header row is reflected in the other parameters. `header_row` is set to `null` to indicate the lack of headers and `origin` is to `A1`. Because there are no column names to use, you have to use the ones provided by the spreadsheet. In this case, we want to use the first two columns: `A` and `B`. Contrary to CSV migrations, the `spreadsheet` plugin does not allow you to define aliases for unnamed columns. That means that you would have to use `A`, `B` in the process section to refer to this columns.
`row_index_column` is set to `null` because it will not be used. And finally, in the `keys` section, we use the `A` column as the primary key. This might seem like an odd choice. Why use that value if you could use the `row_index_column` as the unique identifier for each row? If this were an isolated migration, that would be a valid option. But this migration is referenced from the node migration explained in the previous example. The lookup is made based on the values stored in the `A` column. If we used the index of the row as the unique identifier, we would have to update the other migration or the lookup would fail. In many cases, that is not feasible nor desirable.
Except for the name of the columns, the rest of the migration is almost identical to the CSV example. Small changes were made to prevent machine name conflicts with other examples in the demo repository. For reference, the following snippet shows part of the process and destination section for the Microsoft Excel image migration.
process: psf_destination_filename: plugin: callback callable: basename source: B # This is the photo URL column. destination: plugin: 'entity:file'
Refer to this entry to know how to run migrations that depend on others. In this case, you can execute them all by running: `drush migrate:import –tag=’UD Sheets Source’`. And that is how you can use Microsoft Excel and LibreOffice Calc files as the source of your migrations. This example is very interesting because each of the migration uses a different source type. The node migration explained in the previous post uses a Google Sheet. This is a great example of how powerful and flexible the Migrate API is.
What did you learn in today’s blog post? Have you migrated from Microsoft Excel and LibreOffice Calc files before? If so, what challenges have you found? Did you know the source plugin configuration is not dependent on the file type? Share your answers in the comments. Also, I would be grateful if you shared this blog post with others.