Apart from Version Control API, I've also been working on another piece of code, and after a long time of fuzzing around with architecture and internals it's now slowly getting to a state where one can get excited about it. I didn't make it in time to show it off at DrupalCon DC, but now the module can be found on drupal.org: Say hello to Transformations, your favorite new uncut diamond for performing data transformations in Drupal (including, but not limited to import/export).
Experience has shown that starting with the underlying concepts always triggers the question "So what does it actually do?", so let's start with the use cases that lead to the module's inception:
Now, I hear someone arguing like, "Scope creep! When you pack all that functionality into a single module, it will get bloated, hard to use and unmaintainable." And to a certain extent, I agree with that. As far as I can see, a major reason that Import/Export API failed is that it tried to establish a single common data format that described all possible information, and moreover, every data backend (Drupal nodes, CSV, XML) was required to cope with all that information. In other words, the architecture of Import/Export API might have required too much of a monolithic approach, which lead to the maintainability problems that were at least part of the reason why the project stalled.
Does that mean that generic import/export systems are unfeasible? I don't believe so, and I hope Transformations' architecture takes enough precautions to avoid the fate of Import/Export API.
The key idea to Transformations is that we'll never be able to capture all possible data formats and structures in a single module. What we can do, though, is to decompose that data into bite-size pieces, and define operations that can process those. (For example, operations transforming a Unix timestamp into an ISO date string or into a PHP DateTime object are pretty straightforward.) Building on that, we can assemble the pieces into larger pieces of data, and we can also define operations on those. (Extract the first three dates from a list of dates? Sure, easy enough.) In essence, importing and exporting data is nothing more than decomposing data, processing it according to a given set of rules, and reassembling it in a different form.
The key idea to Transformations is that we don't want to know all those rules upfront, there's just too many ways in which data can be decomposed, processed and reassembled. So instead of trying to cover all use cases, let's provide users with the necessary tools to define their transformations by themselves. Let's not define a single common data format that all data needs to conform to - all we need is a set of data formats, plus the knowledge which operations work on which data, plus a way to wire them up. If you think that wiring up operations sounds like Yahoo! Pipes, you're on the right track - only that Transformations can actually deal with structure information (schemas), and that my current user interface is way more crude than Yahoo's nice JavaScript wizardry.
The key idea to Transformations is that if we want a generic import/export module, it's important not to provide a solution but to provide a framework. Transformations is a framework for creating data transformation pipelines. ETL for Drupal, you might say - a braindead attempt to eliminate a whole class of import/export modules, and a potential basis for a Yahoo! Pipes clone on Drupal. Transformations provides the means for developers and users to build their own stuff.
Oh, and Transformations is incomplete, unpolished and (beware!) object-oriented. Until now, I worked to get the concepts right. Next step: building on the foundations, and extending it to enable more concrete use cases being built upon it. For now, it can be used for CSV import/export in a way that is more complicated than Node Import will ever be. But the flexibility of the framework also promises possibilities that go way beyond nodes and two-dimensional tables.
Still reading? Still interested? Then go and check it out.
[1] "We" as in "my company, Pro.Karriere", or more specifically, my boss and visionary entrepreneur, Klaus.
Comments
Nice Tip
What a great tip. I am a developer that just started a drupal project. The project entails rebuilding a site using Drupal. The original CMS was a proprietary system that stored data as XML. The client wants the new site to be ported over to Drupal as well. This tool may help save a lot of pain with the data porting. Maybe we can automate it, as opposed to reentering the content manually. Thanks
ETL?
If you have experience of ETL tools I would definitely like to see a comparison of what this module can/cant do compared to other (non-Drupal) ETL tools.
Are there Open Source MySQL ETL tools which we should be considering too?
Differences
There are a number of other ETL solutions (for example, Pentaho Data Integration, Talend, Clover, Apatar or SnapLogic) but whether those are suitable for your needs depends on the specific use cases.
What Transformations can't do is, at the moment, pretty much everything that the other tools can - which stems from a lack of operations for the framework. Plus a nice UI. But the basic principles behind the API are sound, and the above points are design advantages that cannot really be replicated by the other ones. I understand though that Transformations might not be the right choice for everyone, and that's ok. By focusing on Drupal, I hope to fill a need that couldn't be tackled with external ETL systems (and at the same improve schema integration capabilities).
Have you seen patterns?
have you taken a look at the patterns module? It seems they are also running along the same lines of easing import (and eventual export) of items in drupal.
Yep, also cool
Patterns seems to get big, and is certainly interesting. Its targeted use case is a different one, however - Patterns is mostly about site configuration and also does a bit of "data" content, while Transformations is mostly about data processing and (by principle) could also be extended to deal with site configuration and deployment. Patterns is certainly more targeted and comes with specific interfaces for its use cases, while Transformations is just a framework to build stuff upon.
Also, the underlying technology is completely different. Patterns standardizes known data structures (with fixed PHP, XML or YAML exports) while Transformations does not assume or prescribe any kind of data structures and just pipes the data through a number of operations - it will do what the pipeline creator tells it to. That kind of approach opens up more possibilities, but also comes with higher complexity and imposes more requirements on the user... unless someone writes a more targeted user interface on top of Transformations, which is entirely possible as it's a nice API to use.
So Patterns is definitely cool but not really directly comparable to Transformations, both take very different directions.
Intriguing
This sounds like it could really go somewhere. I once wrote an XML importer because I needed nested data, but it was so clunky and hard to use that I just gave up on it.
So the big question I have now is "How might Transformations handle merging changes from successive imports?" A prime example might be a price list from a manufacturer. Every 6 months or so they raise the prices, so you have to go in and fix all of your product nodes. Could Transformations use each import of the price list to find the products that are actually in the system and save the new price?
I'll definitely be playing with this some more.
I have partially-completed a
I have partially-completed a generic CSV import module of the kind that Transformations is (quite rightly) designed to replace. (There is a reason why whenever you hear about such things they're always partially-completed.)
I've tackled this problem by using the Unique Field module to tag one or more fields per content type, or combinations thereof, as having unique values (in your example, say a product ID number). This allows me to check (by programatically running a view using the value(s) I want to insert in these fields as arguments) for existing nodes I should update rather than create during the import process without having to build those smarts into my import module (or having to feed node ids back into the foreign database), and has the side benefit of preventing users from manually creating duplicates.
Unique Field is okay for my immediate purposes, but certainly needs some love before it could fit into an industrial-strength solution. As a general principle though you want your import system to be as dumb as possible and let the rest of Drupal do the heavy lifting of finding corresponding nodes, access control, validating data, etc.
Filters
One way to deal with such a situation would be have an operation that takes a list of [insert original data items here], filters it so that only existing items are left in the list, and returns the corresponding node as additional result value. With that new (original data + existing node) list, one can map the data to the node like you would do it with newly created nodes - the "set fields in node object" just overwrites the fields that are actually set as input.
Pipeline execution currently lacks a bit in the area of conditionally doing stuff when something is the case (e.g. node exists) and doing something else (or nothing) when it's not the case. For find-and-update pipelines, it might be necessary to improve the execution logic in a few places. Plus I need to write a generic filter operation that uses other operations (or pipelines) to determine whether an item should stay in or be filtered out.
Fundamentally though, there's nothing that speaks against this use case.