Incremental load etl processes

2/18/2024

The output is an entity that has all the surrogate keys of the source system removed. For example, a customer is created in a store we don’t have all the store data available in the customer entity, so we use the store_id in the customer entity to resolve the store business key and store that instead. We’ll use the indexes there to resolve the business keys for such elements. There are also references to other entities that need to be resolved. For example, an invoice has a date, last printed date, etc., a list of a single customer, a list of invoicelines. An entity usually contains some specific data about the main entity itself and collections of sub records related to the main subject. The entities from the incremental area are now processed. Should obviously be standardized (dates and times with timezone info, precision of the output of floats, etc) In the output format for example json doesn’t have a datetime data type, so the output will be a string. The extract extracts data “as is”, only applying data type transformations to deal with the availability of data types Another method could be XML or any other method ofĮxporting data that you can use, as long as it allows you to do the ‘nesting’.

The extract isĪctually done in JSON, which is handled perfectly in python. In the current example, I’m using advanced SQL functions (2016) to extract nested records. An entity can be considered an artefact involved in an instance of a business process, so for example an invoice, an order, a product sheet. My extract doesn’t extract table by table, it groups certain things together, which I call “entities” in the implementation. The incremental dumps go straight to a PSA. The full dumps go to a preprocessing directory. It is typically discouraged to use a PSA as a strategic mechanismįor the data vault, but there are new trains of thought (because of BigData and data laking) whereĮmbrace your Persistent Staging Area by Roelant Vos.Įxtract data from source either as a full table dump or an ‘incremental’ dump. This new example relies on a PSA as well. Soon as the raw datavault is complete,Īll the downstream processes are completely clear and workable. In my view the loading process is the biggest pain of the data vault. The business key is in customer, but it’s not uaranteed that the customer record is in staging when a record in customerdetails changes. Where the primary key points to a customerid in the customer table. The same is the case for some satellite tables which are extensions to hub tables (usually a 1:1 relationship),įor example a customerdetails table with its own records, potentially has versioned records by ‘date’ and Load on any involved table, then you can no longer guarantee that the hub records are in staging, which impliesĪ lookup which according to the methodology is to be avoided. The join in staging and you can also load the link table in parallel to the hubs. If you have a full dump of a source system, then you can perform A join is needed to both actor and film to get the desiredįields to generate the business key. film_actor describes the relationship between films and actors, so it would become a link tableĪctor and film can easily be loaded in parallel, but there is ambiguity how the business keys for theįilm_actor link table should be generated.a film_actor table made up of (film_id, actor_id), which associates actors with films.a film made up of (film_id, title, release_year, …), where (title, release_year) is the chosen business key.an actor made up of (actor_id, first_name, last_name), where (first_name, last_name) are the chosen business key.Hash all the keys in the source system query, which I think is wrong and not scalable. Uses a huge flat file which is then decomposed into different datasets. Multiple source tables with inter-table relationships in 3rd normal form. On that, but never really explained very well in the book where the hashing should take place when you have Hashes its business keys and then moves it into the data vault. The Datavault 2.0 book contains a picture that displays how you’d extract data from a source system, This example builds on the previous examples yet again and changes the pre-processing stage. The previous example had many disadvantages andĪfter advanced insights, I’ve started reworking the example to become easier to manage.

0 Comments

Incremental load etl processes

Leave a Reply.

Author

Archives

Categories