

This is pretty effective in the data warehousing case, where the underlying data is only updated periodically like every day. If your query takes a long time to run, a materialized view should act as a cache. > XN Merge (cost=10000314618031582361.57 rows=48223972 width=57) A materialized view is physically stored on disk and the underlying table is never touched when the view is queried.
#REDSHIFT MATERIALIZED VIEW LIMITATIONS FULL#
Here’s the explain output for the query in my original post, immediately after running a full vacuum and analyze: XN Limit (cost=10000314618031461801.89 rows=100 width=57) Redshift supports UNIQUE, PRIMARY KEY, and FOREIGN KEY constraints, however, they are only for informational. But materialized views will work for quite a while. If we get to the point where we need to add another layer with incremental updates, that def seems like the best approach.

Yes, it does show DS_DIST_NONE (not on the first line, but when it shows the join), so perhaps there’s no problem as long as we are always filtering by collector_tstamp. To Colm’s point, processing your data in an incremental way is going to be a HUGE benefit, even more so the more your data grows, a materialized view will reprocess the data each time to refresh the view, compared to an incremental approach that would only process new events since the last refresh.įinally, just a word of warning whichever route you take to be careful about joining any contexts that may contain multiple entities (yauaa never will, but you may have some custom ones) as even after de-duplication this can still lead to a 1:many join. Hash joins aren’t intrinsically bad (unlike the much feared nested loops), only if the data needs distributing. does the first line show a DS_DIST_NONE type hash join? This is the best one as it basically means it doesn’t need to redistribute your data to do the join. After vacuuming did you also run analyze in your warehouse? I think you won’t be able to get a merge join, not without changing your sortkey, but given you need the collector_tstamp anyway this is unlikely to be worth it. Snowplow gives you event-level data which is awesome for lots of things but regular querying isn’t one! This way you solve the efficiency issue only once. Typically most dupes won’t have these tstamps duplicated so this avoids a potential cartesian explosion.Ī final word on it is that in the overall workflow design, we recommend using incremental data modeling to produce derived tables, which is where most querying activity would happen. (one for each table involved).Īdditionally, adding the tstamps to the join as Ryan suggests is also advisable, but for a different reason - the semantics are at least once delivery, so you can have duplicates.

I think things should improve if you add a WHERE e.collector_tstamp >. It’s a while since I’ve done any serious work with Redshift SQL, so this might be outdated, but if my memory is correct, the table scans in question won’t be limited unless you specifically restrict the sortkey - limit only restricts the number of results produced.
