Retroactive User Recognition on Clickhouse
Retroactive User Recognition feature needs to update data that was already loaded to a data warehouse
Clickhouse by its nature doesn't support UPDATE
queries, but it provides different solutions with important specifics.
Data Deduplication#
Retroactive User Recognition on Clickhouse works with tables using ReplacingMergeTree or ReplicatedReplacingMergeTree engines and relies on its data deduplication mechanism.
Data deduplication occurs in the background at an unknown time, so you canβt plan for it. Some data may remain unprocessed.
Basically speaking, that means that SELECT
query will contain some duplicated rows when query time period intersects with
User Recognition anonymous events TTL period (Read more on TTL)
To avoid duplicated data in results you need to use FINAL
modifier after table name:
select * from events_with_user_recognition FINAL order by _timestamp desc;
When FINAL
is specified, ClickHouse fully merges the data before returning the result and thus performs all data transformations that happen during merges for the given table engine.
Configuration#
When Jitsu creates table in Clickhouse it automatically setups table engine compatible with Retroactive User Recognition.
If you need Retroactive User Recognition to work with existing table you need to make sure that correct ENGINE
and sorting key (ORDER BY
)
are set for table:
ENGINE = ReplacingMergeTree
or ENGINE = ReplicatedReplacingMergeTree
ORDER_BY eventn_ctx_event_id
Otherwise, User Recognition needs to be turned off for that destination to avoid data duplication.
Example:
CREATE TABLE jitsu.events_with_user_recognition
(
`_timestamp` DateTime,
`api_key` String,
`event_type` String,
...
)
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(_timestamp)
ORDER BY eventn_ctx_event_id
SETTINGS index_granularity = 8192