1/26/2024 0 Comments Redshift wlm per userRemember how we were storing all event properties in a single JSON string? Now that we are splitting events into their own tables, we can extract event properties into their own individual columns in their corresponding event type table. Now, this next part wouldn’t have been possible without our first decision to split each event type into its own table. This means that when you want to query on a specific event type, the query will only need to scan through the data for that event type, rather than all of them. Events belonging to a specific event type will go to their own table for example, there would be one table containing all ‘Add To Cart’ events, and another table containing all ‘Checkout’ events. Rather than having one table for all of an app’s events, each app’s data is now partitioned by event type. To query on one of these properties, a customer had to use the expensive and slow json_extract_path_text function.īreaking out individual tables for each event type As you can see, the event_properties and user_properties were each a giant JSON string that could contain up to 65535 characters. We observed that this json_extract_path_text function was consistently the most CPU-intensive computation on our Redshift clusters.Ī portion of our old Redshift schema. If you wanted to query on any user or event property, you had to use the json_extract_path_text function to parse the entire JSON and extract the specific property you were interested in. So we had one column, user_properties, that held a JSON string of user properties, and another column, event_properties, containing a JSON string of event properties. Storing properties in JSON stringsĪnother shortcoming of the previous schema, which slowed down query times considerably, was that all of the user and event properties were dumped into JSON columns. A lot of time was being wasted scanning over irrelevant rows. Imagine querying billions of rows of data when you’re only interested in getting a list of users of people who just did a specific event, like ‘Buy From Cart’ or ‘Complete Game’. Probably the biggest problem with our old structure was that we didn’t partition the data - all of the events for an app went into one massive table. Our old Redshift structure did the job, but not always as quickly as our customers would like. We provide Redshift to give our customers direct access to their data so that they can perform ad-hoc queries and answer any question about user behavior they can think up. User properties are tied to a specific user ID (things like device type and subscription plan level), while event properties are tied to a specific event type. You can define whatever event types you like, as well as user and event properties. As opposed to traditional pageview-centric web analytics, we track actions that your users take. Our Legacy Redshift SchemaĪt Amplitude, we collect event data. We also learned a lot about the ins and outs of Redshift, and want to share those lessons as well. We thought we’d share some of the changes that we made and how we manage clusters with dynamic schemas. We also decided to give each Amplitude customer a dedicated Redshift cluster to improve load performance and prevent customers using up each other’s resources. The new dynamic schema makes querying far more efficient and has drastically reduced query times - we’ve seen speed improvements of 10-30X. While this standardized the schema across all of our customers, it meant that querying on a particular event or property often required scanning through billions of irrelevant rows.Īfter a few months of work, we’ve retired our old static schema, and now have dynamic schemas that update as new events and properties are sent to Redshift. We were storing all of the events for an app in a single table. The main issue we had to solve was our data schema.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |