18 Nov 2012

Google Analytics Limits and the "(Other)" Bucket

Each Standard Report in Google Analytics (GA) is a pre-calculated on a daily basis called "Dimension value aggregates". Each pre-calculated report stores only 50,000 rows per day. The top 49,999 rows get actual values. The the last 50,000th row gets the value of "(other)" with the sum of all the remaining row values.


Its a “good problem to have” - per day Landing Pages more than 50,000 - wow !

In the above illustration (with one day data range), we are noticing the "(other)" bracket in Landing Pages Report because we are sending more than 50,000 Landings per day for this standard report. Generally this works fine. The totals are always correct. Also most people only view the top 100 results and don't jump to the 49,999 row. But, when I try to do a long tail analysis of Landing v/s Bounces with Estimated True Value, so as to arrive a list of Pages to improve, I get bottle-necked. The problem gets more aggravated when we try to select weeks of data range.

For multi-day reports a page that is grouped in the "(other)" category one day, may not necessarily be grouped in the "(other)" category another day. So when running a report for a multi-day date range, you may run into inconsistencies as some pages (or other dimension value) in the long-tail may be included in the “(other)” bucket or its own row across days.

Further, for multi-day standard reports, the maximum number of aggregated rows per day is 1M/D, where D is the number of days in the query. For example:
A report for the past 30 days would process 33,333 rows per day (e.g. 1,000,000/30).
A report for the past 60 days would process a maximum of 16,666 rows per day (e.g. 1,000,000/60).

Is there a way out to get around the "(Others)" bucket issue?

Yes, we can partially circumvent the "(Other)" bucket issue. I said partially, because we will be able to see data upto 250K Visits, after which the GA's Sampling algorithm kicks in.

We can create an advanced segment to match all sessions and apply that segment to a standard report. For example we can create an advanced segment for the dimension Visitor Type that matches the regular expression .* (this is NOT the same as applying the "All Visits" Segment).


Let us see the original report with this Advanced Segment applied.

Wow, it works !

In cases where the report query cannot be satisfied by existing aggregates (i.e. pre-aggregated tables), GA goes back to the raw session data to compute the requested information. This applies for reports with Advanced Segments too. Reports with advanced segments use the raw session and hit data to re-calculate the report on-the-fly.

Typically, advanced segments are used to include or exclude sessions from being processed. But when we create a segment to match all sessions, we end up only by-passing the pre-calculated reports and force the entire report to be re-calculated.

Few points to note: The numbers between pre-calculated and on-the-fly calculated reports may differ as each type of report has different limits. Pre-calculated reports only store 50k rows of data per day but process all sessions (visits).

Reports calculated on-they-fly can return up to 1 million rows of data, but the only process 250k sessions (visits). After the 250k visits, sampling kicks in. 250k sampling is default, which can be slided upto max 500k.

So this solution works best when we have less that 500k visits in our date range. (We can find the number of sessions in the date range by looking at the visits metrics in the traffic overview report).

References:
How Sampling Works in Google Analytics
http://blog.intrapromote.com/google-analytics-50000-row-limit/
https://plus.google.com/112976464453422312311/posts/FtFtkCCXkr3
https://plus.google.com/112976464453422312311/posts/BCwbdDsXwet