- 🇨🇦Canada joelpittet Vancouver
When someone runs into this problem again we can look at it in D10. I'll close this for now.
Thanks for all your work on the D7 version
The following statistics were extracted from a particular SQL instance. The site is a big one, using menu position rules extensively, having 720 active ones for now. Biggest issues in it are bytes sent and average execution time.
<!--break-->SQL query statistics from percona-tools: totals, top 10 slowest queries and the details for the slowest one:
# Totals:
# Overall: 68.51M total, 611 unique, 196.49 QPS, 0.11x concurrency _______
# Time range: 2016-04-21 11:05:35 to 2016-04-25 11:56:51
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 38523s 51us 267s 562us 1ms 64ms 125us
# Lock time 3616s 0 429ms 52us 66us 65us 47us
# Rows sent 4.52G 0 2.92M 70.84 420.77 1.38k 1.96
# Rows examine 10.56G 0 2.92M 165.54 420.77 3.13k 4.96
# Rows affecte 28.96k 0 7.24k 0.00 0 0.90 0
# Bytes sent 901.28G 11 664.23M 13.79k 13.78k 268.08k 313.99
# Query size 5.17G 6 79.56k 80.97 102.22 110.15 54.21
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ================ ======== ======= ===== ========
# 1 0x8C2FFCA0307FA532 14136.2340 36.7% 2650059 0.0053 0.00 SELECT menu_position_rules
# 2 0x9D0CB602DD025CCE 2931.5101 7.6% 3995441 0.0007 0.00 SELECT domain_alias
# 3 0x8BA7189ACA371127 2807.3743 7.3% 8248377 0.0003 0.00 SELECT domain
# 4 0x67A347A2812914DF 1592.6305 4.1% 10216 0.1559 16... SELECT CUSTOM-TABLE-IN-NON-DRUPAL-PROJECT (Ignore).
# 5 0x0A47FF1E378DDAD2 1466.4478 3.8% 2221700 0.0007 0.00 SELECT domain_alias
# 6 0x6CA6B1186679597D 1276.9851 3.3% 35092 0.0364 0.00 SELECT menu_links menu_router
# 7 0x412A031543B4EA48 1199.9383 3.1% 4254064 0.0003 0.00 SELECT languages
# 8 0x0A971376736289F0 1139.7765 3.0% 11610182 0.0001 0.24 SELECT domain_export
# 9 0xE7091269E98B03C3 955.1257 2.5% 11469 0.0833 0.00 SELECT redirect
# 10 0xF598075701F7C596 893.9671 2.3% 7357200 0.0001 0.00 SELECT date_format_locale
# Query 1: 7.60 QPS, 0.04x concurrency, ID 0x8C2FFCA0307FA532 at byte 3879110581
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2016-04-21 11:05:35 to 2016-04-25 11:56:51
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 3 2650059
# Exec time 36 14136s 2ms 4s 5ms 9ms 5ms 5ms
# Lock time 4 173s 35us 4ms 65us 76us 8us 63us
# Rows sent 39 1.77G 716 716 716 716 0 716
# Rows examine 33 3.53G 1.40k 1.40k 1.40k 1.40k 0 1.40k
# Rows affecte 0 0 0 0 0 0 0 0
# Bytes sent 78 707.67G 280.01k 280.01k 280.01k 280.01k 0 280.01k
# Query size 3 187.02M 74 74 74 74 0 74
# String:
# Databases DB
# Hosts web1 (885378/33%)... 2 more
# Last errno 0
# Users DBUser
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms ################################################################
# 10ms ###
# 100ms #
# 1s #
# 10s+
# Tables
# SHOW TABLE STATUS FROM `DB` LIKE 'menu_position_rules'
# SHOW CREATE TABLE `DB`.`menu_position_rules`
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM menu_position_rules WHERE enabled = '1' ORDER BY weight, rid
Notes:
On this particular portal, there is plenty of writes, so query cache is hindering performance and it's disabled. This results into the fact that this particular query is executed always. Even if that is the case, the slowness comes mostly from the network data transfer, not the query execution within the server.
This is because it does select * every time, it will continue to degrade as more rules are going to be added in future. Out of the whole SQL network traffic generated by the portal, this query s responsible for 25-30% of it on average (or 707 GB for 4 days based on the aggregates). This is the most time-consuming query on the portal, utilizing ~36% of the total SQL server time spent on queries.
There are 2 directions that can be taken on:
1. Offload DB from the query by adding persistent cache around it. This should be relatively fast and easy. Moving the network delays from the DB to the cache layer. There is one cave-at there will be 1 big cache item that might just move the bottleneck to the cache and nothing more. It would be best to have the rules granulated and have many cache items with small sizes (if possible). As this cache is based on the list of rules, any rule that is changed, deleted or updated should invalidate the whole cache data, so cache consistency is left-out to the CRUD methods menu_position_(add|edit|delete)_rule methods. Cache is cleared also when rules are reordered.
2. Optimize the query itself. We should aim in 2 directions here. First reduce columns requested, as * is really non-scalable for long. Second, we are filtering based on the ENABLED flag. Can we have some more filters added, so we can reduce the result set rows count? (This is implemented in comment 6).
Note that 1 and 2 are not mutually exclusive, on the contrary, they supplement each other.
Discussion, , Reviews, RTBC, Commit.
Patch in comment 9 has points 1 and 2 implemented. With functionality related to point 2 on my workload I am experiencing 10-20% result-set size reduction, depending on page context values.
New administrative setting option to handle enable/disable of the persistent cache logic.
Reamed the old parameter to reflect it's core feature - static caching (use_static_cache). Affecting only the function itself.
API addition - added one more optional parameter (last) to menu_position_evaluate_rules method - use_cache defaulting to TRUE. It is ignored if the setting value is disabled.
None.
Closed: outdated
1.0
Code
It affects performance. It is often combined with the Needs profiling tag.
Not all content is available!
It's likely this issue predates Contrib.social: some issue and comment data are missing.
When someone runs into this problem again we can look at it in D10. I'll close this for now.
Thanks for all your work on the D7 version