Created on 3 May 2016, over 8 years ago
Updated 29 April 2023, over 1 year ago

Problem/Motivation

The following statistics were extracted from a particular SQL instance. The site is a big one, using menu position rules extensively, having 720 active ones for now. Biggest issues in it are bytes sent and average execution time.

<!--break-->

SQL query statistics from percona-tools: totals, top 10 slowest queries and the details for the slowest one:

# Totals:
# Overall: 68.51M total, 611 unique, 196.49 QPS, 0.11x concurrency _______
# Time range: 2016-04-21 11:05:35 to 2016-04-25 11:56:51
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time         38523s    51us    267s   562us     1ms    64ms   125us
# Lock time          3616s       0   429ms    52us    66us    65us    47us
# Rows sent          4.52G       0   2.92M   70.84  420.77   1.38k    1.96
# Rows examine      10.56G       0   2.92M  165.54  420.77   3.13k    4.96
# Rows affecte      28.96k       0   7.24k    0.00       0    0.90       0
# Bytes sent       901.28G      11 664.23M  13.79k  13.78k 268.08k  313.99
# Query size         5.17G       6  79.56k   80.97  102.22  110.15   54.21

# Profile
# Rank Query ID           Response time    Calls    R/Call  V/M   Item
# ==== ================== ================ ======== ======= ===== ========
#    1 0x8C2FFCA0307FA532 14136.2340 36.7%  2650059  0.0053  0.00 SELECT menu_position_rules
#    2 0x9D0CB602DD025CCE  2931.5101  7.6%  3995441  0.0007  0.00 SELECT domain_alias
#    3 0x8BA7189ACA371127  2807.3743  7.3%  8248377  0.0003  0.00 SELECT domain
#    4 0x67A347A2812914DF  1592.6305  4.1%    10216  0.1559 16... SELECT CUSTOM-TABLE-IN-NON-DRUPAL-PROJECT (Ignore).
#    5 0x0A47FF1E378DDAD2  1466.4478  3.8%  2221700  0.0007  0.00 SELECT domain_alias
#    6 0x6CA6B1186679597D  1276.9851  3.3%    35092  0.0364  0.00 SELECT menu_links menu_router
#    7 0x412A031543B4EA48  1199.9383  3.1%  4254064  0.0003  0.00 SELECT languages
#    8 0x0A971376736289F0  1139.7765  3.0% 11610182  0.0001  0.24 SELECT domain_export
#    9 0xE7091269E98B03C3   955.1257  2.5%    11469  0.0833  0.00 SELECT redirect
#   10 0xF598075701F7C596   893.9671  2.3%  7357200  0.0001  0.00 SELECT date_format_locale

# Query 1: 7.60 QPS, 0.04x concurrency, ID 0x8C2FFCA0307FA532 at byte 3879110581
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2016-04-21 11:05:35 to 2016-04-25 11:56:51
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          3 2650059
# Exec time     36  14136s     2ms      4s     5ms     9ms     5ms     5ms
# Lock time      4    173s    35us     4ms    65us    76us     8us    63us
# Rows sent     39   1.77G     716     716     716     716       0     716
# Rows examine  33   3.53G   1.40k   1.40k   1.40k   1.40k       0   1.40k
# Rows affecte   0       0       0       0       0       0       0       0
# Bytes sent    78 707.67G 280.01k 280.01k 280.01k 280.01k       0 280.01k
# Query size     3 187.02M      74      74      74      74       0      74
# String:
# Databases    DB
# Hosts        web1 (885378/33%)... 2 more
# Last errno   0
# Users        DBUser
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms  ###
# 100ms  #
#    1s  #
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `DB` LIKE 'menu_position_rules'
#    SHOW CREATE TABLE `DB`.`menu_position_rules`
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM menu_position_rules WHERE enabled = '1' ORDER BY weight, rid

Notes:
On this particular portal, there is plenty of writes, so query cache is hindering performance and it's disabled. This results into the fact that this particular query is executed always. Even if that is the case, the slowness comes mostly from the network data transfer, not the query execution within the server.

This is because it does select * every time, it will continue to degrade as more rules are going to be added in future. Out of the whole SQL network traffic generated by the portal, this query s responsible for 25-30% of it on average (or 707 GB for 4 days based on the aggregates). This is the most time-consuming query on the portal, utilizing ~36% of the total SQL server time spent on queries.

Proposed resolution

There are 2 directions that can be taken on:
1. Offload DB from the query by adding persistent cache around it. This should be relatively fast and easy. Moving the network delays from the DB to the cache layer. There is one cave-at there will be 1 big cache item that might just move the bottleneck to the cache and nothing more. It would be best to have the rules granulated and have many cache items with small sizes (if possible). As this cache is based on the list of rules, any rule that is changed, deleted or updated should invalidate the whole cache data, so cache consistency is left-out to the CRUD methods menu_position_(add|edit|delete)_rule methods. Cache is cleared also when rules are reordered.

2. Optimize the query itself. We should aim in 2 directions here. First reduce columns requested, as * is really non-scalable for long. Second, we are filtering based on the ENABLED flag. Can we have some more filters added, so we can reduce the result set rows count? (This is implemented in comment 6).

Note that 1 and 2 are not mutually exclusive, on the contrary, they supplement each other.

Remaining tasks

Discussion, , Reviews, RTBC, Commit.

Patch in comment 9 has points 1 and 2 implemented. With functionality related to point 2 on my workload I am experiencing 10-20% result-set size reduction, depending on page context values.

User interface changes

New administrative setting option to handle enable/disable of the persistent cache logic.

API changes

Reamed the old parameter to reflect it's core feature - static caching (use_static_cache). Affecting only the function itself.

API addition - added one more optional parameter (last) to menu_position_evaluate_rules method - use_cache defaulting to TRUE. It is ignored if the setting value is disabled.

Data model changes

None.

📌 Task
Status

Closed: outdated

Version

1.0

Component

Code

Created by

🇧🇬Bulgaria ndobromirov

Live updates comments and jobs are added and updated live.
  • Performance

    It affects performance. It is often combined with the Needs profiling tag.

Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇨🇦Canada joelpittet Vancouver

    When someone runs into this problem again we can look at it in D10. I'll close this for now.

    Thanks for all your work on the D7 version

Production build 0.71.5 2024