Problem/Motivation
When administrators have to rebuild permissions in batch mode the process is the following one:
1. Initiate rebuild.
2. Delete permissions.
3. Regenerate permissions. Depending on the site can be very slow (I've seen D7 sites to do this in hours).
4. Rebuild finished - site is operational.
The problem here is that step 2 is relatively fast, so it removed everything form the node_access table, but 3 is (VERY) slow, that leaves the system in a broken/invalid state for a ranging period of time, depending on the size of the system.
Current recommendation is to put the site into maintenance during the rebuild, as the users will get many "403 Forbidden" responses otherwise. The bigger the site is the longer this time-frame will be, the the more users will be unhappy, as the site is effectively not-accessible (down).
I am marking this as:
- Bug - because I am considering it as a such. Feel free to change the categorization.
- Major - because site is not operational when the process takes place. The bigger the site, the bigger the negative impact...
Proposed resolution
The solution comes from the double buffer design pattern:
Do not break the system, until you have the new state ready on the side and them just swap them in a fast manner - the old with the new.
The idea is to have a second database table like (node_access_temp) that will not be in use, except for this case and the process will be changed a bit like so:
1. Initiate rebuild.
2. Clean everything from node_access_temp, for safety (expected to be empty).
3. Rebuild new permissions in node_access_temp. This will be slow (as it currently is), but the site will be operational with the old permissions.
4. Clean node_access. This is fast, as it currently is.
5. Transfer node_access_temp state to node_access. I expect it to be much faster, as it is a solution that is fully dependent on storage level limitations. No high level APIs will be involved here.
- insert select or something similar.
- drop node_access, alter node_access_temp to node_access, recreate node_access_temp.
- Have a state value that will point to the active node_acceess table for managing access on the site, the switch here can be atomic, by changing the pointer value.
- Other ideas?
6. Clean-up node_access_temp, as the data is already active.
7. Rebuild finished - site is operational.
This way the rebuild can take arbitrary long time, but the switch will be just the time to transfer the data from node_access_temp to node_access, greatly reducing the time, where the system's access data is in invalid state. If we manage to make it fully atomic, then rebuild will not require downtime at all (maintenance mode).
Remaining tasks
Discussion, patch, review, RTBC, commit.
User interface changes
None.
API changes
None, this is implementation detail.
Data model changes
New Temporary DB table. No structural changes to existing systems.