Duplicate entry error for paths containing trailing spaces in redirect_404 table

Created on 3 March 2021, almost 4 years ago
Updated 10 February 2023, almost 2 years ago

Problem/Motivation

ERROR 1062 (23000) at line 16479: Duplicate entry '/newsletter -de' for key 'PRIMARY'

Duplicate entry error for paths containing trailing spaces in redirect_404 table when importing a database dump from amazon rds into mariadb/mysql.

Documentation for Mariadb / mysql database states behavior for varchar keys https://mariadb.com/kb/en/varchar/
If a unique index consists of a column where trailing pad characters are stripped or ignored, inserts into that column where values differ only by the number of trailing pad characters will result in a duplicate-key error

Proposed resolution

Escape path value before doing the upsert on the redirect_404 table

🐛 Bug report
Status

Active

Version

1.0

Component

Code

Created by

🇩🇪Germany volkerk

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • 🇺🇸United States loopy1492

    Similar error on both sql-sync and fin db import on our site.

    > ERROR 1062 (23000) at line 25846: Duplicate entry '/consumer/ca_ren.htm​-en' for key 'redirect_404.PRIMARY'

    I ran sql-sync with --extra-dump='--no-tablespaces --insert-ignore' but got:
    > ERROR 1227 (42000) at line 18: Access denied; you need (at least one of) the SUPER, SYSTEM_VARIABLES_ADMIN or SESSION_VARIABLES_ADMIN privilege(s) for this operation

    Clearly I do not have permissions for on our shared hosting environment.

    At any rate, I went into admin/config/search/redirect/404 and actually searched for this redirect. I was shocked/not shocked to see that there actually were two of these in there.

    Apparently, this module uses the url as the unique key for this field instead of a unique ID. That probably should be changed.

  • 🇺🇸United States loopy1492

    So, it seems that telling the module to do nothing with one of these urls does not delete it from the table. Instead, it marks it as resolved. Interestingly, when you mark one as resolved, that stops both it and the duplicate from displaying on the page. You cannot resolve the other one. It wouldn't matter anyways because both are still in the database table. Time to break out the big guns:

    fin drush @nydfs.dev sql-query "SELECT * FROM redirect_404 WHERE path LIKE '%consumer\/ca_ren%'"

    The response:

    /consumer/ca_ren.htm en 1 1675924164 0 0
    /consumer/ca_ren.htm​ en 1 1675924164 1 0

    Okay!

    fin drush @nydfs.dev sql-query "DELETE FROM redirect_404 WHERE path LIKE '%consumer\/ccra_registration.htm%

    Of course, when I do that, ANOTHER site comes up as a duplicate. What's a boy to do? Oh right....

    fin drush @nydfs.dev sql-query "DELETE FROM redirect_404" && echo "GET REKT!"

  • 🇺🇸United States joegl

    Apparently, this module uses the url as the unique key for this field instead of a unique ID. That probably should be changed; either that or some kind of check needs to be performed that denies the user the creation of a duplicate entry.

    I am guessing the issue has to do with collation of a database and how it stores unique characters. Because if the unique key is the URL, then duplicate entries by definition shouldn't exist. The only reason they exist is because they're not duplicates, and one is probably using a special character of some kind. The problem is when the database gets dumped and imported, somewhere along that process the unique character which makes two entries separate gets stripped or modified and thus the entries now match and are considered duplicates.

    In my case, it was because the difference in MySQL versions between the production site and my local where I was importing and how certain characters were treated between those versions.

  • Status changed to Postponed: needs info almost 2 years ago
  • 🇨🇭Switzerland berdir Switzerland

    I don't see what this module could change to fix that, as #16 explains, this is a collation problem and different databases with different settings.

    If you get this on local environments for development, it's perfectly fine to just ignore this table unless you for some reason want to work specifically on this. On staging/test environments, you likely want to ensure that you use the same collation.

  • Status changed to Active almost 2 years ago
  • 🇨🇦Canada joelpittet Vancouver

    @Berdir, we are running into this, the path is used as an index, and MySQL 8 changes what can be in an index and seems to normalize the values where MySQL 5.7 didn't care that there was some weird character some hacker tried to put in the path should also be an index.

    Maybe we can either remove the index/primary key on path, or apply a filter to remove characters that can't be real paths as per some RFC (didn't google yet)

  • 🇨🇦Canada joelpittet Vancouver

    Some examples of paths where this MySQL 8 key collision occurred

    
    /programs/vancouver)
    /programs/vancouver)
    
    
    /∼user/demos/mdp/vi.html
    /~user/demos/mdp/vi.html
    
    
    /user/demos/mdp/vi.html
    /user/demos/mdp/vi.html
    

    Can you spot the funny characters?

  • 🇨🇭Switzerland berdir Switzerland

    > that can't be real paths as per some RFC (didn't google yet)

    Well, they can be in the path or the server wouldn't be able to receive and log them.

    But yeah, fair enough, I'm open to trimming and otherwise removing characters, but not sure how easy that part is, there's plenty of international characters that might actually be used in an URL.

  • 🇨🇦Canada joelpittet Vancouver

    @Berdir, how about URL encoding them?

    There may be some that are worth stripping, I'll look around and see if there is a spec to reference.

  • 🇨🇦Canada joelpittet Vancouver

    This is a really good answer, but still trying to grok how to implement it. https://stackoverflow.com/a/36667242/80281 Basically what you said written as RFC 3987 is out-of-date from reality.

  • 🇨🇦Canada joelpittet Vancouver

    We should have tests to show a sampling of characters that should work. And maybe if they are valid characters for URLs we ensure they save properly in MySQL 8.0?
    Like maybe we don't index the path, which would be a performance issue I imagine.

    There is probably a nice middle ground here... but lets start with test coverage to set the expectations.

  • 🇨🇦Canada sseto

    I'm also getting this from trying to dump from 5.7 to 8.0. Is it safe to just delete the data from the table `redirect_404` and mysqldump that way?

  • 🇺🇸United States loopy1492

    Just got it on another site. Need to copy the db to a different environment upstream, and wipe the redirect_404 table before I can download it and use it locally.

  • 🇺🇦Ukraine Taran2L Lviv

    URL encoding should be the way here, as @joelpittet suggested

  • 🇺🇸United States scott_earnest

    We have been copying our database to a lower environment, then running a database update, then sql-sync to local.

    This command will clear the problem records:
    drush @stage sql-query "DELETE FROM redirect_404 WHERE path LIKE '%my-duplicate-key-redirect%'"

    Or we have dropped the PRIMARY KEY, this also works:
    drush @stage sql-query 'ALTER TABLE redirect_404 DROP PRIMARY KEY'

    This is our current workaround.

  • 🇺🇸United States tedfordgif

    Adding core issue for reference on handling case sensitivity, but with only one mention of "collation".

  • 🇺🇸United States tedfordgif

    Another approach might be to simply diverge from core's handling of URLs, and make redirect_404 everything-sensitive, e.g. throw 'binary' on the 'path' field in the redirect_404 schema. That would be unexpected for users/devs, and is probably not correct, but would have the same effect as the URL-encoding suggestion (absent normalization).

    This is not the right ticket to discuss it, but another place we end up with inconsistencies is that public files don't get served by Drupal, but the 404 page for the same does. Should core have a redirect if the file_managed uri matches the path modulo collation? That would also mean fewer redirect_404 records would get created.

    The issue with allowing the database collation to determine matches is that we may want different behavior in different circumstances. For example, if a single character in /some/long-URL-slüg has a diacritic mark like that, we probably want to serve the same page, especially if the canonical path has the diacritic and the request path does not. However, in a short url slug, such as you might see on /a/view-page-with/tags/slüg, some languages probably have word collisions we wouldn't want. Maybe the correct approach is to always have unambiguous short slugs (add context to the "machine name", e.g. garden_slug).

  • 🇺🇸United States loopy1492

    Still busted.

  • 🇦🇺Australia sonnykt Melbourne, Australia

    The issue is still occurring with MySQL 8:

    MySQL [v8032]> SELECT VERSION();
    +-----------+
    | VERSION() |
    +-----------+
    | 8.0.32    |
    +-----------+
    1 row in set (0.000 sec)
    
    MySQL [v8032]> SELECT @@character_set_database, @@collation_database;
    +--------------------------+----------------------+
    | @@character_set_database | @@collation_database |
    +--------------------------+----------------------+
    | utf8mb4                  | utf8mb4_0900_ai_ci   |
    +--------------------------+----------------------+
    1 row in set (0.001 sec)
    
    MySQL [v8032]> SHOW CREATE TABLE redirect_404\G
    *************************** 1. row ***************************
           Table: redirect_404
    Create Table: CREATE TABLE `redirect_404` (
      `path` varchar(191) NOT NULL COMMENT 'The path of the request.',
      `langcode` varchar(12) CHARACTER SET ascii COLLATE ascii_general_ci NOT NULL DEFAULT 'und' COMMENT 'The language of this request.',
      `count` int unsigned NOT NULL DEFAULT '0' COMMENT 'The number of requests with that path and language.',
      `timestamp` int unsigned NOT NULL DEFAULT '0' COMMENT 'The timestamp of the last request with that path and language.',
      `resolved` int NOT NULL DEFAULT '0' COMMENT 'Boolean indicating whether or not this path has a redirect assigned.',
      `daily_count` int unsigned NOT NULL DEFAULT '0' COMMENT 'The number of requests with that path and language in a day.',
      PRIMARY KEY (`path`,`langcode`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='Stores 404 requests.'
    1 row in set (0.001 sec)
    
    #*** one trailing space 0x20
    MySQL [v8032]> INSERT INTO redirect_404 VALUES ('/test-trailing ','en',0,0,0,0);
    Query OK, 1 row affected (0.004 sec)
    
    #** two trailing spaces 0x20
    MySQL [v8032]> INSERT INTO redirect_404 VALUES ('/test-trailing  ','en',0,0,0,0);
    Query OK, 1 row affected (0.002 sec)
    
    #** one trailing nbsp 0xC2A0
    MySQL [v8032]> INSERT INTO redirect_404 VALUES ('/test-trailing ','en',0,0,0,0);
    ERROR 1062 (23000): Duplicate entry '/test-trailing -en' for key 'redirect_404.PRIMARY'
    
    MySQL [v8032]> SELECT "a " = "a  ";
    +--------------+
    | "a " = "a  " |
    +--------------+
    |            1 |
    +--------------+
    1 row in set (0.001 sec)
    
    MySQL [v8032]> SELECT "a " = "a ";
    +--------------+
    | "a " = "a "  |
    +--------------+
    |            0 |
    +--------------+
    1 row in set (0.001 sec)
    

    The special character is 0xC2A0 (UTF-8 non-breaking space).

  • 🇵🇹Portugal joao.ramos.costa

    Faced the same issue with mysql 8.

    ERROR 1062 (23000) at line 11049: Duplicate entry '/mypage page/example/1.pdf-en' for key 'redirect_404.PRIMARY'

  • 🇺🇸United States loopy1492

    With Acquia moving over to mysql 8 soon, folks on that platform may want to petition Acquia to contribute to this module.

  • 🇺🇸United States loopy1492

    still busted

  • Pipeline finished with Success
    23 days ago
    Total: 179s
    #354216
  • Pipeline finished with Success
    23 days ago
    Total: 204s
    #354336
  • 🇺🇸United States loopy1492

    While I've been developing a solution, I can confirm @joao.ramos.costa that we're getting this issue still with upstream MySQL 8. For some reason, locally with Docksal running mysql 8.0.28, it thinks the full-width kana characters are the same as the medium width ones.

  • Pipeline finished with Success
    23 days ago
    Total: 180s
    #354409
  • Pipeline finished with Success
    23 days ago
    Total: 227s
    #354479
  • Pipeline finished with Success
    23 days ago
    Total: 202s
    #354502
  • Pipeline finished with Success
    23 days ago
    Total: 182s
    #354548
  • 🇺🇸United States loopy1492

    I started by changing how the URLs were stored, but then I realized we'd have to change how they were accessed as well. Different characters that happen to look the same are still different after all. I didn't want to risk the possibility that there were intentional uses of these characters.

    I went back to my original thought that we simply need a proper unique key. The merge request/patch creates an update that creates a new id field, assigns an id value to all the existing 404 entries, then sets that ID as the primary key, then sets the value as auto-incrementing.

  • 🇺🇸United States loopy1492

    Testing steps:

    • Deploy the previous tag to dev
    • Copy the database and files from prod to dev
    • Fire up a local site. You will might get a database sync error if your local build includes a sql-sync. That is fine.
    • Now, on the upstream DEV SITE, log into the remote site
    • From the admin screen, paste these urls into the address bar and visit them one at a time:

    /example/path/
    /example/path
    /example%2Fpath
    /example%20path/
    /example%20path
    /example path
    /example/path)
    /example/path)
    /example)/path
    /example)/path
    /∼example/path
    /~example/path

    • Note that the two parentheses are different… U+FF09 and U+0029
    • Note that the two tildes are different as well… U+223C and U+007E
    • Check the /admin/config/search/redirect/404 page. There should be a bunch of new records in there.
    • Deploy the branch with the patch to the dev server
    • Run drush cr && drush updb -y on the dev server
    • Run drush sql-query "DESCRIBE redirect_404;", then…
    • Run drush sql-query "select * from redirect_404;"
    • Make note of the count (the third field).
    • Hit ./~example/path followed by running drush cr several times. We have to do this because cache protects the resource from incrementing after the first hit.
    • Run fin drush sql-query "select * from redirect_404;" and check to make sure that path has incremented.
    • Now run drush sql-sync @thesite.dev @self -y
    • You should not get a database sync error – it should successfully import the database.
  • 🇵🇹Portugal joao.ramos.costa

    Hi @loopy1492, I will soon test your steps, although I think we should add them as kernel/unit tests. As for the table auto-incremental unique id, here's my two cents: Seems indeed like the most wise choice; otherwise, we might get lost in the infinite creativity of character combinations.

Production build 0.71.5 2024