Add new dictionary to cater for common words not in core dictionaries

Created on 18 December 2024, 2 months ago

Problem/Motivation

Some common programming words and terms, such as lando and ddev are used in many contrib projects but do not appear in core dictionaries. These had been hard-coded in the default cspell.json config, but are now added via prepare-cspell.php

Other words, such as nightwatchjs and endapply were in core dictionaries but have been deleted, causing contrib projects to fail unexpectedly.

Proposed resolution

Add a new dictionary to /assets containing these lost words

Remaining tasks

User interface changes

API changes

Data model changes

šŸ“Œ Task
Status

Active

Component

gitlab-ci

Created by

šŸ‡¬šŸ‡§United Kingdom jonathan1055

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @jonathan1055
  • Merge request !310#3494834 Add extra dictionary ā†’ (Merged) created by jonathan1055
  • Pipeline finished with Success
    2 months ago
    Total: 51s
    #372753
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    Ready for testing and feedback. But I undertand if you do not think this is our job to maintain. However, it is gitlab_templates decision to keep up with the core version, so we are causing contrib projects to fail when we bump to the next version. This won't be the last time that words are dropped.

  • Pipeline finished with Success
    2 months ago
    Total: 52s
    #372761
  • Pipeline finished with Success
    2 months ago
    Total: 51s
    #372762
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    The name "extra-dictionary.txt" is not very imaginative. If anyone has a better idea, let me know. It is not just for words dropped from core, so I did not want to include 'core' in the name.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Iā€™m thinking that while we have a handful of them only we might want to keep them in the array that will also populate the artifacts, instead of as a separate file.

    That way if maintainers see the cspell config file with those words they might choose to keep them or not.

    If or when we have more words we might consider a separate file but if we want to do a quick-fix Iā€™d do it in the most simple way. Thoughts?

  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    OK. I will revert the initial work, and just add the words via the array in .cspell.json.

    But first, I wanted to record the tests using this extra dictionary, just to see how it worked.
    Here's a test with new words, using existing gitlab templates 1.6.11 running core 11.1
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373557

    Using MR310 with new dictionary
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373584
    The job passes and the cspell.json artifact has the new dictionary

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Thanks for running the tests. With the external file approach, we'd need to include it as well in the artifacts, and for example, in the second run, the cspell.json file has "path": "./extra-dictionary.txt", but the artifact is not in there.

    I think that if we see the number of words growing to more than 15-20, we can do the extra file/artifact in a follow-up, but for now it might just be easier to add the files to the array as we had "ddev" and "lando". The only reason that I'm pushing for this simpler approach is that the solution would be lower risk and we can merge it and create a new release before tomorrow.

  • Pipeline finished with Success
    2 months ago
    Total: 51s
    #373610
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have pushed the changes, and re-tested on the same branch:
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373633
    The cspell.json artifact has the new words (and not the new dictionary)

    Ready for review.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Looks good to me. Thanks so much!

    As per #7, I'm happy if a follow-up issue is created to move those array of words into a separate file, where we can improve this and test it properly, but without the pressure of getting it out quick that this issue has.

  • Pipeline finished with Skipped
    2 months ago
    #373718
  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Merged. I'll tag a new release and make it default-ref.

  • šŸ‡¦šŸ‡¹Austria drunken monkey Vienna, Austria

    Would it be possible to also include varchar there? See, e.g., šŸ“Œ Fix CI failures Active .
    The two other words re-added there to the dictionary, CURLOPT and RETURNTRANSFER, seem less common, but Iā€™d expect varchar is a biggy.

  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    OK, maybe we need to look at the whole file. This is not an ideal thing to do, but as we have started on this path, it needs fuller investigation.

    Using the following

    FILE1=/path/to/local/drupal-10.4-dev/core/misc/cspell/drupal-dictionary.txt
    FILE2=/path/to/local/drupal-11.0-dev/core/misc/cspell/drupal-dictionary.txt
    echo $FILE1 && cat $FILE1 && wc -l < $FILE1;
    echo $FILE2 && cat $FILE2 && wc -l < $FILE2;
    echo '===\nin first, not second';cat $FILE1 $FILE2 $FILE2 | sort | uniq --unique
    

    has 18 words in D10 and 20 in Drupal 11, none are dropped. So that's all OK.

    However has 823 words in D10 but only 718 in Drupal 11. There are 3 new words but attached is the list of 108 dropped words. Now we need to decide what we do with this info.

  • šŸ‡ŗšŸ‡øUnited States tr Cascadia

    I brought up the varchar issue in the core queue: #3493141-25: Bump cspell to 8.16.1 ā†’
    This cspell error started showing up with the Drupal 11.1 rollout that uses cspell 8.16.1 instead of the 8.13.0 that was used for Drupal 11.0. More details in that issue ...

    Not sure that gitlab_templates is the right place to address this, but no-one is answering my questions in core when I ask where this should be addressed...

    Re #13: I also noticed that CURLOPT started to generate an error, but I just threw in some cspell:ignore/cspell:disable for the few usages of curl in my modules. In principle because core requires curl, the core cspell configuration should not be complaining when I use perfectly valid curl constants, but in practice I've always had to explicitly add ignores for things like CURLOPT_FOLLOWLOCATION.

    Regardless, I think these are all related. I don't know what changed between 11.0 and 11.1 that's causing this - I would not expect it to be a problem with gitlab_templates because I do not get the failures testing against 11.0 but I do get them testing against 11.1.

  • šŸ‡¦šŸ‡¹Austria drunken monkey Vienna, Austria

    I have to admit that I donā€™t really understand this whole problem, or the whole mechanism at work here. Why did Core remove those words, and why does this not cause CSpell errors in the Core pipelines?

    However, from a contrib perspective it seems clear that this is a regression and at least some of those words should be added back. (Though, by the above, I have no idea whether to do this in Core or in this project.) Looking through your list (Thanks!) I see lots of words that are actual nonsense, for which it makes sense to remove them and let contrib deal with any instances of them in their code.

    However, a few words Iā€™d say are common enough (some are even on Wiktionary, I linked them below) to warrant re-adding and I think Iā€™d be confused if any of them were suddenly reported as spelling errors:

    That being said, it might be the case that some of those are now just part of the inbuilt CSpell dictionary, or something like that, so they donā€™t need to be included anymore? I quickly tested in the Search API module and this indeed seems to be the case: only ā€œbodylessā€, ā€œkernighanā€, ā€œpermissionlessā€ and ā€œvarcharā€ are actually reported as errors.
    I would suggest readding those four words in some way but, again, I donā€™t actually know where to best do that.

  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have just spent several hours investigating this problem, trying to track the changes to core cspell verions against core releases and the dictionary changes, and I believe I have just made a huge breakthrough!

    Summary

    The 'varchar' string seems to have been added/moved between internal cspell dictionaries. In cspell 8.13 it must have been in a dictionary that we use but in cspell 8.16 it is now in a dictionary 'sql' that Gitlab Templates has not previously used. Adding this dictionary will solve the problem. In addition, 'curlopt' and 'endapply' are in a dictionary called 'cpp' which we do not use. Neither of these dictionaries are specified in the core .cspell.json so they were also not specified in the Gitlab Templates list of dictionaries (because the inital work was based on the core config). I have no idea why the core cspell job is not reporting these words. But it will be an easy fix here.

    History

    šŸ“Œ Update JavaScript dependencies for Drupal 11.0-rc Fixed
    cspell version ^8.8.1 updated to ^8.10.4
    The dictionary.txt was entirely rebuilt which dropped 38 words including varchar (no words added)
    only 3 files changed in the commit - dictionary.txt, core/package.json and core/yarn.lock. All three of these files are in the .cspell.json config "ignorePaths" so what changes have caused the words to be dropped? Either (1) the words are now included in one of the cspell built-in dictionaries, or (2) other core changes had been done previously without a dictioinary rebuild, or (3) a combination of both.
    10 July Committed to 11.x and 11.0.x same dictionary changes on each. (no backport to 10.x)

    šŸ“Œ Update JavaScript dependencies for Drupal 11.0.0 Fixed
    Update JavaScript dependencies for Drupal 11.0.0
    cspell 8.10.4 updated to 8.13.0. 5 words dropped
    1 Aug committed to 11.x and 11.1.x (No backport to 10.x)

    7 Aug Gitlab Templates switched up to 11.0 for the 'current' variant.

    šŸ“Œ Update cspell to latest Active
    cspell ^8.13.0 -> ^8.16.0
    dictionary.txt 32 dropped words (eg autoloader and backported), 1 added 'lspeak' (but this word does not appear in any core file. We have lolspeak but not lspeak, so why was it added?)
    25 Nov committed to 11.1.x and 11.x (same changes)
    NOT backported to 11.0.x which remains at cspell 8.13.0
    backported to 10.4 and 10.5 cspell upgrade was a bigger jump 8.0.0 -> 8.16.0.
    81 words dropped (and lspeak added). varchar is still in the file for 10.x

    11 Dec gitlab templates (TR project honeypot) https://git.drupalcode.org/project/honeypot/-/pipelines/366142
    current core 11.0.8 running cspell 8.13.0 and varchar is OK

    šŸ“Œ Bump cspell to 8.16.1 Active
    cspell 8.16.0 -> 8.16.1
    11 Dec in 11.1.x and 11.x this change removed 'endapply', 'nightwatchjs' and 'testgroups'
    Backport to 10.5.x removed nightwatchjs and testgroups but endapply was left in.
    Backport to 10.4.x did not remove endapply, nightwatchjs and testgroups. varchar is still in.

    18 Dec Gitlab templates updated current from 11.0.9 to 11.1.0
    gitlab templates https://git.drupalcode.org/project/honeypot/-/pipelines/373004
    current core 11.1.0 running cspell 8.16.1 and varchar fails

    Details

    https://cspell.org/docs/dictionaries/ lists the dictionaries, but this is not all of them. cspell provides a trace parameter which allows you to find which dictionaries a word is contained in. This shows many more dictionaries, and specifically highlights the ones where 'varchar' and 'curlopt' are stotred. And for a bonus, we also get all the curl constants, so for example 'FOLLOWLOCATION' is now in a dictionary and TR will not have to add this to his projects custom words.

    The question still remains as to why core jobs do not fail. Maybe that list of dictionaries in core/.csepll.json is somehow augmented to use every intenal dictionary? Core does run cspell via yarn, whereas Gitlab Templates runs is via npx, but surely that can't make a difference?

    I will create a MR with the extra dictionaries ready for testing.

  • Merge request !313#3494834 Add more internal dictionaries ā†’ (Merged) created by jonathan1055
  • Pipeline finished with Success
    about 2 months ago
    Total: 55s
    #387581
  • šŸ‡ŗšŸ‡øUnited States tr Cascadia

    Wow! Great work @jonathan1055 ! Thanks for looking into this.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    @tr - you can easily modify your .gitlab-ci.yml file to point to the fork created for the MR of this issue. See this: https://project.pages.drupalcode.org/gitlab_templates/info/testing-mrs/

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    @jonathan1055 - thanks so much for the detailed investigation and insight. I'd be happy to add those dictionaries as soon as you say it's ready if it's going to help the rest of contrib with all these cspell issues.

  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    Hi TR,
    Thanks for the offer of testing, that's great. You can see my example here
    https://git.drupalcode.org/issue/scheduler-3356800/-/blob/3356800-spelli...
    Also I've skipped everything except CSPELL to save resources.
    https://git.drupalcode.org/issue/scheduler-3356800/-/blob/3356800-spelli...

    fjgarlin,
    While we are making changes to the spelling config, I also would like to help to solve the deadlock on šŸ› Rename ColourDefinitionSniff to ColorDefinitionSniff Needs work and related issues, by adding the six sniff names/error codes with spelling errors that cannot be renamed (which would break BC). The spell checker is intended to help maintainers by highlighting real typos, not make our lives harder by adding extra cspell:ignore everywhere. I know it is not strictly Gitlab Templates job to make these additions, but if we don't we risk people simply turning off the cspell job altogether, or adding more files to the ignore path, neither of which is a good outcome. If you'd like a separate issue for that, let me know, but it would be easy to do it here too.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Happy for those six names to be added too and for that to happen on this issue. I'd like to see how the core issue develops separately, but we can defo add the suggested fix to the contrib templates regardless.

  • Pipeline finished with Failed
    about 2 months ago
    Total: 130s
    #388477
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have added those six phpcs sniff names. Are you bothered about the array being on two lines not six? It passes all phpcs checks like this.

    Also, some of the previously added words (ddev, endapply and curlopt) are in the cpp (c++) dictionary so those can be removed. We could also remove 'lando' as that was a temporary addition for an earlier file that has now changed.

    Ready for feedback.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    The array syntax is weird, but if the CI check is happy, I'm ok with it.

    The changes look good to me. I just added a comment about modifying a comment and maybe recovering a word back, just for clarity.

  • Pipeline finished with Success
    about 2 months ago
    Total: 52s
    #388550
  • Pipeline finished with Success
    about 2 months ago
    Total: 51s
    #388574
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have put back 'ddev' and updated the comment as requested.

    Unfortunately I've found one downside to including the 'cpp' dictionary. It allows the British spelling of 'colour' and also the incorrect version of 'uneccessary' with only 1 'n'. We can very easily add these into the flagWords list, just like we already do with 'e-mail', 'grey' etc copying what Core does. But we'd want to try to find other incorrect words that would now go unreported.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    I'd rather not add "cpp" and add a set of words. "cpp" seems to contain nearly 40k words (https://github.com/streetsidesoftware/cspell-dicts/blob/main/dictionarie...).

    We could grab the dictionaries from D10 and D11 and add the diff, or at least see which is the diff and then decide what to do. Very similar, if not almost the same, to what you did in #17 ( https://www.drupal.org/files/issues/2025-01-06/words-in-D11.0-dropped-fr... ā†’ ).

  • Pipeline finished with Success
    about 2 months ago
    Total: 52s
    #389094
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I'm not an expert on C++ but that dictionary of 40k words contains a lot of total rubbish and you are right that we don't want to enable it. Yes, we can get the dropped words and check through, as many of them won't be needed. Shall we use the approach that I started in the first three commits on the MR310, by having our .txt dictionary files in /assets.

  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Yeah, now that we have more words it make sense to have our own dictionary with the words instead of a massive array. We'll need to expose it as artifact too.

  • šŸ‡ŗšŸ‡øUnited States tr Cascadia
  • Pipeline finished with Success
    about 1 month ago
    Total: 122s
    #390897
  • Pipeline finished with Failed
    about 1 month ago
    Total: 52s
    #390942
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    Thanks for linking to that issue, I've made some more progress, from #10 #3498323-10: Add missing words to dictionary eg varchar ā†’

    I have pushed a few tempoary changes to this MR, to force a fault by adding 'varchar' in a .php and a .sh file. The verbose job log shows that extra dictionaries are being checked on the .php file and the word is not reported because it is in the 'fullstack' dictionary. However it is reported in the .sh file
    https://git.drupalcode.org/issue/gitlab_templates-3494834/-/jobs/3958908...

    If the word was in our custom dictionary then that would be used for every file and the word would never get reported. I think that is the behavior we want.

  • Pipeline finished with Success
    about 1 month ago
    Total: 54s
    #391036
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have now added 'fullstack' to the dictionary list and the job passes.
    https://git.drupalcode.org/issue/gitlab_templates-3494834/-/jobs/3960072

    In the failed job, the extra internal dictionaries are added by default, even if not specifically enabled in the .cspell.json file. The list for a .sh file is

    Checked: scripts/set-default-tag.sh, File type: shellscript, Language: en-US ... Issues: 1
    Dictionaries Used:
    aws, companies, cryptocurrencies, filetypes, public-licenses,
    softwareTerms, coding-compound-terms, computing-acronyms,
    software-term-suggestions, web-services, allowed-words, fonts, html, php,
    misc, typescript, node, css, bash, npm, lorem-ipsum,
    en_us, en-us-common-misspellings, shellscript
    

    This is because of the default behaviour of how cspell is designed. It checks some extra dictionaries based on the file type.

    The list of dictionaries specified in .cspell.json can be thought of as "always used these regardless of the file type, in addition to the defauts we automatically add". I think that behavior can be avoided by using --no-default-configuration but I have not tested that yet.

    The passing job had one extra dictionary in the list:

    Dictionaries Used:
    aws, companies, cryptocurrencies, filetypes, public-licenses,
    softwareTerms, coding-compound-terms, computing-acronyms,
    software-term-suggestions, web-services, allowed-words, fonts, html, php,
    misc, typescript, node, css, bash, npm, lorem-ipsum,
    fullstack, en_us, en-us-common-misspellings, shellscript
    

    because I added 'fullstack' into the array in .cspell.json.

    My hunch is that following the core changes in 11.1 the cspell job would report a failure for 'varchar' if it was written in a .sh file or one of the other file types which does not load the 'fullstack' extra dictionary.

    So I think our way forward is to add the sql and fullstack dictionaries into our list, so that they are checked regardless of the file type. Also we could add some specific curlopt constants into our extra allowed words, if that would be helpful.

    'endapply' was validly dropped from the core dictionary between 11.0 and 11.1 because that word was used in .twig files in 11.0 but does not appear anywhere in the 11.1 source files.

    'curlopt' was dropped from the dictionarry but Core 11.1 has introduced some // cspell:ignore curlopt returntransfer and // cspell:ignore curle curlopt customrequest failonerror postfields
    I don't know why that is better than having those two words in the dictionary where everyoone could use them. We can add those back if we decide its a good idea.

  • Pipeline finished with Success
    about 1 month ago
    Total: 51s
    #391105
  • Pipeline finished with Success
    about 1 month ago
    Total: 52s
    #392073
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    Here is a test which checks three more files -

    • 14 words in 10.3 dropped in 10.4
    • 108 words in 10.4 dropped in 11.0
    • 64 words in 11.0 dropped in 11.1

    Total dropped = 186
    https://git.drupalcode.org/issue/scheduler-3356800/-/pipelines/392151
    You can see the different numbers of reported word in the 3 variants just for info, but it is the regular cspell (at core 11.1) which gives us the full picture. Of the 186 dropped words, some of these, for example varchar, are now in one or more of the built-in dictionaries, and only 95 are reported as unknown - in the attached.

    Of these 95, we clearly want to add 'curopt' and 'returntransfer' and we could add other curlopt constants listed here -
    https://github.com/streetsidesoftware/cspell-dicts/blob/main/dictionarie...
    I don't know which of these are commonly used. Maybe we shouldn't be adding anything that had not previosuly been in core dictionary?

    We now need to review this list of 95 to see if any others are common enough in Contrib to warrant us adding them in. I think the list will be small, but worth checking through carefully.

  • Pipeline finished with Success
    about 1 month ago
    Total: 112s
    #393036
  • Pipeline finished with Success
    about 1 month ago
    Total: 78s
    #393503
  • Pipeline finished with Success
    about 1 month ago
    Total: 53s
    #393591
  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    I have sifted through the 95 words, looking at the obvious oddities and identifying where they hads been used in previous core releases. Where they were obviously only for core files, mostly in tests or Drupal 6 migration files that no longer exist in D11, I have removed them from the checking list. The list is now down to 44, attached.

    https://git.drupalcode.org/issue/scheduler-3356800/-/jobs/3991461

    I have not checked the original use of these remaining 44, but someone else is welcome to do that. I have a few suggestions that we may want to add back

    • backlinks (because backlink is a valid word and suggested alternative)
    • mdhash
    • uploaders (because uploader is a valid word in software terms)
    • vocabs (because vocab is a valid word in en-us)

    I have pushed a change to add these four (but can undo it if you disagree) and this is now ready for review. Then we can get it merged and out for use by all contrib.

  • Pipeline finished with Success
    about 1 month ago
    Total: 114s
    #393592
  • šŸ‡ŖšŸ‡øSpain fjgarlin

    The changes look good to me, we are basically adding a few words which we know can be really common and the "SQL" and "fullstack" dictionaries, which also seem like good and generic enough to add to contrib.

    Downstream pipelines.

    RTBC from my end.

  • Pipeline finished with Skipped
    about 1 month ago
    #394456
  • šŸ‡ŖšŸ‡øSpain fjgarlin

    Merged.

  • šŸ‡¬šŸ‡§United Kingdom jonathan1055

    Thanks for merging. It will be good to have this in the upcoming release that you are doing.

  • šŸ‡¦šŸ‡¹Austria drunken monkey Vienna, Austria

    Thanks a lot, @jonathan1055, great job!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024