Cspell: sanitize suggested words for dictionary

Created on 6 April 2024, about 1 year ago

Problem/Motivation

Currently the Cspell job generates an artifact with a txt file with the words to put in a dictionary ".cspell-project-words.txt" in a project.

But in this txt file, the list of words is from Cspell output:
- unsorted
- as it is found in the code: capitalized, lowercased, uppercased, any step in between.
- duplicated

Proposed resolution

Add a manipulation to sanitize this output.

In my project skeleton, I have a small script that do this sanitization: remove duplicate, lowercase, sort

https://gitlab.com/florenttorregrosa-drupal/docker-drupal-project/-/blob...

cat ${DICTIONARY_FILE_PATH} | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -u -o ${DICTIONARY_FILE_PATH}

Maybe the best will be, if a project already have a ".cspell-project-words.txt" file, also provide a file with merged existing and new words in this project dictionary.

Remaining tasks

- Discuss if maintainers want such addition
- Provide MR

Feature request
Status

Active

Component

gitlab-ci

Created by

🇫🇷France Grimreaper France 🇫🇷

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @Grimreaper
  • 🇪🇸Spain fjgarlin

    I think it's a great feature to work on. I updated the issue description with the goal.

  • 🇬🇧United Kingdom jonathan1055

    Hi Grimreaper,
    This is a good idea. I have noticed that we get duplicated words with different case, so that should be improved. But the list is already sorted alpahbetically and de-duplicated (apart from case). Can you link to a job where this is not working for you?

  • 🇫🇷France Grimreaper France 🇫🇷

    Hi,

    Thanks for your feedbacks :)

    the list is already sorted alpahbetically and de-duplicated (apart from case)

    I checked https://git.drupalcode.org/project/ui_suite_bootstrap/-/jobs/1151883, it is effectively sorted and deduplicated. My bad, I saw duplicated words due to case and did not check further, sorry.

    Also, I think I kept in mind https://git.drupalcode.org/project/gin_lb/-/jobs/1120341, which at this moment did not had the "_cspell_unrecognized_words.txt" file generated.

    So I guess "only" case handling and suggesting a file with merged existing and new words in a project vocabulary is what remains to be done.

  • 🇪🇸Spain fjgarlin

    Updated the issue description to mention the merging part.

  • 🇬🇧United Kingdom jonathan1055

    Thanks for the links. I see in ui_suite_bootstrap you have lots of latin words. The number will reduce significantly when you start using the improvement from #3436206: Add lorem-ipsum spelling dictionary . This link shows everything that is committed to 'main' but have not yet reached 'default-ref'

  • 🇫🇷France Grimreaper France 🇫🇷

    Thanks for the feedback.

    This made me think, is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words.

    Like PHPStan that can tell when an ignore is not needed.

    In case, the answer is not trivial or already known, I will search for that when I will be back on the CI subject for my skeleton.

  • 🇬🇧United Kingdom jonathan1055

    is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed

    That would be very useful, I've not seen an option, but please yes do some research on it.

  • 🇬🇧United Kingdom jonathan1055

    Also, following up from #6, now that #3436206: Add lorem-ipsum spelling dictionary is committed and gitlab_tempates default-ref updated, you can probably remove lots of the latin words from your custom dictionary. Or you could leave them there for now, and use your project as a test for any work we do here on identifying non-required ignored words.

  • 🇫🇷France Grimreaper France 🇫🇷

    Hi,

    Yes, I will currently keep it that way in ui_suite_bootstrap to do testing when I will have time to search for a system of detection of no more used words.

  • Assigned to Grimreaper
  • 🇫🇷France Grimreaper France 🇫🇷
  • 🇬🇧United Kingdom jonathan1055

    The suggested script in the issue summary does work. Could also add in the current dictionary, then could diff to make a patch for it. I used:

    export WORDS=dic-words.txt
    export DICTIONARY_FILE_PATH=dic.txt
    export DICTIONARY_FILE_PATH2=dic2.txt
    echo "Aa\naa\nAA" > $DICTIONARY_FILE_PATH 
    cat $DICTIONARY_FILE_PATH $WORDS
    cat ${DICTIONARY_FILE_PATH} $WORDS | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -u -o ${DICTIONARY_FILE_PATH2}
    

    There are some other choices for LC_ALL, so we need decide on that.

  • 🇬🇧United Kingdom jonathan1055

    Sorry, just to be clear, I'm not trying to take the work over. It's assigned to you :-)

  • 🇫🇷France Grimreaper France 🇫🇷

    No problem :)

    Currently I am not working on this issue. I assigned it to myself to not forget about this issue. I am currently too spread out on too many issues.

    So if you want to do it no problem :)

  • Issue was unassigned.
  • 🇪🇸Spain fjgarlin

    Marking as unassigned based on #14.

  • Merge request !303#3439240 Lower-case deduplicated words file → (Merged) created by jonathan1055
  • Pipeline finished with Success
    4 months ago
    Total: 52s
    #363458
  • Pipeline finished with Success
    4 months ago
    #363550
  • Pipeline finished with Success
    4 months ago
    Total: 112s
    #363563
  • Pipeline finished with Success
    4 months ago
    #363566
  • 🇬🇧United Kingdom jonathan1055

    I have pushed some initial commits. Summary of what is done:

    1. The existing artifact _cspell_unrecognized_words.txt is now converted to all lower-case and deduplicated (it was already sorted by cspell)
    2. New artifact _cspell_updated_project_words.txt contains everything in the current project dictionary .cspell-project-words.txt plus the new words in _cspell_unrecognized_words.txt, all converted to lowercase and re-sorted
    3. If the project has no dictionary, then _cspell_updated_project_words.txt will just be the same as _cspell_unrecognized_words.txt
    4. If there are no new unrecognized words then _cspell_updated_project_words.txt is empty (this logic could be changed, as it may be helpful to provide a sorted, lowercased de-duped version of the file)

    My initial attempt to read two files together using
    tr '[:upper:]' '[:lower:]' < $PROJECT_DICTIONARY < $WORDS_FILE worked fine locally, but in the pipeline job only the second file seems to be read. So I changed it to
    cat $PROJECT_DICTIONARY $WORDS_FILE | tr '[:upper:]' '[:lower:]' and this works as intended.

    Tested when there are no spelling errors - the two artifact files are empty.
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3637729

    Tested when two new words are added, in different versions of case.
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3638728

    Ready for review and feedback. As usual there is debug to remove. Probably need to add a line or two into the documentation.

  • 🇬🇧United Kingdom jonathan1055

    From #7 and #8

    Is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed

    We could do that easily here. Just make another run of cspell without the project dictionary, with no log output, but writing all unrecognized words to a file, then compare that with the project dictionary to show which words are no longer needed.

  • 🇪🇸Spain fjgarlin

    Re #18, while we could do that, it kind of sounds like a feature that should be in "cspell", same as the one in "phpstan". A quick google search shows this https://www.npmjs.com/package/cspell-check-unused-words, which we could try to apply.

    Agree that we just need to add an explanation about the files in the documentation, but the code overall looks good.

    Back to NW for the documentation and just in case we want to try that plugin (if it works great, if it doesn't maybe a follow-up).

  • 🇬🇧United Kingdom jonathan1055

    Looking at that package, the readme and the test source implies it does the check against config 'words' list.

    Over time, the word ignore list in the "cspell.json" file will become quite large. ...

    https://github.com/complete-ts/cspell-check-unused-words/blob/main/tests
    This would only be partially useful, as it would not help anyone who is using a project-words dictionary file. I tried to look at the source to see if it also allows checking against certain dictionary files, but could not work out if it does.

    Does that change your answer to us providing this functionality here? I guess to be complete, our solution would have to get the words list specfied in the projects .cspell.json and also the environment variable.

  • Pipeline finished with Success
    4 months ago
    Total: 55s
    #364398
  • Pipeline finished with Success
    4 months ago
    Total: 53s
    #364412
  • 🇪🇸Spain fjgarlin

    My biggest concern is to go into a rabbit hole where we start checking dictionaries, as a project can have many, and those dictionaries (core, lorem ipsum...) might have unused words.

    If we stick to only the project ones, how can we tell the project's dictionaries (which can be named differently) from any other ones, and then see how it mixes up with the words variables and/or configuration file.

    For me, the work done so far is good enough for a RTBC as it is an improvement, but I'd let you comment on my above concerns to see if we draw the line here or we continue.

  • Pipeline finished with Success
    4 months ago
    Total: 57s
    #370039
  • 🇬🇧United Kingdom jonathan1055

    Yes absolutely, we would not do anything with the other dictionaries, only the project's custom file. I think it might be helpful if we had a variable for the name of the default dictionary, instead of hard-coding it, then we could use that file as "the" project dictionary, and it would be more flexible. I'm going to investigate that work here, but leave the actual processing the same as now, given that you have reviewed it to this point . Then I will open a follow-up issue to look at the work of reporting unmatched words in the project dictionary + config + words list.

  • Pipeline finished with Failed
    4 months ago
    Total: 51s
    #371096
  • 🇬🇧United Kingdom jonathan1055

    I have pushed the changes to use a variable _CSPELL_DICTIONARY instead of hard-coding '.cspell-project-words.txt'

    Using the default value, lots of debug included, and it runs as expected
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731335

    Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731484

    Now with _CSPELL_DICTIONARY defined in the pipeline UI form as 'project-custom-name.txt'
    https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731544

  • 🇪🇸Spain fjgarlin

    I made a minor comment in the MR. I think we can remove the debug and it should be good to RTBC. But NW for now just for the previous things.

  • Pipeline finished with Success
    4 months ago
    Total: 52s
    #372596
  • Pipeline finished with Success
    4 months ago
    Total: 50s
    #373745
  • Pipeline finished with Success
    4 months ago
    Total: 53s
    #373757
  • 🇬🇧United Kingdom jonathan1055

    I've addressed the MR comments, but left the debug in for now. Re-testing on the branch as in #23 but in the reverse order

    Project's before script renames the project dictionary to 'project-custom-name.txt' and then use pipeline form variable CSPELL_DICTIONARY = 'project-custom-name.txt'. The renamed project dictionary is used.
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373806

    Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373821

    Project file restored to default value, dictionary is used
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373833

    The debug in the cspell pipeline job will come out. But with the debug added in the prepare-cspell.php script, I think it would be useful to leave this in, but add a --verbose parameter (off by default) so that we can easily re-use this, instead of deleting and then re-adding next time. We have the same in scripts/unformatted-links.php?ref_type=heads#L7-17

  • 🇪🇸Spain fjgarlin

    Agree, happy to add the verbose option to the prepare-cspell.php script.

    Other than this, it looks good. Just setting to NW to clean up some of the debug and to put some behind a verbose flag.

  • Pipeline finished with Success
    4 months ago
    Total: 52s
    #374002
  • Pipeline finished with Success
    4 months ago
    Total: 53s
    #374023
  • 🇬🇧United Kingdom jonathan1055

    I have added the --verbose flag in prepare-cspell.php. I also changed the single positional argument for the test suffix to be a named argument --suffix. This is only used locally, when manually running the script, so I though it was OK to change the way it is called. It keeps the code simple and consistent with the other arguments. Each argument also has it's single letter shortcut -v and -s

    The extra output in the job step is gone, but I have used the --verbose script in the execution, just to check it works in the pipeline, then will remove that too.

    Other minor changes:

    • add one echo in the job to say that the full updated dictionary has been created as an artifact
    • change in assets/internal/.cspell.json to only ignore the .cspell.json file not the entire assets folder. We don't want spelling errors in the assets files. This now checks 71 files not 62
    • Added array_filter( ) when processing $cspell_json['words'] to remove any empty array values

    I will run one test then remove the --verbose flag.

  • Pipeline finished with Success
    4 months ago
    Total: 51s
    #374114
  • Pipeline finished with Success
    4 months ago
    Total: 54s
    #374129
  • 🇬🇧United Kingdom jonathan1055

    Test using --verbose
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374059

    After removing the debug lines I could see more clearly the output being written to the log, and decided it was better to have the hint url link at the end. I also realised that I had accidentally deleted one of the required echo & cat lines which displays the cleaned-up list of words, so I have put that back in.

    Final test with --verbose removed.
    https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374130

    Ready for review.

  • 🇪🇸Spain fjgarlin

    Downstream pipelines: https://git.drupalcode.org/issue/gitlab_templates-3439240/-/pipelines/37...

    I reviewed the code again and it looks good. RTBC.

  • 🇪🇸Spain fjgarlin

    Merged to main. Thanks so much!

  • 🇬🇧United Kingdom jonathan1055

    It was good to work on these enhancements, and also make some other improvements to the process along the way. Thanks for all the reviews - I try to make it easy to re-review by being clear in the comments and commit messages about what has changed since your previous review, and also putting changes into separate commits when they are distinct.

    From the comments in #18-22, now that we have the variable for the project dictionary file, I may open a follow-up issue to report ignored words that are no longer needed. But there are other things to work on before that.

  • 🇫🇷France Grimreaper France 🇫🇷

    Hi,

    Thanks both of you for the work done here and sorry that I not had the time to work on it.

  • 🇪🇸Spain fjgarlin

    @jonathan1055 - re #32: that goes a long way to reviewing the changes. I usually check the commit by commit and the issue comment and that helps me a lot. Then at the end of the issue I do an overall review again just in case. I appreciate the clarity of information/changes and tests between reviews.

    Happy for the follow-up to be open and worked on.

    --

    @grimreaper - no worries, even opening useful issues and providing possible fixes or code snippets you're also helping, so we appreciate that too.

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024