Cspell: sanitize suggested words for dictionary

Issue created by @Grimreaper
Comment over 1 year ago →
🇪🇸Spain fjgarlin
I think it's a great feature to work on. I updated the issue description with the goal.
Comment over 1 year ago →
🇬🇧United Kingdom jonathan1055
Hi Grimreaper,
This is a good idea. I have noticed that we get duplicated words with different case, so that should be improved. But the list is already sorted alpahbetically and de-duplicated (apart from case). Can you link to a job where this is not working for you?
Comment over 1 year ago →
🇫🇷France Grimreaper France 🇫🇷
Hi,

Thanks for your feedbacks :)

the list is already sorted alpahbetically and de-duplicated (apart from case)

I checked https://git.drupalcode.org/project/ui_suite_bootstrap/-/jobs/1151883, it is effectively sorted and deduplicated. My bad, I saw duplicated words due to case and did not check further, sorry.

Also, I think I kept in mind https://git.drupalcode.org/project/gin_lb/-/jobs/1120341, which at this moment did not had the "_cspell_unrecognized_words.txt" file generated.

So I guess "only" case handling and suggesting a file with merged existing and new words in a project vocabulary is what remains to be done.
Comment over 1 year ago →
🇪🇸Spain fjgarlin
Updated the issue description to mention the merging part.
Comment over 1 year ago →
🇬🇧United Kingdom jonathan1055
Thanks for the links. I see in ui_suite_bootstrap you have lots of latin words. The number will reduce significantly when you start using the improvement from #3436206: Add lorem-ipsum spelling dictionary → . This link shows everything that is committed to 'main' but have not yet reached 'default-ref'
Comment over 1 year ago →
🇫🇷France Grimreaper France 🇫🇷
Thanks for the feedback.

This made me think, is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words.

Like PHPStan that can tell when an ignore is not needed.

In case, the answer is not trivial or already known, I will search for that when I will be back on the CI subject for my skeleton.
Comment over 1 year ago →
🇬🇧United Kingdom jonathan1055
is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed

That would be very useful, I've not seen an option, but please yes do some research on it.
Comment over 1 year ago →
🇬🇧United Kingdom jonathan1055
Also, following up from #6, now that #3436206: Add lorem-ipsum spelling dictionary → is committed and gitlab_tempates default-ref updated, you can probably remove lots of the latin words from your custom dictionary. Or you could leave them there for now, and use your project as a test for any work we do here on identifying non-required ignored words.
Comment about 1 year ago →
🇫🇷France Grimreaper France 🇫🇷
Hi,

Yes, I will currently keep it that way in ui_suite_bootstrap to do testing when I will have time to search for a system of detection of no more used words.
Assigned to Grimreaper
Comment about 1 year ago →
🇫🇷France Grimreaper France 🇫🇷

🇬🇧United Kingdom jonathan1055

The suggested script in the issue summary does work. Could also add in the current dictionary, then could diff to make a patch for it. I used:

export WORDS=dic-words.txt
export DICTIONARY_FILE_PATH=dic.txt
export DICTIONARY_FILE_PATH2=dic2.txt
echo "Aa\naa\nAA" > $DICTIONARY_FILE_PATH 
cat $DICTIONARY_FILE_PATH $WORDS
cat ${DICTIONARY_FILE_PATH} $WORDS | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -u -o ${DICTIONARY_FILE_PATH2}

There are some other choices for LC_ALL, so we need decide on that.

Comment about 1 year ago →
🇬🇧United Kingdom jonathan1055
Sorry, just to be clear, I'm not trying to take the work over. It's assigned to you :-)
Comment about 1 year ago →
🇫🇷France Grimreaper France 🇫🇷
No problem :)

Currently I am not working on this issue. I assigned it to myself to not forget about this issue. I am currently too spread out on too many issues.

So if you want to do it no problem :)
Issue was unassigned.
Comment about 1 year ago →
🇪🇸Spain fjgarlin
Marking as unassigned based on #14.
Merge request !303#3439240 Lower-case deduplicated words file → (Merged) created by jonathan1055
Pipeline finished with Success
7 months ago
Total: 52s
#363458
Pipeline finished with Success
7 months ago
#363550
Pipeline finished with Success
7 months ago
Total: 112s
#363563
Pipeline finished with Success
7 months ago
#363566
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
I have pushed some initial commits. Summary of what is done:

The existing artifact _cspell_unrecognized_words.txt is now converted to all lower-case and deduplicated (it was already sorted by cspell)

New artifact _cspell_updated_project_words.txt contains everything in the current project dictionary .cspell-project-words.txt plus the new words in _cspell_unrecognized_words.txt, all converted to lowercase and re-sorted

If the project has no dictionary, then _cspell_updated_project_words.txt will just be the same as _cspell_unrecognized_words.txt

If there are no new unrecognized words then _cspell_updated_project_words.txt is empty (this logic could be changed, as it may be helpful to provide a sorted, lowercased de-duped version of the file)

My initial attempt to read two files together using
tr '[:upper:]' '[:lower:]' < $PROJECT_DICTIONARY < $WORDS_FILE worked fine locally, but in the pipeline job only the second file seems to be read. So I changed it to
cat $PROJECT_DICTIONARY $WORDS_FILE | tr '[:upper:]' '[:lower:]' and this works as intended.

Tested when there are no spelling errors - the two artifact files are empty.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3637729

Tested when two new words are added, in different versions of case.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3638728

Ready for review and feedback. As usual there is debug to remove. Probably need to add a line or two into the documentation.
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
From #7 and #8

Is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed

We could do that easily here. Just make another run of cspell without the project dictionary, with no log output, but writing all unrecognized words to a file, then compare that with the project dictionary to show which words are no longer needed.
Comment 7 months ago →
🇪🇸Spain fjgarlin
Re #18, while we could do that, it kind of sounds like a feature that should be in "cspell", same as the one in "phpstan". A quick google search shows this https://www.npmjs.com/package/cspell-check-unused-words, which we could try to apply.

Agree that we just need to add an explanation about the files in the documentation, but the code overall looks good.

Back to NW for the documentation and just in case we want to try that plugin (if it works great, if it doesn't maybe a follow-up).
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
Looking at that package, the readme and the test source implies it does the check against config 'words' list.

Over time, the word ignore list in the "cspell.json" file will become quite large. ...

https://github.com/complete-ts/cspell-check-unused-words/blob/main/tests
This would only be partially useful, as it would not help anyone who is using a project-words dictionary file. I tried to look at the source to see if it also allows checking against certain dictionary files, but could not work out if it does.

Does that change your answer to us providing this functionality here? I guess to be complete, our solution would have to get the words list specfied in the projects .cspell.json and also the environment variable.
Pipeline finished with Success
7 months ago
Total: 55s
#364398
Pipeline finished with Success
7 months ago
Total: 53s
#364412
Comment 7 months ago →
🇪🇸Spain fjgarlin
My biggest concern is to go into a rabbit hole where we start checking dictionaries, as a project can have many, and those dictionaries (core, lorem ipsum...) might have unused words.

If we stick to only the project ones, how can we tell the project's dictionaries (which can be named differently) from any other ones, and then see how it mixes up with the words variables and/or configuration file.

For me, the work done so far is good enough for a RTBC as it is an improvement, but I'd let you comment on my above concerns to see if we draw the line here or we continue.
Pipeline finished with Success
7 months ago
Total: 57s
#370039
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
Yes absolutely, we would not do anything with the other dictionaries, only the project's custom file. I think it might be helpful if we had a variable for the name of the default dictionary, instead of hard-coding it, then we could use that file as "the" project dictionary, and it would be more flexible. I'm going to investigate that work here, but leave the actual processing the same as now, given that you have reviewed it to this point . Then I will open a follow-up issue to look at the work of reporting unmatched words in the project dictionary + config + words list.
Pipeline finished with Failed
7 months ago
Total: 51s
#371096
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
I have pushed the changes to use a variable _CSPELL_DICTIONARY instead of hard-coding '.cspell-project-words.txt'

Using the default value, lots of debug included, and it runs as expected
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731335

Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731484

Now with _CSPELL_DICTIONARY defined in the pipeline UI form as 'project-custom-name.txt'
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731544
Comment 7 months ago →
🇪🇸Spain fjgarlin
I made a minor comment in the MR. I think we can remove the debug and it should be good to RTBC. But NW for now just for the previous things.
Pipeline finished with Success
7 months ago
Total: 52s
#372596
Pipeline finished with Success
7 months ago
Total: 50s
#373745
Pipeline finished with Success
7 months ago
Total: 53s
#373757
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
I've addressed the MR comments, but left the debug in for now. Re-testing on the branch as in #23 but in the reverse order

Project's before script renames the project dictionary to 'project-custom-name.txt' and then use pipeline form variable CSPELL_DICTIONARY = 'project-custom-name.txt'. The renamed project dictionary is used.
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373806

Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373821

Project file restored to default value, dictionary is used
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373833

The debug in the cspell pipeline job will come out. But with the debug added in the prepare-cspell.php script, I think it would be useful to leave this in, but add a --verbose parameter (off by default) so that we can easily re-use this, instead of deleting and then re-adding next time. We have the same in scripts/unformatted-links.php?ref_type=heads#L7-17
Comment 7 months ago →
🇪🇸Spain fjgarlin
Agree, happy to add the verbose option to the prepare-cspell.php script.

Other than this, it looks good. Just setting to NW to clean up some of the debug and to put some behind a verbose flag.
Pipeline finished with Success
7 months ago
Total: 52s
#374002
Pipeline finished with Success
7 months ago
Total: 53s
#374023
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
I have added the --verbose flag in prepare-cspell.php. I also changed the single positional argument for the test suffix to be a named argument --suffix. This is only used locally, when manually running the script, so I though it was OK to change the way it is called. It keeps the code simple and consistent with the other arguments. Each argument also has it's single letter shortcut -v and -s

The extra output in the job step is gone, but I have used the --verbose script in the execution, just to check it works in the pipeline, then will remove that too.

Other minor changes:

add one echo in the job to say that the full updated dictionary has been created as an artifact

change in assets/internal/.cspell.json to only ignore the .cspell.json file not the entire assets folder. We don't want spelling errors in the assets files. This now checks 71 files not 62

Added array_filter( ) when processing $cspell_json['words'] to remove any empty array values

I will run one test then remove the --verbose flag.
Pipeline finished with Success
7 months ago
Total: 51s
#374114
Pipeline finished with Success
7 months ago
Total: 54s
#374129
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
Test using --verbose
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374059

After removing the debug lines I could see more clearly the output being written to the log, and decided it was better to have the hint url link at the end. I also realised that I had accidentally deleted one of the required echo & cat lines which displays the cleaned-up list of words, so I have put that back in.

Final test with --verbose removed.
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374130

Ready for review.
Comment 7 months ago →
🇪🇸Spain fjgarlin
Downstream pipelines: https://git.drupalcode.org/issue/gitlab_templates-3439240/-/pipelines/37...

I reviewed the code again and it looks good. RTBC.
Comment 7 months ago →
System Message

fjgarlin → committed 1fef4ecc on main authored by jonathan1055 →
Issue #3439240 by jonathan1055, fjgarlin, grimreaper: Cspell: sanitize...
Comment 7 months ago →
🇪🇸Spain fjgarlin
Merged to main. Thanks so much!
Comment 7 months ago →
🇬🇧United Kingdom jonathan1055
It was good to work on these enhancements, and also make some other improvements to the process along the way. Thanks for all the reviews - I try to make it easy to re-review by being clear in the comments and commit messages about what has changed since your previous review, and also putting changes into separate commits when they are distinct.

From the comments in #18-22, now that we have the variable for the project dictionary file, I may open a follow-up issue to report ignored words that are no longer needed. But there are other things to work on before that.
Comment 7 months ago →
🇫🇷France Grimreaper France 🇫🇷
Hi,

Thanks both of you for the work done here and sorry that I not had the time to work on it.
Comment 7 months ago →
🇪🇸Spain fjgarlin
@jonathan1055 - re #32: that goes a long way to reviewing the changes. I usually check the commit by commit and the issue comment and that helps me a lot. Then at the end of the issue I do an overall review again just in case. I appreciate the clarity of information/changes and tests between reviews.

Happy for the follow-up to be open and worked on.

--

@grimreaper - no worries, even opening useful issues and providing possible fixes or code snippets you're also helping, so we appreciate that too.
Comment 6 months ago →
System Message
Automatically closed - issue fixed for 2 weeks with no activity.

Cspell: sanitize suggested words for dictionary

Problem/Motivation

Proposed resolution

Remaining tasks

Merge Requests

!303Cspell: sanitize suggested words for dictionary
Merged

Comments & Activities

Cspell: sanitize suggested words for dictionary

Problem/Motivation

Proposed resolution

Remaining tasks

Merge Requests

!303Cspell: sanitize suggested words for dictionaryMerged

Comments & Activities

!303Cspell: sanitize suggested words for dictionary
Merged