- Issue created by @Grimreaper
- 🇪🇸Spain fjgarlin
I think it's a great feature to work on. I updated the issue description with the goal.
- 🇬🇧United Kingdom jonathan1055
Hi Grimreaper,
This is a good idea. I have noticed that we get duplicated words with different case, so that should be improved. But the list is already sorted alpahbetically and de-duplicated (apart from case). Can you link to a job where this is not working for you? - 🇫🇷France Grimreaper France 🇫🇷
Hi,
Thanks for your feedbacks :)
the list is already sorted alpahbetically and de-duplicated (apart from case)
I checked https://git.drupalcode.org/project/ui_suite_bootstrap/-/jobs/1151883, it is effectively sorted and deduplicated. My bad, I saw duplicated words due to case and did not check further, sorry.
Also, I think I kept in mind https://git.drupalcode.org/project/gin_lb/-/jobs/1120341, which at this moment did not had the "_cspell_unrecognized_words.txt" file generated.
So I guess "only" case handling and suggesting a file with merged existing and new words in a project vocabulary is what remains to be done.
- 🇬🇧United Kingdom jonathan1055
Thanks for the links. I see in ui_suite_bootstrap you have lots of latin words. The number will reduce significantly when you start using the improvement from #3436206: Add lorem-ipsum spelling dictionary → . This link shows everything that is committed to 'main' but have not yet reached 'default-ref'
- 🇫🇷France Grimreaper France 🇫🇷
Thanks for the feedback.
This made me think, is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words.
Like PHPStan that can tell when an ignore is not needed.
In case, the answer is not trivial or already known, I will search for that when I will be back on the CI subject for my skeleton.
- 🇬🇧United Kingdom jonathan1055
is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed
That would be very useful, I've not seen an option, but please yes do some research on it.
- 🇬🇧United Kingdom jonathan1055
Also, following up from #6, now that #3436206: Add lorem-ipsum spelling dictionary → is committed and gitlab_tempates default-ref updated, you can probably remove lots of the latin words from your custom dictionary. Or you could leave them there for now, and use your project as a test for any work we do here on identifying non-required ignored words.
- 🇫🇷France Grimreaper France 🇫🇷
Hi,
Yes, I will currently keep it that way in ui_suite_bootstrap to do testing when I will have time to search for a system of detection of no more used words.
- Assigned to Grimreaper
- 🇬🇧United Kingdom jonathan1055
The suggested script in the issue summary does work. Could also add in the current dictionary, then could
diff
to make a patch for it. I used:export WORDS=dic-words.txt export DICTIONARY_FILE_PATH=dic.txt export DICTIONARY_FILE_PATH2=dic2.txt echo "Aa\naa\nAA" > $DICTIONARY_FILE_PATH cat $DICTIONARY_FILE_PATH $WORDS cat ${DICTIONARY_FILE_PATH} $WORDS | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -u -o ${DICTIONARY_FILE_PATH2}
There are some other choices for LC_ALL, so we need decide on that.
- 🇬🇧United Kingdom jonathan1055
Sorry, just to be clear, I'm not trying to take the work over. It's assigned to you :-)
- 🇫🇷France Grimreaper France 🇫🇷
No problem :)
Currently I am not working on this issue. I assigned it to myself to not forget about this issue. I am currently too spread out on too many issues.
So if you want to do it no problem :)
- Issue was unassigned.
- 🇬🇧United Kingdom jonathan1055
I have pushed some initial commits. Summary of what is done:
- The existing artifact
_cspell_unrecognized_words.txt
is now converted to all lower-case and deduplicated (it was already sorted by cspell) - New artifact
_cspell_updated_project_words.txt
contains everything in the current project dictionary.cspell-project-words.txt
plus the new words in_cspell_unrecognized_words.txt
, all converted to lowercase and re-sorted - If the project has no dictionary, then
_cspell_updated_project_words.txt
will just be the same as_cspell_unrecognized_words.txt
- If there are no new unrecognized words then
_cspell_updated_project_words.txt
is empty (this logic could be changed, as it may be helpful to provide a sorted, lowercased de-duped version of the file)
My initial attempt to read two files together using
tr '[:upper:]' '[:lower:]' < $PROJECT_DICTIONARY < $WORDS_FILE
worked fine locally, but in the pipeline job only the second file seems to be read. So I changed it to
cat $PROJECT_DICTIONARY $WORDS_FILE | tr '[:upper:]' '[:lower:]'
and this works as intended.Tested when there are no spelling errors - the two artifact files are empty.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3637729Tested when two new words are added, in different versions of case.
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3638728Ready for review and feedback. As usual there is debug to remove. Probably need to add a line or two into the documentation.
- The existing artifact
- 🇬🇧United Kingdom jonathan1055
From #7 and #8
Is there a mechanism in Spellcheck that tell if a word in a dictionary is not found in the analyzed codebase? to curate the list of custom words. Like PHPStan that can tell when an ignore is not needed
We could do that easily here. Just make another run of cspell without the project dictionary, with no log output, but writing all unrecognized words to a file, then compare that with the project dictionary to show which words are no longer needed.
- 🇪🇸Spain fjgarlin
Re #18, while we could do that, it kind of sounds like a feature that should be in "cspell", same as the one in "phpstan". A quick google search shows this https://www.npmjs.com/package/cspell-check-unused-words, which we could try to apply.
Agree that we just need to add an explanation about the files in the documentation, but the code overall looks good.
Back to NW for the documentation and just in case we want to try that plugin (if it works great, if it doesn't maybe a follow-up).
- 🇬🇧United Kingdom jonathan1055
Looking at that package, the readme and the test source implies it does the check against config 'words' list.
Over time, the word ignore list in the "cspell.json" file will become quite large. ...
https://github.com/complete-ts/cspell-check-unused-words/blob/main/tests
This would only be partially useful, as it would not help anyone who is using a project-words dictionary file. I tried to look at the source to see if it also allows checking against certain dictionary files, but could not work out if it does.Does that change your answer to us providing this functionality here? I guess to be complete, our solution would have to get the words list specfied in the projects .cspell.json and also the environment variable.
- 🇪🇸Spain fjgarlin
My biggest concern is to go into a rabbit hole where we start checking dictionaries, as a project can have many, and those dictionaries (core, lorem ipsum...) might have unused words.
If we stick to only the project ones, how can we tell the project's dictionaries (which can be named differently) from any other ones, and then see how it mixes up with the words variables and/or configuration file.
For me, the work done so far is good enough for a RTBC as it is an improvement, but I'd let you comment on my above concerns to see if we draw the line here or we continue.
- 🇬🇧United Kingdom jonathan1055
Yes absolutely, we would not do anything with the other dictionaries, only the project's custom file. I think it might be helpful if we had a variable for the name of the default dictionary, instead of hard-coding it, then we could use that file as "the" project dictionary, and it would be more flexible. I'm going to investigate that work here, but leave the actual processing the same as now, given that you have reviewed it to this point . Then I will open a follow-up issue to look at the work of reporting unmatched words in the project dictionary + config + words list.
- 🇬🇧United Kingdom jonathan1055
I have pushed the changes to use a variable
_CSPELL_DICTIONARY
instead of hard-coding '.cspell-project-words.txt'Using the default value, lots of debug included, and it runs as expected
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731335Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731484Now with
_CSPELL_DICTIONARY
defined in the pipeline UI form as 'project-custom-name.txt'
https://git.drupalcode.org/issue/scheduler-3445052/-/jobs/3731544 - 🇪🇸Spain fjgarlin
I made a minor comment in the MR. I think we can remove the debug and it should be good to RTBC. But NW for now just for the previous things.
- 🇬🇧United Kingdom jonathan1055
I've addressed the MR comments, but left the debug in for now. Re-testing on the branch as in #23 but in the reverse order
Project's before script renames the project dictionary to 'project-custom-name.txt' and then use pipeline form variable
CSPELL_DICTIONARY
= 'project-custom-name.txt'. The renamed project dictionary is used.
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373806Custom name for project file 'project-custom-name.txt' but not using the variable. The dictionary is not used (as expected)
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373821Project file restored to default value, dictionary is used
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/373833The debug in the cspell pipeline job will come out. But with the debug added in the prepare-cspell.php script, I think it would be useful to leave this in, but add a
--verbose
parameter (off by default) so that we can easily re-use this, instead of deleting and then re-adding next time. We have the same in scripts/unformatted-links.php?ref_type=heads#L7-17 - 🇪🇸Spain fjgarlin
Agree, happy to add the verbose option to the
prepare-cspell.php
script.Other than this, it looks good. Just setting to NW to clean up some of the debug and to put some behind a verbose flag.
- 🇬🇧United Kingdom jonathan1055
I have added the
--verbose
flag in prepare-cspell.php. I also changed the single positional argument for the test suffix to be a named argument--suffix
. This is only used locally, when manually running the script, so I though it was OK to change the way it is called. It keeps the code simple and consistent with the other arguments. Each argument also has it's single letter shortcut-v
and-s
The extra output in the job step is gone, but I have used the --verbose script in the execution, just to check it works in the pipeline, then will remove that too.
Other minor changes:
- add one echo in the job to say that the full updated dictionary has been created as an artifact
- change in
assets/internal/.cspell.json
to only ignore the .cspell.json file not the entire assets folder. We don't want spelling errors in the assets files. This now checks 71 files not 62 - Added array_filter( ) when processing $cspell_json['words'] to remove any empty array values
I will run one test then remove the --verbose flag.
- 🇬🇧United Kingdom jonathan1055
Test using
--verbose
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374059After removing the debug lines I could see more clearly the output being written to the log, and decided it was better to have the hint url link at the end. I also realised that I had accidentally deleted one of the required echo & cat lines which displays the cleaned-up list of words, so I have put that back in.
Final test with
--verbose
removed.
https://git.drupalcode.org/issue/scheduler-3445052/-/pipelines/374130Ready for review.
- 🇪🇸Spain fjgarlin
Downstream pipelines: https://git.drupalcode.org/issue/gitlab_templates-3439240/-/pipelines/37...
I reviewed the code again and it looks good. RTBC.
-
fjgarlin →
committed 1fef4ecc on main authored by
jonathan1055 →
Issue #3439240 by jonathan1055, fjgarlin, grimreaper: Cspell: sanitize...
-
fjgarlin →
committed 1fef4ecc on main authored by
jonathan1055 →
- 🇬🇧United Kingdom jonathan1055
It was good to work on these enhancements, and also make some other improvements to the process along the way. Thanks for all the reviews - I try to make it easy to re-review by being clear in the comments and commit messages about what has changed since your previous review, and also putting changes into separate commits when they are distinct.
From the comments in #18-22, now that we have the variable for the project dictionary file, I may open a follow-up issue to report ignored words that are no longer needed. But there are other things to work on before that.
- 🇫🇷France Grimreaper France 🇫🇷
Hi,
Thanks both of you for the work done here and sorry that I not had the time to work on it.
- 🇪🇸Spain fjgarlin
@jonathan1055 - re #32: that goes a long way to reviewing the changes. I usually check the commit by commit and the issue comment and that helps me a lot. Then at the end of the issue I do an overall review again just in case. I appreciate the clarity of information/changes and tests between reviews.
Happy for the follow-up to be open and worked on.
--
@grimreaper - no worries, even opening useful issues and providing possible fixes or code snippets you're also helping, so we appreciate that too.
Automatically closed - issue fixed for 2 weeks with no activity.