Recognition of spanish contributtors with 2 last names

Created on 16 May 2024, about 1 month ago
Updated 13 June 2024, 13 days ago

We're working for a university thtat needs some improvements on BibText modules. They download info from zotero and some of the fields ara not beeeing imported correctly, so we'll be doing some features to expand the bibtex module. We will be keeping issues separatly so you can approve any of them.

On this specific issue, we have noticed that, when two last names does'nt appear separeted by an "y" (the spanish "and"), the adci library does'nt get it as 2 last names, marks the first last name as middle name, and the second as the unic last name. As we are working for a spanish university, they need the authors to be reconised properly.

I know that the solution proposed is not the most optimal, as we mark the entity, so once processed, they will be reprocessed, but this is importatn so is someone is named Calabrés Garcia, Pau, it should'nt be changed to, Calabrés y Garcia, Pau.

🐛 Bug report
Status

Needs review

Version

3.0

Component

Code

Created by

🇪🇸Spain paucala

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @paucala
  • Merge request !30add suport to spanish contributors → (Open) created by paucala
  • 🇺🇸United States AndrewGearhart

    @paucala I would recommend, since you're needing to reparse the names anyway... (the additional step you mentioned to remove the ' y ') that you instead use a strange character boundary as your replacement foundation. 'y' would have a plausible place in a real name for example: "Thomas y Jesus Jimenez Corporation". Instead, if you were to have the names entered as "Pau Calabrés|Garcia" the parser would parse this as a first and last name, and you could run your reparsing of the names in the database to replace the '|' pipe character as a space.

    I'm working on some pretty significant revisions for names in a module that I plan to release shortly (likely before the end of May 2024). I'd like to confirm how the name is entered, which is it:

    1. Calabrés Garcia, Pau
    2. Pau Calabrés Garcia
    3. Pau Calabrés y Garcia
    4. Calabrés y Garcia, Pau

    My understanding is that you would like to enter it as #2, but you are planning to enter it as either #3 or #4.

    One element that I'm considering for this is a bit more logic being entered into the name pattern whereby it would use some prioritization of the fields based upon number of entries and some anchoring of common entries. This isn't a problem that only exists with Spanish either. Some Scottish names too... but other name structures it doesn't do well with ... for instance:

    • William Mc Gonnell
    • Maria de la Cruz

    If your dataset _isn't_ using middle names, the name parser could a pattern such as:
    @prefix[#prefix_list] @first_name[1] @last_name[1-3]
    (from the available tokens) - @prefix, @leading_title, @first_name, @middle_name, @last_name, @nick, @suffix
    My theoretical breakdown of this would be that it would:

    1. greedily search for a prefix that is in the prefix_list for a prefix
    2. look for the next item to be a first_name
    3. look for any other items 1-3 to include as a last name

    I'm still working out, a) how to expand the parser to set optional items and how it should determine groupings beyond item count and order criteria (for instance, what should it do with a middle name if it was entered in my example parser... something like Maria Angelica de la Cruz. Should Angelica be discarded? This is one of the reasons that in my upcoming module, I am now storing the name as entered. When the parsing goes wrong, it is then something that could be used by an editor to determine how the name actually should be broken apart on the edit form. Ultimately, understanding that while the parser can get things better with this... it might not get _everything_ correct... and the name parts on the contributor edit screen are manually editable. Clearly, that is not a great solution though when you might be dealing with thousands of entries... but it's better than having indecipherable broken data!

  • 🇪🇸Spain paucala

    Hi @andrewgearhart

    In fact, i've benn working with the first case:

    • Calabrés Garcia, Pau

    I also have test another forms, such like:

    • Calabrés de Garcia, Pau
    • del Calabrés de Garcia, Pau
    • de la Calabrés Garcia, Pau

    The main problem i think is that the actual library (adci) is not working ok with the comma separeteted fields, as i understant names written both on latin or germanic languages accept this structure as valid on academic environments. The library should separete Last-names "," first names. But it reads only the fisrt word as the last name...

    The patch i commited doesn't add the "y" if there's already one.

    Are you working on a submodule for bibtex? something to substitute adsci library? Or is something unrelated?

  • 🇨🇦Canada mediameriquat

    I see a similar problem French, where two last names are very common nowadays.

    I worked around the problem by using a non-breaking space to bundle the two last names together.

    An advanced text editor such as Notepad++ can be very useful to tweak Bibtex and RIS files before importing the data. This does not solve the issue as reported above, but my point is that Bibcite can do a great job for your organization in spite of its multiple flaws.

  • It more relates to name parser library than bibcite problem.
    You can use this patch as temporary solution if it works correct for you.
    In the fact I don't understand in which cases it is middle name and which second last name. How we should separate this?

  • Other problem if we add configuration key to parser package that we don't have configuration page in module for this.

  • 🇪🇸Spain paucala

    Hi @AardWolf
    I think that the problem is very related to the bibcite module, as long as the library is deprecated ans does'nt get updates.
    Of course, some of the patches I've beeing doing for this project I kwon they are'nt the better solution and maybe won't merge, but still the issue should be reviewed.

    Answering your question "How we should separate this?"

    Maybe for a non-spanush speaker, is difficult when the name is something like: Javier Manuel García Castellón, as Name1 Name2 LastName1 LastName2.
    But academicly is accepted that the surnames goes first, and are separated by a comma to the name/s. So the easy solution should be that the first part is intepreted as the last name, an the comma separeted part, as the name. Seeing the example above, should be:
    García Castellón, Javier Manuel

    The problem is that, right now, the library only does this function if the first and secon lastnames are separated by a conjunction ("y", wich in spanish means "and")

  • Sorry. Which library doesn't get updates? Did you try to create issue here? https://github.com/ADCI/full-name-parser/issues
    Do you see on packagist it is deprecated? https://packagist.org/packages/adci/full-name-parser

    So the easy solution should be that the first part is intepreted as the last name, an the comma separeted part, as the name.

    What's about middle name?

  • 🇪🇸Spain paucala

    Sorry, i don't know why i read that it was deprecated, i will try to open the issue there if i have time.
    About your question, I don't really understantd what you refer with middle name, isn't it the second name? In the example i provided before, Javier would be the first and Manuel the second or middle name. It's how i understood it.
    https://dictionary.cambridge.org/dictionary/english-spanish/middle-name

Production build 0.69.0 2024