Author field is not saved on aggregator item

Created on 5 March 2019, almost 6 years ago
Updated 3 October 2023, about 1 year ago

laminas-feed only parses <author> fields that contain an email address or conform to the format <author>email@example.com (author name)</author>. Arbitrary data not conforming to that format is discarded. Aggregator only uses the author name portion of the field, ignoring the email address. So even if there is an email address with no name, we treat that as being empty.

Remaining Tasks

Decide what to do about this.

Original report

Working with aggregator I found that the author field is not working properly, the author tag content is not stored on the aggregator item on database. Looking the code I found a XML on the test module (/core/modules/aggregator/tests/modules/aggregator_test/aggregator_test_rss091.xml) with an author field so I think this worked on the past, but now is failing, using this test XML as feed shows how the field author is not stored.

πŸ› Bug report
Status

Fixed

Version

2.0

Component

Code

Created by

πŸ‡ͺπŸ‡ΈSpain duban Santiago de Compostela

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MySQL 8
    last update over 1 year ago
    139 pass, 2 fail
  • Status changed to Needs work over 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States dcam
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update over 1 year ago
    140 pass, 2 fail
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    As I dug into this the first thing that I noticed is that my site is importing authors. The difference is that the items it's importing have the <dc:creator> tag instead of <author>. My own test feed is generated by Drupal 10's default frontpage view. The RSS version is 2.0. But the one that's being tested by the module is version 0.91. It seems like we aren't compatible with this older version of RSS.

    I think we could keep the scope of this issue as it is, but this makes me think that we may need to spin off another issue to test different RSS versions separately.

  • πŸ‡ΊπŸ‡ΈUnited States dcam

    I did further research. Essentially, laminas-feed has strict requirements for the contents of the <author> field. I believe they are following the guidance from https://www.rssboard.org/rss-2-0-1:

    <author> is an optional sub-element of <item>.

    It's the email address of the author of the item. For newspapers and magazines syndicating via RSS, the author is the person who wrote the article that the <item> describes. For collaborative weblogs, the author of the item might be different from the managing editor or webmaster. For a weblog authored by a single individual it would make sense to omit the <author> element.

    <author>lawyer@boyer.net (Lawyer Boyer)</author>

    See the relevant lines of code from laminas-feed at https://github.com/laminas/laminas-feed/blob/4efdf913e7bb93310f9afd5d426.... They just try to do a best effort attempt to parse <author> fields, expecting that it will contain an email address and possibly a person's name in parenthesis. This matches the syntax example shown above. They start by checking for an '@', assuming that it is an email address. If they don't find one, then they don't bother doing anything else with the data. It's effectively discarded.

    There's a further problem in how Aggregator uses the author data. Any email address is stored in an email key. If there's a name in parenthesis, that gets stored in the name key. We only use the name key from the results. So even if there's data in the email key, we ignore it. This may have been done because the laminas-feed docs only demonstrate using the name key.

    Based on this, it doesn't seem to be an RSS version-specific problem. laminas-feed simply doesn't support random data in an <author> field. It's probably fine in <dc:creator> fields because laminas-feed hands off that processing to a separate library. So I guess it's our move to decide how we want to handle this.

  • πŸ‡ΊπŸ‡ΈUnited States dcam

    I expanded the test from #8 to check for authors that are compatible with laminas-feed. So my new assertion should pass. It does locally.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update over 1 year ago
    139 pass, 4 fail
  • πŸ‡ΊπŸ‡ΈUnited States dcam
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    I updated DeleteFeedItemTest to account for the new item.

  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update over 1 year ago
    140 pass, 2 fail
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    I'm open to suggestions about what to do about this. On one hand this is probably undesirable behavior. On the other hand we would have to do extra work outside of what laminas-feed does for us in order to import arbitrary <author> data. Do we just document the problem and say we don't support arbitrary <author> data? Or do we try to fix the problem?

  • πŸ‡ΊπŸ‡ΈUnited States dcam

    Based on these findings I don't think this qualifies as being Major.

  • Status changed to Needs review about 1 year ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    153 pass, 2 fail
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    I'm uploading my work to make sure it's saved. I know it won't pass the new test.

    I decided to go beyond what laminas-feed does and try to import any data that's in the <author>, if it's populated. To do this I've created an extension for laminas Entry parsing.

    Unfortunately, it isn't working and I don't know why. The extension and function for getting the author are called. But there's something wrong with my implementation or the underlying code. The base AbstractEntry class contains an entryKey property that tells any extension which <item> from the feed is being parsed. For some reason, in this new extension the entryKey is 1 (technically a value of 0), either because that's the default or because of some bug. But we're trying to parse number 4. So the extension gets the XML for entry 1, tries to parse the author from it (it has none), and fails the test.

  • The last submitted patch, 17: 3037567-17.patch, failed testing. View results β†’
    - codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

  • Status changed to Needs work about 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    Inspiration for this came from the official docs, https://docs.laminas.dev/laminas-feed/reader/#writing-laminasfeedreader-..., and from the other extensions. Unfortunately, I don't see where we're doing anything different. I'm starting to think that it's because the RSS Entry class only sets the entryKey for extensions that it expects out-of-the-box. We may have to set the key ourselves. I'd already thought of that, but dismissed it as a workaround hack. But that may be the only way to get the job done.

  • Status changed to Needs review about 1 year ago
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 10.1.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    158 pass
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    It just needed one small change to the check for existing author data.

  • πŸ‡ΊπŸ‡ΈUnited States dcam

    And the moral of the story is: don't debug late at night when you're tired.

  • Status changed to RTBC about 1 year ago
  • πŸ‡¦πŸ‡ΊAustralia larowlan πŸ‡¦πŸ‡ΊπŸ.au GMT+10

    Looks good to me, now I need to finish reading the great gatsby for I too want to get out and walk eastward toward the park through the soft twilight

    • dcam β†’ committed 08a55031 on 2.x
      Issue #3037567 by dcam, larowlan: Author field is not saved on...
  • Open in Jenkins β†’ Open on Drupal.org β†’
    Core: 9.5.x + Environment: PHP 8.1 & MariaDB 10.3.22
    last update about 1 year ago
    158 pass
    • dcam β†’ committed ff107d20 on 1.x
      Issue #3037567 by dcam, larowlan: Author field is not saved on...
  • Status changed to Fixed about 1 year ago
  • πŸ‡ΊπŸ‡ΈUnited States dcam

    Thank you to everyone who worked on this!

  • Automatically closed - issue fixed for 2 weeks with no activity.

Production build 0.71.5 2024