Are we implementing oEmbed discovery correctly/efficiently?

Created on 1 March 2022, about 3 years ago
Updated 16 February 2023, about 2 years ago

Discovery as described in https://oembed.com/#section4 is currently only kicking in if we get a URL which does not match any of the URL schemes defined for an endpoint, see \Drupal\media\OEmbed\UrlResolver::getProviderByUrl().

I'm working with a provider which currently does not provide <link rel="alternate" type="application/json+oembed" href="..." /> discovery tags on the actual resource URL, but they do provide that tag on a different URL (other subdomain). The URL delivered for the resource correctly contains the endpoint URL and the full actual resource URL as the url query parameter. If you request it, it does return the full oEmbed JSON object we expect.

If I specify the URL scheme to the player / resource URL and our editors input it when creating a Media entity in Drupal, everything works as expected.
However, the provider clearly exposes the other URL in their interface, so our editors are likely to see that and try to create a Media entity using that - which leads to an "unknown provider" error.

If I add the secondary URL as another scheme for the same endpoint, it won't work. The endpoint only accepts the actual player / resource URL and won't return any JSON for the secondary URL.

Since the secondary URL page does contain a working <link/> for discovery, as shown above, I thought we'd be fine anyway if the editors did this, and we almost are.
What happens is, Drupal:

  1. starts the main process using Drupal\media\Plugin\media\Source\OEmbed::getMetadata()
  2. wants to know the URL to the oEmbed resource so uses UrlResolver::getResourceUrl() on the user input (the "secondary" URL in my case)
  3. checks the schemes for matches to the input URL using UrlResolver::getProviderByUrl()
  4. finds no matching schemes
  5. falls back on discovery in the protected method UrlResolver::discoverResourceUrl() and requests the input url URL directly using an HTTP client
  6. parses markup, sees the <link/> and pops that URL back up to UrlResolver::discoverResourceUrl()
  7. requests the discovered URL with ResourceFetcher::fetchResource() (it's pointing to the endpoint with the correct url parameter.)
  8. gets valid oEmbed JSON it can parse
  9. grabs the name of the provider
  10. looks up the provider definition based on the provider name
  11. creates a complete \Drupal\media\OEmbed\Resource instance, passing in the now found provider definition
  12. returns that [oEmbed] resource up to UrlResolver::getProviderByUrl()
  13. throws away the resource and returns just the provider definition up to UrlResolver::getResourceUrl()
  14. still inside UrlResolver::getResourceUrl() it wants to find out which endpoint is appropriate for this resource url using UrlResolver::getEndpointMatchingUrl()
  15. iterates through any defined endpoints in the provider, testing the URL schemes for matches to the input ("secondary") URL (or in most normal cases a valid resource URL which just wasn't listed in the schemes and we needed the discovery process for)
  16. finds no matching URL schemes and falls back to whichever endpoint is listed first and pops that back up to UrlResolver::getResourceUrl()
  17. makes another request to that endpoint - using the secondary URL as the url parameter
  18. (For this provider it does not match the expected endpoint URL format based on the input string, and is thus not a request that has been cached by the resource fetcher. Even if it did match the requested format the URL would not have been fetched before, or we would have known to call it earlier and would not have ended up in the discovery phase.)
  19. ends up with a 404 because the secondary URL used on that endpoint is not actually a valid resource and throws an exception
  20. OEmbed::getMetadata() catches that exception, prints "The provided URL does not represent a valid oEmbed resource." validation error, returns NULL, leading to a form error.

I can see a sort of elegance to doing the discovery process inside the getProviderByUrl() method and just returning the provider there since that's what that method is looking for, avoiding calling code having to care about the internal discovery steps we needed to go through to actually find a valid provider.
However, it also means the calling code does not know the correct resource has already been found, parsed, and thrown away, so it must go through the entire URL scheme matching again as outlined above, make a guess on the endpoint to use, we already know none of them matched, and hopefully find the embed code we already had as part of the resource fethed earlier.

Should we not move the discovery handling out of getProviderByUrl()?
The closest candidate location for handling the discovery process would be getResourceUrl() and OEmbedResourceConstraintValidator::validate(), but then we have nearly the same issue of asking for a resource, not finding the provider, falling back on discovery, getting a full resource object, extract the URL and then throwing it away to just keep the provider. For the constraint validator I think that may actually be fine, at least if the resource fetcher caches the response, but otherwise it just bumps the problem up a level.

\Drupal\media\OEmbed\UrlResolverInterface has no other methods so we're out of candidate locations here. Doing it a level higher (other than in the validator) would mean we're all the way up in the OEmbed media source class, but may'be that's not so bad. It knows about the intricacies of oEmbed anyway, and could fall back on trying to directly the resource using the discovery URL if UrlResolver::getResourceUrl() didn't work.

It would not require big API changes if we basically just moved the protected methods doing the discovery process up there, but it would mean anyone using the media.oembed.url_resolver service directly would not automatically fall back to using the discovery process.

There are a few alternatives if we want to preserve that behavior, such as extending the interface UrlResolverInterface either with new optional parameters to disable the automatic use of discovery when desiring to do it manually, or perhaps create a new service just for this purpose.

πŸ“Œ Task
Status

Active

Version

10.1 ✨

Component
MediaΒ  β†’

Last updated 4 days ago

Created by

πŸ‡ΈπŸ‡ͺSweden twod Sweden

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

Not all content is available!

It's likely this issue predates Contrib.social: some issue and comment data are missing.

  • πŸ‡«πŸ‡·France pmunch

    @TwoD, thanks for your detailed review.

    I ran into this problem while implementing discovery mechanism on a site, which pages I want to consume video resources into drupal as oEmbed content (media remote video field).

    oEmbed specification on oembed.com is really light, but I agree that "discovery" means looking into the page's discovery tags to identify (and request) the actual media source, which of course could be provided by another domain than the one implementing the discovery tags.

    To sum it up from an integrator pov:

    1. site foo.com serves pages with video content provided by streaming provider bar.com, which implements a functional oembed endpoint
    2. bar.com player code is integrated as an iframe into foo.com pages
    3. foo.com pages having such a player implement discovery tags pointing to bar.com oembed endpoint and media
    4. drupal site baz.com wants to integrate content displayed on a foo.com page via oembed, and has foo.com and bar.com declared as custom providers using module oembed_providers
    5. when pasting the foo.com url into the media remote video field and try to save the media => error "The provided URL does not represent a valid oEmbed resource"
    6. when pasting the bar.com media url, which also implements discovery tags (self-pointed) into the media remote video field, media is embedded correctly.

    How to reproduce easily:

    1. create a page containing discovery tags pointing to any youtube video, eg<link rel="alternate" type="application/json+oembed" href="https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKIDu6a9COmg" title="Panopticom" />
    2. host this page on your_domain.com
    3. in drupal, set your_domain.com as custom provider using module drupal/oembed_providers (see instructions)
    4. paste the your_domain.com url into the media remote video field => error "The provided URL does not represent a valid oEmbed resource"

    Obviously this is a grey zone for many oEmbed implementations (same happens in wordpress core embed block).

    oEmbed sourcing mechanism shoud be like:

    1. an url is pasted into the remote video field
    2. drupal checks discovery tags of the url page
    3. drupal requests oembed response using the oembed discovery tag href attribute, which domain MAY be different from the pasted url one
    4. if no discovery tags found, drupal (or modules) may eventually rely on some supported oEmbed directories

    Finally, there are 2 useful online oEmbed checker:
    - https://charhey.com/oembed: works correctly according to oEmbed sourcing mechanism above
    - http://debug.iframely.com: although more popular, this one doesn't

    I think this pb should really be addressed, as pasting a youtube url to integrate a video has become a very common practice, which should be made easily accessible to other media providers.

    I'm available for any help testing if needed.

Production build 0.71.5 2024