UserData only returns strings for scalar values

Created on 9 March 2023, almost 2 years ago
Updated 2 April 2024, 9 months ago

Problem/Motivation

The user.data services stores and retrieves all scalar values as strings.

In versions of Drupal prior to 8.2 user data was stored as part of the user and (as far back as at least D4.6) fully serialized so that any int/bool/array/object/string/etc would always be serialized for storage in the database and deserialized upon recovery from the database.

Steps to reproduce

> $userData = \Drupal::service('user.data');
= Drupal\user\UserData {#9804}

> $userData->set('test', 1, 'testbool', FALSE);
= null

> $userData->set('test', 1, 'testint', 3);
= null

> $userData->set('test', 1, 'testfloat', (10/3));
= null

> $b = $userData->get('test', 1, 'testbool');
= ""

> $i = $userData->get('test', 1, 'testint');
= "3"

> $f = $userData->get('test', 1, 'testfloat');
= "3.3333333333333"

> var_dump(compact('b','i','f'));

array(3) {
  'b' =>
  string(0) ""
  'i' =>
  string(1) "3"
  'f' =>
  string(15) "3.3333333333333"
}

Proposed resolution

Serialize all data except strings.

Remaining tasks

User interface changes

None

API changes

Signature change: None
Returned data: We will now see INT/BOOL returned, this is within the signature of MIXED.

Data model changes

Scalar values, except strings, will now be stored in the users_data table in a serialized format.

Release notes snippet

The user.data service now allows the storing/retreiving of scalar values without casting to strings.

🐛 Bug report
Status

Needs work

Version

11.0 🔥

Component
User module 

Last updated 1 day ago

Created by

🇺🇸United States brad.bulger

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Merge Requests

Comments & Activities

  • Issue created by @brad.bulger
  • Status changed to Closed: works as designed about 1 year ago
  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    The only way to store and return other datatypes is by wrapping them in a non-scalar datatype

    That is not true. Since non-scalar values are passed to serialize(), they can always be stored. There is no need to pass them in an array; it is sufficient to pass them to UserData::set() as they are.

    public function set($module, $uid, $name, $value) {
      $serialized = (int) (!is_scalar($value));
      if ($serialized) {
        $value = serialize($value);
      }
      $this->connection->merge('users_data')
        ->keys(['uid' => $uid,
          'module' => $module,
          'name' => $name,
        ])
        ->fields(['value' => $value,
          'serialized' => $serialized,
        ])
        ->execute();
    }

    It is sufficient to verify what the following code prints.

    function set($value) {
      if (!is_scalar($value)) {
        $value = serialize($value);
      }
      var_dump($value);
    }
    
    set(FALSE);
    set(array(10));
    set((object) array("r200");
    
    bool(false)
    string(15) "a:1:{i:0;i:10;}"
    string(38) "O:8:"stdClass":1:{s:1:"0";s:4:"r200";}"
    

    There is no need to document the function stores only strings, as that does not limit which values can be passed to $value.

    The only "issue" is that FALSE would eventually be stored in the database as 0, but that is rarely a problem. If that would be documented for UserData::set(), it should be documented for every method/function that store the value passed as argument in a database table, as a Boolean value will always be stored as either a string or an integer.

  • Status changed to Active about 1 year ago
  • 🇺🇸United States brad.bulger

    The issue is to spell out that any non-string scalar values will be stored as strings. Something like

    mixed $value: The value to store. Non-scalar values are serialized automatically. Scalar values are stored as strings.

    I don't doubt that this is as designed, the issue is to make it explicit.

    Actually using the real service demonstrates this easily:

    > $userData = \Drupal::service('user.data');
    = Drupal\user\UserData {#9804}
    
    > $userData->set('test', 1, 'testbool', FALSE);
    = null
    
    > $userData->set('test', 1, 'testint', 3);
    = null
    
    > $userData->set('test', 1, 'testfloat', (10/3));
    = null
    
    > $b = $userData->get('test', 1, 'testbool');
    = ""
    
    > $i = $userData->get('test', 1, 'testint');
    = "3"
    
    > $f = $userData->get('test', 1, 'testfloat');
    = "3.3333333333333"
    
    > var_dump(compact('b','i','f'));
    
    array(3) {
      'b' =>
      string(0) ""
      'i' =>
      string(1) "3"
      'f' =>
      string(15) "3.3333333333333"
    }
    = null
    

    If there are other places where values are handled in the same way, then yes, it would be helpful to note this there as well.

  • 🇺🇸United States cmlara

    I hit this again myself, and the repercussions would have necessitated a security advisory if the issue made it to a tagged release.

    At the very least, I agree with @brad.bulger, this should be documented, the API should clearly define its limits.

    The user.data service is not presented as a 'raw' storage it presents itself currently as being capable of returning data the same as it was submitted which is not the case here. It is presented in a manner that if I put in an integer I expect an integer to be returned from get()

    I don't doubt that this is as designed,

    Honestly I'm not sure this was actually designed. Even in core we have issues such as AccountInterface::id(): int; returning strings because of the database translating scalar value to strings.

    Given that Core is moving to use more use of typedef's and contrib is starting to adopt strict_types we may want to consider going further and serializing everything except strings, although I'm not sure of the BC risks of doing so and a documentation fix is at least progress.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    The value to store. Non-scalar values are serialized automatically. Scalar values are stored as strings.

    Every value is stored as string, even non-scalar values. As it is written, it seems that only scalar values are stored as strings, which is not true.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    I would also remove The only way to store and return other datatypes is by wrapping them in a non-scalar datatype - eg [FALSE] will return an array with a boolean FALSE value. from the issue summary, since that is not true.

  • 🇺🇸United States cmlara

    Digging deeper,

    I actually think this wasn't intentional and is a bug

    Before
    https://www.drupal.org/node/1852360
    https://git.drupalcode.org/project/drupal/-/commit/4d0d5c5712802690ac387...

    The $user->data was populated by de-serializing $data. $data was also a multi-dimensional array which meant that any testing wouldn't have seen a 'root level' value that wouldn't be serialized because it was always in an array, the same as the IS comment that to store FALSE you need to use a non-scalar like [FALSE].

    After the changes serialize was only called on non-scalar values.

    The Change Record shows examples of retrieving a value with I would contend is an implication one would be working with the full range of scalar values

    // Get a value.
    $data = Drupal::service('user.data')->get('mymodule', $account->id(), 'key') ?: 1;

    Additionally looking at the testing (unless I'm misreading) it looks like the conversions used assertEquals, against integer values (I believe assertSame didn't exist yet either) implying that the goal was to see the full scalar range.

    Given the lack of strict_types at the time, and how all of PHP works with implicit casting, to me this looks like it wasn't noticed that the database cased everything to strings similar to our current issues with entities.

    @brad.bulger I don't want to hijack your issue, how would you feel about us changing this to a bug report against when serialize is called instead of this being a documentation fault?

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    Basing on #347988-22: Move $user->data into own table , the serialized column was added for similarity with other table schemas already used.
    I cannot find any comment that explains why that column is necessary for those tables, but passing only non-scalar values to serialize()/unserialize() introduced a slight difference between the Drupal 7 code and the Drupal 8 code.

  • 🇺🇸United States brad.bulger

    @cmlara I suppose that it is unlikely that the comment would be updated now while a code change that moots it is being considered. That would be the only reason to split the bug off, I think.

  • 🇺🇸United States cmlara

    Thanks @brad.bulger. Changing this to a bug report (we can revert back if this gets unwieldy or if it looks like a fix will not be merged).

    I cannot find any comment that explains why that column is necessary for those tables,

    I haven't looked for confirmation, however if they intend to store serialized and unserialzed data it makes sense, we wouldn't want a user placing a string that could exploit an issue like 📌 Harden Drupal against PHP Gadget Chain Drupal9/RCE1 Closed: won't fix .

    There is some minor talks about performance and micro optimization in the issue thread.

    Identified #347988-55: Move $user->data into own table as the comment where this change was inserted into the patches which also looks like it was really the first serious D8 patch with this new infrastructure design.

    Also it looks like as far bask as at least D4.6 everything would have been serialized https://api.drupal.org/api/drupal/modules%21user.module/function/user_sa...

    The real big question is: what is the BC impact if we change !is_scalar($value) to !is_string($value)?

    Opened a quick MR to see if anything breaks with this change. Even if no fails I'm sure we will want tests to be added before this gets moved forward.

  • Pipeline finished with Failed
    about 1 year ago
    Total: 797s
    #68958
  • 🇺🇸United States cmlara

    Single failure, its in a migration test validating data migrated from D6. It has been modified since (converted to assertSame()), however the original source is #2268897: Write per user contact settings D6=>D8 migration , 2 years after the user.data change was made.

    No discussion on it regarding the scalars being returned as strings, however assertIdentical() was explicitly used in that test to validate the data returned by the user.data service.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    UserData::get() returns NULL when $module, $name, and $uid are passed but no value is found.

    // If $module, $uid, and $name were passed, return the value.
    if (isset($name) && isset($uid)) {
      $result = $result->fetchAllAssoc('uid');
        if (isset($result[$uid])) {
        return $result[$uid]->serialized ? unserialize($result[$uid]->value) : $result[$uid]->value;
      }
      return NULL;
    }
    

    If also NULL is serialized, the code that calls UserData::get() would not be able to understand if no value has been found, or NULL was effectively the stored value.

  • 🇺🇸United States cmlara

    If also NULL is serialized, the code that calls UserData::get() would not be able to understand if no value has been found or NULL was effectively the stored value.

    Good thinking. NULL is non-scalar and is already going through the serialize code. The change currently proposed wouldn't change the current action of NULL. I would suggest we can keep that out of scope for this issue and deal with it in a followup if we feel it is a concern.

    Your comment does cause me to realize, there is one 'non documented' change this would cause. unserialize() returns FALSE when it has an error and is currently the only time the current code could return a BOOL. This isn't officially documented on the method signature of userData->get() however if any contrib was using === FALSE they would now no longer be able to tell the difference between a deserialization fault and FALSE being stored.

    The table itself is considered non-api and shouldn't be accessed by anyone expect the user.data service. IMHO we shouldn't hit that in the real world inside of the standard API, though I mention it in case it requires us to change the BC plans.

    If we changed serialization to embed everything inside of a new "StoredUserDataObject" we could detect the difference between a stored FALSE and a unserialize() FALSE though that would have a much larger data modal change.

    Looking closer at the failed test:
    Realize this is where the user.data migration plugin was added, it uses the user.data service to migrate the data, inherently the test itself isn't testing only the migration its testing the internal of how user.data processes data at the time.

    This issue could benefit from a really good unit test of the service running all the standard variables through set and validating the database layer gets the same request, and through get() with mocked result data. I will try and get some time to write one up.

  • 🇺🇸United States brad.bulger

    Can users detect the E_NOTICE from an unserialize() error with an error handler? That wouldn't mean that they are now, of course.

  • 🇺🇸United States cmlara

    Can users detect the E_NOTICE from an unserialize() error with an error handler? That wouldn't mean that they are now, of course.

    Working on writing up unit tests and a quick attempt (at least in unit tests) says it would seem it is possible.
    Also error_get_last() would seem to be an option as well (as long as no other errors occur between the time of calling unserialize and checking for the error)

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    Actually, the fact UserData::get() could return FALSE because the string passed to unserialize() is not unserializeable is always a problem, not just when the code does a strict comparison between the returned value and a Boolean value.
    For the code calling UserData::get(), the fact the data has been serialized when written to the database is a private implementation detail; it is therefore impossible to know if FALSE is returned as error code.

    This is different in cache backends, where get() is documented to return FALSE in case of failure.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    As a side note, to understand when unserialize() returns FALSE because an error, code similar to the following one would work.

    $data = @unserialize($serialized);
    
    if ($data === FALSE) {
      if ($serialized !== serialize(FALSE)) {
        // unserialize() returned an error.
      }
    }
    
  • 🇺🇸United States cmlara

    Updated MR with a more substantial set of Unit tests for the user.data service.
    Also updated the one failing test from migrations to account for the fact that data will no be properly converted.
    if ($serialized !== serialize(FALSE)) { could bprobalby even be if ($serialized !== 'b:0;' {

    Should we be making changes to have that throw an exception in this issue or a followup?

  • Pipeline finished with Success
    about 1 year ago
    Total: 603s
    #69674
  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    Should we be making changes to have that throw an exception in this issue or a followup?

    I would leave that to a follow-up issue. Throwing an exception would avoid the code calling UserData::get() has to understand whenever NULL and FALSE are really the stored values or values that indicate the absence of stored data or errors; it is also a change that is not backward compatible.

  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    As for serializing data when the data is not a string (instead of when the data is not a scalar value), that is what the database back-end for the cache already does, in DatabaseBackend::doSetMultiple().

    if (!is_string($item['data'])) {
      $fields['data'] = serialize($item['data']);
      $fields['serialized'] = 1;
    }
    else {
      $fields['data'] = $item['data'];
      $fields['serialized'] = 0;
    }
    

    I would say that is the correct change.

  • Status changed to Needs review about 1 year ago
  • 🇺🇸United States cmlara

    Setting NR.

    Not sure if this will need a CR

  • Pipeline finished with Success
    about 1 year ago
    Total: 617s
    #70077
  • 🇮🇹Italy apaderno Brescia, 🇮🇹

    users.data column replaced with user_data API does even mention the user.data service serialize only some datatypes, contrary to the existing column, which is described as The serialized data column in the {users} table in that change record.
    I guess a change record should not be created for this change.

  • Status changed to Needs work 11 months ago
  • 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

    Was asked to have a look at this by the needs review queue initiative.

    The code change looks self contained and fine.

    My concern is with mocking DB queries. Can we do the testing in a kernel test instead?

  • Status changed to Needs review 11 months ago
  • 🇺🇸United States cmlara

    No conflict rebased on 11.x
    Converted to Kernel test.
    Added one additional test case to providerUserDataGet() that I've learned about since working on this issue (integer like strings) for extra integrity/future type-proofing of the tests.

  • Pipeline finished with Success
    11 months ago
    Total: 1738s
    #101844
  • Status changed to RTBC 11 months ago
  • 🇺🇸United States smustgrave

    Verified test coverage here https://git.drupalcode.org/issue/drupal-3347042/-/jobs/884570

    Thanks for taking a look @larowlan!

    Believe all feedback has been addressed

  • Status changed to Needs work 10 months ago
  • 🇦🇺Australia larowlan 🇦🇺🏝.au GMT+10

    Left another comment on the review, thanks for making this a kernel test, gives me much more confidence.

    Can we get a change record here. There's a BC break here for those who're using strict equality on returns and assuming everything is a string.

  • Status changed to Needs review 10 months ago
  • 🇺🇸United States cmlara

    Added CR https://www.drupal.org/node/3427506

    Converted test to a single set()/get() callback combo.
    Removed tests that check 'internals' (serialization injection, unserialize failure, missing classes).
    Added a test for a Float value.

  • Pipeline finished with Success
    10 months ago
    Total: 526s
    #118742
  • Status changed to RTBC 10 months ago
  • 🇺🇸United States smustgrave

    Feedback appears to be addressed.

    CR contains a before/after example which is always a plus.

  • Status changed to Needs work 9 months ago
  • 🇬🇧United Kingdom alexpott 🇪🇺🌍

    This issue scares me because of the subtle bugs in can introduce. For example if you do a set with something equal to TRUE and then do === '1' code that consumes this value. Obviously this code is icky but it would have work and now it won't and we have no idea of the impact.

    We also need to update \Drupal\user\UserDataInterface. \Drupal\user\UserDataInterface::set says that The value to store. Non-scalar values are serialized automatically. so that needs to change but also I think we should be careful about BC. I think code should opt into scalar type safety

    i.e. change \Drupal\user\UserDataInterface::set($module, $uid, $name, $value) to \Drupal\user\UserDataInterface::set($module, $uid, $name, $value, bool $preserve_scalar_type = FALSE)

    And then in the code to something like:

        $serialized = (int) !is_scalar($value);
        if ($serialized) {
          $value = serialize($value);
        }
        elseif (!is_string($value)) {
          if ($preserve_scalar_type) {
            $serialized = 1;
            $value = serialize($value);
          }
          else {
            @trigger_error('Some message about passing TRUE and ensuring the code that uses the value does the right thing', E_USER_DEPRECATED); 
          }
        }
    

    That way code can opt-in and eventually we can drop the flag for the behaviour and rely on PHPStan to tell everyone to remove it.

  • 🇺🇸United States cmlara

    @alexpott
    Is it acceptable if we make the decision in get()?

    UserDataInterface::get($module, $uid, $name, bool $cast_scalar_to_string = TRUE);

    This way the data is preserved (as the documentation calls for) and it becomes a per-retrieval choice on destroying the native type. This allows the stored data to more quickly become up-to-date while allowing individuals consumers to more slowly update.

Production build 0.71.5 2024