creator as non-classic data object by abaevbog · Pull Request #161 · zotero/dataserver

abaevbog · 2023-07-21T23:41:50Z

Removed ClassicDataObject as a parent class of Zotero_Creators
Removed all load()-ing of Zotero_Creators. An object is constructed with data right away.
One join in loadCreators to load all creators on GET requests, instead of sequential $item->load().
Some minor refactoring to try to avoid extra queries
One bulk-insert query for creators instead of multiple $creator->save() calls on create or update

abaevbog · 2023-07-24T16:18:14Z

One of the consequences of removing libraryID from the creators table is that the items from the same shard essentially "share" creators across libraries. So if 2 items are created with the same creator's information, we have 1 creator entry, and 2 itemCreators entries linking this creator to items.

I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an item, we can check if its creator belongs to any other items, and delete them if not. Or, we can run a cron job every 30 minutes or so to fetch and delete creators that are not linked to any items.

I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver

dstillman · 2023-07-24T20:14:53Z

Sorry, should've noted this earlier — we should just do what we do for itemData and inline the values in itemCreators. That'll mean a bit of data duplication, but it'll be faster, and we really don't want to have to purge these in an event.

(The empty itemDataValueHash column in itemData is a holdover from many years ago when we had a separate itemDataValues table like the client.)

dstillman · 2023-07-24T20:18:06Z

+  `fieldMode` tinyint(1) unsigned DEFAULT NULL,
+  PRIMARY KEY (`creatorID`),
+  KEY `name` (`lastName`(7),`firstName`(6))
+) ENGINE=InnoDB DEFAULT CHARSET=utf8;


This will need to be a PHP script like the others that does a proper migration of existing data. It should set the shard to down, wait 20 seconds, do the migration, and bring the shard up.

Right. I only kept it as .sql until now for convenience.
I'll have it wrapped into the same algorithm as this update, except with state=down.

dstillman · 2023-07-24T20:18:08Z

  `lastUpdated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `version` int(10) unsigned NOT NULL DEFAULT '0',
  `shardID` smallint(5) unsigned NOT NULL,
+  `hasData` TINYINT( 1 ) NOT NULL DEFAULT '0',


We need to add this, but it shouldn't be in this PR.

Can I push just this one line to master?

abaevbog · 2023-07-24T20:29:04Z

inline the values in itemCreators

Got it. So then, we completely drop the creators table, and the itemCreators will become something like this:

CREATE TABLE itemCreators (
itemID bigint unsigned NOT NULL,
firstName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
lastName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
fieldMode tinyint(1) unsigned DEFAULT NULL,
creatorTypeID smallint(5) unsigned NOT NULL,
orderIndex smallint(5) unsigned NOT NULL,
PRIMARY KEY (itemID,orderIndex),
KEY creatorTypeID (creatorTypeID),
KEY name (lastName(7),firstName(6))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With a many-to-one relationship with the items table

abaevbog · 2023-07-24T23:53:12Z

In my last commit, I got rid of creators table, moved all data storage about creators into the itemCreators table, and removed some code dealing with creators that we no longer need. For example, we no longer need the for-loop when we add creators, so I grouped creator insertion into one query. I think there are a few other things can simplify now but that's a start.

php cript to update the database added copies of files before changes from this PR with old_ prefix zotero_autoload checks the shard and uses old files for shards that were not migrated yet

abaevbog · 2023-07-25T21:22:46Z

Added a php script to move data from creators and itemCreators into the new itemCreators table on a shard-by-shard basis.

To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an old_ prefix. Then, in ApiController.php, a global variable $shardID is set - based on it and a hardcoded array of shardIDs, zotero_autoload in header.php decides to use old_* files for shards that were not updated yet or usual files for updated shards. I had no problems running tests or using the API after migrating one shard out of 2 locally.

abaevbog · 2023-07-31T19:43:48Z

Brief summary of changes up to this point:

Added migration script to remove creators and tags tables and migrate to schema where all data is stored in itemCreators and itemTags.
In header.php, a temporary global variable shards is used to specify which shards have been migrated. Based on this, there is some conditional behavior in ItemController.php and TagController.php, depending on if the current shard was migrated or not.
When a method from any affected model is called, auto_loader checks current shard against migrated shards and selects either model/...inc.php - model with recent changes or model/old_...inc.php - model using old schema.
Both for creators and tags, load(), init(), and save() functions are removed. All data loading happens in the item model via loadCreators and loadTags directly into objects via constructor, and data saving/updating is done via static bulkInsert and bulkDelete.
loadTags, loadCreators, bulkCreate and bulkInsert batch multiple queries into one, so fetching or inserting 10 tags is 1 large transaction instead of 10.
changed property is set in loadTags, loadCreators - there is a check if the payload is different from existing tags/creators or not.
Zotero_Tag and Zotero_Creator properties now match the SQL columns, so there is some refactoring (e.g. in Cite.inc.php) to account for that.
All code for Tag and Creator models that is not used is removed. I thought it's easier to see changes that way.
Other minor changes, mainly in SQL joins, to work with new schema

I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators.

Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds.
After: tags ~ 40 sec, creators ~ 20 sec. ~30 larger DB queries that took 0.4-0.6 seconds.

So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down.

dstillman

Some initial notes

dstillman · 2023-08-01T11:22:50Z

+	Zotero_Admin_DB::query("ALTER TABLE `itemTags` DROP CONSTRAINT `itemTags_ibfk_2`;", false, $shardID);
+
+	// Create new itemTags table
+	Zotero_Admin_DB::query("CREATE TABLE `itemTagsNew` ( `tagID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `type` tinyint(1) unsigned NOT NULL DEFAULT '0', `version` int(10) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`tagID`, `itemID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID);


Why did you leave version here? Because of ?since= support for tags? I think we can just take that out — I suspect it's not used, or is used rarely enough that we can give a brief warning that it's going away. It's only mentioned as an aside in the Syncing documentation, and I'm not seeing any recent usage of it.

I'd say the same with tagID as with creatorID above, though I'm having second thoughts there. We could decide that the data savings are worth keeping a creators table with a creatorID and libraryID (but still no dates, key, or version). We don't do that for item data values and we won't do it for creators, but tags are much more likely to be assigned to many items, so the extra data usage could be significant. We would have to deal with purging when a tag was removed or an item was deleted, though, so it might not be worth it. Maybe need to run some queries there on real data to see how much data we would save.

There's no index on the tag here. We'll have to check the queries used for tag-based API searches to see if it can even use an index or if it will only use the libraryID and scan for tags. But it'd be the same with the name index on itemCreators — we either need it for both or for neither. (Before, in separate tables, we had to do name-based lookups to actually get the creatorID/tagID, so the indexes were definitely used.)

Yes, there is a test that calls /tags?newer=, so I assumed we need to keep it around. If we don't want it - that's great, I'll remove it from here and from my mocha tests PR.

Yeah, some context from real world data would certainly be helpful. Maybe if we could see the output of select count(*) from itemTags and divide it by select count(*) from tags, we'd get an idea of roughly many itemTag relationships there are for each tag. And then if we multiply that by select SUM(length(name)) from tags, we'd have a rough estimate of how much extra memory duplicating tag names would cause.

Queries related to tags that were affected are:

In /items?tag= where we select the itemIDs that fit the tag filter SELECT itemID FROM items JOIN itemTags USING (itemID) WHERE LOWER(itemTags.name) IN (?)

In tags/?tag= where we select tagIDs that fit the query before fetching their actual data and constructing objects SELECT SQL_CALC_FOUND_ROWS DISTINCT tagID FROM itemTags JOIN items USING (itemID) WHERE libraryID=? AND namein (?) ANDtype in (?);

Currently, these do a full table scan, which is not good. There was a UNIQUE KEY uniqueTags (libraryID, name, type) in the old creators table that was used for such queries. We can add KEY nameType (name, type) to current itemTags and these queries will use that key. I checked manually with explain sql statement and it seems to work.

With creators, the name key indeed does not do anything now because the creators never fetched by themselves and are only loaded as a part of an item via SELECT * FROM itemCreators WHERE itemID=?.

To sum up, I guess we should add KEY nameType (name, type) to itemTags and remove name from creators?

dstillman · 2023-08-01T11:24:51Z

 CREATE TABLE `syncDeleteLogKeys` (
  `libraryID` int(10) unsigned NOT NULL,
-  `objectType` enum('collection','creator','item','relation','search','setting','tag','tagName') NOT NULL,
+  `objectType` enum('collection','item','relation','search','setting','tag','tagName') NOT NULL,


tag should be removed

dstillman · 2023-08-01T11:25:28Z

Need all the schema changes here

dstillman · 2023-08-01T11:28:47Z

+	// Creators 
+	echo "Migrating creators\n";
+	// Drop foreign key constraint
+	Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_1`;", false, $shardID);


No semicolons needed in SQL from PHP (throughout)

dstillman · 2023-08-01T11:29:17Z

 	}
 }
-
+$GLOBALS['updatedShards'] = [1];


Should be empty in the committed code

dstillman · 2023-08-01T11:31:43Z

+	public static function bulkDelete($libraryID, $itemID, $creatorOrdersArray) {
+		$placeholders = implode(', ', array_fill(0, sizeOf($creatorOrdersArray), '?'));
+		$sql = "DELETE FROM itemCreators WHERE itemID=? AND orderIndex IN ($placeholders)";
+		Zotero_DB::query($sql, array_merge([$itemID],$creatorOrdersArray), Zotero_Shards::getByLibraryID($libraryID));


Missing space after comma

dstillman · 2023-08-01T11:47:42Z

+		$names = array_map(function($tag) { 
+			return $tag->name; 
+		}, $tags);
+		$existingTagsSql = "SELECT t.tagID, t.name,  MAX(t.version) as `version` from itemTags t JOIN items i USING (itemID) WHERE libraryID = ? AND name IN ($placeholders) GROUP BY tagID, `name`;"; 


House style: Capitalized acronyms in variables: existingTagsSql → existingTagsSQL

House style: Capitalized table aliases in SQL: itemTags t → itemTags IT, items i → items I

Remove extra space after comma

Only need backticks for reserved words (e.g., groups)

dstillman · 2023-08-01T11:51:06Z

-							// Use a real tag name, in case case differs
-							if (!$title) {
-								$title = "Items of Tag ‘" . $tag->name . "’";
+						if (in_array($GLOBALS['shardID'], $GLOBALS['updatedShards']) ) {


No need to have this same check everywhere. You can just do it once in header.inc.php and define() a constant that's available globally (like, e.g., Z_ENV_BASE_PATH).

dstillman · 2023-08-01T12:24:47Z

+	Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_2`;", false, $shardID);
+
+	// Create new itemCreators table
+	Zotero_Admin_DB::query("CREATE TABLE `itemCreatorsNew` ( `creatorID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `firstName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `lastName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `fieldMode` tinyint(1) UNSIGNED DEFAULT NULL, `creatorTypeID` smallint(5) UNSIGNED NOT NULL, `orderIndex` smallint(5) UNSIGNED NOT NULL, PRIMARY KEY (`creatorID`, `itemID`), KEY `creatorTypeID` (`creatorTypeID`), KEY `name` (`lastName`(7),`firstName`(6)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID);


Sorry, I led you astray originally, and didn't clarify later. I originally said there'd still be creatorID in creators, but since we're inlining the data in itemCreators, there's no reason for there to be a creatorID at all — it's not serving any purpose. And the primary key should be on itemID, orderIndex, since that's what's unique. (It was actually wrong to begin with — it should've been on itemID, orderIndex before like in the client, not itemID, creatorID, orderIndex.)

…t, formating

creator as non-classic data object

9a15ff7

abaevbog requested a review from dstillman July 21, 2023 23:41

small libraryID fix

0561025

bulk-insert new creators + minor cleanup

e03e1dc

dstillman reviewed Jul 24, 2023

View reviewed changes

store all data in itemCreators table

170704d

abaevbog added 2 commits July 25, 2023 20:58

php db migration script + use old files in autoload

dd18ed3

php cript to update the database added copies of files before changes from this PR with old_ prefix zotero_autoload checks the shard and uses old files for shards that were not migrated yet

removed typo from header

422ac13

abaevbog added 5 commits July 27, 2023 00:43

removing unused code, prevent updating unchanged creators

baf5c93

removing tags as classic data objects - in progress

8a25c08

tag tests passing, move old model files

ed68a2e

itemTags primary key including itemID

ea9030c

final cleanup, tests passing

327f68e

abaevbog force-pushed the classic_data_objects branch from 71a2a24 to 327f68e Compare July 31, 2023 19:19

moved checking of previously existing items outside of the loop

2956830

dstillman reviewed Aug 1, 2023

View reviewed changes

no creatorID, added key to itemTags, tag insert performace improvemen…

938f1fe

…t, formating

Conversation

abaevbog commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abaevbog commented Jul 24, 2023

Uh oh!

dstillman commented Jul 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abaevbog commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abaevbog commented Jul 24, 2023

Uh oh!

abaevbog commented Jul 25, 2023

Uh oh!

abaevbog commented Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstillman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

abaevbog commented Jul 21, 2023 •

edited

Loading

abaevbog commented Jul 24, 2023 •

edited

Loading

abaevbog commented Jul 31, 2023 •

edited

Loading