creator as non-classic data object#161
Conversation
|
One of the consequences of removing I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver |
|
Sorry, should've noted this earlier — we should just do what we do for (The empty |
| `fieldMode` tinyint(1) unsigned DEFAULT NULL, | ||
| PRIMARY KEY (`creatorID`), | ||
| KEY `name` (`lastName`(7),`firstName`(6)) | ||
| ) ENGINE=InnoDB DEFAULT CHARSET=utf8; No newline at end of file |
There was a problem hiding this comment.
This will need to be a PHP script like the others that does a proper migration of existing data. It should set the shard to down, wait 20 seconds, do the migration, and bring the shard up.
There was a problem hiding this comment.
Right. I only kept it as .sql until now for convenience.
I'll have it wrapped into the same algorithm as this update, except with state=down.
| `lastUpdated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', | ||
| `version` int(10) unsigned NOT NULL DEFAULT '0', | ||
| `shardID` smallint(5) unsigned NOT NULL, | ||
| `hasData` TINYINT( 1 ) NOT NULL DEFAULT '0', |
There was a problem hiding this comment.
We need to add this, but it shouldn't be in this PR.
There was a problem hiding this comment.
Can I push just this one line to master?
Got it. So then, we completely drop the CREATE TABLE With a many-to-one relationship with the items table |
|
In my last commit, I got rid of |
php cript to update the database added copies of files before changes from this PR with old_ prefix zotero_autoload checks the shard and uses old files for shards that were not migrated yet
|
Added a php script to move data from To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an |
71a2a24 to
327f68e
Compare
|
Brief summary of changes up to this point:
I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators. Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds. So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down. |
| Zotero_Admin_DB::query("ALTER TABLE `itemTags` DROP CONSTRAINT `itemTags_ibfk_2`;", false, $shardID); | ||
|
|
||
| // Create new itemTags table | ||
| Zotero_Admin_DB::query("CREATE TABLE `itemTagsNew` ( `tagID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `type` tinyint(1) unsigned NOT NULL DEFAULT '0', `version` int(10) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`tagID`, `itemID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID); |
There was a problem hiding this comment.
-
Why did you leave
versionhere? Because of?since=support for tags? I think we can just take that out — I suspect it's not used, or is used rarely enough that we can give a brief warning that it's going away. It's only mentioned as an aside in the Syncing documentation, and I'm not seeing any recent usage of it. -
I'd say the same with
tagIDas withcreatorIDabove, though I'm having second thoughts there. We could decide that the data savings are worth keeping acreatorstable with acreatorIDandlibraryID(but still no dates, key, or version). We don't do that for item data values and we won't do it for creators, but tags are much more likely to be assigned to many items, so the extra data usage could be significant. We would have to deal with purging when a tag was removed or an item was deleted, though, so it might not be worth it. Maybe need to run some queries there on real data to see how much data we would save. -
There's no index on the tag here. We'll have to check the queries used for tag-based API searches to see if it can even use an index or if it will only use the libraryID and scan for tags. But it'd be the same with the
nameindex onitemCreators— we either need it for both or for neither. (Before, in separate tables, we had to do name-based lookups to actually get the creatorID/tagID, so the indexes were definitely used.)
There was a problem hiding this comment.
- Yes, there is a test that calls
/tags?newer=, so I assumed we need to keep it around. If we don't want it - that's great, I'll remove it from here and from my mocha tests PR.
There was a problem hiding this comment.
- Yeah, some context from real world data would certainly be helpful. Maybe if we could see the output of
select count(*) from itemTagsand divide it byselect count(*) from tags, we'd get an idea of roughly manyitemTagrelationships there are for each tag. And then if we multiply that byselect SUM(length(name)) from tags, we'd have a rough estimate of how much extra memory duplicating tag names would cause.
There was a problem hiding this comment.
- Queries related to tags that were affected are:
-
In
/items?tag=where we select the itemIDs that fit the tag filterSELECT itemID FROM items JOIN itemTags USING (itemID) WHERE LOWER(itemTags.name) IN (?) -
In
tags/?tag=where we select tagIDs that fit the query before fetching their actual data and constructing objectsSELECT SQL_CALC_FOUND_ROWS DISTINCT tagID FROM itemTags JOIN items USING (itemID) WHERE libraryID=? ANDnamein (?) ANDtypein (?);
Currently, these do a full table scan, which is not good. There was a UNIQUE KEY uniqueTags (libraryID, name, type) in the old creators table that was used for such queries. We can add KEY nameType (name, type) to current itemTags and these queries will use that key. I checked manually with explain sql statement and it seems to work.
With creators, the name key indeed does not do anything now because the creators never fetched by themselves and are only loaded as a part of an item via SELECT * FROM itemCreators WHERE itemID=?.
To sum up, I guess we should add KEY nameType (name, type) to itemTags and remove name from creators?
| CREATE TABLE `syncDeleteLogKeys` ( | ||
| `libraryID` int(10) unsigned NOT NULL, | ||
| `objectType` enum('collection','creator','item','relation','search','setting','tag','tagName') NOT NULL, | ||
| `objectType` enum('collection','item','relation','search','setting','tag','tagName') NOT NULL, |
There was a problem hiding this comment.
Need all the schema changes here
| // Creators | ||
| echo "Migrating creators\n"; | ||
| // Drop foreign key constraint | ||
| Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_1`;", false, $shardID); |
There was a problem hiding this comment.
No semicolons needed in SQL from PHP (throughout)
| } | ||
| } | ||
|
|
||
| $GLOBALS['updatedShards'] = [1]; |
There was a problem hiding this comment.
Should be empty in the committed code
| public static function bulkDelete($libraryID, $itemID, $creatorOrdersArray) { | ||
| $placeholders = implode(', ', array_fill(0, sizeOf($creatorOrdersArray), '?')); | ||
| $sql = "DELETE FROM itemCreators WHERE itemID=? AND orderIndex IN ($placeholders)"; | ||
| Zotero_DB::query($sql, array_merge([$itemID],$creatorOrdersArray), Zotero_Shards::getByLibraryID($libraryID)); |
| $names = array_map(function($tag) { | ||
| return $tag->name; | ||
| }, $tags); | ||
| $existingTagsSql = "SELECT t.tagID, t.name, MAX(t.version) as `version` from itemTags t JOIN items i USING (itemID) WHERE libraryID = ? AND name IN ($placeholders) GROUP BY tagID, `name`;"; |
There was a problem hiding this comment.
- House style: Capitalized acronyms in variables:
existingTagsSql→existingTagsSQL - House style: Capitalized table aliases in SQL:
itemTags t→itemTags IT,items i→items I - Remove extra space after comma
- Only need backticks for reserved words (e.g.,
groups)
| // Use a real tag name, in case case differs | ||
| if (!$title) { | ||
| $title = "Items of Tag ‘" . $tag->name . "’"; | ||
| if (in_array($GLOBALS['shardID'], $GLOBALS['updatedShards']) ) { |
There was a problem hiding this comment.
No need to have this same check everywhere. You can just do it once in header.inc.php and define() a constant that's available globally (like, e.g., Z_ENV_BASE_PATH).
| Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_2`;", false, $shardID); | ||
|
|
||
| // Create new itemCreators table | ||
| Zotero_Admin_DB::query("CREATE TABLE `itemCreatorsNew` ( `creatorID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `firstName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `lastName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `fieldMode` tinyint(1) UNSIGNED DEFAULT NULL, `creatorTypeID` smallint(5) UNSIGNED NOT NULL, `orderIndex` smallint(5) UNSIGNED NOT NULL, PRIMARY KEY (`creatorID`, `itemID`), KEY `creatorTypeID` (`creatorTypeID`), KEY `name` (`lastName`(7),`firstName`(6)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID); |
There was a problem hiding this comment.
Sorry, I led you astray originally, and didn't clarify later. I originally said there'd still be creatorID in creators, but since we're inlining the data in itemCreators, there's no reason for there to be a creatorID at all — it's not serving any purpose. And the primary key should be on itemID, orderIndex, since that's what's unique. (It was actually wrong to begin with — it should've been on itemID, orderIndex before like in the client, not itemID, creatorID, orderIndex.)
load()-ing of Zotero_Creators. An object is constructed with data right away.joininloadCreatorsto load all creators on GET requests, instead of sequential$item->load().$creator->save()calls on create or update