Related to #3289
When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).
The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).
Question: is there a way to test for duplicates more efficiently than Google?
Attempt 1.: use the Active Storage database table active_storage_blob column checksum to verify each attachment is unique therefore finding any duplicate items.
irb(main):016:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_attachments")[0]
=> {"count"=>351758}
irb(main):017:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_blobs")[0]
=> {"count"=>388052}
irb(main):018:0> ActiveRecord::Base.connection.execute("select count(*) from theses")[0]
=> {"count"=>34452}
irb(main):019:0> ActiveRecord::Base.connection.execute("select count(*) from items")[0]
=> {"count"=>44781}
The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?
Let's test, each active_storage_blob should appear only once for each unique attachment, right?
irb(main):014:0> ActiveRecord::Base.connection.execute("SELECT a_blob.checksum, count(*) FROM active_storage_blobs as a_blob GROUP BY a_blob.checksum HAVING count(a_blob.checksum)>1").count()
=> 26520
Why are there so many blobs with the same checksum? Let's filter the attachment count by record_type
ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd5fcc98 status=PGRES_TUPLES_OK ntuples=374 nfields=1 cmd_tuples=374>
Why the decrease in numbers?
irb(main):065:0> results = ActiveRecord::Base.connection.execute("SELECT record_type FROM active_storage_attachments group by record_type")
=> #<PG::Result:0x00007f97dd922fd0 status=PGRES_TUPLES_OK ntuples=7 nfields=1 cmd_tuples=7>
irb(main):067:0> results.values
=>
[["ActiveStorage::Blob"],
["ActiveStorage::VariantRecord"],
["Community"],
["DraftItem"],
["DraftThesis"],
["Item"],
["Thesis"]]
Maybe due to the filter on the record_type? The answer seems like "yes" from the below
ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97de6f34e8 status=PGRES_TUPLES_OK ntuples=34829 nfields=1 cmd_tuples=34829>
In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.
irb(main):055:0> results = ActiveRecord::Base.connection.execute("SELECT * FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97e5553780 status=PGRES_TUPLES_OK ntuples=1118 nfields=16 cmd_tuples=1118>
{"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"name"=>"files",
"created_at"=>"2020-08-07T16:41:30.790Z",
"record_type"=>"Item",
"fileset_uuid"=>"819ebccf-476e-43e7-8712-17dcf5c0ddf8",
"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
"blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"},
{"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"name"=>"files",
"created_at"=>"2020-08-07T16:41:30.790Z",
"record_type"=>"DraftItem",
"fileset_uuid"=>nil,
"record_id"=>"f4244872-c911-46ee-a56c-f013accec73d",
"blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"},
{"id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
"name"=>"files",
"created_at"=>"2020-12-23T17:44:09.337Z",
"record_type"=>"DraftItem",
"fileset_uuid"=>nil,
"record_id"=>"ad7db058-22e2-44f7-a71f-53f2097cbd4d",
"blob_id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
"key"=>"lujere8kg5fpoad2cpslxec22q7o",
"filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"}]
Let's filter out the DraftItems and DraftThesis
irb(main):058:0> results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND (a.record_type = 'Item' OR a.record_type='Thesis') and a.name='files' and b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97dd710710 status=PGRES_TUPLES_OK ntuples=835 nfields=16 cmd_tuples=835>
irb(main):059:0> results.values
...
{"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
"record_type"=>"Item",
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"created_at"=>"2020-08-07T16:41:30.790Z",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
{"record_id"=>"20121c56-ed6e-4b1c-986e-7014401e83c5",
"record_type"=>"Item",
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"created_at"=>"2020-12-23T17:44:09.337Z",
"filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}
Let's write this to a CSV file
CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):
irb(main):050:0> results = ActiveRecord::Base.connection.execute("SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=b`lob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd75b5d0 status=PGRES_TUPLES_OK ntuples=12 nfields=2 cmd_tuples=12>
irb(main):051:0> results.as_json
=>
[{"checksum"=>"Zl2l9EGyAmAksQ5tS+B/iA==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
{"checksum"=>"CgOhJouY1r4jDj0HQAXm2w==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
{"checksum"=>"1rjlByqkT2HV70S4rQsqTw==", "record_id"=>"ebef783e-da83-478b-8467-d5cc99709272"},
{"checksum"=>"HljKqB6+a4z7yjfqllx4kQ==", "record_id"=>"6e08f261-6dd1-4498-a4be-c3cfb1e46393"},
{"checksum"=>"qtt8obM5FdBOZKgmhECXFg==", "record_id"=>"9b515e88-e783-44b4-89ef-7ea4d3dc02fc"},
{"checksum"=>"MxCGonF7giYkVKN4dx+rKg==", "record_id"=>"4653e6af-12d3-4767-be78-2713dec4d1f6"},
{"checksum"=>"BY3gLzkSPTF18A1JX67kBg==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
{"checksum"=>"xzbqWsiVx+4hVQiY64SAVA==", "record_id"=>"25520184-11ad-4465-a039-18695e42a92d"},
{"checksum"=>"aL/ITdTrXnTksq0IU5ABcA==", "record_id"=>"fae6573a-934c-4b60-93e6-38a7ea82651c"},
{"checksum"=>"KBSFkcK7+prvJwAqpJTAaQ==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
{"checksum"=>"36r3b8SWaEEHSDmUDWYUZw==", "record_id"=>"fc272f44-17bb-4980-bb31-3b86097abab0"},
{"checksum"=>"wXGvEO+P2YiZJuaB6CPwYA==", "record_id"=>"0b800874-9bc7-49e1-9f08-aeba43cfde77"}]
Are these intentional?
Let's output nicely in a similar format to the duplicate records finder
results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id and (b.checksum,a.record_id) IN (SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1)")
CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193
Related to #3289
When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).
The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).
Question: is there a way to test for duplicates more efficiently than Google?
Attempt 1.: use the Active Storage database table
active_storage_blobcolumnchecksumto verify each attachment is unique therefore finding any duplicate items.The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?
Let's test, each active_storage_blob should appear only once for each unique attachment, right?
Why are there so many blobs with the same checksum? Let's filter the attachment count by
record_typeWhy the decrease in numbers?
Maybe due to the filter on the record_type? The answer seems like "yes" from the below
In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.
Let's filter out the DraftItems and DraftThesis
Let's write this to a CSV file
Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):
Are these intentional?
Let's output nicely in a similar format to the duplicate records finder
Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193