Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#8374 closed patch (Won't Fix)

update mythvideo's scan.php to support hashes

Reported by: Dave Miller <mythtv@…> Owned by: Rob Smith
Priority: minor Milestone: 0.25
Component: Plugin - MythWeb Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

The attached patches update scan.php in mythweb's video module to support handling the hashes that are now kept on video files. They also fix a minor potential security issue where the path being scanned isn't properly shell-escaped before being passed to the find command. Even in the non-security case, the lack of shell-escape could have undesired consequences if your storage path happens to contain any spaces or other funny characters.

From what I can tell, php appears to not support unsigned 64-bit integers at all, which are required for proper calculation of the hash. So I had to write a python script to do the hash calculation (it borrows the hash routine from JAMU.py and just makes it executable standalone from the command line) and dump it to standard out, then we shell out to it. This will probably need some massaging, as I don't know where the proper place is to put such a script, and right now scan.php with this patch just depends on it being in your default PATH (which I'm assuming probably isn't the best way to do it).

I'm more than happy to update the patches if there are suggestions of a better way to do it.

As patched, the new scanning process is: 1) Search for new files first. If it exists in the DB already, continue like before. 2) If there isn't a file with this filename in the DB already, then we hash the file, and look to see if there's a file with the same hash already (this step is skipped if the file is too small to hash). If there's a matching hash, then we update the filename on that entry in the database, on the assumption that the file's been moved. 3) If we still haven't matched anything at this point, it's a new file and it's added to the database, hash included. 4) After the above is done, so we can catch any files that were moved, then we go back and scan for files in the database that are no longer on the filesystem and remove them from the database.

Attachments (3)

scan-trunk.diff (4.4 KB) - added by Dave Miller <mythtv@…> 14 years ago.
Patch against trunk
scan-0-23.diff (4.4 KB) - added by Dave Miller <mythtv@…> 14 years ago.
patch against 0.23 branch
mythhash (1.5 KB) - added by Dave Miller <mythtv@…> 14 years ago.
mythhash python script

Download all attachments as: .zip

Change History (14)

Changed 14 years ago by Dave Miller <mythtv@…>

Attachment: scan-trunk.diff added

Patch against trunk

Changed 14 years ago by Dave Miller <mythtv@…>

Attachment: scan-0-23.diff added

patch against 0.23 branch

Changed 14 years ago by Dave Miller <mythtv@…>

Attachment: mythhash added

mythhash python script

comment:1 Changed 14 years ago by anonymous

What about http://www.php.net/manual/en/ref.bc.php ?

Or just shoving it in a string and formatting it that way? There doesn't look like there's any reason it requires all the math if it's just concatting them and outputting as a hex value

comment:2 Changed 14 years ago by robertm

Status: newinfoneeded_new

You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this). Please see:

http://www.mythtv.org/wiki/QUERY_FILE_HASH_%28Myth_Protocol%29

I suspect this is how the MythWeb devs would prefer to see this implemented.

comment:3 in reply to:  2 ; Changed 14 years ago by Dave Miller <mythtv@…>

Replying to anonymous:

Or just shoving it in a string and formatting it that way? There doesn't look like there's any reason it requires all the math if it's just concatting them and outputting as a hex value

It's not a matter of formatting it, it's a matter of calculating it to begin with. It's not a pair of crc32s concatenated to each other, it's a 64-bit checksum of the first and last megabyte of the file. You take the first and last megabyte of the file 64 bits at a time and add it together, ignoring any carry on each addition. *if* your PHP does 64-bit ints (which it apparently only does if running on a 64-bit OS) it only does signed ints, which means the addition breaks when you go over 63 bits. The BCMath stuff looks like it'd probably solve this, however it's completely unnecessary because...

Replying to robertm:

You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this).

Asking the backend to do it for us indeed sounds like a much better way. "largish" changes like this patch would still be needed (it's not as large as it looks, the code's being reordered, not massively changed) it just won't need the extra python file.

Browsing through the Myth protocol, it doesn't look like it'll do the actual database update for us yet, but it *will* let us do the file search, which can also replace the call out to the unix 'find' command, thus eliminating both shell execs. Not only that but it'll relay to secondary backends if they're accessible, removing the need to limit ourselves to storage groups that reside on the backend we're running on. The only downside is our find command on the local filesystem gives us the entire tree in one shot, where the directory listing generated by the Myth protocol only gives us the current directory, so we'll have to recurse it ourselves and repeat calls to the backend server to get each of the subdirectories and so forth. I like the idea of being able to scan all of the backends and not just the one we're running on. (Heck, maybe mythweb doesn't even need to be running on the backend anymore, and will still work that way!)

comment:4 in reply to:  3 Changed 14 years ago by robertm

Status: infoneeded_newnew

Replying to Dave Miller <mythtv@…>:

Replying to robertm:

You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this).

Asking the backend to do it for us indeed sounds like a much better way. "largish" changes like this patch would still be needed (it's not as large as it looks, the code's being reordered, not massively changed) it just won't need the extra python file.

Browsing through the Myth protocol, it doesn't look like it'll do the actual database update for us yet, but it *will* let us do the file search, which can also replace the call out to the unix 'find' command, thus eliminating both shell execs. Not only that but it'll relay to secondary backends if they're accessible, removing the need to limit ourselves to storage groups that reside on the backend we're running on. The only downside is our find command on the local filesystem gives us the entire tree in one shot, where the directory listing generated by the Myth protocol only gives us the current directory, so we'll have to recurse it ourselves and repeat calls to the backend server to get each of the subdirectories and so forth. I like the idea of being able to scan all of the backends and not just the one we're running on. (Heck, maybe mythweb doesn't even need to be running on the backend anymore, and will still work that way!)

You're right. I only gave it a casual look before I responded, I see that your basic logic is necessary regardless. Xris and Kormoc, when you get to this one, the logic above is basically correct and analogous to what we do in MythVideo?. Dave, you are correct that at this point there should be no reason to use any local file finding in MythWeb as it supports Storage Groups only now. You can get a queried file list with the protocol on all backends. This is how MythVideo? does its scanning. Between hashing, properly querying all backends (get the contents of the "Video" SGs, get a file list on the relevant backend hosts of each dir), and parsing filenames properly (see the regexp using in MythVideo? to determine season/episode/title/subtitle), MythWeb would be scanning things well enough that at least a metadata lookup will work properly (It does not in the case of television right now).

http://www.mythtv.org/wiki/QUERY_SG_GETFILELIST_%28Myth_Protocol%29

Obviously you are under no obligation to add these things, just getting them out there as prerequisites for baseline functionality of the MW scanner.

comment:5 Changed 14 years ago by robertm

Status: newassigned
Summary: update mythvideo's scan.php to support hash tagsupdate mythvideo's scan.php to support hashes

comment:6 Changed 14 years ago by Rob Smith

Milestone: unknown0.24

comment:7 Changed 14 years ago by Rob Smith

Milestone: 0.240.25

I disabled scanning in the module so pushing this to 0.25

comment:8 Changed 14 years ago by stuartm

Milestone: 0.25

Milestone 0.25 deleted

comment:9 Changed 13 years ago by robertm

Note that I have added a hash function to the new Content API service.

backendserverip:6544/Content/GetHash?StorageGroup=Videos&FileName=subdir/file.ext

This is probably what MythWeb should be using as it's baked directly into the core library hashing methods.

comment:10 Changed 13 years ago by Raymond Wagner

Resolution: Won't Fix
Status: assignedclosed

Thank you for the patches but we're going a different direction with this one. The backend protocol now includes a SCAN_VIDEOS command that should be called to initiate a scan on the backend, rather than maintain a separate scanner in MythWeb.

comment:11 Changed 13 years ago by Dave Miller <mythtv@…>

Totally agreed. I was trying to initiate the scan from external scripts, and only started messing with MythWeb because it seemed to be the only way to trigger it externally. If there's a *real* way to do it, I don't need it in MythWeb anymore. :)

Note: See TracTickets for help on using tickets.