Opened 14 years ago
Closed 13 years ago
Last modified 13 years ago
#8374 closed patch (Won't Fix)
update mythvideo's scan.php to support hashes
Reported by: | Owned by: | Rob Smith | |
---|---|---|---|
Priority: | minor | Milestone: | 0.25 |
Component: | Plugin - MythWeb | Version: | Master Head |
Severity: | medium | Keywords: | |
Cc: | Ticket locked: | no |
Description
The attached patches update scan.php in mythweb's video module to support handling the hashes that are now kept on video files. They also fix a minor potential security issue where the path being scanned isn't properly shell-escaped before being passed to the find command. Even in the non-security case, the lack of shell-escape could have undesired consequences if your storage path happens to contain any spaces or other funny characters.
From what I can tell, php appears to not support unsigned 64-bit integers at all, which are required for proper calculation of the hash. So I had to write a python script to do the hash calculation (it borrows the hash routine from JAMU.py and just makes it executable standalone from the command line) and dump it to standard out, then we shell out to it. This will probably need some massaging, as I don't know where the proper place is to put such a script, and right now scan.php with this patch just depends on it being in your default PATH (which I'm assuming probably isn't the best way to do it).
I'm more than happy to update the patches if there are suggestions of a better way to do it.
As patched, the new scanning process is: 1) Search for new files first. If it exists in the DB already, continue like before. 2) If there isn't a file with this filename in the DB already, then we hash the file, and look to see if there's a file with the same hash already (this step is skipped if the file is too small to hash). If there's a matching hash, then we update the filename on that entry in the database, on the assumption that the file's been moved. 3) If we still haven't matched anything at this point, it's a new file and it's added to the database, hash included. 4) After the above is done, so we can catch any files that were moved, then we go back and scan for files in the database that are no longer on the filesystem and remove them from the database.
Attachments (3)
Change History (14)
Changed 14 years ago by
Attachment: | scan-trunk.diff added |
---|
comment:1 Changed 14 years ago by
What about http://www.php.net/manual/en/ref.bc.php ?
Or just shoving it in a string and formatting it that way? There doesn't look like there's any reason it requires all the math if it's just concatting them and outputting as a hex value
comment:2 follow-up: 3 Changed 14 years ago by
Status: | new → infoneeded_new |
---|
You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this). Please see:
http://www.mythtv.org/wiki/QUERY_FILE_HASH_%28Myth_Protocol%29
I suspect this is how the MythWeb devs would prefer to see this implemented.
comment:3 follow-up: 4 Changed 14 years ago by
Replying to anonymous:
Or just shoving it in a string and formatting it that way? There doesn't look like there's any reason it requires all the math if it's just concatting them and outputting as a hex value
It's not a matter of formatting it, it's a matter of calculating it to begin with. It's not a pair of crc32s concatenated to each other, it's a 64-bit checksum of the first and last megabyte of the file. You take the first and last megabyte of the file 64 bits at a time and add it together, ignoring any carry on each addition. *if* your PHP does 64-bit ints (which it apparently only does if running on a 64-bit OS) it only does signed ints, which means the addition breaks when you go over 63 bits. The BCMath stuff looks like it'd probably solve this, however it's completely unnecessary because...
Replying to robertm:
You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this).
Asking the backend to do it for us indeed sounds like a much better way. "largish" changes like this patch would still be needed (it's not as large as it looks, the code's being reordered, not massively changed) it just won't need the extra python file.
Browsing through the Myth protocol, it doesn't look like it'll do the actual database update for us yet, but it *will* let us do the file search, which can also replace the call out to the unix 'find' command, thus eliminating both shell execs. Not only that but it'll relay to secondary backends if they're accessible, removing the need to limit ourselves to storage groups that reside on the backend we're running on. The only downside is our find command on the local filesystem gives us the entire tree in one shot, where the directory listing generated by the Myth protocol only gives us the current directory, so we'll have to recurse it ourselves and repeat calls to the backend server to get each of the subdirectories and so forth. I like the idea of being able to scan all of the backends and not just the one we're running on. (Heck, maybe mythweb doesn't even need to be running on the backend anymore, and will still work that way!)
comment:4 Changed 14 years ago by
Status: | infoneeded_new → new |
---|
Replying to Dave Miller <mythtv@…>:
Replying to robertm:
You can perform hashes using the Myth protocol, this is the preferred way to do it (and should only take a couple lines of code instead of large-ish changes like this).
Asking the backend to do it for us indeed sounds like a much better way. "largish" changes like this patch would still be needed (it's not as large as it looks, the code's being reordered, not massively changed) it just won't need the extra python file.
Browsing through the Myth protocol, it doesn't look like it'll do the actual database update for us yet, but it *will* let us do the file search, which can also replace the call out to the unix 'find' command, thus eliminating both shell execs. Not only that but it'll relay to secondary backends if they're accessible, removing the need to limit ourselves to storage groups that reside on the backend we're running on. The only downside is our find command on the local filesystem gives us the entire tree in one shot, where the directory listing generated by the Myth protocol only gives us the current directory, so we'll have to recurse it ourselves and repeat calls to the backend server to get each of the subdirectories and so forth. I like the idea of being able to scan all of the backends and not just the one we're running on. (Heck, maybe mythweb doesn't even need to be running on the backend anymore, and will still work that way!)
You're right. I only gave it a casual look before I responded, I see that your basic logic is necessary regardless. Xris and Kormoc, when you get to this one, the logic above is basically correct and analogous to what we do in MythVideo?. Dave, you are correct that at this point there should be no reason to use any local file finding in MythWeb as it supports Storage Groups only now. You can get a queried file list with the protocol on all backends. This is how MythVideo? does its scanning. Between hashing, properly querying all backends (get the contents of the "Video" SGs, get a file list on the relevant backend hosts of each dir), and parsing filenames properly (see the regexp using in MythVideo? to determine season/episode/title/subtitle), MythWeb would be scanning things well enough that at least a metadata lookup will work properly (It does not in the case of television right now).
http://www.mythtv.org/wiki/QUERY_SG_GETFILELIST_%28Myth_Protocol%29
Obviously you are under no obligation to add these things, just getting them out there as prerequisites for baseline functionality of the MW scanner.
comment:5 Changed 14 years ago by
Status: | new → assigned |
---|---|
Summary: | update mythvideo's scan.php to support hash tags → update mythvideo's scan.php to support hashes |
comment:6 Changed 14 years ago by
Milestone: | unknown → 0.24 |
---|
comment:7 Changed 14 years ago by
Milestone: | 0.24 → 0.25 |
---|
I disabled scanning in the module so pushing this to 0.25
comment:9 Changed 13 years ago by
Note that I have added a hash function to the new Content API service.
backendserverip:6544/Content/GetHash?StorageGroup=Videos&FileName=subdir/file.ext
This is probably what MythWeb should be using as it's baked directly into the core library hashing methods.
comment:10 Changed 13 years ago by
Resolution: | → Won't Fix |
---|---|
Status: | assigned → closed |
Thank you for the patches but we're going a different direction with this one. The backend protocol now includes a SCAN_VIDEOS command that should be called to initiate a scan on the backend, rather than maintain a separate scanner in MythWeb.
Patch against trunk