Code Comments
Programming Forum and web based access to our favorite programming groups.I need to get a list of all the files that end with '.html' in a directory a nd all of its subdirectories. I then want to search through each file and remov e the ones from the list that contain '<%perl>' or '<%init>'. How can I do thi s? Thanks for any help. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
Post Follow-up to this messageOn Fri, 30 Jul 2004, Andrew Gaffney wrote: > I need to get a list of all the files that end with '.html' in a > directory and all of its subdirectories. I then want to search through > each file and remove the ones from the list that contain '<%perl>' or > '<%init>'. How can I do this? Thanks for any help. From a Unix command line, you could do something like this: $ find /path/to/htdocs -type f | xargs egrep -li '<%(perl|init)>' The above line results in a list of all the files that have either '<%perl>' or '<%init>' in them. From here, you can o a step further by deleting them all. Because files with spaces in their name (or their path) can break this horribly, I'll use `sed` to wrap each line in quotes before removing them: $ find /path/to/htdocs -type f | \ > xargs egrep -li '<%(perl|init)>' | \ > sed 's/\(.*\)/"\1"/' | \ > xargs rm -i This should also prompt you before taking any action, in case you realize that you really wanted one of these files. If you want to just proceed blindly -- and my but you're brave if you do -- then delete the "-i" from the last line. Of course, you probably wanted to do this in Perl, but sometimes things are just as easy to do with shell tools, and this seems like a good example. Unless you want to do this all the time -- in which case go ahead & script it in Perl -- a shell one liner like this should be fine. And of course this all breaks down if you're using Windows, in which case unless you're a fan of Cygwin you can just ignore all of this :) -- Chris Devers
Post Follow-up to this messageChris Devers wrote: > On Fri, 30 Jul 2004, Andrew Gaffney wrote: > > > > From a Unix command line, you could do something like this: > > $ find /path/to/htdocs -type f | xargs egrep -li '<%(perl|init)>' > > The above line results in a list of all the files that have either > '<%perl>' or '<%init>' in them. > > From here, you can o a step further by deleting them all. Because files > with spaces in their name (or their path) can break this horribly, I'll > use `sed` to wrap each line in quotes before removing them: > > $ find /path/to/htdocs -type f | \ > > This should also prompt you before taking any action, in case you > realize that you really wanted one of these files. If you want to just > proceed blindly -- and my but you're brave if you do -- then delete the > "-i" from the last line. I think you misunderstand. I don't want to delete the files that contain '<%perl>' or '<%init>'. I just want to make a list of all .html files in a directory tree and remove the ones that contains '<%perl>' or '<%init>' from my list. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
Post Follow-up to this messageOn Fri, 30 Jul 2004, Andrew Gaffney wrote: > I think you misunderstand. I don't want to delete the files that > contain '<%perl>' or '<%init>'. I just want to make a list of all > .html files in a directory tree and remove the ones that contains > '<%perl>' or '<%init>' from my list. Then yes, I misunderstood. This version should do what you want: $ find /path/to/htdocs -type f | xargs egrep -liv '<%(perl|init)>' It's exactly like the first one I sent before, but I've added "-v" to the egrep arguments, which inverts the meaning from "all files with this pattern" to "all files NOT with this pattern". In this case, that's what you're trying to get. If you then want to remove / delete files, tack on the sed & rm commands I had in the earlier version, but it sounds like you just mean "omit from the list" rather than "remove from the hard drive". -- Chris Devers
Post Follow-up to this messageChris Devers wrote: > On Fri, 30 Jul 2004, Andrew Gaffney wrote: > > > > Then yes, I misunderstood. This version should do what you want: > > $ find /path/to/htdocs -type f | xargs egrep -liv '<%(perl|init)>' > > It's exactly like the first one I sent before, but I've added "-v" to > the egrep arguments, which inverts the meaning from "all files with this > pattern" to "all files NOT with this pattern". In this case, that's what > you're trying to get. > > If you then want to remove / delete files, tack on the sed & rm commands > I had in the earlier version, but it sounds like you just mean "omit > from the list" rather than "remove from the hard drive". That still doesn't appear to do what I want. I believe it is showing me all files where *all* lines don't contain '<%perl>' or '<%init>'. Since not *all * lines contain either one of those, all files still show in the list. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
Post Follow-up to this messageOn Fri, 30 Jul 2004, Andrew Gaffney wrote: > Chris Devers wrote: > > That still doesn't appear to do what I want. I believe it is showing > me all files where *all* lines don't contain '<%perl>' or '<%init>'. > Since not *all* lines contain either one of those, all files still > show in the list. Okay, let's try again then: $ grep -li '<title>' *html # print all html files with '<title>' 20things.html bookmarks.html gas.html gas_form.html itunes.html noise.html $ $ grep -Li '<title>' *html # print all html files WITHOUT '<title>' HEADER.shtml $ The sets are non-intersecting, and so apparently what you meant. If you want to refine this further, try `egrep --help` or `man egrep`. I should have tested what I sent before sending it, but ten seconds of skimming over the documentation on your own should have been enough to show you these lines from `egrep --help`: $ egrep --help | grep -i 'files.*match.*print' -L, --files-without-match only print FILE names containing no match -l, --files-with-matches only print FILE names containing matches $ So, as with many Unix commands, shift-L inverts the usual sense of L, meaning that '-L' gets you the opposite of what '-l' does. Now have we got it? :-) -- Chris Devers
Post Follow-up to this messageIn DOS:
> perl -n0 -e "push @b, $ARGV unless /<%(?:perl|init)>/; END{print \"@b\"}"
file1.html file2.html file3.html
In *nix (untested):
> perl -n0 -e 'push @b, $ARGV unless /<%(?:perl|init)>/; END{print "@b"}'
*.html
"Andrew Gaffney" <agaffney@skylineaero.com> wrote in message That still
doesn't appear to do what I want. I believe it is showing me all
> files where *all* lines don't contain '<%perl>' or '<%init>'. Since not
*all*
> lines contain either one of those, all files still show in the list.
Post Follow-up to this messageChris Devers wrote: > On Fri, 30 Jul 2004, Andrew Gaffney wrote: > > > > Okay, let's try again then: > > $ grep -li '<title>' *html # print all html files with '<title>' > 20things.html > bookmarks.html > gas.html > gas_form.html > itunes.html > noise.html > $ > > $ grep -Li '<title>' *html # print all html files WITHOUT '<title>' > HEADER.shtml > $ > > The sets are non-intersecting, and so apparently what you meant. > > If you want to refine this further, try `egrep --help` or `man egrep`. > I should have tested what I sent before sending it, but ten seconds of > skimming over the documentation on your own should have been enough to > show you these lines from `egrep --help`: > > $ egrep --help | grep -i 'files.*match.*print' > -L, --files-without-match only print FILE names containing no match > -l, --files-with-matches only print FILE names containing matches > $ > > So, as with many Unix commands, shift-L inverts the usual sense of L, > meaning that '-L' gets you the opposite of what '-l' does. > > Now have we got it? :-) I think it is a problem with the regex. If I change it to: grep -RLi '<%init>' * | grep '.html' I get all files that don't have '<%init>', but it doesn't work with the '<%(init|perl)>'. That regex doesn't seem to match anything. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
Post Follow-up to this messageOn Fri, 30 Jul 2004, Andrew Gaffney wrote: > I think it is a problem with the regex. If I change it to: > > grep -RLi '<%init>' * | grep '.html' > > I get all files that don't have '<%init>', but it doesn't work with > the '<%(init|perl)>'. That regex doesn't seem to match anything. More man page material: I was using `egrep` for the earlier examples, not `grep`. On my computer (a Mac), `egrep` is equivalent to `grep -e`; either way, this pulls in an enhanced regex parser that, in this case, is being used to match multiple patterns (by|doing|this). Hence, these two lines are equivalent: egrep 'pattern|anotherpattern' * grep -e 'pattern|anotherpattern' * Also, the line you ended up with -- grep -RLi '<%init>' * | grep '.html' -- should be equivalent to this one -- grep -RLi '<%init>' *html -- without needing the second grep statement. And to weave the multiple pattern matching back in, you can do these: egrep -RLi '<%(init|perl)>' *html grep -RLie '<%(init|perl)>' *html Both of these should match files that have neither of the two patterns you were asking about : /<%init>/ nor /<%perl>/ . Make sense? -- Chris Devers
Post Follow-up to this messageChris Devers wrote: > On Fri, 30 Jul 2004, Andrew Gaffney wrote: > > > > More man page material: I was using `egrep` for the earlier examples, > not `grep`. On my computer (a Mac), `egrep` is equivalent to `grep -e`; > either way, this pulls in an enhanced regex parser that, in this case, > is being used to match multiple patterns (by|doing|this). > > Hence, these two lines are equivalent: > > egrep 'pattern|anotherpattern' * > grep -e 'pattern|anotherpattern' * > > Also, the line you ended up with -- > > grep -RLi '<%init>' * | grep '.html' > > -- should be equivalent to this one -- > > grep -RLi '<%init>' *html > > -- without needing the second grep statement. It isn't though. I had the '-R' flag in which means I want it to search subdirectories also. The '*html' gets interpreted by the shell and it ends u p not recursing. > And to weave the multiple pattern matching back in, you can do these: > > egrep -RLi '<%(init|perl)>' *html > grep -RLie '<%(init|perl)>' *html I ended up with "egrep -RLi '<%(init|perl)>' * | egrep '.html$'" which seem s to get me exactly what I wanted. > Both of these should match files that have neither of the two patterns > you were asking about : /<%init>/ nor /<%perl>/ . > > Make sense? Yes. Thanks for the help. -- Andrew Gaffney Network Administrator Skyline Aeronautics, LLC. 636-357-1548
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.