| View previous topic :: View next topic |
| Author |
Message |
Dutch_Master LXF regular
Joined: Tue Mar 27, 2007 2:49 am Posts: 2354
|
Posted: Sun Mar 25, 2012 10:18 pm Post subject: De-internet-archive-scripting webpages |
|
|
Some time ago I downloaded a website from the internet archive pages as it turned out the owner discontinued it. They insert some ***** scripting I distaste greatly and it keeps on linking to the archive. I could remove all instances by hand in the html code, but with 100+ pages, I'd think there's a better (quicker!) solution. I assume sed or awk are required, but knowing nought about either, what's the best oneliner (or script, that's fine) that get me running? (s'cuse the pun ) |
|
| Back to top |
|
 |
nelz Moderator

Joined: Mon Apr 04, 2005 12:52 pm Posts: 8002 Location: Warrington, UK
|
Posted: Mon Mar 26, 2012 1:55 am Post subject: |
|
|
If the code is in the same place in each file, you can remove a range of line with
| Code: | for i in *.html; do
sed -i x-yd $i
done |
Where x and y are the first and last lines of the script. Otherwise we'd need to see an example to see how to identify the lines to be deleted. _________________ Unix is user-friendly. It's just very selective about who it's friends are. |
|
| Back to top |
|
 |
Dutch_Master LXF regular
Joined: Tue Mar 27, 2007 2:49 am Posts: 2354
|
Posted: Mon Mar 26, 2012 2:32 am Post subject: |
|
|
The problem is that although the archive script puts a lot of files in the same place, it also hard-links all links in a page, using absolute links (with the http header). However, when storing the files I also introduced some issues, only to become apparent when I opened the html code.... Right now, I think your script will remove the bulk of the added code and I'd have to manually edit the hard-links back to relative links. Thanks again Nelz!
[edit: cried victory too soon, after replacing the x and y with numerical values I got the error | Code: | | sed: -e expression #1, char 3: unknown command: `-' | I've got as far as | Code: | | for i in *.html; do sed i\ {14-216}d $i; done | This clears the files completely. Luckily I got a backup.... ]
[edit2: here's a simple sample] |
|
| Back to top |
|
 |
nelz Moderator

Joined: Mon Apr 04, 2005 12:52 pm Posts: 8002 Location: Warrington, UK
|
Posted: Mon Mar 26, 2012 10:34 am Post subject: |
|
|
My bad, I was working from failing memory, it is x,yd not x-yd, which means something completely different.
You can also give an extension to -i and sed will create backups of the original files with that extension. This will remove the toolbar stuff and save a backup
| Code: | | sed -i.bak /BEGIN\ WAYBACK\ TOOLBAR/,/END\ WAYBACK\ TOOLBAR/d Start\ page.html |
You can change the links to relative with something like
| Code: | | sed 's/http:\/\/web.archive.org\/web\/20041103050546\/http:\/\/web.utanet.at\/smiderkr\/asr\//g' |
on all files in the same directory, but it gets messy if the pages are stored in multiple subdirectories. _________________ Unix is user-friendly. It's just very selective about who it's friends are. |
|
| Back to top |
|
 |
Dutch_Master LXF regular
Joined: Tue Mar 27, 2007 2:49 am Posts: 2354
|
Posted: Mon Mar 26, 2012 2:06 pm Post subject: |
|
|
Thanks Nelz, give it a try later.  |
|
| Back to top |
|
 |
Dutch_Master LXF regular
Joined: Tue Mar 27, 2007 2:49 am Posts: 2354
|
Posted: Thu Jun 21, 2012 3:07 pm Post subject: |
|
|
And again I find some familiar question in LFX159 this time Thx guys!
(haven't pursued it yet, got some RSI complaints by my wrist back then... ) |
|
| Back to top |
|
 |
| View previous topic :: View next topic |
|