Theremin World Archiver

Posted: 6/2/2020 7:30:54 AM
gerd

Joined: 11/25/2017

I continue the discussion of how to download a Theremin World thread from here:
lets-design-and-build-a-mostly-digital-theremin Page=214

I've done it with a JAVA application. Souce code is here: ThereminWorldArchiver.zip

It downloads all selected paged and resources and stores them in a local cache. So the next time these already stored pages won't be loaded a second time.
To convert HTML to PDF I simply used Chrome.

I've also added these features:
- Title, Thread number, start and stop pages are stored in a config file.
- It is possible to exclude single posts or post from a specified user (Troll/spam)
- instead of the box on the left side with detailed user information, I've used a single horizontal line and the user name, date and number.
- Embedded youtube videos have been replaced by a preview image with link.
- If the first line is strong, it is used as h1 heading and with CSS it would be possible to add a page break before each h1 heading

Example of config script:

Code:
title: D-Lev_Theremin
thread:28554
from:1
to:214
-host:http://illegal.images.com
-user:Troll_1
-user:Troll_2
-post:42
-post:43
-post:44
-post:142
-post:143

Posted: 6/2/2020 1:15:30 PM
pitts8rh

From: Minnesota USA

Joined: 11/27/2015

Hi gerd,

I've downloaded the source code, but could you give some more details on what to do with three different scripts? My attempts to run them have failed, but I'm probably doing something wrong.

Posted: 6/2/2020 3:01:27 PM
gerd

Joined: 11/25/2017

Unzip the archive.

Footer.html and header.html are just more or less empty html templates. But you can add there for example a title or add some CSS stuff.

The config.txt is the file that controls the export process. Here is an example to export your "Build Project: Dewster's D-Lev Digital Theremin" thread.
It has the ID 32389
I want to export all pages (from 1 to 14)
with "minus"host:http://zz-sex.oldtemecula.com I've prevented to download pictures from a sex site.
with "minus"user:gerd you can exclude all of my posts. But it would be possible to include a single post (by me) with +post:NUMBER
with "minus"post:135 I've excluded a troll post.

Code:
title: D-Lev_Build_Project
thread:32389
from:1
to:14
-host:http://zz-sex.oldtemecula.com
-user:gerd
-post:135


I use Eclipse to compile and run the java program.
File->Import->General/Existing Projects into workspace. Then select the extracted folder.
Run->Run As->Java Application

Posted: 6/2/2020 3:18:11 PM
gerd

Joined: 11/25/2017

Here is a precompiled JAR file:

ThereminWorldArchiver.jar

Copy it into the folder with the config.txt and run it from the console with 

Code:
java -jar ThereminWorldArchiver.jar

 
Java runtime must be installed.

Posted: 6/2/2020 4:48:04 PM
pitts8rh

From: Minnesota USA

Joined: 11/27/2015

Hey, that's working!  I was just wading through installing Eclipse and trying to figure it out, and you came through with the easier method.  And you customized my config file for me (you must have been reading my mind!).

Thanks for doing this.

Posted: 6/2/2020 5:06:21 PM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

I know virtually nothing about Java but was able to compile & run this in Linux Mint.  Had to install "Default-jdk" from package manager, then it compiled in Geany (I love Geany!) but it wouldn't run correctly until I put all the files in the src directory (good error messages).  Running it a second time is really quick as it doesn't re-download files it already has (as you said).  Thank you gerd, this is fantastic!  How long did it take you to write it?

"with "minus"host:http://zz-sex.oldtemecula.com I've prevented to download pictures from a sex site."  - gerd

I wouldn't click that with a ten foot pole...

"I was just wading through installing Eclipse and trying to figure it out..."  - pitts8rh

I highly recommend Geany.  For me it's at that "just right" level of simple enough to just use, just complex enough to do what it needs to do, and it works great in both Win10 and Linux.  I wrote a HAL language description for it so I could do Hive assembly with the editor:

https://geany.org/

I've demoed a bunch of IDEs and most are way too complex and not ready to go out of the box.  The Geany editor reminds me a lot of notepad++, which I was really missing on Linux until I found Geany.  (notepad++ has the most dead-simple new language setup dialog that I've ever seen.)  The Geany "find in files" search is really nice, though nothing I've found yet can replace TextCrawler on Win.

Posted: 6/11/2020 3:11:00 PM
Jason

From: Hillsborough, NC (USA)

Joined: 2/13/2005

Would it be easier if I just expose an API to download a thread? Or a forums RSS feed for each thread? Someone’s going to have to archive and republish this whole site some day... this crusty old C# code isn’t going to last forever, and I can’t imagine rewriting the whole site over from scratch for a 4th time. 

Posted: 1/26/2021 3:29:44 PM
dewster

From: Northern NJ, USA

Joined: 2/17/2012

"Would it be easier if I just expose an API to download a thread?"  - Jason

Jason, that would be great!

The absolutely wonderful TW archiver that gerd wrote and graciously provided to us isn't working for me anymore.  I've had some HD issues here and there messing with my files, so it could be on my end.  Just wondering if anyone else is having trouble with it?  I don't know where to start with debugging this.

[EDIT] I got it to work again gerd - sorry if I bothered you!  Some time in the past the script downloaded blank pages which messed it up.  I deleted the associated zero html files in the "ThereminWorldArchiver/content/28554/pages/" directory, and also deleted the top level html file "Let's Design and Build a (mostly) Digital Theremin!.html" and now it's fine when I run it in Geany.  Thanks again for this fantastic program!

You must be logged in to post a reply. Please log in or register for a new account.