Changelog

Unreleased

Full Changelog

3.5.0 2024-10-29

Full Changelog

Removals

Removed HBase modules from contrib. #621

Fixes

ConfigurableExtractorJS: Set default value (false) for strict property. #612
ExtractorHTML: Treat cite attribute as a navlink instead of embed. #608
Building no longer require the builds.archive.org repository. #614
Updated to new URL of the restlet repository.

Dependency Upgrades

Removed hbase, joda-time, log4j
commons-io 2.14.0
kafka-clients 3.8.0
ftpserver-core 1.2.0
jetty 9.4.56.v20240826
webarchive-commons 1.1.10

3.4.0-20240909 2024-09-09

Full Changelog

Compatibility Note

Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.

New Features

JDK 22 support
Added ConfigurableExtractorJS for more flexible JavaScript extraction. (#602)
Added HostnameQueueAssignmentPolicyWithLimits with optional name length limits. (#598)
ExtractorHTML can now extract more variants of alternative resolution image URLs. (#605)
ExtractorHTTP can now be configured with extra inferred paths (#597)
ExtractorYoutubeDL metadata records can now be optionally logged to crawl.log (#593)

Removals

Removed ExtractorChrome from contrib (#601)

Fixes

Reduced false positive speculative URLs from meta tags (#595)
Fixed BdbModule resource leak on job teardown (f4280012ae5f23763f1e19d196a245ae49f9b697)
Corrected function name in ScriptedProcessor Javadoc. (#599)
Updated Maven builds to use HTTPS for resolving dependencies.
Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
Fixed older junit3 tests not being run (#592)
Increased DiskSpaceMonitor default pause threshold to 8 GiB (#499)
Stopping logging authentication failures when header is missing (#539)
Fixed console still showing job running after crash (#549)

Dependency Upgrades

Transitioned PDFParser and ExtractorPDF to pdfbox (#575)
Transitioned ExtractorYoutubeDL to yt-dlp
commons-net 3.9.0
com.rabbitmq:amqp-client 5.18.0
dnsjava 3.6.0
groovy 4.0.21
kryo 5.6.0
spring-expression 5.3.39

3.4.0-20220727 (2022-07-27)

Full Changelog

Fixed bugs:

ExtractorHTML matches srcset attribute case-sensitively #477
Overcrawling due to sitemap links acting like transclusions #469
"java.lang.NoClassDefFoundError: Could not initialize class org.archive.util.CLibrary" on Apple Silicon #467
Heritrix crasching on malformed Content-Length header #449
Java version check throws StringIndexOutOfBoundsException on exact major versions #439
dnsjava NIO selector thread stuck at 100% after terminating job #425
Do not treat all URLs from link/@href tags as embeds. #263
BdbCookieStore not implemented iterator at RetryExec #200
"RIS already open for ToeThread..." exception during https pages crawl over proxy #191

Closed issues:

Heritrix not ignoring robots.txt #479
JDK18: ExtractorMultipleRegexTest fails due to Groovy asm incompatiblity #473
Setting of maxLogFileSize in the BDBModule is ineffective #464
Question about memory usage #462
Build failing via maven-assembly-plugin: group id is too big #447
Do not require DNS when using a web proxy #211

Merged pull requests:

Bump jsch from 0.1.52 to 0.1.54 in /commons #492 (dependabot[bot])
Bump spring-core from 5.3.19 to 5.3.20 in /commons #491 (dependabot[bot])
Bump jsch from 0.1.52 to 0.1.54 in /modules #490 (dependabot[bot])
Add robotsTxtOnly robots policy #489 (ato)
Removed a potential NPE in hashCode method to CrawlURI which was fata… #488 (csrster)
Bump gson from 2.8.6 to 2.8.9 in /contrib #486 (dependabot[bot])
Bump spring-core from 5.3.18 to 5.3.19 in /commons #480 (dependabot[bot])
ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478 (ato)
Issue211: support dns over https if local DNS is not working / available #476 (ClemensRobbenhaar)
Bump spring-beans from 5.3.14 to 5.3.18 in /commons #475 (dependabot[bot])
TransclusionDecideRule: Don't treat sitemap links ('M') as transclusions #470 (ato)
Use Files.createLink() and Files.createSymbolicLink() instead of JNA #468 (ato)
Fix name of parameter in setMaxLogFileSize #465 (ClemensRobbenhaar)
Add conf to not allow TLDs as seeds found via redirect from other seeds #461 (kris-sigur)
Bump spring-core from 5.3.3 to 5.3.14 in /commons #460 (dependabot[bot])
ExtractorHTML: Determine LINK tag type by parsing REL attribute #459 (ato)
Fix issue#191: "RIS already open for ToeThread..." exception during https pages crawl over proxy #457 (ClemensRobbenhaar)
FetchHTTP: Handle null characters in the Content-Length header #452 (ato)
Add Dockerfile #450 (Querela)
Resolve gid too big #448 (ldko)
FetchDNS: Keep dnsjava selector thread out of ToePool #444 (ato)
Enabled configurable url-matching and extraction for sitemaps. #441 (csrster)

3.4.0-20210923 (2021-09-23)

Full Changelog

Fixed bugs:

ExtractorChrome exception on images as data uris #430
Thread-safely issues with the CookieStore #427
Cookies being sent to wrong site #259

Closed issues:

Trying to get in touch regarding a security issue #429
Upgrade HTTP Client to 4.5.x #245

Merged pull requests:

Add safer cookie iteration #434 (anjackson)
ExtractorChrome bug fixes #431 (ato)
UI: Refactor duplicate template rendering code #424 (ato)

3.4.0-20210803 (2021-08-03)

Full Changelog

Fixed bugs:

Jobs can get stuck STOPPING with "Interrupt leaving unfinished CrawlURI" #420
Groovy version is incompatible with JDK 16+ #419
module java.base does not export sun.security.tools.keytool to unnamed module @1ece4432 #417
Distribution package has broken filesystem permissions #413
Add WARC-IP-Address header to WARCWriterChainProcessor #396

Merged pull requests:

Don't extract data URIs #423 (ato)
ToeThread: ensure currentCuri is finished before exiting #421 (ato)
JDK 16 compatibility #418 (ato)
ExtractorChrome: reduce request duplication between browser and frontier #416 (ato)
Upgrade maven-assembly-plugin to 3.3.0 to fix file permissions #414 (ato)
ExtractorChrome: Capture requests made by the browser #411 (ato)
Warc writer stats fixes #410 (ato)
Fix WARC-IP-Address and use a common server-ip CrawlURI attribute for all protocols #409 (ato)
Add basic syntax highlighting to the crawl.log viewer #408 (ato)
Fix a couple of boring maven warnings #407 (ato)
Fix and document the -r option which runs a named job on startup #406 (ato)
Speed up test suite #405 (ato)
Switch from Travis CI to Github Actions #404 (ato)
Add ExtractorChrome to contrib #403 (ato)
Upgrade httpclient to 4.5 #397 (anjackson)

3.4.0-20210621 (2021-06-21)

Full Changelog

Merged pull requests:

Remove dependency on mg4j #402 (ato)
Graceful UI shutdown #401 (kris-sigur)
Remove unnecessary fiddling with VIA path in ExtractorRobotsTxt #400 (kris-sigur)

3.4.0-20210618 (2021-06-18)

Full Changelog

Merged pull requests:

Switch to properties that enforce Java 8 compatibility. #399 (anjackson)

3.4.0-20210617 (2021-06-17)

Full Changelog

Closed issues:

Ensure valid checkpoints can be created when recovering from errors #392

Merged pull requests:

Annotate nested sitemap links as sitemaps #398 (kris-sigur)
Update AMPQ client library to address security warning. #394 (anjackson)
Only update last checkpoint stats if the checkpoint completed, for #392. #393 (anjackson)
Sync changelog with release. #391 (anjackson)

3.4.0-20210527 (2021-05-27)

Full Changelog

Fixed bugs:

Upgrade dnsjava to cope with Azure CNAME lists #344
Spring instantiation broken for MatchesListRegexDecideRule #337

Closed issues:

Browse Bean template errors on editing Regex Pattern #378
BrowseBeans broken under Java 11 #376
Usable variables, e.g. for warcWriter template #363
Heritrix 3.3 out-of-the-box archives pages with meta noindex #351
Error Binding hostname or ip to Web UI #339
Add support for the SFTP protocol #319
java.nio.BufferUnderflowException in BdbMultipleWorkQueues.get #278
Upgrade dependencies to spring 4.x.x #254

Merged pull requests:

Update changelog. #390 (anjackson)
Update dependencies 2021 05 26 #389 (anjackson)
Bring changelog up to date #386 (anjackson)
Allow tuning of BDB-JE evictor and cleaner threads. #384 (anjackson)
Update to latest version of dnsjava, for #344 #383 (anjackson)
Avoid error when bean properties have no url available #379 (ldko)
Handle empty Optionals when browsing beans #377 (ato)
Fix misspell in comments #368 (webdev4422)
Upgrade to Spring 5.3.3 #366 (ato)
ait youtube-dl options #359 (galgeek)
Strip quotes from URL value. #352 (BitBaron)
Fixes leaky file handles #348 (adam-miller)
youtube-dl --no-playlist #341 (galgeek)
Revert "Warc convention for storing ftp responses has been to use a WARC reso…" #336 (ato)
Fixes extractor multiple regex matcher recycle #335 (adam-miller)
Warc convention for storing ftp responses has been to use a WARC reso… #334 (adam-miller)
Remove deprecated sudo setting. #333 (dengliming)
don't youtubedl receivedFromAMQP #330 (galgeek)
youtube-dl no cache dir #329 (galgeek)
best medium-ish size #327 (galgeek)
Recycle the regex Matcher after use. #317 (adam-miller)
Support for extracting URLs in sitemaps #262 (kris-sigur)

3.4.0-20200518 (2020-05-18)

Full Changelog

Closed issues:

Cannot find class [ExtractorYoutubeDL] #322
Checkpoints 'spoiled' when used to resume crawls #277

Merged pull requests:

Fix match result is always false in MatchesListRegexDecideRule #328 (morokosi)
Add real crawlStatus in the crawlReport #326 (clawia)
youtube-dl: request best medium-ish size format #325 (galgeek)
Add parsing for HTML tags (data-*) #323 (clawia)
Add support for the SFTP protocol #320 (bnfleb)

3.4.0-20200304 (2020-03-04)

Full Changelog

Fixed bugs:

exception logged when opening/saving crawler-beans.cxml via web interface editor #305
Java interface text editor error when saving crawler-beans.cxml #293
Unable to upload crawler-beans.cxml with curl #282
CookieStoreTest.testConcurrentLoad fails randomly #274

Closed issues:

Contrib project has a maven dependency with an older version of guava library. #311
BloomFilter64bitTest is slow #299
ObjectIdentityBdbManualCacheTest is slow #297
HTTPS console inaccessible via browser #279
JDK11 support: ssl errors from console #275
JDK11 support: FetchHTTPTest: ssl handshake_failure #268
JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures #267
JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid #266
JDK11 support: tools.jar #265
JDK11 support: jaxb #264

Merged pull requests:

Use the Wayback Machine to repair a link to Oracle docs. #315 (anjackson)
Utilize the d parameter #314 (hennekey)
Exclude hbase-client's guava 12 transitive dependency #312 (ato)
Fix stream closed exception for Paged view #308 (ldko)
Fix stream closed exception by not closing output stream #306 (ato)
Replace custom Base32 encoding #304 (hennekey)
Replace constant with accessor methods #303 (hennekey)
limit ExtractorYoutubeDL heap usage #302 (nlevitt)
fix logging config #301 (nlevitt)
Use Guice instead of custom bloom filter implementation #300 (hennekey)
Speed up ObjectIdentityBdbManualCacheTest #298 (hennekey)
Set JUnit version to latest #296 (hennekey)
Disable test that connects to wwwb-dedup.us.archive.org #295 (ato)
Fix 'Method Not Allowed' on POST of config editor form #294 (ato)
Crawltrap regex timeout #290 (csrster)
Bdb frontier access #289 (csrster)
Attempt to filter out embedded images. #288 (csrster)
change trough dedup date type to varchar. #287 (nlevitt)
Add support for forced queue assignment and parallel queues #286 (adam-miller)
Warc writer chain #285 (nlevitt)
Fix jobdir PUT #283 (ato)
Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE #281 (anjackson)
Mitigate random CookieStore.testConcurrentLoad test failures #280 (ato)
JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support #276 (ato)
JDK11 support: remove unused class ObjectIdentityBdbCache and tests #273 (ato)
JDK11 support: upgrade maven-surefire-plugin to 2.22.2 #272 (ato)
JDK11 support: exclude tools.jar from hbase-client dependency #271 (ato)
Travis fixes #270 (ato)
JDK11 support: explicitly depend on JAXB #269 (ato)
WIP: ExtractorYoutubeDL #257 (nlevitt)
Update README and add LICENSE.txt #256 (ruebot)

3.4.0-20190418 (2019-04-18)

Full Changelog

Fixed bugs:

Invalid format exception in scanJobLog #239
Domain name lookup failures get cached forever #234
Allow failed lookups to expire, for #234. #235 (anjackson)

Closed issues:

Failed DNS requests remain enqueued #252
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" #236
Make FetchHistoryProcessor 304 handler more robust #229
ToeThread death when using HighestUriPrecedenceProvider #221
Google Drive robots.txt broken #193

Merged pull requests:

set of frontier management changes to support CrawlHQ module #253 (dvanduzer)
fix some trough dedup bugs #251 (nlevitt)
Remove suffix from warcWriter since it is no longer used. #249 (ruebot)
Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" #248 (ato)
Upgrade httpclient to 4.5.7 and handle cookies more compliantly #246 (anjackson)
Update README.md #244 (mikeizbicki)
Handle commas more compliantly when parsing srcset #243 (ato)
Trough dedup #242 (nlevitt)
Ensure we start parsing full lines, for #239. #240 (anjackson)
Add CHANGELOG; address #233. #238 (ruebot)

3.4.0-20190207 (2019-02-07)

Full Changelog

Fixed bugs:

Add checks to guard against server sending 304 in error #230 (anjackson)

Merged pull requests:

Add synchronized statements for #221. #231 (anjackson)

3.4.0-20190205 (2019-02-05)

Full Changelog

Fixed bugs:

HTML extractor does not handle the base href correctly when it's relative #208
Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib #176
Heritrix3 Fails to Build from Source #175
Missing OneLineSimpleLayout class file #173

Closed issues:

BdbFrontier thread safety #212
HTTP response only results in garbage bytes #206
Possibly stalled crawl #203
Where do i find the crawled information (Contents) after crawling is completed #199
-j option can'not handle spaces in directory names? #182
heritrix doesn't scrape rewrite srcset urls correctly #177
Possible race-condition when first using the WARC writers? #167
can you integration with spring boot #162
Noisy alerts about 401s without auth challenge #158
Can't see all beans in scripts #157
How to configure warcWriter with MirrorWriter? #156
Requesting inaccurate paths from js causes routing errors #155

Merged pull requests:

do not checkpoint if crawl job has not started #227 (nlevitt)
namespace scope log logger to crawl job #226 (nlevitt)
un-threadlocal the HConnection #224 (nlevitt)
reset HBaseAdmin on error #223 (nlevitt)
keep trying to start up hbase dedup forever #222 (nlevitt)
implement PredicatedDecideRule.onlyDecision() #220 (nlevitt)
use non-deprecated hbase api #219 (nlevitt)
Correct spelling mistakes. #218 (EdwardBetts)
Update API with note about checkpoint launching. #217 (anjackson)
Extend API to simplify using the latest checkpoint #215 (anjackson)
Ensure frontier work queues are updated safely across threads. #213 (anjackson)
fix exception starting DecideRuleSequence logging #210 (nlevitt)
HtmlExtractor: allow relative hrefs in the base element #209 (anjackson)
Fix link to User Guide #207 (maurice-schleussinger)
Add parameter to allow even distribution for parallel queues. #205 (adam-miller)
catch exceptions scoping outlinks to stop them from derailing process… #197 (nlevitt)
fix for test failures in a workspace on NFS-mounted filesystem #196 (kngenie)
limit max size of form input #194 (galgeek)
Enforce robots.txt character limit per char not per line #192 (ato)
Allow JavaDNS to be disabled as part of resolving outstanding build and test issues #190 (anjackson)
WARCLimitEnforcer.java - Add support for multiple warc writers. #189 (adam-miller)
treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187 (nlevitt)
reduce batch size to 400 and avoid ridiculously long log lines #186 (nlevitt)
escape strings in sql posted to trough #185 (nlevitt)
trough feed #180 (nlevitt)
Add parsing for srcset attributes #179 (BitBaron)
KafkaCrawlLogFeed had been using lots of heap because each callback i… #178 (nlevitt)
AMQP fine control #171 (anjackson)
fix for race-condition when first using the WARC writers https://gith… #168 (nlevitt)
Don't wait to receive Umbra urls if Heritrix sends no url to Umbra #166 (galgeek)
AMQP URL Waiter #165 (galgeek)
Fixes for apparent build errors (extends #154) #164 (nlevitt)
Kafka 0.9 #163 (nlevitt)
No link extraction on URI not successfully downloaded #161 (kris-sigur)
Fixes issue #158 : Noisy alerts about 401s without auth challenge #159 (kris-sigur)
Fixes for apparent build errors #154 (anjackson)
Switch to Java 7 #152 (anjackson)
Make Content-Location header url INFERRED not REFFER hop type since C… #151 (vonrosen)
various changes to amqp publish and receive #150 (nlevitt)
Update to ExtractorHTML.java for cond. comments #149 (eleclerc)
Don't canonicalize source tag so that SourceSeedDecideRule will work.… #148 (vonrosen)
More fixes for multipart form submission #146 (vonrosen)
Make some urls with whitespace acceptable to JavaScript extractor. #145 (vonrosen)
run received urls through the candidates processor, to check scope an… #144 (nlevitt)
handle login forms with <input type="text"> fields in addition to use… #143 (nlevitt)
Form login multipart #142 (nlevitt)
Disable SNI for a request if that request failed due to an SNI error … #141 (vonrosen)
handle multiple clauses for same user agent in robots.txt #139 (nlevitt)
crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138 (nlevitt)
SourceSeedDecideRule, SeedLimitsEnforcer #137 (nlevitt)
Register seeds send in via AMQP #136 (anjackson)
Allow KnowledgableExtractorJS to parse out youtube watch from youtube… #135 (vonrosen)
Add maximum to number of cookies to store for domain to BdbCookieStore #133 (vonrosen)
try very hard to start url consumer, and therefore bind the queue to … #132 (nlevitt)
set isRunning=true so that stop() gets called to avoid leaking connec… #131 (nlevitt)
catch exceptions and log error in StatisticsTracker.run(), to make su… #130 (nlevitt)
load keytool utility main class dynamically, trying both the old and … #129 (nlevitt)
AMQPUrlReceiver changes to support RabbitMQ >= 3.3 #128 (anjackson)
'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing #126 (caofangkun)
Amqp declarations fix #125 (ldko)
Allow realm to be set by server for basic auth. #124 (vonrosen)
Hosts report #123 (kris-sigur)
only submit checkbox and radio button form fields if they are on by d… #122 (nlevitt)
new contrib module KnowledgableExtractorJS, a subclass of ExtractorJS th... #121 (nlevitt)
for ARI-4267 accept possible uris with two dots in the filename part if ... #120 (nlevitt)
Fix for HER-2082 #119 (adam-miller)
Fix for ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest #118 (kris-sigur)
avoid java.lang.NullPointerException at org.archive.modules.writer.Write... #117 (nlevitt)
make sure log4j is configured when running unit tests, to avoid log4j er... #116 (nlevitt)
Set character set to UTF-8 when passing through files. #115 (kris-sigur)
remove RecordingOutputStreamTest.java (moving to webarchive-commons) #114 (nlevitt)
Amqp receiver deadlock #112 (nlevitt)
somewhat ugly fix to handle exceptions from the bean browser like java.l... #111 (nlevitt)
Upgrade to HttpClient 4.3.6 #110 (kris-sigur)
so that it can appear in the crawl log, add contentSize to CrawlURI extr... #109 (nlevitt)
kafka crawl log feed #108 (nlevitt)
Handle case where form does not have an action defined. #107 (vonrosen)
seriously, fix extraInfo handling in AMQPCrawlLogFeed #106 (nlevitt)
fix extraInfo handling in AMQPCrawlLogFeed #105 (nlevitt)
change field names to match new druid config #104 (nlevitt)
CandidatesProcessor.java #103 (adam-miller)
avoid deadlock in AMQPUrlReceiver hopefully #102 (nlevitt)
Remove forcefetch for AMQP received urls so they don't get crawled twice... #101 (vonrosen)
Allow discovery of urls in content attribute of meta tags. #100 (vonrosen)
AMQPCrawlLogFeed, DecideRuleSequenceWithAMQPFeed, DecideRuleSequence.logExtraInfo #99 (nlevitt)
Fix for HER-2074 #97 (kris-sigur)
new cookie store system to address HER-2070 "cookie monster" bug #96 (nlevitt)
FIX corner-case of bean browser failing due to an exception from hashCode() #95 (kngenie)
do not require "+" (plus sign) before @OPERATOR_CONTACT_URL@ in user-age... #94 (nlevitt)
Allow urls in JavaScript between unicode quotes to be detected. #93 (vonrosen)
remove more unused classes #92 (nlevitt)
FetchHTTP.java #91 (adam-miller)
Move Wayback-dedup module to heritrix-contrib #90 (kngenie)
Don’t let exception from property getter fail entire bean-browser. #89 (kngenie)
fix bug in CrawlURI.compare() discovered by Kenji, add unit test CrawlUR... #88 (nlevitt)
Allow xml extractor to handle urls in CDATA. #87 (vonrosen)
remove unused Transform* classes #86 (nlevitt)
switch to mainline iipc webarchive-commons latest release #84 (nlevitt)
oops! count novel urls/bytes for hosts report, etc #83 (nlevitt)
Fix for HER-2071 #82 (kris-sigur)
Hbase cdh5 #81 (nlevitt)
ExtractorHTML when a/@href links include the attribute data-remote="true... #80 (nlevitt)
Revisit redux #79 (nlevitt)
treat content as html and extract links if it looks like html, even if m... #78 (nlevitt)
Force urls received from AMQP to be recrawled so custom http headers can... #77 (vonrosen)
HER-2039 remove class Link, use CrawlURI #76 (nlevitt)
in CrawlURI.createCrawlURI(), avoid clobbering inherited data with data ... #75 (nlevitt)
Fix for https://webarchive.jira.com/browse/ARI-3943 #74 (vonrosen)
Treat codebase as link hops, not embeds #73 (kris-sigur)
add A_ANNOTATIONS to persistentKeys so that CrawlURI doesn't lose its an... #72 (nlevitt)
avoid calling CheckpointService.hasAvailableCheckpoints() when crawl not... #71 (nlevitt)
for ARI-3712, add extracted links relative to both via and base, and annotate with "extractorSWFRelToVia", "extractorSWFRelToBase", or "extractorSWFRelToBoth" if resulting link is the same whether relative to base or via #70 (nlevitt)
For https://webarchive.jira.com/browse/ARI-3865 #69 (vonrosen)
handle exception determining whether to apply overlay #68 (nlevitt)
don't log severe with stack trace on normal amqp shutdown #67 (nlevitt)
oops, make "exit java process" button work again #66 (nlevitt)
shut down the starter-restarter thread at crawl finish!! #65 (nlevitt)
Via surt prefixed decide rule #64 (adam-miller)
Contrib - ExtractorPDFContent #63 (adam-miller)
Ari 3765 gracefully handle amqp server going up and down #62 (nlevitt)
HER-2065 synchronize on inactiveQueuesByPrecedence inside of synchronize... #61 (nlevitt)
Cosmetics #60 (nlevitt)
fix unit test now that we accept speculative urls with query params with... #59 (nlevitt)
for ARI-3723, accept speculative urls with query params with no value #58 (nlevitt)
AMQPUrlReceiver - improve handling of case where rabbitmq is unreachable... #57 (nlevitt)
fix FormLoginProcessor checkpointing #56 (nlevitt)
oops, update test to expect post data as url-encoded query string #54 (nlevitt)
Fix form login #53 (nlevitt)
Implicitly add the ${} around groovyExpression. When cxml contains ${}, ... #52 (nlevitt)
Expression deciderule #51 (nlevitt)
Replace deprecated routines in guava #50 (shriphani)
Youtube march 2014 #49 (nlevitt)
Umbra #48 (nlevitt)
Adjusting Youtube itag priority #47 (adam-miller)
switch dependency from ia-web-commons 1.1.1-SNAPSHOT to webarchive-commo... #46 (nlevitt)
Update youtube itags #45 (nlevitt)
update httpcomponents, should address NPE we've seen https://issues.apac... #44 (nlevitt)
fix job.log file handler was left open when jobdir is removed #43 (martinsbalodis)
Adding the queue declaration and binding to the UrlReceiver #42 (eldondev)
Fix slow cookies #41 (nlevitt)
For https://webarchive.jira.com/browse/HER-2064 #40 (vonrosen)
progress and formatting changes #39 (nlevitt)
Umbra - AMQPUrlReceiver.java receive urls via amqp and add to frontier, related changes #38 (nlevitt)
fix HER-2063 - omit port in Host request header when it is default for t... #37 (nlevitt)
Avoid the exception below by handling bad charsets in FetchHTTP. Restore... #36 (nlevitt)
whoops! send escaped path+query on http request line; had been sending r... #35 (nlevitt)
fix NullPointerException in case of 401 with no auth challenge (includes... #34 (nlevitt)
First pass at a processor to publish crawluris to AMQP channels #33 (eldondev)
Switch to BasicHttpClientConnectionManager instead of #32 (nlevitt)
make http proxy port configurable in cxml, avoiding this: org.springfram... #31 (nlevitt)
Fix bdb cookie store #30 (nlevitt)
HER-2062 Fix for WorkQueueFrontier.deleteURIs handling of retired queues #29 (kris-sigur)
switch to httpcomponents, get rid of archive-overlay-commons-httpclient #28 (nlevitt)
rename dist/README.md to dist/README.txt so that maven bundles it in the... #27 (nlevitt)

3.2.0 (2014-01-10)

Full Changelog

Merged pull requests:

update readme for 3.2.0 release #26 (nlevitt)
bump version number to 3.2.0 for release #25 (nlevitt)
for url-agnostic dedup, follow "Proposal for Standardizing the Recording... #24 (nlevitt)
fix HER-1979 so heritrix can run on windows xp #23 (nlevitt)
HER-1726: Templatize HTML #21 (adam-miller)
Her 2031 - Improve login-form submission options #20 (gojomo)
BeanLookupBindings for simpler script access to beans #19 (travisfw)
Fix for HER-2018: XML representation for /engine/job/<jobName>/beans returns incorrect url for named beans #17 (adam-miller)
Fix for HER-2017 XML representation of beans uses root node of type "script" #16 (adam-miller)
Reuse htmllinkcontext #15 (kngenie)
suppress unused warnings for serialVersionUid #14 (travisfw)
have TooManyPathSegmentsDecideRule count path segments only #13 (travisfw)
generics warnings fixes #12 (travisfw)
New reports #11 (travisfw)
ScriptedDecideRule#getEngine() rewrite for better synchronization and thread local mgmt #10 (travisfw)

3.1.1 (2012-05-02)

Full Changelog

Merged pull requests:

Publicsuffixes2 #9 (kngenie)
Ip address set decide rule #7 (travisfw)
HER-2001: Use the CodeMirror editor for crawl config and script console #6 (ato)
HER-1998 #5 (adam-miller)
sort script engines in script console #4 (travisfw)

3.0.0 (2009-12-05)

Full Changelog

* This Changelog was automatically generated by github_changelog_generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

Unreleased

3.5.0 2024-10-29

Removals

Fixes

Dependency Upgrades

3.4.0-20240909 2024-09-09

Compatibility Note

New Features

Removals

Fixes

Dependency Upgrades

3.4.0-20220727 (2022-07-27)

3.4.0-20210923 (2021-09-23)

3.4.0-20210803 (2021-08-03)

3.4.0-20210621 (2021-06-21)

3.4.0-20210618 (2021-06-18)

3.4.0-20210617 (2021-06-17)

3.4.0-20210527 (2021-05-27)

3.4.0-20200518 (2020-05-18)

3.4.0-20200304 (2020-03-04)

3.4.0-20190418 (2019-04-18)

3.4.0-20190207 (2019-02-07)

3.4.0-20190205 (2019-02-05)

3.2.0 (2014-01-10)

3.1.1 (2012-05-02)

3.0.0 (2009-12-05)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Unreleased

3.5.0 2024-10-29

Removals

Fixes

Dependency Upgrades

3.4.0-20240909 2024-09-09

Compatibility Note

New Features

Removals

Fixes

Dependency Upgrades

3.4.0-20220727 (2022-07-27)

3.4.0-20210923 (2021-09-23)

3.4.0-20210803 (2021-08-03)

3.4.0-20210621 (2021-06-21)

3.4.0-20210618 (2021-06-18)

3.4.0-20210617 (2021-06-17)

3.4.0-20210527 (2021-05-27)

3.4.0-20200518 (2020-05-18)

3.4.0-20200304 (2020-03-04)

3.4.0-20190418 (2019-04-18)

3.4.0-20190207 (2019-02-07)

3.4.0-20190205 (2019-02-05)

3.2.0 (2014-01-10)

3.1.1 (2012-05-02)

3.0.0 (2009-12-05)