3.5.0 2024-10-29
- Removed HBase modules from contrib. #621
- ConfigurableExtractorJS: Set default value (false) for strict property. #612
- ExtractorHTML: Treat
cite
attribute as a navlink instead of embed. #608 - Building no longer require the builds.archive.org repository. #614
- Updated to new URL of the restlet repository.
- Removed hbase, joda-time, log4j
- commons-io 2.14.0
- kafka-clients 3.8.0
- ftpserver-core 1.2.0
- jetty 9.4.56.v20240826
- webarchive-commons 1.1.10
3.4.0-20240909 2024-09-09
Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.
- JDK 22 support
- Added
ConfigurableExtractorJS
for more flexible JavaScript extraction. (#602) - Added
HostnameQueueAssignmentPolicyWithLimits
with optional name length limits. (#598) ExtractorHTML
can now extract more variants of alternative resolution image URLs. (#605)ExtractorHTTP
can now be configured with extra inferred paths (#597)ExtractorYoutubeDL
metadata records can now be optionally logged to crawl.log (#593)
- Removed
ExtractorChrome
from contrib (#601)
- Reduced false positive speculative URLs from meta tags (#595)
- Fixed BdbModule resource leak on job teardown (f4280012ae5f23763f1e19d196a245ae49f9b697)
- Corrected function name in
ScriptedProcessor
Javadoc. (#599) - Updated Maven builds to use HTTPS for resolving dependencies.
- Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
- Fixed older junit3 tests not being run (#592)
- Increased DiskSpaceMonitor default pause threshold to 8 GiB (#499)
- Stopping logging authentication failures when header is missing (#539)
- Fixed console still showing job running after crash (#549)
- Transitioned
PDFParser
andExtractorPDF
to pdfbox (#575) - Transitioned
ExtractorYoutubeDL
to yt-dlp - commons-net 3.9.0
- com.rabbitmq:amqp-client 5.18.0
- dnsjava 3.6.0
- groovy 4.0.21
- kryo 5.6.0
- spring-expression 5.3.39
3.4.0-20220727 (2022-07-27)
Fixed bugs:
- ExtractorHTML matches srcset attribute case-sensitively #477
- Overcrawling due to sitemap links acting like transclusions #469
- "java.lang.NoClassDefFoundError: Could not initialize class org.archive.util.CLibrary" on Apple Silicon #467
- Heritrix crasching on malformed Content-Length header #449
- Java version check throws StringIndexOutOfBoundsException on exact major versions #439
- dnsjava NIO selector thread stuck at 100% after terminating job #425
- Do not treat all URLs from link/@href tags as embeds. #263
- BdbCookieStore not implemented iterator at RetryExec #200
- "RIS already open for ToeThread..." exception during https pages crawl over proxy #191
Closed issues:
- Heritrix not ignoring robots.txt #479
- JDK18: ExtractorMultipleRegexTest fails due to Groovy asm incompatiblity #473
- Setting of maxLogFileSize in the BDBModule is ineffective #464
- Question about memory usage #462
- Build failing via maven-assembly-plugin: group id is too big #447
- Do not require DNS when using a web proxy #211
Merged pull requests:
- Bump jsch from 0.1.52 to 0.1.54 in /commons #492 (dependabot[bot])
- Bump spring-core from 5.3.19 to 5.3.20 in /commons #491 (dependabot[bot])
- Bump jsch from 0.1.52 to 0.1.54 in /modules #490 (dependabot[bot])
- Add robotsTxtOnly robots policy #489 (ato)
- Removed a potential NPE in hashCode method to CrawlURI which was fata… #488 (csrster)
- Bump gson from 2.8.6 to 2.8.9 in /contrib #486 (dependabot[bot])
- Bump spring-core from 5.3.18 to 5.3.19 in /commons #480 (dependabot[bot])
- ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478 (ato)
- Issue211: support dns over https if local DNS is not working / available #476 (ClemensRobbenhaar)
- Bump spring-beans from 5.3.14 to 5.3.18 in /commons #475 (dependabot[bot])
- TransclusionDecideRule: Don't treat sitemap links ('M') as transclusions #470 (ato)
- Use Files.createLink() and Files.createSymbolicLink() instead of JNA #468 (ato)
- Fix name of parameter in setMaxLogFileSize #465 (ClemensRobbenhaar)
- Add conf to not allow TLDs as seeds found via redirect from other seeds #461 (kris-sigur)
- Bump spring-core from 5.3.3 to 5.3.14 in /commons #460 (dependabot[bot])
- ExtractorHTML: Determine LINK tag type by parsing REL attribute #459 (ato)
- Fix issue#191: "RIS already open for ToeThread..." exception during https pages crawl over proxy #457 (ClemensRobbenhaar)
- FetchHTTP: Handle null characters in the Content-Length header #452 (ato)
- Add Dockerfile #450 (Querela)
- Resolve gid too big #448 (ldko)
- FetchDNS: Keep dnsjava selector thread out of ToePool #444 (ato)
- Enabled configurable url-matching and extraction for sitemaps. #441 (csrster)
3.4.0-20210923 (2021-09-23)
Fixed bugs:
- ExtractorChrome exception on images as data uris #430
- Thread-safely issues with the CookieStore #427
- Cookies being sent to wrong site #259
Closed issues:
Merged pull requests:
- Add safer cookie iteration #434 (anjackson)
- ExtractorChrome bug fixes #431 (ato)
- UI: Refactor duplicate template rendering code #424 (ato)
3.4.0-20210803 (2021-08-03)
Fixed bugs:
- Jobs can get stuck STOPPING with "Interrupt leaving unfinished CrawlURI" #420
- Groovy version is incompatible with JDK 16+ #419
- module java.base does not export sun.security.tools.keytool to unnamed module @1ece4432 #417
- Distribution package has broken filesystem permissions #413
- Add WARC-IP-Address header to WARCWriterChainProcessor #396
Merged pull requests:
- Don't extract data URIs #423 (ato)
- ToeThread: ensure currentCuri is finished before exiting #421 (ato)
- JDK 16 compatibility #418 (ato)
- ExtractorChrome: reduce request duplication between browser and frontier #416 (ato)
- Upgrade maven-assembly-plugin to 3.3.0 to fix file permissions #414 (ato)
- ExtractorChrome: Capture requests made by the browser #411 (ato)
- Warc writer stats fixes #410 (ato)
- Fix WARC-IP-Address and use a common server-ip CrawlURI attribute for all protocols #409 (ato)
- Add basic syntax highlighting to the crawl.log viewer #408 (ato)
- Fix a couple of boring maven warnings #407 (ato)
- Fix and document the -r option which runs a named job on startup #406 (ato)
- Speed up test suite #405 (ato)
- Switch from Travis CI to Github Actions #404 (ato)
- Add ExtractorChrome to contrib #403 (ato)
- Upgrade httpclient to 4.5 #397 (anjackson)
3.4.0-20210621 (2021-06-21)
Merged pull requests:
- Remove dependency on mg4j #402 (ato)
- Graceful UI shutdown #401 (kris-sigur)
- Remove unnecessary fiddling with VIA path in ExtractorRobotsTxt #400 (kris-sigur)
3.4.0-20210618 (2021-06-18)
Merged pull requests:
3.4.0-20210617 (2021-06-17)
Closed issues:
- Ensure valid checkpoints can be created when recovering from errors #392
Merged pull requests:
- Annotate nested sitemap links as sitemaps #398 (kris-sigur)
- Update AMPQ client library to address security warning. #394 (anjackson)
- Only update last checkpoint stats if the checkpoint completed, for #392. #393 (anjackson)
- Sync changelog with release. #391 (anjackson)
3.4.0-20210527 (2021-05-27)
Fixed bugs:
- Upgrade dnsjava to cope with Azure CNAME lists #344
- Spring instantiation broken for MatchesListRegexDecideRule #337
Closed issues:
- Browse Bean template errors on editing Regex Pattern #378
- BrowseBeans broken under Java 11 #376
- Usable variables, e.g. for warcWriter template #363
- Heritrix 3.3 out-of-the-box archives pages with meta noindex #351
- Error Binding hostname or ip to Web UI #339
- Add support for the SFTP protocol #319
- java.nio.BufferUnderflowException in BdbMultipleWorkQueues.get #278
- Upgrade dependencies to spring 4.x.x #254
Merged pull requests:
- Update changelog. #390 (anjackson)
- Update dependencies 2021 05 26 #389 (anjackson)
- Bring changelog up to date #386 (anjackson)
- Allow tuning of BDB-JE evictor and cleaner threads. #384 (anjackson)
- Update to latest version of dnsjava, for #344 #383 (anjackson)
- Avoid error when bean properties have no url available #379 (ldko)
- Handle empty Optionals when browsing beans #377 (ato)
- Fix misspell in comments #368 (webdev4422)
- Upgrade to Spring 5.3.3 #366 (ato)
- ait youtube-dl options #359 (galgeek)
- Strip quotes from URL value. #352 (BitBaron)
- Fixes leaky file handles #348 (adam-miller)
- youtube-dl --no-playlist #341 (galgeek)
- Revert "Warc convention for storing ftp responses has been to use a WARC reso…" #336 (ato)
- Fixes extractor multiple regex matcher recycle #335 (adam-miller)
- Warc convention for storing ftp responses has been to use a WARC reso… #334 (adam-miller)
- Remove deprecated sudo setting. #333 (dengliming)
- don't youtubedl receivedFromAMQP #330 (galgeek)
- youtube-dl no cache dir #329 (galgeek)
- best medium-ish size #327 (galgeek)
- Recycle the regex Matcher after use. #317 (adam-miller)
- Support for extracting URLs in sitemaps #262 (kris-sigur)
3.4.0-20200518 (2020-05-18)
Closed issues:
Merged pull requests:
- Fix match result is always false in MatchesListRegexDecideRule #328 (morokosi)
- Add real crawlStatus in the crawlReport #326 (clawia)
- youtube-dl: request best medium-ish size format #325 (galgeek)
- Add parsing for HTML tags (data-*) #323 (clawia)
- Add support for the SFTP protocol #320 (bnfleb)
3.4.0-20200304 (2020-03-04)
Fixed bugs:
- exception logged when opening/saving crawler-beans.cxml via web interface editor #305
- Java interface text editor error when saving crawler-beans.cxml #293
- Unable to upload crawler-beans.cxml with curl #282
- CookieStoreTest.testConcurrentLoad fails randomly #274
Closed issues:
- Contrib project has a maven dependency with an older version of guava library. #311
- BloomFilter64bitTest is slow #299
- ObjectIdentityBdbManualCacheTest is slow #297
- HTTPS console inaccessible via browser #279
- JDK11 support: ssl errors from console #275
- JDK11 support: FetchHTTPTest: ssl handshake_failure #268
- JDK11 support: org.archive.util.ObjectIdentityBdbCacheTest failures #267
- JDK11 support: ClassNotFoundException: javax.transaction.xa.Xid #266
- JDK11 support: tools.jar #265
- JDK11 support: jaxb #264
Merged pull requests:
- Use the Wayback Machine to repair a link to Oracle docs. #315 (anjackson)
- Utilize the
d
parameter #314 (hennekey) - Exclude hbase-client's guava 12 transitive dependency #312 (ato)
- Fix stream closed exception for Paged view #308 (ldko)
- Fix stream closed exception by not closing output stream #306 (ato)
- Replace custom Base32 encoding #304 (hennekey)
- Replace constant with accessor methods #303 (hennekey)
- limit ExtractorYoutubeDL heap usage #302 (nlevitt)
- fix logging config #301 (nlevitt)
- Use Guice instead of custom bloom filter implementation #300 (hennekey)
- Speed up ObjectIdentityBdbManualCacheTest #298 (hennekey)
- Set JUnit version to latest #296 (hennekey)
- Disable test that connects to wwwb-dedup.us.archive.org #295 (ato)
- Fix 'Method Not Allowed' on POST of config editor form #294 (ato)
- Crawltrap regex timeout #290 (csrster)
- Bdb frontier access #289 (csrster)
- Attempt to filter out embedded images. #288 (csrster)
- change trough dedup
date
type to varchar. #287 (nlevitt) - Add support for forced queue assignment and parallel queues #286 (adam-miller)
- Warc writer chain #285 (nlevitt)
- Fix jobdir PUT #283 (ato)
- Upgrade BDB JE to version 7.5.11 - IMPORTANT CHANGE #281 (anjackson)
- Mitigate random CookieStore.testConcurrentLoad test failures #280 (ato)
- JDK11 support: upgrade to Jetty 9.4.19, Restlet 2.4.0 and drop JDK 7 support #276 (ato)
- JDK11 support: remove unused class ObjectIdentityBdbCache and tests #273 (ato)
- JDK11 support: upgrade maven-surefire-plugin to 2.22.2 #272 (ato)
- JDK11 support: exclude tools.jar from hbase-client dependency #271 (ato)
- Travis fixes #270 (ato)
- JDK11 support: explicitly depend on JAXB #269 (ato)
- WIP: ExtractorYoutubeDL #257 (nlevitt)
- Update README and add LICENSE.txt #256 (ruebot)
3.4.0-20190418 (2019-04-18)
Fixed bugs:
- Invalid format exception in scanJobLog #239
- Domain name lookup failures get cached forever #234
- Allow failed lookups to expire, for #234. #235 (anjackson)
Closed issues:
- Failed DNS requests remain enqueued #252
- SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" #236
- Make FetchHistoryProcessor 304 handler more robust #229
- ToeThread death when using HighestUriPrecedenceProvider #221
- Google Drive robots.txt broken #193
Merged pull requests:
- set of frontier management changes to support CrawlHQ module #253 (dvanduzer)
- fix some trough dedup bugs #251 (nlevitt)
- Remove suffix from warcWriter since it is no longer used. #249 (ruebot)
- Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly" #248 (ato)
- Upgrade httpclient to 4.5.7 and handle cookies more compliantly #246 (anjackson)
- Update README.md #244 (mikeizbicki)
- Handle commas more compliantly when parsing srcset #243 (ato)
- Trough dedup #242 (nlevitt)
- Ensure we start parsing full lines, for #239. #240 (anjackson)
- Add CHANGELOG; address #233. #238 (ruebot)
3.4.0-20190207 (2019-02-07)
Fixed bugs:
Merged pull requests:
3.4.0-20190205 (2019-02-05)
Fixed bugs:
- HTML extractor does not handle the base href correctly when it's relative #208
- Heritrix3 (including pre-built binaries) Fails to Bootstrap with Java8 due to Changes in Java stdlib #176
- Heritrix3 Fails to Build from Source #175
- Missing OneLineSimpleLayout class file #173
Closed issues:
- BdbFrontier thread safety #212
- HTTP response only results in garbage bytes #206
- Possibly stalled crawl #203
- Where do i find the crawled information (Contents) after crawling is completed #199
-j
option can'not handle spaces in directory names? #182- heritrix doesn't scrape rewrite srcset urls correctly #177
- Possible race-condition when first using the WARC writers? #167
- can you integration with spring boot #162
- Noisy alerts about 401s without auth challenge #158
- Can't see all beans in scripts #157
- How to configure warcWriter with MirrorWriter? #156
- Requesting inaccurate paths from js causes routing errors #155
Merged pull requests:
- do not checkpoint if crawl job has not started #227 (nlevitt)
- namespace scope log logger to crawl job #226 (nlevitt)
- un-threadlocal the HConnection #224 (nlevitt)
- reset HBaseAdmin on error #223 (nlevitt)
- keep trying to start up hbase dedup forever #222 (nlevitt)
- implement PredicatedDecideRule.onlyDecision() #220 (nlevitt)
- use non-deprecated hbase api #219 (nlevitt)
- Correct spelling mistakes. #218 (EdwardBetts)
- Update API with note about checkpoint launching. #217 (anjackson)
- Extend API to simplify using the latest checkpoint #215 (anjackson)
- Ensure frontier work queues are updated safely across threads. #213 (anjackson)
- fix exception starting DecideRuleSequence logging #210 (nlevitt)
- HtmlExtractor: allow relative hrefs in the base element #209 (anjackson)
- Fix link to User Guide #207 (maurice-schleussinger)
- Add parameter to allow even distribution for parallel queues. #205 (adam-miller)
- catch exceptions scoping outlinks to stop them from derailing process… #197 (nlevitt)
- fix for test failures in a workspace on NFS-mounted filesystem #196 (kngenie)
- limit max size of form input #194 (galgeek)
- Enforce robots.txt character limit per char not per line #192 (ato)
- Allow JavaDNS to be disabled as part of resolving outstanding build and test issues #190 (anjackson)
- WARCLimitEnforcer.java - Add support for multiple warc writers. #189 (adam-miller)
- treat a failed fetch (e.g. socket timeout) of robots.txt the same way… #187 (nlevitt)
- reduce batch size to 400 and avoid ridiculously long log lines #186 (nlevitt)
- escape strings in sql posted to trough #185 (nlevitt)
- trough feed #180 (nlevitt)
- Add parsing for srcset attributes #179 (BitBaron)
- KafkaCrawlLogFeed had been using lots of heap because each callback i… #178 (nlevitt)
- AMQP fine control #171 (anjackson)
- fix for race-condition when first using the WARC writers https://gith… #168 (nlevitt)
- Don't wait to receive Umbra urls if Heritrix sends no url to Umbra #166 (galgeek)
- AMQP URL Waiter #165 (galgeek)
- Fixes for apparent build errors (extends #154) #164 (nlevitt)
- Kafka 0.9 #163 (nlevitt)
- No link extraction on URI not successfully downloaded #161 (kris-sigur)
- Fixes issue #158 : Noisy alerts about 401s without auth challenge #159 (kris-sigur)
- Fixes for apparent build errors #154 (anjackson)
- Switch to Java 7 #152 (anjackson)
- Make Content-Location header url INFERRED not REFFER hop type since C… #151 (vonrosen)
- various changes to amqp publish and receive #150 (nlevitt)
- Update to ExtractorHTML.java for cond. comments #149 (eleclerc)
- Don't canonicalize source tag so that SourceSeedDecideRule will work.… #148 (vonrosen)
- More fixes for multipart form submission #146 (vonrosen)
- Make some urls with whitespace acceptable to JavaScript extractor. #145 (vonrosen)
- run received urls through the candidates processor, to check scope an… #144 (nlevitt)
- handle login forms with <input type="text"> fields in addition to use… #143 (nlevitt)
- Form login multipart #142 (nlevitt)
- Disable SNI for a request if that request failed due to an SNI error … #141 (vonrosen)
- handle multiple clauses for same user agent in robots.txt #139 (nlevitt)
- crawl level and host level limits on *novel* (not deduplicated) bytes and urls #138 (nlevitt)
- SourceSeedDecideRule, SeedLimitsEnforcer #137 (nlevitt)
- Register seeds send in via AMQP #136 (anjackson)
- Allow KnowledgableExtractorJS to parse out youtube watch from youtube… #135 (vonrosen)
- Add maximum to number of cookies to store for domain to BdbCookieStore #133 (vonrosen)
- try very hard to start url consumer, and therefore bind the queue to … #132 (nlevitt)
- set isRunning=true so that stop() gets called to avoid leaking connec… #131 (nlevitt)
- catch exceptions and log error in StatisticsTracker.run(), to make su… #130 (nlevitt)
- load keytool utility main class dynamically, trying both the old and … #129 (nlevitt)
- AMQPUrlReceiver changes to support RabbitMQ >= 3.3 #128 (anjackson)
- 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing #126 (caofangkun)
- Amqp declarations fix #125 (ldko)
- Allow realm to be set by server for basic auth. #124 (vonrosen)
- Hosts report #123 (kris-sigur)
- only submit checkbox and radio button form fields if they are on by d… #122 (nlevitt)
- new contrib module KnowledgableExtractorJS, a subclass of ExtractorJS th... #121 (nlevitt)
- for ARI-4267 accept possible uris with two dots in the filename part if ... #120 (nlevitt)
- Fix for HER-2082 #119 (adam-miller)
- Fix for ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest #118 (kris-sigur)
- avoid java.lang.NullPointerException at org.archive.modules.writer.Write... #117 (nlevitt)
- make sure log4j is configured when running unit tests, to avoid log4j er... #116 (nlevitt)
- Set character set to UTF-8 when passing through files. #115 (kris-sigur)
- remove RecordingOutputStreamTest.java (moving to webarchive-commons) #114 (nlevitt)
- Amqp receiver deadlock #112 (nlevitt)
- somewhat ugly fix to handle exceptions from the bean browser like java.l... #111 (nlevitt)
- Upgrade to HttpClient 4.3.6 #110 (kris-sigur)
- so that it can appear in the crawl log, add contentSize to CrawlURI extr... #109 (nlevitt)
- kafka crawl log feed #108 (nlevitt)
- Handle case where form does not have an action defined. #107 (vonrosen)
- seriously, fix extraInfo handling in AMQPCrawlLogFeed #106 (nlevitt)
- fix extraInfo handling in AMQPCrawlLogFeed #105 (nlevitt)
- change field names to match new druid config #104 (nlevitt)
- CandidatesProcessor.java #103 (adam-miller)
- avoid deadlock in AMQPUrlReceiver hopefully #102 (nlevitt)
- Remove forcefetch for AMQP received urls so they don't get crawled twice... #101 (vonrosen)
- Allow discovery of urls in content attribute of meta tags. #100 (vonrosen)
- AMQPCrawlLogFeed, DecideRuleSequenceWithAMQPFeed, DecideRuleSequence.logExtraInfo #99 (nlevitt)
- Fix for HER-2074 #97 (kris-sigur)
- new cookie store system to address HER-2070 "cookie monster" bug #96 (nlevitt)
- FIX corner-case of bean browser failing due to an exception from hashCode() #95 (kngenie)
- do not require "+" (plus sign) before @OPERATOR_CONTACT_URL@ in user-age... #94 (nlevitt)
- Allow urls in JavaScript between unicode quotes to be detected. #93 (vonrosen)
- remove more unused classes #92 (nlevitt)
- FetchHTTP.java #91 (adam-miller)
- Move Wayback-dedup module to heritrix-contrib #90 (kngenie)
- Don’t let exception from property getter fail entire bean-browser. #89 (kngenie)
- fix bug in CrawlURI.compare() discovered by Kenji, add unit test CrawlUR... #88 (nlevitt)
- Allow xml extractor to handle urls in CDATA. #87 (vonrosen)
- remove unused Transform* classes #86 (nlevitt)
- switch to mainline iipc webarchive-commons latest release #84 (nlevitt)
- oops! count novel urls/bytes for hosts report, etc #83 (nlevitt)
- Fix for HER-2071 #82 (kris-sigur)
- Hbase cdh5 #81 (nlevitt)
- ExtractorHTML when a/@href links include the attribute data-remote="true... #80 (nlevitt)
- Revisit redux #79 (nlevitt)
- treat content as html and extract links if it looks like html, even if m... #78 (nlevitt)
- Force urls received from AMQP to be recrawled so custom http headers can... #77 (vonrosen)
- HER-2039 remove class Link, use CrawlURI #76 (nlevitt)
- in CrawlURI.createCrawlURI(), avoid clobbering inherited data with data ... #75 (nlevitt)
- Fix for https://webarchive.jira.com/browse/ARI-3943 #74 (vonrosen)
- Treat codebase as link hops, not embeds #73 (kris-sigur)
- add A_ANNOTATIONS to persistentKeys so that CrawlURI doesn't lose its an... #72 (nlevitt)
- avoid calling CheckpointService.hasAvailableCheckpoints() when crawl not... #71 (nlevitt)
- for ARI-3712, add extracted links relative to both via and base, and annotate with "extractorSWFRelToVia", "extractorSWFRelToBase", or "extractorSWFRelToBoth" if resulting link is the same whether relative to base or via #70 (nlevitt)
- For https://webarchive.jira.com/browse/ARI-3865 #69 (vonrosen)
- handle exception determining whether to apply overlay #68 (nlevitt)
- don't log severe with stack trace on normal amqp shutdown #67 (nlevitt)
- oops, make "exit java process" button work again #66 (nlevitt)
- shut down the starter-restarter thread at crawl finish!! #65 (nlevitt)
- Via surt prefixed decide rule #64 (adam-miller)
- Contrib - ExtractorPDFContent #63 (adam-miller)
- Ari 3765 gracefully handle amqp server going up and down #62 (nlevitt)
- HER-2065 synchronize on inactiveQueuesByPrecedence inside of synchronize... #61 (nlevitt)
- Cosmetics #60 (nlevitt)
- fix unit test now that we accept speculative urls with query params with... #59 (nlevitt)
- for ARI-3723, accept speculative urls with query params with no value #58 (nlevitt)
- AMQPUrlReceiver - improve handling of case where rabbitmq is unreachable... #57 (nlevitt)
- fix FormLoginProcessor checkpointing #56 (nlevitt)
- oops, update test to expect post data as url-encoded query string #54 (nlevitt)
- Fix form login #53 (nlevitt)
- Implicitly add the ${} around groovyExpression. When cxml contains ${}, ... #52 (nlevitt)
- Expression deciderule #51 (nlevitt)
- Replace deprecated routines in guava #50 (shriphani)
- Youtube march 2014 #49 (nlevitt)
- Umbra #48 (nlevitt)
- Adjusting Youtube itag priority #47 (adam-miller)
- switch dependency from ia-web-commons 1.1.1-SNAPSHOT to webarchive-commo... #46 (nlevitt)
- Update youtube itags #45 (nlevitt)
- update httpcomponents, should address NPE we've seen https://issues.apac... #44 (nlevitt)
- fix job.log file handler was left open when jobdir is removed #43 (martinsbalodis)
- Adding the queue declaration and binding to the UrlReceiver #42 (eldondev)
- Fix slow cookies #41 (nlevitt)
- For https://webarchive.jira.com/browse/HER-2064 #40 (vonrosen)
- progress and formatting changes #39 (nlevitt)
- Umbra - AMQPUrlReceiver.java receive urls via amqp and add to frontier, related changes #38 (nlevitt)
- fix HER-2063 - omit port in Host request header when it is default for t... #37 (nlevitt)
- Avoid the exception below by handling bad charsets in FetchHTTP. Restore... #36 (nlevitt)
- whoops! send escaped path+query on http request line; had been sending r... #35 (nlevitt)
- fix NullPointerException in case of 401 with no auth challenge (includes... #34 (nlevitt)
- First pass at a processor to publish crawluris to AMQP channels #33 (eldondev)
- Switch to BasicHttpClientConnectionManager instead of #32 (nlevitt)
- make http proxy port configurable in cxml, avoiding this: org.springfram... #31 (nlevitt)
- Fix bdb cookie store #30 (nlevitt)
- HER-2062 Fix for WorkQueueFrontier.deleteURIs handling of retired queues #29 (kris-sigur)
- switch to httpcomponents, get rid of archive-overlay-commons-httpclient #28 (nlevitt)
- rename dist/README.md to dist/README.txt so that maven bundles it in the... #27 (nlevitt)
3.2.0 (2014-01-10)
Merged pull requests:
- update readme for 3.2.0 release #26 (nlevitt)
- bump version number to 3.2.0 for release #25 (nlevitt)
- for url-agnostic dedup, follow "Proposal for Standardizing the Recording... #24 (nlevitt)
- fix HER-1979 so heritrix can run on windows xp #23 (nlevitt)
- HER-1726: Templatize HTML #21 (adam-miller)
- Her 2031 - Improve login-form submission options #20 (gojomo)
- BeanLookupBindings for simpler script access to beans #19 (travisfw)
- Fix for HER-2018: XML representation for /engine/job/<jobName>/beans returns incorrect url for named beans #17 (adam-miller)
- Fix for HER-2017 XML representation of beans uses root node of type "script" #16 (adam-miller)
- Reuse htmllinkcontext #15 (kngenie)
- suppress unused warnings for serialVersionUid #14 (travisfw)
- have TooManyPathSegmentsDecideRule count path segments only #13 (travisfw)
- generics warnings fixes #12 (travisfw)
- New reports #11 (travisfw)
- ScriptedDecideRule#getEngine() rewrite for better synchronization and thread local mgmt #10 (travisfw)
3.1.1 (2012-05-02)
Merged pull requests:
- Publicsuffixes2 #9 (kngenie)
- Ip address set decide rule #7 (travisfw)
- HER-2001: Use the CodeMirror editor for crawl config and script console #6 (ato)
- HER-1998 #5 (adam-miller)
- sort script engines in script console #4 (travisfw)
3.0.0 (2009-12-05)
* This Changelog was automatically generated by github_changelog_generator