Skip to content

Latest commit

 

History

History
436 lines (310 loc) · 15.6 KB

s3cmd.md

File metadata and controls

436 lines (310 loc) · 15.6 KB
layout title permalink redirect_from
post
S3CMD
/docs/s3cmd
/s3cmd.md/
/docs/s3cmd.md/

While the preferred and recommended management client for AIStore is its own CLI, Amazon's s3cmd client can also be used, with certain minor limitations.

But first:

A quick example using s3cmd to operate on any buckets

AIStore is a multi-cloud mutli-backend solution: an AIS cluster can simultaneously access ais://, s3://, gs://, etc. buckets.

For background on supported Cloud and non-Cloud backends, please see Backend Providers

However:

When we use 3rd party clients, such as s3cmd and aws, we must impose a certain limitation: buckets in question must be unambiguously resolvable by name.

The following shows (native) ais and (Amazon's) s3cmd CLI that in many cases can be used interchangeably. There is a single bucket named abc and we access it using the two aforementioned clients.

But again, if we want to use s3cmd (or aws, etc.), there must be a single abc bucket across all providers.

Notice that with s3cmd we must always use s3:// prefix.

$ ais ls ais:
$ ais create ais://abc
"ais://abc" created (see https://github.com/NVIDIA/aistore/blob/main/docs/bucket.md#default-bucket-properties)

$ ais bucket props set ais://abc checksum.type=md5
Bucket props successfully updated
"checksum.type" set to: "md5" (was: "xxhash")

$ s3cmd put README.md s3://abc
upload: 'README.md' -> 's3://abc/README.md'  [1 of 1]
 10689 of 10689   100% in    0s     3.13 MB/s  done
upload: 'README.md' -> 's3://abc/README.md'  [1 of 1]
 10689 of 10689   100% in    0s     4.20 MB/s  done

$ s3cmd rm s3://abc/README.md
delete: 's3://abc/README.md'

Similarly:

$ ais ls s3:
aws://my-s3-bucket
...

$ s3cmd put README.md s3://my-s3-bucket
upload: 'README.md' -> 's3://my-s3-bucket/README.md'  [1 of 1]
 10689 of 10689   100% in    0s     3.13 MB/s  done
upload: 'README.md' -> 's3://abc/README.md'  [1 of 1]
 10689 of 10689   100% in    0s     4.20 MB/s  done

$ s3cmd rm s3://my-s3-bucket/README.md
delete: 's3://my-s3-bucket/README.md'

Table of Contents

s3cmd Configuration

When using s3cmd the very first time, or if your AWS access credentials have changed, or if you'd want to change certain s3cmd defaults (also shown below) - in each one and all of those cases run s3cmd --configure.

NOTE: it is important to have s3cmd client properly configured.

For example:

# s3cmd --configure

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
Access Key [ABCDABCDABCDABCDABCD]: EFGHEFGHEFGHEFGHEFGH
Secret Key [abcdabcdABCDabcd/abcde/abcdABCDabc/ABCDe]: efghEFGHefghEFGHe/ghEFGHe/ghEFghef/hEFGH
Default Region [us-east-2]:

Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]:

Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
if the target S3 system supports dns based buckets.
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]:

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password:
Path to GPG program [/usr/bin/gpg]:

When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
Use HTTPS protocol [Yes]:

On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name:

New settings:
  Access Key: EFGHEFGHEFGHEFGHEFGH
  Secret Key: efghEFGHefghEFGHe/ghEFGHe/ghEFghef/hEFGH
  Default Region: us-east-2
  S3 Endpoint: s3.amazonaws.com
  DNS-style bucket+hostname:port template for accessing a bucket: %(bucket)s.s3.amazonaws.com
  Encryption password:
  Path to GPG program: /usr/bin/gpg
  Use HTTPS protocol: True
  HTTP Proxy server name:
  HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] n
Save settings? [y/N] y
Configuration saved to '/home/.s3cfg'

It is maybe a good idea to also notice the version of the s3cmd you have, e.g.:

$ s3cmd --version
s3cmd version 2.0.1

Getting Started

In this section we walk the most basic and simple (and simplified) steps to get s3cmd to conveniently work with AIStore.

1. AIS Endpoint

With s3cmd client configuration safely stored in $HOME/.s3cfg, the next immediate step is to figure out AIS endpoint

AIS cluster must be running, of course.

The endpoint consists of a gateway's hostname and its port followed by /s3 suffix.

AIS clusters usually run multiple gateways all of which are equivalent in terms of supporting all operations and providing access (to their respective clusters).

For example: given AIS gateway at 10.10.0.1:51080 (where 51080 would be the gateway's listening port), AIS endpoint then would be 10.10.0.1:51080/s3.

NOTE the /s3 suffix. It is important to have it in all subsequent s3cmd requests to AIS, and the surest way to achieve that is to have it in the endpoint.

2. How to have s3cmd calling AIS endpoint

But then the question is, how to transfer AIS endpoint into s3cmd commands. There are essentially two ways:

  1. s3cmd command line
  2. s3cmd configuration

For command line (related) examples, see, for instance, this multipart upload test. In particular, the following settings:

s3endpoint="localhost:8080/s3"
host="--host=$s3endpoint"
host_bucket="--host-bucket=$s3endpoint/%(bucket)"

Separately, note that by default aistore handles S3 API at its AIS_ENDPOINT/s3 endpoint (e.g., localhost:8080/s3). However, any aistore cluster is configurable to accept S3 API calls at its root as well. That is, without the "/s3" suffix shown above.

Back to running s3cmd though - the second, and arguably the easiest, way is exemplified by the diff below:

# diff -uN .s3cfg.orig $HOME/.s3cfg
--- .s3cfg.orig   2022-07-18 09:42:36.502271267 -0400
+++ .s3cfg        2022-07-18 10:14:50.878813029 -0400
@@ -29,8 +29,8 @@
 gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
 gpg_passphrase =
 guess_mime_type = True
-host_base = s3.amazonaws.com
-host_bucket = %(bucket)s.s3.amazonaws.com
+host_base = 10.10.0.1:51080/s3
+host_bucket = 10.10.0.1:51080/s3
 human_readable_sizes = False
 invalidate_default_index_on_cf = False
 invalidate_default_index_root_on_cf = True

Here we hack s3cmd configuration: replace Amazon's default s3.amazonaws.com endpoint with the correct one, and be done.

From this point on, s3cmd will be calling AIStore at 10.10.0.1:51080, with /s3 suffix causing the latter to execute special handling (specifically) designed to support S3 compatibility.

3. Alternatively

Alternatively, instead of hacking .s3cfg once and for all we could use --host and --host-bucket command-line options (of the s3cmd). For instance:

$ s3cmd put README.md s3://mmm/saved-readme.md --no-ssl --host=10.10.0.1:51080/s3 --host-bucket=10.10.0.1:51080/s3

Compare with the identical PUT example in the section 5 below.

Goes without saying that, as long as .s3cfg keeps pointing to s3.amazonaws.com, the --host and --host-bucket must be explicitly specified in every s3cmd command.

4. Note and, possibly, update AIS configuration

This next step actually depends on the AIStore configuration - the configuration of the cluster we intend to use with s3cmd client.

Specifically, there are two config knobs of interest:

# ais config cluster net.http.use_https
PROPERTY                 VALUE
net.http.use_https       false

# ais config cluster checksum.type
PROPERTY         VALUE
checksum.type    xxhash

Note that HTTPS is s3cmd default, and so if AIStore runs on HTTP every single s3cmd command must have the --no-ssl option.

Setting net.http.use_https=true requires AIS cluster restart. In other words, HTTPS is configurable but for the HTTP => HTTPS change to take an effect AIS cluster must be restarted.

NOTE --no-ssl flag, e.g.: s3cmd ls --no-ssl to list buckets.

$ s3cmd ls --host=10.10.0.1:51080/s3

If the AIS cluster in question is deployed with HTTP (the default) and not HTTPS:

$ ais config cluster net.http
PROPERTY                         VALUE
net.http.server_crt              server.crt
net.http.server_key              server.key
net.http.write_buffer_size       65536
net.http.read_buffer_size        65536
net.http.use_https               false # <<<<<<<<< (NOTE) <<<<<<<<<<<<<<<<<<
net.http.skip_verify             false
net.http.chunked_transfer        true

we need turn HTTPS off in the s3cmd client using its --no-ssl option.

For example:

$ s3cmd ls --host=10.10.0.1:51080/s3 --no-ssl

Secondly, there's the second important knob mentioned above: checksum.type=xxhash (where xxhash is the AIS's default).

However:

When using s3cmd with AIStore, it is strongly recommended to update the checksum to md5.

The following will update checksum type globally, on the level of the entire cluster:

# This update will cause all subsequently created buckets to use `md5`.
# But note: all existing buckets will keep using `xxhash`, as per their own - per-bucket - configuration.

$ ais config cluster checksum.type
PROPERTY         VALUE
checksum.type    xxhash

# ais config cluster checksum.type=md5
{
    "checksum.type": "md5"
}

Alternatively, and preferably, update specific bucket's property (e.g. ais://nnn below):

$ ais bucket props set ais://nnn checksum.type=md5

Bucket props successfully updated
"checksum.type" set to: "md5" (was: "xxhash")

5. Create bucket and PUT/GET objects using s3cmd

Once the 3 steps (above) are done, the rest must be really easy. Just start using s3cmd as described, for instance:

# Create bucket `mmm` using `s3cmd` make-bucket (`mb`) command:
$ s3cmd mb s3://mmm --no-ssl
Bucket 's3://mmm/' created

# And double-check it using AIS CLI:
$ ais ls ais:
AIS Buckets (2)
  ais://mmm
  ...

Not to forget to change the bucket's checksum to md5 (needed iff the default cluster-level checksum != md5):

$ ais bucket props set ais://mmm checksum.type=md5

PUT:

$ s3cmd put README.md s3://mmm/saved-readme.md --no-ssl

GET:

$ s3cmd get s3://mmm/saved-readme.md /tmp/copied-readme.md --no-ssl
download: 's3://mmm/saved-readme.md -> '/tmp/copied-readme.md'  [1 of 1]

And so on.

6. Multipart upload using s3cmd

In this section, we use updated .s3cfg to avoid typing much longer command lines that contain --host and --host-bucket options.

In other words, we simplify s3cmd commands using the following local configuration update:

$ diff -uN ~/.s3cfg.orig ~/.s3cfg
--- /root/.s3cfg.orig
+++ /root/.s3cfg
@@ -31,6 +31,8 @@
 guess_mime_type = True
 host_base = s3.amazonaws.com
 host_bucket = %(bucket)s.s3.amazonaws.com
+host_base = localhost:8080/s3
+host_bucket = localhost:8080/s3
 human_readable_sizes = False
 invalidate_default_index_on_cf = False
 invalidate_default_index_root_on_cf = True

NOTE: localhost:8080 (above) can be replaced with any legitimate (http or https) address of any AIS gateway. The latter may - but not necessarily have to - be specified with the environment variable AIS ENDPOINT.

The following further assumes that abc is an AIStore bucket, while my-s3-bucket is S3 bucket that this AIStore cluster can access.

The cluster must be deployed with AWS credentials to list, read, and write my-s3-bucket.

# Upload 50MB aisnode executable in 5MB chunks
$ s3cmd put /go/bin/aisnode s3://abc --multipart-chunk-size-mb=5

# Notice the `ais://` prefix:
$ ais ls ais://abc
NAME      SIZE
aisnode   50.98MiB

# When using Amazon clients, we have to resort to always use s3://:
$ s3cmd ls s3://abc
2022-08-22 13:04  53452800   s3://abc/aisnode

# Confirm via `ls`:
$ ls -al /go/bin/aisnode
-rwxr-xr-x 1 root root 53452800 Aug 22 12:17 /root/gocode/bin/aisnode*

Uploading s3://my-s3-bucket looks absolutely identical with a one notable difference: consistently using s3: (or aws://) prefix:

# Upload 50MB aisnode executable in 7MB chunks
$ s3cmd put /go/bin/aisnode s3://my-s3-bucket --multipart-chunk-size-mb=7

$ ais ls s3://my-s3-bucket
NAME      SIZE
aisnode   50.98MiB

$ s3cmd ls s3://my-s3-bucket
2022-08-22 13:04  53452800   s3://my-s3-bucket/aisnode

Use s3cmd multipart to show any/all ongoing uploads to s3://my-s3-bucket (or any other bucket):

$ s3cmd multipart s3://my-s3-bucket

S3 URI and Further References

Note that s3cmd expects S3 URI, simethin like s3://bucket-name.

In other words, s3cmd does not recognize any prefix other than s3://.

In the examples above, the mmm and nnn buckets are, actually, AIS buckets with no remote backends.

Nevertheless, when using s3cmd we have to reference them as s3://mmm and s3://nnn, respectively.

S3CMD with AIStore Authentication (AuthN)

When Auth is enabled on AIStore, it expects a JWT token for each request. Unfortunately, using the --add-header option in s3cmd doesn't work because the header gets overwritten with the signature and signing algorithm when the actual request is made.

To overcome this, you can modify the S3.py file in s3cmd to include the JWT token directly in the Authorization header before the request is sent.

Example:

In the S3.py file (found in the S3CMD GitHub repository), add the following line before the request is sent:

self.headers["Authorization"] = "Bearer <token>"

Git Diff for reference:

$ git diff
diff --git a/S3/S3.py b/S3/S3.py
index d4cac8f..9fa1496 100644
--- a/S3/S3.py
+++ b/S3/S3.py
@@ -210,6 +210,7 @@ class S3Request(object):
         resource['uri'] = s3_quote(resource['uri'], quote_backslashes=False, unicode_output=True)
         # Get the final uri by adding the uri parameters
         resource['uri'] += format_param_str(self.params)
+        self.headers["Authorization"] = "Bearer <token>"
         return (self.method_string, resource, self.headers)

Adding this line ensures that the Authorization header contains the correct token for requests to the AIStore server.

For table summary documenting AIS/S3 compatibility and further discussion, please see: