Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] EC2 AMIs do not start user-data provisioning or amazon-init.service #137

Open
hjkatz opened this issue May 24, 2024 · 29 comments
Open

Comments

@hjkatz
Copy link

hjkatz commented May 24, 2024

While launching 23.11 AMIs in the us-west-2 region we observe that not every instance launched starts the amazon-init.service nor seems to set the key pair properly for the root user.

We are presented with the following:

Warning: Permanently added '35.95.86.153' (ED25519) to the list of known hosts.
([email protected]) Password:

This happens ~70% of the time, other times the instance comes up just fine.

AMI ID: ami-0f1471461676aaa00

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

Another note: Rebooting the instance a few times seems to get past the problem.

Observed Heuristic: If the ssh keys are not set up by like ~30s after boot, then the instance never comes up properly at all.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

first of all could you check if the same problem occurs on 24.05beta? https://nixos.github.io/amis/

23.11 will be EOL in a few days

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

This part I don't understand. amazon-init.service has nothing to do with boot logs in the AWS console. Those are collected directly from the serial console.

However in my experience those are quite eventually consistent. It can takes minutes before they populate (Even on non-NixOS)

A better way is to directly connect to the serial console instead in the AWS Console or through the aws cli

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Note that the ssh key is set up by

https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-data.nix#L21 not amazon-init.service

Both are ordered after but have no strict requirement https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/amazon-image.nix#L72

So I think what might be happening is that https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-metadata-fetcher.sh is failing for some reason.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

the output of the ec2-metadata-fetcher.sh is posted to the serial console, so it should be available when connecting to the serial console in EC2 at least.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

To get debug logs programatically (Only works for Nitro instances!) this command can be used:

aws ec2 get-console-output --region us-west-2 --latest --instance-id $id

Output of that command on a failing instance would be very useful.

In the meantime I'm gonna spawn 10 instances and see if I can reproduce

aws ec2 run-instances --region us-west-2 --key-name [email protected] --image-id  ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1  --count 10 

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

first of all could you check if the same problem occurs on 24.05beta? nixos.github.io/amis

23.11 will be EOL in a few days

This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.

This part I don't understand. amazon-init.service has nothing to do with boot logs in the AWS console. Those are collected directly from the serial console.

However in my experience those are quite eventually consistent. It can takes minutes before they populate (Even on non-NixOS)

A better way is to directly connect to the serial console instead in the AWS Console or through the aws cli

Understood, thank you for explaining.

Yes, I'll give it a try on the new 24.05 beta.

FWIW we did do a test where user-data was this script:

#!/usr/bin/env bash
echo hello

on 23.11 and saw the same issue.

I'll try this out on the newer AMI version 24.05 and see how it behaves.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

I spawned 10 instances in a row and non of this reproduce what you're reporting. Can you provide more info about your deployment?

The commands I used
#!/bin/sh
id=$(aws ec2 run-instances --region us-west-2 --key-name [email protected] --image-id  ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)

ip=$(aws ec2 describe-instances --region us-west-2 --instance-ids $id --query 'Reservations[0].Instances[0].PublicIpAddress' --output text)

sleep 30
ssh root@$ip

aws ec2 get-console-output --region us-west-2 --latest --instance-id $id

aws ec2 terminate-instances --region us-west-2 --instance-ids $id

Did you set a NixOS config in user-data? And What instance type are you using?

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

@arianvp I was able to reproduce with a minimal user-data just echo hello

We're using instance type: c6id.8xlarge in region us-west-2

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Also with that user data I'm not able to reproduce on a t3.nano instance.

id=$(aws ec2 run-instances --region us-west-2 --user-data file://user-data.sh --key-name [email protected] --image-id  ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)

Last thing I can think of. Did you maybe disable IMDSv2 on your instance and are using the legacy IMDSv1?

Otherwise it must be something with that specific instance type. Which is kinda odd.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

So yeh please provide the output of the following command

aws ec2 get-console-output --region us-west-2 --latest --instance-id $id

On a machine that didn't get the ssh key set up properly. Shortly after booting up the instance .

And share it here.

It should give us a clue what is failing.

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

@arianvp Here you go!

No. We're requiring IMDSv2 via the launch template.

Output: https://gist.github.com/hjkatz/d454f2ccb0f86e39217b9293d69dadf8

Command: aws ec2 get-console-output --profile ngrok-infra --region us-west-2 --latest --instance-id i-04e2b3d87804e3a73 | jq '.Output' -r

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Are you sure this is the log of a server that failed to set up?

Because all three services succeeded in that run:

[ OK ] Finished fetch-ec2-metadata.service. (Fetches the IMDS metadata)
[ OK ] Finished Apply EC2 Data. (apply-ec2-data.service which applies ssh keys)
[ OK ] Finished Reconfigure the system from EC2 userdata on startup. (amazon-init.service user data run)

The only thing fishy in those logs is that it took all 3 retries to fetch the IMDSv2 Token. Though it did succeed on the last try.

[   15.741638] fetch-ec2-metadata-start[1399]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   15.814428] fetch-ec2-metadata-start[1402]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0 ms: Couldn't connect to server
[   16.820736] fetch-ec2-metadata-start[1399]: (attempt 2/3) getting an EC2 instance metadata service v2 token...
[   16.824142] fetch-ec2-metadata-start[1405]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0 ms: Couldn't connect to server
[   17.825681] fetch-ec2-metadata-start[1399]: (attempt 3/3) getting an EC2 instance metadata service v2 token...

I'm wondering if the amount of retries for some reason is not sufficient on your instance type and we need to bump up the amount of retries. And that in 70% of the cases this is failing. However it's super weird that the IMDS is so slow to start up.

E.g. on my t3 instance the IMDS call succeeds immediately:

[   20.686667] fetch-ec2-metadata-start[1003]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   21.219427] fetch-ec2-metadata-start[1003]: (attempt 1/10) validating the EC2 instance metadata service v2 token...
[   21.225272] fetch-ec2-metadata-start[1003]: getting EC2 instance metadata...

@arianvp
Copy link
Member

arianvp commented May 24, 2024

IMDS not being available 16 seconds into the boot sounds like the kind of thing I'd open an AWS Support Ticket for to be honest. This is extremely puzzling if this is indeed what is failing for you.

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

Agreed that IMDS not being available immediately is odd and a huge smell.

Actually after pulling those logs I found out that this server did setup successfully and was launched using my echo hello user-data. So I'm trying again.

Here are the new interesting logs: https://gist.github.com/hjkatz/daa4935f532f8947e0ce48892e3955b9

...
[   28.716698] fetch-ec2-metadata-start[1406]: (attempt 9/10) validating the EC2 instance metadata service v2 token...
[   28.720560] fetch-ec2-metadata-start[1506]: curl: (22) The requested URL returned error: 401
[   29.721970] fetch-ec2-metadata-start[1406]: (attempt 10/10) validating the EC2 instance metadata service v2 token...
[   29.725800] fetch-ec2-metadata-start[1513]: curl: (22) The requested URL returned error: 401
[   30.728080] fetch-ec2-metadata-start[1406]: getting EC2 instance metadata...
[   30.731982] fetch-ec2-metadata-start[1515]: curl: (22) The requested URL returned error: 401
[   30.736337] fetch-ec2-metadata-start[1517]: curl: (22) The requested URL returned error: 401
[   30.740249] fetch-ec2-metadata-start[1518]: curl: (22) The requested URL returned error: 401
[   30.743971] fetch-ec2-metadata-start[1519]: curl: (22) The requested URL returned error: 401
[  OK  ] Finished fetch-ec2-metadata.service.
         Starting Apply EC2 Data...
[  OK  ] Finished Apply EC2 Data.
         Starting SSH Daemon...
[  OK  ] Started SSH Daemon.
         Starting Print SSH Host Key...
-----BEGIN SSH HOST KEY FINGERPRINTS-----
256 SHA256:J5vpSDmh8fXONHxYT7dfKOrWC28zWGglO6RjavRsFRU [email protected] (ED25519)
4096 SHA256:pRyFHB2EoKabHysZKNptZNSyy+B+zkmfXpITaBC1QWU [email protected] (RSA)
-----END SSH HOST KEY FINGERPRINTS-----
[  OK  ] Finished Print SSH Host Key.
[  OK  ] Reached target Multi-User System.
         Starting Reconfigure the system from EC2 userdata on startup...
[  OK  ] Finished Reconfigure the system from EC2 userdata on startup.

It does appear to be an IMDS issue. I'll try to follow up with AWS support too.

Though I don't think that NixOS should continue booting through this failure. It says it put the keys there, but in reality we still get a password prompt for root.

Also just for my case why not add a backoff + retry longer? I don't see a huge downside personally.

BTW those keys' fingerprints as loaded/printed do not match the key pair we've uploaded and use for our servers.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Also just for my case why not add a backoff + retry longer? I don't see a huge downside personally.

Yep that makes sense. The IMDS fetcher is defined here
https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/ec2-metadata-fetcher.sh

So you'll need to modify that and then build and upload a new AMI yourself.

(Side-note; I wonder if the script can be simplified if we add a --retry-connrefused to the curl call instead of doing a while loop)

BTW those keys' fingerprints as loaded/printed do not match the key pair we've uploaded and use for our servers.

Those are the host key fingerprints. Not the fingerprints of the client keys you upload. The host key is what you use to authenticate against the server (not the other way around) and they're generated fresh on startup. The idea is that you can retrieve the fingerprint from the console output and compare them with the fingerprint that is printed the first time you connect to the instance

e.g. the

$ ssh [email protected]
The authenticity of host '34.222.101.176 (34.222.101.176)' can't be established.
ED25519 key fingerprint is SHA256:/cDtKeoOXMDi7CdtH6H1OnhvJCTJa7OgjAFGPRaY1xg.

prompt.

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

Beautiful, thank you for the information and pointers to the script. I'll find time this weekend to open a PR.

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Though I don't think that NixOS should continue booting through this failure. It says it put the keys there, but in reality we still get a password prompt for root.

Yeh I agree that fetch-ec2-metadata.service should fail if it cant reach the IMDS. And the dependent services ideally fail too

@arianvp
Copy link
Member

arianvp commented May 24, 2024

Hey @mswilson long shot. Does having IMDS not being reachable up to 20 seconds into boot for c6id instances ring a bell to you at all?

@mswilson
Copy link

mswilson commented May 24, 2024 via email

@hjkatz
Copy link
Author

hjkatz commented May 24, 2024

cc @arianvp, the PR: NixOS/nixpkgs#314427

I would love any advice for building the AMI to test this. This is my first PR for the nixpkgs ecosystem and I do not run a NixOS machine personally.

@arianvp
Copy link
Member

arianvp commented May 25, 2024

Are you sure that networking is gettingfully initialized?

I am pretty sure the network is properly initialized before we attempt to reach IMDS:

[  OK  ] Started DHCP Client.
[  OK  ] Reached target Network is Online.

is printed before the imds fetcher script starts which means we successfully got a DHCP lease.

I'd love to debug this more but the instance type is really expensive.
I'll ask the NixOS infra team if I can use their AWS Credits to debug on this instance type.
Otherwise a donation to https://github.com/sponsors/arianvp would be very welcome to cover the costs :')

I would love any advice for building the AMI to test this. This is my first PR for the nixpkgs ecosystem and I do not run a NixOS machine personally.

Could you join https://app.element.io/#/room/#aws:nixos.org ? I think Matrix is a better place to help you out with this.
I also made a PR to add some docs on this: https://github.com/NixOS/nixpkgs/pull/314523/files

@arianvp
Copy link
Member

arianvp commented May 25, 2024

I tried to reproduce on your instance type and I can't. IMDS is always instantly responds. and I am always able to SSH.

I spawned 15 instances of type c6id.8xlarge and non of them had the issue you're describing.

All of them retrieved the IMDS data on first try like follows:

[   24.148384] fetch-ec2-metadata-start[1466]: (attempt 1/3) getting an EC2 instance metadata service v2 token...
[   24.850824] fetch-ec2-metadata-start[1466]: (attempt 1/10) validating the EC2 instance metadata service v2 token...
[   24.854566] fetch-ec2-metadata-start[1466]: getting EC2 instance metadata...

At this point I really don't know what is going on in your environment. But I can not reproduce any of the behaviour you're reporting I'm afraid.

I suggest contacting AWS support and I am going to close this issue as "can not reproduce".

The script that I used:

#!/usr/bin/env bash

set -x

instancetype=c6id.8xlarge
export AWS_REGION=us-west-2

id=$(aws ec2 run-instances --user-data file://user-data.sh --key-name [email protected] --image-id  ami-0f1471461676aaa00 --instance-type "$instancetype" --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)

terminate_instance() {
  aws ec2 terminate-instances --instance-ids "$id"
  sleep 10
  aws ec2 get-console-output --latest --instance-id "$id" --query 'Output' --output text | tee /dev/stderr > "logs/$id"
}

trap terminate_instance EXIT

ip=$(aws ec2 describe-instances --instance-ids "$id" --query 'Reservations[0].Instances[0].PublicIpAddress' --output text)

aws ec2 wait instance-running --instance-id "$id"

until ssh -oStrictHostKeyChecking=accept-new "root@$ip" echo hello; do sleep 5; done

@arianvp arianvp closed this as completed May 25, 2024
@arianvp
Copy link
Member

arianvp commented May 25, 2024

(Feel free to reopen if you have more info)

@arianvp arianvp reopened this May 28, 2024
@arianvp
Copy link
Member

arianvp commented May 28, 2024

I have successfully been able to reproduce your issue.

It reproduces as soon as you launch an instance in an ipv6 dual-stack subnet.

The reason why it's happening:

dhcpcd reports ready as soon as it receives any lease (ipv6 or ipv4). The instance receives the ipv6 lease first
This causes network-online.target to become ready which in turn fires off the metadata fetcher. But at this point there is no ipv4 connectivity yet and the fetcher fails.

I think the easiest and most correct fix here is that we should adjust the script to try both the IPv6 and IPv4 IMDS address.

@hjkatz
Copy link
Author

hjkatz commented May 28, 2024

@arianvp Hooray for us!

We were successfully able to avoid the issue when turning off ipv6 IP assignment on our EC2 instances. We launched ~40 of them and all came up without issues.

Would you like to take a stab at that ipv6 adjustments to the metadata script?

@hjkatz
Copy link
Author

hjkatz commented May 28, 2024

Check out the PR here: NixOS/nixpkgs#314427

Happy to make adjustments and work back and forth there.

@hjkatz hjkatz closed this as completed May 28, 2024
@hjkatz
Copy link
Author

hjkatz commented May 28, 2024

Did not mean to close

@hjkatz hjkatz reopened this May 28, 2024
@commiterate
Copy link

commiterate commented Sep 6, 2024

Given the complicated nature of amazon-init and some of the other AWS EC2 scripts that exist in Nixpkgs, I'm wondering if we should just deprecate those in favor of the existing cloud-init module and other modules the Amazon Linux team has open sourced over the years. I don't see anything wrong with cloud-init running the following user data script to apply a NixOS config:

#!/bin/sh

# /bin/sh should be set up by the NixOS image to point to bash-interactive in the Nix store.

# Regenerate the hardware configuration?
nixos-generate-config

# Overwrite the existing configuration.
cat > /etc/nixos/configuration.nix << END_OF_FILE
{ config, pkgs, ... }:

{
  imports = [
    ./hardware-configuration.nix
  ];

  nix = {
    settings = {
      experimental-features = [
        "nix-command"
        "flakes"
      ];
    };
  };

  # ...
}
END_OF_FILE

# Rebuild the system.
nixos-rebuild switch

# Maybe create a flake.nix if they would rather a flake configuration instead.

Alternatively, the user script uses a wget (assuming GNU coreutils is available) or nix-shell --packages awscli2 --command "aws s3 cp s3://... /etc/nixos/configuration.nix" to download a NixOS configuration from an AWS S3 bucket.

That way, the NixOS AMIs provide:

  • amazon-ec2-utils
    • Provides udev rules for consistent device naming.
    • Added to services.udev.packages instead of environment.systemPackages.
  • cloud-init
    • Handles EC2 user data scripts.
  • amazon-ec2-net-utils
    • Handles network setup.
  • amazon-ssm-agent
    • Handles AWS SSM functionality.
  • aws-ec2-instance-connect-config
    • Handles AWS EC2 Instance Connect functionality.
    • Edit: Scratch this, the provided scripts hardcode the locations of all binaries it calls (curl, echo) and also only works for IMDS over IPv4.

All of these properly wait for IMDS and are maintained by upstream (i.e. cloud-init + AWS) instead of Nixpkgs maintainers.

Here's the Amazon Linux 2023 package list for reference:

https://docs.aws.amazon.com/linux/al2023/ug/image-comparison.html

The only thing missing would be a systemd oneshot unit that looks for an EC2 key pair from IMDS and registers it with the SSH daemon.

Personally, I don't really view that as high priority given that they can use SSM Session Manager with the SSM agent installed (also needs an SSM IAM role for the instance profile, but AWS provides a managed policy for that) or they can use EC2 Instance Connect to push an ephemeral SSH key through IAM-protected APIs.

Long-lived credentials like EC2 key pairs are bad practice to begin with.

The only other optional things I can think of are:

  • AWSTOE (open sourcing request)
    • For AWS EC2 Image Builder.
  • AWS CloudWatch Agent (Nixpkgs init PR)
    • For pushing logs + metrics + traces to AWS CloudWatch and AWS X-Ray.
    • Alternatively use fluentd (NixOS module exists) or fluent-bit (no module yet).

There's also the AWS CodeDeploy Agent, though the deploy pattern for that with mutable environments feels like something most NixOS users won't use.

Instead, I imagine NixOS users are more likely to bake immutable AMIs ahead of time and then use EC2 Instance Refresh to replace ASG instances. I have a separate feature request to the AWS CloudFormation team to allow triggering that feature natively with a CloudFormation stack update.

aws-cloudformation/cloudformation-coverage-roadmap#2119

To support setups that bake a NixOS disk image and upload those to S3 for EC2's ImportImage or ImportSnapshot + RegisterSnapshot workflows, I have another feature request to Image Builder for both CloudFormation support and expanding ImportImage to support AArch64 (also expand EC2 Image Builder's ImportVmImage to encompass it all so we get an ARN-able AMI that can be targeted with Image Builder lifecycle rules).

aws/ec2-image-builder-roadmap#103

@commiterate
Copy link

commiterate commented Sep 25, 2024

On a separate note for the IPv4 and IPv6 lease race condition, it looks like systemd-networkd has some options around that.

systemd/systemd#16700

https://search.nixos.org/options?channel=unstable&show=systemd.network.wait-online.anyInterface

I think even in IPv6-only subnets, IMDS is always available over IPv4?

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html

When the IPv6 endpoint is enabled, the IPv4 endpoint remains enabled.

https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-ip-address-range

Resources in IPv6-only subnets are assigned IPv4 link-local addresses from CIDR block 169.254.0.0/16. These addresses are used to communicate with VPC services like Instance Metadata Service (IMDS).

@arianvp
Copy link
Member

arianvp commented Sep 25, 2024

Ah nice!

We just have to configure networkd to wait on both ipv6 and ipv4 then I think.

Resources in IPv6-only subnets are assigned IPv4 link-local addresses from CIDR block 169.254.0.0/16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants