Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove archived repos #1704

Open
hbontempo-br opened this issue Oct 8, 2022 · 8 comments
Open

Remove archived repos #1704

hbontempo-br opened this issue Oct 8, 2022 · 8 comments

Comments

@hbontempo-br
Copy link
Contributor

hbontempo-br commented Oct 8, 2022

First of all: Great service!

Last week I received I "CodeTriage misses you" email (yeah, I know... 😞) that suggested me to engage with an archived repo.

I don't think that any archived repository should be suggested to a user (both on email suggestions and through the web application)

Expected Behavior

Archived repos probably should be removed from the platform.
They shouldn't be suggested on daily emails or through the web interface "Open source projects on GitHub that need your help." functionatlity

Current Behavior

Archived repos are handled the same way as any other repo.

Possible Solution

Periodically check if the repo was archived then:

  • Record it:
    • Save to a flag column in the database
    • Add a default_scope to the model

OR

  • Delete it

(Probably could expand the the existing Checks if repos have been deleted on GitHub task if necessary to unnecessary requests to GH's api)

Steps to Reproduce (for bugs)

Example: https://www.codetriage.com/simplabs/excellent

@schneems
Copy link
Member

schneems commented Oct 8, 2022 via email

@hbontempo-br
Copy link
Contributor Author

hbontempo-br commented Oct 10, 2022

Didn't though about about this other scenarios.

I didn't found anything explicitly on GH's documentation on what to expect when a repo is rename or transferred, so I did some testing with my accounts.
Renaming and transferring the repo private shouldn't be a big problem to track, despite all the changes the repo can still be accessed using the repo endpoint with the old address (unless the a new repo is created with the name of old repo).
Making the repo private results in the same behaviour as a delete repo (response 404).


Update 2022-10-10

After @avcwisesa's comment I went back to check my testing and I was wrong. In my laziness I used the gh api commands thinking they would return the same thing a curl, but they don't.
Here is the correct behavior for the request GET https://api.github.com/repos/USER_NAME/REPO_NAME:

  • Public repo:

    • Response status: 200
    • Response body: as described by the documentation
  • Archived repo:

    • Response status: 200
    • Response body: the same as the previous case but with 'archived' key set to true
  • Renamed OR Transferred:

    • Response 301
    • Body:
    {
      "message": "Moved Permanently",
      "url": "https://api.github.com/repositories/1111111",
      "documentation_url": "https://docs.github.com/v3/#http-redirects"
    }
    • When following the address I received the same body as described in the GET https://api.github.com/repos/USER_NAME/REPO_NAME request
  • Deleted OR Private repo:

    • Response status: 404
    • Response body:
    {
     "message": "Not Found",
     "documentation_url": "https://docs.github.com/rest/reference/repos#get-a-repository"
    }

In summary:

  • tracking with deleted/private is already working
  • tracking archived repos should be easy as well
  • tracking renamed/transferred repos apparently is easy, just start saving the id of the repo and used to in the api instead of the USER_NAME/REPO_NAME, but I'm not sure if this is a good idea since it's not in the GH documentation

@avcwisesa
Copy link
Contributor

I also encountered a slight doubt when using CodeTriage the first time related to moved/transferred repo not sure which one is the correct one.

image

I saw when the old address is accessed, Github will return 301 with the new location.

curl -v https://github.com/pydata/pandas
< HTTP/2 301
< server: GitHub.com
< date: Mon, 10 Oct 2022 06:28:41 GMT
< content-type: text/html; charset=utf-8
< vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X-Requested-With
< location: https://github.com/pandas-dev/pandas
< cache-control: no-cache

@schneems
Copy link
Member

I think that what's happening is the client https://github.com/zombocom/git_hub_bub/blob/master/README.md is automatically following the 301 redirect.

It uses excon under the hood and here's where I'm doing that location following https://github.com/zombocom/git_hub_bub/blob/5003d6ef8d6a23fae3c0a534c49d10057265eb1b/lib/git_hub_bub/request.rb#L46.

Essentially in this case it's hitting the old API endpoint, which is giving me a 301 and I'm following that 301 to return actual valid issues. So in this case subscribing to either is "correct" in that you're getting the same issues. However it's confusing that there's two of them.

What I think we need to detect that case is to make a request to the API for each repo (without following the 301 redirect) to determine if it's been moved or not. If it has been moved, loop through all the users and move them over to the new repo ( create a new repo if not already done. Then there's also updating dependent to point at the new repo as well:

has_many :repo_subscriptions
has_many :users, through: :repo_subscriptions
has_many :subscribers, through: :repo_subscriptions, source: :user
has_many :doc_classes, dependent: :destroy
has_many :doc_methods, dependent: :destroy
delegate :open_issues, to: :issues
has_many :repo_labels
has_many :labels, through: :repo_labels
.

@schneems
Copy link
Member

I'm wondering if there are any Rails gems or methods to handle that for us instead of having to do it all manually. (i.e. to update all foreign keys that to point at a new database value).

@avcwisesa
Copy link
Contributor

What I think we need to detect that case is to make a request to the API for each repo (without following the 301 redirect) to determine if it's been moved or not.

+1, maybe can put this check on new repo submission & as periodic job?

If it has been moved, loop through all the users and move them over to the new repo ( create a new repo if not already done. Then there's also updating dependent to point at the new repo as well:

Some stuff I want to clarify about that thought:

  • When a registered repo in CodeTriage got transferred/renamed. Do we want to create a new repo or update the existing instance with the new name/user_name?
  • Does subscribed users need to be notified? (in case of a rename)

If we're only updating the name & user_name, then not need to touch the repos with updated name & user and maybe can have rake task to deduplicate the repo 🤔

@hbontempo-br
Copy link
Contributor Author

What I think we need to detect that case is to make a request to the API for each repo (without following the 301 redirect) to determine if it's been moved or not.

Since git_hub_bub deals with the redirects why not use this feature and just compare the information returned by the RepoFetcher with the one we have in the Repo?


Suggestion: Let's break all this discussion in smaller chuncks? (maybe even break this issue into multiple)

In my mind we could split the problem in:

  1. Keeping track of changes
    (Pretty confidant that I can do a PR until the weekend so we can discuss over)
  2. Dealing with duplicated repos
  3. Handling notification to subscribers

@schneems
Copy link
Member

In my mind we could split the problem in:

100% sounds good. I'm totally fine with breaking into multiple pieces.

Since git_hub_bub deals with the redirects why not use this feature and just compare the information returned by the RepoFetcher with the one we have in the Repo?

That's perfectly valid as well. GitHub returns URLs to resources, so you could compare the expected versus actual URL without needing an extra API request.

When a registered repo in CodeTriage got transferred/renamed. Do we want to create a new repo or update the existing instance with the new name/user_name?

I think renaming it is probably the easiest thing if we can, then we don't have to deal with merging/porting over data. It becomes a problem if someone else ALREADY added the other repo before we can move it ourself.

Does subscribed users need to be notified? (in case of a rename)

I would say no, if we need to change this policy later we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants