Page MenuHomePhabricator

[services][blob] Add upload blob HTTP handler
ClosedPublic

Authored by bartek on Apr 18 2023, 12:04 AM.
Tags
None
Referenced Files
Unknown Object (File)
Apr 3 2024, 4:16 PM
Unknown Object (File)
Apr 3 2024, 4:16 PM
Unknown Object (File)
Apr 3 2024, 4:15 PM
Unknown Object (File)
Apr 3 2024, 4:15 PM
Unknown Object (File)
Apr 3 2024, 4:13 PM
Unknown Object (File)
Mar 28 2024, 11:28 AM
Unknown Object (File)
Mar 7 2024, 6:03 AM
Unknown Object (File)
Mar 7 2024, 6:03 AM
Subscribers

Details

Summary

This diff implements blob uploading using multipart/form-data. Details are in the linear task.

Depends on D7482

Test Plan

Used Postman client to upload a large file (more than 5MB to exceed the single S3 part limit). The file was uploaded successfully.
Also checked failure cases: already existing blob hash, invalid form-data field order, missing fields.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

bartek held this revision as a draft.
bartek published this revision for review.Apr 18 2023, 8:23 AM

Might be nice to factor out all of the failable s3 code, which returns an s3 error, then we can just assert the error once from the calling code.

while let Some(mut blob_field) = payload.try_next().await? {
    let field_name = blob_field.name();
    if field_name != "blob_data" {
      warn!(
        field_name,
        "Malfolmed request: 'blob_data' multipart field expected."
      );
      return Err(ErrorBadRequest("Bad request"));
    }

  upload_blob_to_s3(blob_field).await.map_err(handle_s3_error)?;
}
services/blob/src/http/handlers/blob.rs
129–132 ↗(On Diff #25252)

Feel like this is a bit fragile, not a stopper. just concerned.

169–194 ↗(On Diff #25252)

It's my assumption that you would need a separate upload_session per blob, if that's true, we could simplify this a bit

194 ↗(On Diff #25252)

This will kill the server it does it hit, we should probably map_err

This revision now requires changes to proceed.Apr 18 2023, 10:07 AM

I should've explain this in the diff description.

Basically, the whole multipart stream consists of multiple Fields (a form field) where each field has its own stream of data bytes. For each field, I iterate over the stream and join the bytes to receive full field content.
The expected input format is:

  • a blob_hash field - it's content is joined to a single string
  • multiple blob_data fields (multipart spec allows fields to repeat) - they're concatenated together and uploaded to S3 as a single blob. Meanwhile buffered and uploaded as S3 parts, which are independent of the input partitions.

Note that this endpoint uploads a single blob, regardless of the input partitioning.

services/blob/src/http/handlers/blob.rs
169–194 ↗(On Diff #25252)

Nope, there's a single S3 session for all of them. The upload_session should be created only if the input contains some data

194 ↗(On Diff #25252)

This one should never fail because this upload_session is set to Some right above and there the ? operator fails.
If this hits, something is really wrong.

But I'll try to refactor the code anyway to simplify it a bit

Simplify the upload code. Extract the data handling logic into a separate function.

services/blob/src/http/handlers/blob.rs
224 ↗(On Diff #25509)

Tracked in https://linear.app/comm/issue/ENG-2398/blob-service-add-dedicated-s3-errors

Change of the finish_upload() signature will enforce looking to this callsite and adjusting it appropriately so this TODO won't get lost

services/blob/src/s3.rs
200 ↗(On Diff #25509)
services/blob/src/http/handlers/blob.rs
135 ↗(On Diff #25509)

Just a nit, but do we expect blob_hash to be potentially large? Is it possible that this field could be delivered entirely at once? The code would be a little simpler.

171 ↗(On Diff #25509)

Looks like an intermediate failure will force client to retry upload process from scratch. Will already uploaded data from previous attempt be cleaned automatically or left obsolete?

services/blob/src/http/handlers/blob.rs
135 ↗(On Diff #25509)

In practice, it will probably always be sent as one chunk but the library doesn't guarantee that so we need to collect it this way.

The actix_multipart provides a high level interface for that, but it isn't suitable for our use case where we stream the data part by part to s3

171 ↗(On Diff #25509)

S3 will take care of cleaning uploads that weren't confirmed as complete by finish_upload().

This revision is now accepted and ready to land.Apr 24 2023, 12:41 PM