[services][blob] Add upload blob HTTP handler
ClosedPublic
Actions

Authored by bartek on Apr 18 2023, 12:04 AM.

Details

Reviewers

marcin
tomek
michal
• jon

Commits

rCOMMc6c35c02c58e: [services][blob] Add upload blob HTTP handler

Summary

This diff implements blob uploading using multipart/form-data. Details are in the linear task.

Depends on D7482

Test Plan

Used Postman client to upload a large file (more than 5MB to exceed the single S3 part limit). The file was uploaded successfully.
Also checked failure cases: already existing blob hash, invalid form-data field order, missing fields.

Diff Detail

Repository

rCOMM Comm

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

bartek created this revision.Apr 18 2023, 12:04 AM

bartek held this revision as a draft.

Herald added subscribers: atul, ashoat. · View Herald TranscriptApr 18 2023, 12:04 AM

Harbormaster completed remote builds in B18437: Diff 25252.Apr 18 2023, 12:22 AM

bartek published this revision for review.Apr 18 2023, 8:23 AM

Might be nice to factor out all of the failable s3 code, which returns an s3 error, then we can just assert the error once from the calling code.

while let Some(mut blob_field) = payload.try_next().await? {
    let field_name = blob_field.name();
    if field_name != "blob_data" {
      warn!(
        field_name,
        "Malfolmed request: 'blob_data' multipart field expected."
      );
      return Err(ErrorBadRequest("Bad request"));
    }

  upload_blob_to_s3(blob_field).await.map_err(handle_s3_error)?;
}

services/blob/src/http/handlers/blob.rs
129–132 ↗	(On Diff #25252)	Feel like this is a bit fragile, not a stopper. just concerned.
169–194 ↗	(On Diff #25252)	It's my assumption that you would need a separate upload_session per blob, if that's true, we could simplify this a bit
194 ↗	(On Diff #25252)	This will kill the server it does it hit, we should probably `map_err`

This revision now requires changes to proceed.Apr 18 2023, 10:07 AM

I should've explain this in the diff description.

Basically, the whole multipart stream consists of multiple Fields (a form field) where each field has its own stream of data bytes. For each field, I iterate over the stream and join the bytes to receive full field content.
The expected input format is:

a blob_hash field - it's content is joined to a single string
multiple blob_data fields (multipart spec allows fields to repeat) - they're concatenated together and uploaded to S3 as a single blob. Meanwhile buffered and uploaded as S3 parts, which are independent of the input partitions.

Note that this endpoint uploads a single blob, regardless of the input partitioning.

services/blob/src/http/handlers/blob.rs
169–194 ↗	(On Diff #25252)	Nope, there's a single S3 session for all of them. The `upload_session` should be created only if the input contains some data
194 ↗	(On Diff #25252)	This one should never fail because this `upload_session` is set to `Some` right above and there the `?` operator fails. If this hits, something is really wrong. But I'll try to refactor the code anyway to simplify it a bit

Simplify the upload code. Extract the data handling logic into a separate function.

bartek added inline comments.Apr 21 2023, 3:41 AM

services/blob/src/http/handlers/blob.rs
224 ↗	(On Diff #25509)	Tracked in https://linear.app/comm/issue/ENG-2398/blob-service-add-dedicated-s3-errors Change of the `finish_upload()` signature will enforce looking to this callsite and adjusting it appropriately so this TODO won't get lost
services/blob/src/s3.rs
200 ↗	(On Diff #25509)	Tracked in https://linear.app/comm/issue/ENG-2398/blob-service-add-dedicated-s3-errors

Harbormaster completed remote builds in B18619: Diff 25509.Apr 21 2023, 3:55 AM

marcin added inline comments.Apr 21 2023, 4:49 AM

services/blob/src/http/handlers/blob.rs
135 ↗	(On Diff #25509)	Just a nit, but do we expect `blob_hash` to be potentially large? Is it possible that this field could be delivered entirely at once? The code would be a little simpler.
171 ↗	(On Diff #25509)	Looks like an intermediate failure will force client to retry upload process from scratch. Will already uploaded data from previous attempt be cleaned automatically or left obsolete?

bartek added inline comments.Apr 21 2023, 5:52 AM

services/blob/src/http/handlers/blob.rs
135 ↗	(On Diff #25509)	In practice, it will probably always be sent as one chunk but the library doesn't guarantee that so we need to collect it this way. The actix_multipart provides a high level interface for that, but it isn't suitable for our use case where we stream the data part by part to s3
171 ↗	(On Diff #25509)	S3 will take care of cleaning uploads that weren't confirmed as complete by `finish_upload()`.