Skip to content

downstairs error retries will pause all IO in the guest #1918

@leftwo

Description

@leftwo

We will retry a failed write/flush in the downstairs forever.

However, this can cause a backup that is seen in the upstairs, and will pause IO until either the upstairs gives up on the downstairs, or the guest gives up on it's outstanding IO and marks the volume RO. I think we can do better here.

Here is what I see happening.

Regular IO is in flight to the Downstairs.

    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     1  1602     545941   5454       4096      1     1     1   1601  1601  1601
7de543f2 c2b329b0 ACT ACT ACT     1  6321     550660   4719      94208      2    23     2   6319  6298  6319
7de543f2 c2b329b0 ACT ACT ACT     1  1175     555519   4859       8192      1     2     1   1174  1173  1174
7de543f2 c2b329b0 ACT ACT ACT     1   330     561066   5547       8192      2     1     2    328   329   328
7de543f2 c2b329b0 ACT ACT ACT     1  5963     566699   5633      69632      4    17     4   5959  5946  5959
7de543f2 c2b329b0 ACT ACT ACT     2 10008     570744   4045      81920      2    21     2  10006  9987 10006
7de543f2 c2b329b0 ACT ACT ACT     0  5746     576483   5739      12288      1     3     3   5745  5743  5743

I then take space up on a downstairs region so writes will fail

271FVPY0 # mkfile 1g ballast3; sleep 5; rm ballast3
ballast3: initialized 115474432 of 1073741824 bytes: Disc quota exceeded    

I see IOs pause in the upstairs. The DELTA column here is what to watch.

    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     1  5445    4137388   2163     667648      1     1   163   5444  5444  5282
7de543f2 c2b329b0 ACT ACT ACT     1  8009    4139952   2564     643072      1     1   157   8008  8008  7852
7de543f2 c2b329b0 ACT ACT ACT     0   151    4142095   2143     618496      0     0   151    151   151     0
Fill of dataset starts here
7de543f2 c2b329b0 ACT ACT ACT     1   738    4144308   2213    1478656      1   361   125    737   377   613
7de543f2 c2b329b0 ACT ACT ACT     1  2054    4145624   1316    3284992      1     1   802   2053  2053  1252
7de543f2 c2b329b0 ACT ACT ACT     1  2772    4146342    718    5689344      1     1  1389   2771  2771  1383
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579    237    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509

And, the upstairs at this point no more IOs are being completed by the Upstairs.
In the propolis logs, we see the flush check failures:

16:58:42.269Z WARN propolis-server (vm_state_driver): flush check fired despite having jobs; resetting it

But, we don't see any errors coming back from the downstairs, just the flush check failed repeating
while we have a full dataset downstairs.

After five seconds, we have deleted the ballast file and freed up space on the downstairs, but we now have a
storm of messages that come back once the IO in the downstairs finishes. I believe these are all the errors
that were generated during the time the IO could not complete:

16:58:46.838Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("EDQUOT: Disc quota exceeded")
     = downstairs
    client = 2
    session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe

... many messages deleted

16:58:54.453Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("writing context slots failed: E
DQUOT: Disc quota exceeded")
     = downstairs
    client = 2
    session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe

So, that's like 10 seconds of logs of IO errors.
From the propolis log, we can count 777659 errors were printed,
All from that ~ 5 seconds of time where we could not
write our change to disk.

I think this additional time is also preventing IOs from resuming. If we look at the full DTrace output
we see longer than 5 seconds (assuming 1 line per second here), we have a block of 12 seconds where
we have no delta (except for the 36 at 5 seconds, which I wonder if that is when we delete
the ballast file)

    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579    237    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615     36    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     1   936    4147556    941     634880      1     1   155    935   935   781
7de543f2 c2b329b0 ACT ACT ACT     1  3377    4149997   2441     692224      1     1   169   3376  3376  3208

Things I think to figure out next are:

  • Why does one stuck downstairs pause IO on the other two?
    Have we just reached the amount of outstanding IOs we can hold in memory and need to
    kick out a downstairs to move forward?

  • It looks like we store up error messages while in the error loop, and don't send any until the error is
    resolved. This is a problem as it means our unlimited queue depth for messages now grows without
    bounds. I can see this if I watch the downstairs memory usage while under error, it grows until the
    error condition ends.

  • I think there may be some effect of printing all the errors we received, as the downstairs should be
    ready to start processing IO after 5 seconds. I'm not sure if it's because their are 700,000 mesages
    to send, or 700,000 to log (or both) that give us this extra delay before IO resumes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions