downstairs error retries will pause all IO in the guest

We will retry a failed write/flush in the downstairs forever.

However, this can cause a backup that is seen in the upstairs, and will pause IO until either the upstairs gives up on the downstairs, or the guest gives up on it's outstanding IO and marks the volume RO.  I think we can do better here.

Here is what I see happening.

Regular IO is in flight to the Downstairs.
```
    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     1  1602     545941   5454       4096      1     1     1   1601  1601  1601
7de543f2 c2b329b0 ACT ACT ACT     1  6321     550660   4719      94208      2    23     2   6319  6298  6319
7de543f2 c2b329b0 ACT ACT ACT     1  1175     555519   4859       8192      1     2     1   1174  1173  1174
7de543f2 c2b329b0 ACT ACT ACT     1   330     561066   5547       8192      2     1     2    328   329   328
7de543f2 c2b329b0 ACT ACT ACT     1  5963     566699   5633      69632      4    17     4   5959  5946  5959
7de543f2 c2b329b0 ACT ACT ACT     2 10008     570744   4045      81920      2    21     2  10006  9987 10006
7de543f2 c2b329b0 ACT ACT ACT     0  5746     576483   5739      12288      1     3     3   5745  5743  5743
```
I then take space up on a downstairs region so writes will fail
```
271FVPY0 # mkfile 1g ballast3; sleep 5; rm ballast3
ballast3: initialized 115474432 of 1073741824 bytes: Disc quota exceeded    
```

I see IOs pause in the upstairs.  The `DELTA` column here is what to watch.
```     
    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     1  5445    4137388   2163     667648      1     1   163   5444  5444  5282
7de543f2 c2b329b0 ACT ACT ACT     1  8009    4139952   2564     643072      1     1   157   8008  8008  7852
7de543f2 c2b329b0 ACT ACT ACT     0   151    4142095   2143     618496      0     0   151    151   151     0
Fill of dataset starts here
7de543f2 c2b329b0 ACT ACT ACT     1   738    4144308   2213    1478656      1   361   125    737   377   613
7de543f2 c2b329b0 ACT ACT ACT     1  2054    4145624   1316    3284992      1     1   802   2053  2053  1252
7de543f2 c2b329b0 ACT ACT ACT     1  2772    4146342    718    5689344      1     1  1389   2771  2771  1383
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579    237    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
```
And, the upstairs at this point no more IOs are being completed by the Upstairs.
In the propolis logs, we see the flush check failures:
```
16:58:42.269Z WARN propolis-server (vm_state_driver): flush check fired despite having jobs; resetting it
```
But, we don't see any errors coming back from the downstairs, just the flush check failed repeating
while we have a full dataset downstairs.

After five seconds, we have deleted the ballast file and freed up space on the downstairs, but we now have a
storm of messages that come back once the IO in the downstairs finishes.  I believe these are all the errors
that were generated during the time the IO could not complete:

```
16:58:46.838Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("EDQUOT: Disc quota exceeded")
     = downstairs
    client = 2
    session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe
``` 
... many messages deleted
```
16:58:54.453Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("writing context slots failed: E
DQUOT: Disc quota exceeded")
     = downstairs
    client = 2
    session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe
```
    
So, that's like 10 seconds of logs of IO errors.
From the propolis log, we can count 777659 errors were printed,
All from that ~ 5 seconds of time where we could not
write our change to disk.

I think this additional time is also preventing IOs from resuming.  If we look at the full DTrace output
we see longer than 5 seconds (assuming 1 line per second here), we have a block of 12 seconds where
we have no delta (except for the 36 at 5 seconds, which I wonder if that is when we delete
the ballast file)

```
    UUID  SESSION DS0 DS1 DS2   UPW   DSW   NEXT_JOB  DELTA   WRITE_BO    IP0   IP1   IP2     D0    D1    D2
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579    237    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3009    4146579      0    6144000      0     0  1500   3009  3009  1509
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615     36    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     0  3045    4146615      0    6144000      0     0  1500   3045  3045  1545
7de543f2 c2b329b0 ACT ACT ACT     1   936    4147556    941     634880      1     1   155    935   935   781
7de543f2 c2b329b0 ACT ACT ACT     1  3377    4149997   2441     692224      1     1   169   3376  3376  3208
```

Things I think to figure out next are: 
* Why does one stuck downstairs pause IO on the other two?
Have we just reached the amount of outstanding IOs we can hold in memory and need to
kick out a downstairs to move forward?

* It looks like we store up error messages while in the error loop, and don't send any until the error is
resolved.  This is a problem as it means our unlimited queue depth for messages now grows without
bounds.  I can see this if I watch the downstairs memory usage while under error, it grows until the
error condition ends.

* I think there may be some effect of printing all the errors we received, as the downstairs should be
ready to start processing IO after 5 seconds.  I'm not sure if it's because their are 700,000 mesages
to send, or 700,000 to log (or both) that give us this extra delay before IO resumes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

downstairs error retries will pause all IO in the guest #1918

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

downstairs error retries will pause all IO in the guest #1918

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions