We will retry a failed write/flush in the downstairs forever.
However, this can cause a backup that is seen in the upstairs, and will pause IO until either the upstairs gives up on the downstairs, or the guest gives up on it's outstanding IO and marks the volume RO. I think we can do better here.
Here is what I see happening.
Regular IO is in flight to the Downstairs.
UUID SESSION DS0 DS1 DS2 UPW DSW NEXT_JOB DELTA WRITE_BO IP0 IP1 IP2 D0 D1 D2
7de543f2 c2b329b0 ACT ACT ACT 1 1602 545941 5454 4096 1 1 1 1601 1601 1601
7de543f2 c2b329b0 ACT ACT ACT 1 6321 550660 4719 94208 2 23 2 6319 6298 6319
7de543f2 c2b329b0 ACT ACT ACT 1 1175 555519 4859 8192 1 2 1 1174 1173 1174
7de543f2 c2b329b0 ACT ACT ACT 1 330 561066 5547 8192 2 1 2 328 329 328
7de543f2 c2b329b0 ACT ACT ACT 1 5963 566699 5633 69632 4 17 4 5959 5946 5959
7de543f2 c2b329b0 ACT ACT ACT 2 10008 570744 4045 81920 2 21 2 10006 9987 10006
7de543f2 c2b329b0 ACT ACT ACT 0 5746 576483 5739 12288 1 3 3 5745 5743 5743
I then take space up on a downstairs region so writes will fail
271FVPY0 # mkfile 1g ballast3; sleep 5; rm ballast3
ballast3: initialized 115474432 of 1073741824 bytes: Disc quota exceeded
I see IOs pause in the upstairs. The DELTA column here is what to watch.
UUID SESSION DS0 DS1 DS2 UPW DSW NEXT_JOB DELTA WRITE_BO IP0 IP1 IP2 D0 D1 D2
7de543f2 c2b329b0 ACT ACT ACT 1 5445 4137388 2163 667648 1 1 163 5444 5444 5282
7de543f2 c2b329b0 ACT ACT ACT 1 8009 4139952 2564 643072 1 1 157 8008 8008 7852
7de543f2 c2b329b0 ACT ACT ACT 0 151 4142095 2143 618496 0 0 151 151 151 0
Fill of dataset starts here
7de543f2 c2b329b0 ACT ACT ACT 1 738 4144308 2213 1478656 1 361 125 737 377 613
7de543f2 c2b329b0 ACT ACT ACT 1 2054 4145624 1316 3284992 1 1 802 2053 2053 1252
7de543f2 c2b329b0 ACT ACT ACT 1 2772 4146342 718 5689344 1 1 1389 2771 2771 1383
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 237 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 0 6144000 0 0 1500 3009 3009 1509
And, the upstairs at this point no more IOs are being completed by the Upstairs.
In the propolis logs, we see the flush check failures:
16:58:42.269Z WARN propolis-server (vm_state_driver): flush check fired despite having jobs; resetting it
But, we don't see any errors coming back from the downstairs, just the flush check failed repeating
while we have a full dataset downstairs.
After five seconds, we have deleted the ballast file and freed up space on the downstairs, but we now have a
storm of messages that come back once the IO in the downstairs finishes. I believe these are all the errors
that were generated during the time the IO could not complete:
16:58:46.838Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("EDQUOT: Disc quota exceeded")
= downstairs
client = 2
session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe
... many messages deleted
16:58:54.453Z ERRO propolis-server (vm_state_driver): job id 4145115 saw error IoError("writing context slots failed: E
DQUOT: Disc quota exceeded")
= downstairs
client = 2
session_id = c2b329b0-03f4-4de1-ab26-319fff4a82fe
So, that's like 10 seconds of logs of IO errors.
From the propolis log, we can count 777659 errors were printed,
All from that ~ 5 seconds of time where we could not
write our change to disk.
I think this additional time is also preventing IOs from resuming. If we look at the full DTrace output
we see longer than 5 seconds (assuming 1 line per second here), we have a block of 12 seconds where
we have no delta (except for the 36 at 5 seconds, which I wonder if that is when we delete
the ballast file)
UUID SESSION DS0 DS1 DS2 UPW DSW NEXT_JOB DELTA WRITE_BO IP0 IP1 IP2 D0 D1 D2
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 237 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 0 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 0 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 0 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3009 4146579 0 6144000 0 0 1500 3009 3009 1509
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 36 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 0 3045 4146615 0 6144000 0 0 1500 3045 3045 1545
7de543f2 c2b329b0 ACT ACT ACT 1 936 4147556 941 634880 1 1 155 935 935 781
7de543f2 c2b329b0 ACT ACT ACT 1 3377 4149997 2441 692224 1 1 169 3376 3376 3208
Things I think to figure out next are:
-
Why does one stuck downstairs pause IO on the other two?
Have we just reached the amount of outstanding IOs we can hold in memory and need to
kick out a downstairs to move forward?
-
It looks like we store up error messages while in the error loop, and don't send any until the error is
resolved. This is a problem as it means our unlimited queue depth for messages now grows without
bounds. I can see this if I watch the downstairs memory usage while under error, it grows until the
error condition ends.
-
I think there may be some effect of printing all the errors we received, as the downstairs should be
ready to start processing IO after 5 seconds. I'm not sure if it's because their are 700,000 mesages
to send, or 700,000 to log (or both) that give us this extra delay before IO resumes.
We will retry a failed write/flush in the downstairs forever.
However, this can cause a backup that is seen in the upstairs, and will pause IO until either the upstairs gives up on the downstairs, or the guest gives up on it's outstanding IO and marks the volume RO. I think we can do better here.
Here is what I see happening.
Regular IO is in flight to the Downstairs.
I then take space up on a downstairs region so writes will fail
I see IOs pause in the upstairs. The
DELTAcolumn here is what to watch.And, the upstairs at this point no more IOs are being completed by the Upstairs.
In the propolis logs, we see the flush check failures:
But, we don't see any errors coming back from the downstairs, just the flush check failed repeating
while we have a full dataset downstairs.
After five seconds, we have deleted the ballast file and freed up space on the downstairs, but we now have a
storm of messages that come back once the IO in the downstairs finishes. I believe these are all the errors
that were generated during the time the IO could not complete:
... many messages deleted
So, that's like 10 seconds of logs of IO errors.
From the propolis log, we can count 777659 errors were printed,
All from that ~ 5 seconds of time where we could not
write our change to disk.
I think this additional time is also preventing IOs from resuming. If we look at the full DTrace output
we see longer than 5 seconds (assuming 1 line per second here), we have a block of 12 seconds where
we have no delta (except for the 36 at 5 seconds, which I wonder if that is when we delete
the ballast file)
Things I think to figure out next are:
Why does one stuck downstairs pause IO on the other two?
Have we just reached the amount of outstanding IOs we can hold in memory and need to
kick out a downstairs to move forward?
It looks like we store up error messages while in the error loop, and don't send any until the error is
resolved. This is a problem as it means our unlimited queue depth for messages now grows without
bounds. I can see this if I watch the downstairs memory usage while under error, it grows until the
error condition ends.
I think there may be some effect of printing all the errors we received, as the downstairs should be
ready to start processing IO after 5 seconds. I'm not sure if it's because their are 700,000 mesages
to send, or 700,000 to log (or both) that give us this extra delay before IO resumes.