Development features of isoreader

This vignette introduces some of the development features of the isoreader package and is aimed primarily at code contributors interested in expanding its functionality or helping with bug fixes.

Adding new file format readers

Testing out new file format readers is easiest by registering a new reader function for a specific file extension using iso_register_dual_inlet_file_reader and iso_register_continuous_flow_file_reader, respectively. Both require an extension (e.g. ".ext"), name of the new reader function ("new_reader"), and optionally a description. Both functions automatically return a data frame with a list of all registered reader. Overwriting of existing readers with a different function requires an explicit overwrite = TRUE flag. All reader functions must accept an isoreader data structure object (ds) as the first argument, a list of reader specific options as the second argument (options), and should return the structure with data filled in for downstream isoreader operations to work smoothly. The following minimal example illustrates how to do this with the new_reader function simply printing out the layout of the provided data structure skeleton ds.

new_reader <- function(ds, options = list()) {
  isoreader:::log_message("this is the new reader!")
  str(ds)
  return(ds)
}

# register new reader
readers <- iso_register_dual_inlet_file_reader(".new.did", "new_reader")
knitr::kable(readers)

type	call	extension	func	cacheable	post_read_check	description	software	env
dual inlet	iso_read_dual_inlet	.caf	iso_read_caf	TRUE	TRUE	Dual Inlet file format (older)	Isodat	isoreader
dual inlet	iso_read_dual_inlet	.did	iso_read_did	TRUE	TRUE	Dual Inlet file format (newer)	Isodat	isoreader
dual inlet	iso_read_dual_inlet	.txt	iso_read_nu	TRUE	TRUE	Dual Inlet file format	Nu	isoreader
continuous flow	iso_read_continuous_flow	.cf	iso_read_cf	TRUE	TRUE	Continuous Flow file format (older)	Isodat	isoreader
continuous flow	iso_read_continuous_flow	.dxf	iso_read_dxf	TRUE	TRUE	Continuous Flow file format (newer)	Isodat	isoreader
continuous flow	iso_read_continuous_flow	.iarc	iso_read_flow_iarc	TRUE	TRUE	Continuous Flow data archive	ionOS	isoreader
scan	iso_read_scan	.scn	iso_read_scn	TRUE	TRUE	Scan file format	Isodat	isoreader
continuous flow	iso_read_continuous_flow	.cf.rds	iso_read_rds	FALSE	FALSE	R Data Storage	isoreader	isoreader
dual inlet	iso_read_dual_inlet	.di.rds	iso_read_rds	FALSE	FALSE	R Data Storage	isoreader	isoreader
scan	iso_read_scan	.scan.rds	iso_read_rds	FALSE	FALSE	R Data Storage	isoreader	isoreader
dual inlet	iso_read_dual_inlet	.new.did	new_reader	TRUE	TRUE	NA	NA	R_GlobalEnv


# copy an example file from the package with the new extension
iso_get_reader_example("dual_inlet_example.did") |> file.copy(to = "example.new.did")
#> [1] TRUE

# read the file
iso_read_dual_inlet("example.new.did", read_cache = FALSE)
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file 'example.new.did' with '.new.did' reader...
#> Info: this is the new reader!
#> List of 7
#>  $ version          :Classes 'package_version', 'numeric_version'  hidden list of 1
#>   ..$ : int [1:3] 1 4 1
#>  $ read_options     :List of 4
#>   ..$ file_info        : logi TRUE
#>   ..$ method_info      : logi TRUE
#>   ..$ raw_data         : logi TRUE
#>   ..$ vendor_data_table: logi TRUE
#>  $ file_info        : tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
#>   ..$ file_id      : chr "example.new.did"
#>   ..$ file_root    : chr "."
#>   ..$ file_path    : chr "example.new.did"
#>   ..$ file_subpath : chr NA
#>   ..$ file_datetime: POSIXct[1:1], format: NA
#>   ..$ file_size    : int 134446
#>  $ method_info      : list()
#>  $ raw_data         : tibble [0 × 0] (S3: tbl_df/tbl/data.frame)
#>  Named list()
#>  $ vendor_data_table: tibble [0 × 0] (S3: tbl_df/tbl/data.frame)
#>  Named list()
#>  $ bgrd_data        : tibble [0 × 0] (S3: tbl_df/tbl/data.frame)
#>  Named list()
#>  - attr(*, "class")= chr [1:2] "dual_inlet" "iso_file"
#>  - attr(*, "problems")= tibble [0 × 3] (S3: tbl_df/tbl/data.frame)
#>   ..$ type   : chr(0) 
#>   ..$ func   : chr(0) 
#>   ..$ details: chr(0)
#> Info: finished reading 1 files in 0.21 secs
#> Dual inlet iso file 'example.new.did': 0 cycles, 0 ions ()
file.remove("example.new.did")
#> [1] TRUE

Note that for parallel processing to work during the read process (parallel = TRUE), isoreader needs to know where to find the new reader function. It will figure this out automatically as long as the function name is unique but if this fails (or to be on the safe side), please specify e.g. env = "R_GlobalEnv" or env = "newpackage" during the reader registration. Also note that isoreader will not automatically know where to find all functions called from within the new reader function if they are not part of base R and it is recommended to make all outside calls explicit (e.g. dplyr::filter(...)) to preempt this potential problem. For info messages and warnings to work with the progress bar and in parallel reads, make sure to use isoreader:::log_message(...) and isoreader:::log_warning(...) instead of base R’s message(...) and warning(...).

If you have designed and tested a new reader, please consider contributing it to the isoreader github repository via pull request.

Processing hooks

Isoreader defines two processing hooks at the beginning and end of reading an individual file. This is useful for integration into pipelines that require additional output (such as GUIs) but is also sometimes useful for debugging purposes. The expressions are evaluated in the context of the isoreader:::read_iso_file function and have access to all parameters passed to this function, such as e.g. file_n and path. Same as for new readers: for info messages and warnings to work with the progress bar and in parallel reads, make sure to use isoreader:::log_message(...) and isoreader:::log_warning(...) instead of base R’s message(...) and warning(...). The main difference between the two is that log_message() will honor the quiet = TRUE flag passed to the main iso_read...() call whereas log_warning() will always show its message no matter the quiet setting.

isoreader:::set_read_file_event_expr({
  isoreader:::log_message(sprintf("starting file #%.d, named '%s'", file_n, basename(path)))
})
isoreader:::set_finish_file_event_expr({
  isoreader:::log_message(sprintf("finished file #%.d", file_n))
})

c(
  iso_get_reader_example("dual_inlet_example.did"),
  iso_get_reader_example("dual_inlet_example.caf")
) |> iso_read_dual_inlet(read_cache = FALSE)
#> Info: preparing to read 2 data files (all will be cached)...
#> Info: reading file 'dual_inlet_example.did' with '.did' reader...
#> Info: starting file #1, named 'dual_inlet_example.did'
#> Info: finished file #1
#> Info: reading file 'dual_inlet_example.caf' with '.caf' reader...
#> Info: starting file #2, named 'dual_inlet_example.caf'
#> Info: finished file #2
#> Info: finished reading 2 files in 7.54 secs
#> Data from 2 dual inlet iso files: 
#> # A tibble: 2 × 6
#>   file_id                file_path_  file_subpath raw_data file_info method_info
#>   <chr>                  <chr>       <chr>        <glue>   <chr>     <chr>      
#> 1 dual_inlet_example.did dual_inlet… NA           7 cycle… 16 entri… standards,…
#> 2 dual_inlet_example.caf dual_inlet… NA           8 cycle… 22 entri… standards,…

isoreader:::initialize_options() # reset all isoreader options

Debugging isoreader

The best way to start debugging an isoreader call is to switch the package into debug mode. This is done using the internal iso_turn_debug_on() function. This enables debug messages, turns caching off by default so files are always read anew, and makes the package keep more information in the isofile objects. It continues to catch errors inside file readers (keeping track of them in the problems) unless you set iso_turn_debug_on(catch_errors = FALSE), in which case no errors are caught and stop the processing so you get the full traceback and debugging options of your IDE.

Debugging binary file reads (Isodat)

Errors during the binary file reads usually indicate the approximate position in the file where the error was encountered. The easiest way to get started on figuring out what the file looks like at that position is to use a binary file editor and jump to the position. For a sense of the interpreted structure around that position, one can use iso_print_source_file_structure() which shows what binary patterns isoreader recognized. This binary representation of the source file is only available if the file is read while in debug mode, otherwise file objects would get unnecessarily large:

# turn on debug mode
isoreader:::iso_turn_debug_on()
#> Info: debug mode turned on, error catching turned on, caching turned off
# read example file
ex <- iso_get_reader_example("dual_inlet_example.did") |>  
  iso_read_dual_inlet(quiet = TRUE)
# retrieve source structure and print a part of it
bin <- ex |> iso_get_source_file_structure() 
bin |> iso_print_source_file_structure(length = 500)
#> # Textual representation of the partial structure (bytes 1 - 504) of the isodat file.
#> # Print more/less by specifying the 'start', 'length' or 'end' parameters.
#> 0000001: <CFileHeader>{unknown-4: 'fe f7 31 01'}
#> 0000022:   <06-000>{text-10: 'CBlockData'}{text-18: 'CDualInletDocument'}<4x00>
#> 0000094:   <03-000>{unknown-2: '2f 00'}{text-20: 'Acquisition-1568.did'}{text-11: 'File Header'}<4x00>
#> 0000174:   <02-000>
#> 0000178:   <02-000>
#> 0000182: <CTimeObject>
#> 0000199:   <03-000>{unknown-2: '2f 00'}{text-4: 'Date'}{text-4: 'Date'}<4x00>
#> 0000233:   <01-000>{unknown-4: '4a 2b 4e 54'}
#> 0000241: <CStr>
#> 0000251:   <02-000>{text-18: 'RW2000TemplateName'}
#> 0000295:   <02-000>{text-84: 'C:\Thermo\Isodat NT\Global\User\Dual Inlet System\Result Workshop\Default Result.IRW'}
#> 0000471: <CDataIndex>
#> 0000487:   <03-000>{unknown-2: '2f 00'}{text-0: 'NA'}{text-0: 'NA'}<4x00>

This structure representation shows recognized control elements in <...> and data elements in {...} which are converted to text or numeric representation if the interpretation is unambiguous, or plain hexadecimal characters if the nature of the data cannot be determined with certainty. You can adjust start and length to look at different parts of the binary file or save the the structure to a text file with save_to_file.

For an overview of all the elements (blocks) identified in the binary file as a tibble, use:

bin$blocks |> head(20)
#> # A tibble: 20 × 8
#>    block_idx start   end   len data_len type    priority block               
#>        <int> <int> <int> <int>    <dbl> <chr>      <int> <chr>               
#>  1         1     1    17    17       11 C block        1 CFileHeader         
#>  2         2    18    21     4        4 unknown        5 fe f7 31 01         
#>  3         3    22    25     4        0 x-000          3 06-000              
#>  4         4    26    49    24       10 text           2 CBlockData          
#>  5         5    50    89    40       18 text           2 CDualInletDocument  
#>  6         6    90    93     4        0 0000+          4 4x00                
#>  7         7    94    97     4        0 x-000          3 03-000              
#>  8         8    98    99     2        2 unknown        5 2f 00               
#>  9         9   100   143    44       20 text           2 Acquisition-1568.did
#> 10        10   144   169    26       11 text           2 File Header         
#> 11        11   170   173     4        0 0000+          4 4x00                
#> 12        12   174   177     4        0 x-000          3 02-000              
#> 13        13   178   181     4        0 x-000          3 02-000              
#> 14        14   182   198    17       11 C block        1 CTimeObject         
#> 15        15   199   202     4        0 x-000          3 03-000              
#> 16        16   203   204     2        2 unknown        5 2f 00               
#> 17        17   205   216    12        4 text           2 Date                
#> 18        18   217   228    12        4 text           2 Date                
#> 19        19   229   232     4        0 0000+          4 4x00                
#> 20        20   233   236     4        0 x-000          3 01-000

While this provides all elements, the top level structure is provided by the so-called control blocks:

bin$blocks |> dplyr::filter(type == "C block") |> head(20)
#> # A tibble: 20 × 8
#>    block_idx start   end   len data_len type    priority block                  
#>        <int> <int> <int> <int>    <dbl> <chr>      <int> <chr>                  
#>  1         1     1    17    17       11 C block        1 CFileHeader            
#>  2        14   182   198    17       11 C block        1 CTimeObject            
#>  3        22   241   250    10        4 C block        1 CStr                   
#>  4        27   471   486    16       10 C block        1 CDataIndex             
#>  5        35   513   535    23       17 C block        1 CSeqLineIndexData      
#>  6        43   588   598    11        5 C block        1 CData                  
#>  7       113  1133  1157    25       19 C block        1 CDualInletBlockData    
#>  8       121  1240  1261    22       16 C block        1 CMeasurmentInfos       
#>  9       129  1324  1350    27       21 C block        1 CISLScriptMessageData  
#> 10       163  1945  1967    23       17 C block        1 CMeasurmentErrors      
#> 11       172  2038  2060    23       17 C block        1 CDualInletRawData      
#> 12       180  2099  2114    16       10 C block        1 CBlockData             
#> 13       188  2269  2302    34       28 C block        1 CIntegrationUnitTransf…
#> 14       199  2361  2380    20       14 C block        1 CIntensityData         
#> 15       639  5299  5319    21       15 C block        1 CDualInletShout        
#> 16       655  5460  5485    26       20 C block        1 CTwoDoublesArrayData   
#> 17       827  6412  6433    22       16 C block        1 CStatusArrayData       
#> 18       885  6747  6764    18       12 C block        1 COutlierData           
#> 19      6461 40876 40902    27       21 C block        1 CResultDataSimpleList  
#> 20      6469 40951 40973    23       17 C block        1 CResultDataSimple

To look at specific control-blocks, simply provide the relevant start position to iso_print_source_file_structure():

cdata <- bin$blocks |> dplyr::filter(block == "CData")
cdata
#> # A tibble: 1 × 8
#>   block_idx start   end   len data_len type    priority block
#>       <int> <int> <int> <int>    <dbl> <chr>      <int> <chr>
#> 1        43   588   598    11        5 C block        1 CData

bin |> iso_print_source_file_structure(start = cdata$start, length = 500)
#> # Textual representation of the partial structure (bytes 588 - 1098) of the isodat file.
#> # Print more/less by specifying the 'start', 'length' or 'end' parameters.
#> 0000588: <CData>
#> 0000599:   <03-000>{unknown-2: '2f 00'}{text-3: '158'}{text-4: 'Line'}<4x00>{unknown-2: '0b 80'}
#> 0000633:   <03-000>{unknown-2: '2f 00'}{text-1: '1'}{text-11: 'Peak Center'}<4x00>{unknown-2: '0b 80'}
#> 0000677:   <03-000>{unknown-2: '2f 00'}{text-1: '1'}{text-11: 'Pressadjust'}<4x00>{unknown-2: '0b 80'}
#> 0000721:   <03-000>{unknown-2: '2f 00'}{text-1: '1'}{text-10: 'Background'}<4x00>{unknown-2: '0b 80'}
#> 0000763:   <03-000>{unknown-2: '2f 00'}{text-11: 'CIT Carrara'}{text-12: 'Identifier 1'}<4x00>{unknown-2: '0b 80'}
#> 0000829:   <03-000>{unknown-2: '2f 00'}{text-2: '13'}{text-12: 'Identifier 2'}<4x00>{unknown-2: '0b 80'}
#> 0000877:   <03-000>{unknown-2: '2f 00'}{text-5: '49077'}{text-8: 'Analysis'}<4x00>{unknown-2: '0b 80'}
#> 0000923:   <03-000>{unknown-2: '2f 00'}{text-0: 'NA'}{text-7: 'Comment'}<4x00>{unknown-2: '0b 80'}
#> 0000957:   <03-000>{unknown-2: '2f 00'}{text-0: 'NA'}{text-11: 'Preparation'}<4x00>{unknown-2: '0b 80'}
#> 0000999:   <03-000>{unknown-2: '2f 00'}{text-0: 'NA'}{text-11: 'Post Script'}<4x00>{unknown-2: '0b 80'}
#> 0001041:   <03-000>{unknown-2: '2f 00'}{text-16: 'CO2_multiply_16V'}{text-6: 'Method'}

2023-07-31

Adding new file format readers

Processing hooks

Debugging isoreader

Debugging binary file reads (Isodat)