Usage
PureImage is a command-line Java application. When you have already downloaded and extracted PureImage, you can use it:
- through command
pureimage
when deb package were installed (on GNU/Debian Linux); - use prepared
pureimage.sh
script when used ZIP package; - directly from the command-line with respect to its syntax if you want co change e.g. output directory.
The most important part of configuration is the de-identification profile. Its real content depends on the goals you want achieve. The level of anonymization and file format leads to making a proper de-identification profile you need for the PureImage. A sort of examples you can find at the profile page or in the downloaded PureImage, which is not a comprehensive example, but can be useful for writing your own rules.
Command Line Syntax
This information you can see also with --help
argument.
usage: java -Xms128m -Xmx512m -jar pureimage-2024.11.jar [-a] [-c <file>] [-d <dir>] [-dl <file>] [-dlo <file>] [-dlof <filter>] [-do <dir>] [-dt <dir>] [-h] [-m <mode>] [-nc] [-o <file>] [-ot <type>] [-pre] [--remove-source-file] [-tc <count>] [-v]
Options
- -a,--annotate
- Enable annotation in blind output files.
- -c,--config file
- The PureImage configuration file.
- -d,--data dir
- Input data directory or file. The whole directory is processed recursively.
- -dl,--data-list file
- File with the list of input data files for (selective) processing. Each file with relative or absolute path per line. An input data (-d) option can append additional files for processing.
- -dlo,--data-list-output file
- Load list of input files from an existing result file. This loads and parses an existing result file.
- -dlof,--data-list-output-filter filter
- Filter loaded result file by this/these filters. It is possible to use 'clean' and 'clean*' syntax in the filter. The first one accept only files with the clean status, but the second one accept anything that starts with the word 'clean' (e.g. cleanNotSure). Be careful and use quotation marks to prevent a star (*) expansion in the shell.
- -do,--dir-output dir
- Output directory. Processing result will be stored there. Default is the ./out/ directory in the working directory.
- -dt,--dir-tmp dir
- Temporary directory. Processing temporary files will be stored there. Default is the system temp directory (e.g. /tmp/ or C:\Users<name>\AppData\Local\Temp\).
- -h,--help
- Print this message.
- -m,--mode mode
- Processing mode. Values: (c)lassify or (d)eidentify. Default value is classify.
- -nc,--not-clean
- Do not clean temporary files automatically.
- -o,--output file
- Output file with results.
- -ot,--output-type type
- Set output type. Type of files that are stored to the output directory. Possible values are: none (default), all, clean, notClean, cleanUnsure, cleanOrUnsure, cleanAndDeidentified, deidentified, failed.
- -pre,--pre-processing
- Enable files pre-processing. It is used for text identification only.
- --remove-source-file
- Enable source file removing after de-identification.
- -tc,--thread-count count
- Number of threads used for input files processing. Number of available threads minus 1 is a default value
- -v,--verbose
- Verbose output.
Run PureImage in the Debian GNU/Linux
You have command pureimage
available when the pureimage_2024.11-1_all.deb package were installed.
The configuration file is placed in the /etc/mre/pureimage/ directory. You can copy this file or make modifications. Be careful, it can affect classification and the time or resources required for processing each file.
The best practice is to use copy of the configuration file in the production.
Run PureImage from the ZIP file
You have a script pureimage.sh
available when you just extract the pureimage_2024.11.zip archive.
First, execute the ./pureimage.sh script without arguments. It will produce output like this:
/tmp/pureimage-2024.11$ ./pureimage.sh # =================================================== # # PureImage # # - default configuration.properties # # - pre-configured classification mode # # - pre-configured input/output/tmp local directories # # =================================================== # PureImage uses these external tools: - tool /usr/bin/convert -- OK - tool /usr/bin/dcm2pnm -- OK - tool /usr/bin/parallel -- OK - tool /usr/bin/tesseract -- OK - tool /usr/local/bin/binarizewolfjolion -- OK Please, place your data into the ./data/ directory and re-run the command ./pureimage.sh to do the classification task.
It will notify you when any of external tool is missing. Please install missing package or fix path in the configuration.properties
.
Well, it prepares the ./data/
, ./output/
and ./tmp/
directory for you.
Next, you can copy your files to the ./data/
directory and execute the ./pureimage.sh
again. It will classify the input files.
It is pre-configured that copy of NOT CLEAN files will be stored in the ./output/
directory. There will be also a TSV file with the result of classification.
Temporary ./tmp/
directory is place where the data processing is done. Be careful, the processing does very intensive disc operations. You can use in-memory storage for ./tmp/
directory. Be careful and do not use --not-clean
option with huge data amount and in-memory storage.