Usage

PureImage is a command-line Java application. When you have already downloaded and extracted PureImage, you can use it:

  • through command pureimage when deb package were installed (on GNU/Debian Linux);
  • use prepared pureimage.sh script when used ZIP package;
  • directly from the command-line with respect to its syntax if you want co change e.g. output directory.

The most important part of configuration is the de-identification profile. Its real content depends on the goals you want achieve. The level of anonymization and file format leads to making a proper de-identification profile you need for the PureImage. A sort of examples you can find at the profile page or in the downloaded PureImage, which is not a comprehensive example, but can be useful for writing your own rules.

Command Line Syntax

This information you can see also with --help argument.

usage: java -Xms128m -Xmx512m -jar pureimage-2024.11.jar  [-a] [-c
       <file>] [-d <dir>] [-dl <file>] [-dlo <file>] [-dlof <filter>] [-do
       <dir>] [-dt <dir>] [-h] [-m <mode>] [-nc] [-o <file>] [-ot <type>]
       [-pre] [--remove-source-file] [-tc <count>] [-v]

Options

-a,--annotate
Enable annotation in blind output files.
-c,--config file
The PureImage configuration file.
-d,--data dir
Input data directory or file. The whole directory is processed recursively.
-dl,--data-list file
File with the list of input data files for (selective) processing. Each file with relative or absolute path per line. An input data (-d) option can append additional files for processing.
-dlo,--data-list-output file
Load list of input files from an existing result file. This loads and parses an existing result file.
-dlof,--data-list-output-filter filter
Filter loaded result file by this/these filters. It is possible to use 'clean' and 'clean*' syntax in the filter. The first one accept only files with the clean status, but the second one accept anything that starts with the word 'clean' (e.g. cleanNotSure). Be careful and use quotation marks to prevent a star (*) expansion in the shell.
-do,--dir-output dir
Output directory. Processing result will be stored there. Default is the ./out/ directory in the working directory.
-dt,--dir-tmp dir
Temporary directory. Processing temporary files will be stored there. Default is the system temp directory (e.g. /tmp/ or C:\Users<name>\AppData\Local\Temp\).
-h,--help
Print this message.
-m,--mode mode
Processing mode. Values: (c)lassify or (d)eidentify. Default value is classify.
-nc,--not-clean
Do not clean temporary files automatically.
-o,--output file
Output file with results.
-ot,--output-type type
Set output type. Type of files that are stored to the output directory. Possible values are: none (default), all, clean, notClean, cleanUnsure, cleanOrUnsure, cleanAndDeidentified, deidentified, failed.
-pre,--pre-processing
Enable files pre-processing. It is used for text identification only.
--remove-source-file
Enable source file removing after de-identification.
-tc,--thread-count count
Number of threads used for input files processing. Number of available threads minus 1 is a default value
-v,--verbose
Verbose output.

Run PureImage in the Debian GNU/Linux

You have command pureimage available when the pureimage_2024.11-1_all.deb package were installed.

The configuration file is placed in the /etc/mre/pureimage/ directory. You can copy this file or make modifications. Be careful, it can affect classification and the time or resources required for processing each file.

The best practice is to use copy of the configuration file in the production.

Run PureImage from the ZIP file

You have a script pureimage.sh available when you just extract the pureimage_2024.11.zip archive.

First, execute the ./pureimage.sh script without arguments. It will produce output like this:

/tmp/pureimage-2024.11$ ./pureimage.sh

# =================================================== #
# PureImage                                           #
# - default configuration.properties                  #
# - pre-configured classification mode                #
# - pre-configured input/output/tmp local directories #
# =================================================== #

PureImage uses these external tools:
 - tool /usr/bin/convert -- OK
 - tool /usr/bin/dcm2pnm -- OK
 - tool /usr/bin/parallel -- OK
 - tool /usr/bin/tesseract -- OK
 - tool /usr/local/bin/binarizewolfjolion -- OK

Please, place your data into the ./data/ directory and re-run 
the command ./pureimage.sh to do the classification task.

It will notify you when any of external tool is missing. Please install missing package or fix path in the configuration.properties.

Well, it prepares the ./data/, ./output/ and ./tmp/ directory for you.

Next, you can copy your files to the ./data/ directory and execute the ./pureimage.sh again. It will classify the input files.

It is pre-configured that copy of NOT CLEAN files will be stored in the ./output/ directory. There will be also a TSV file with the result of classification.

Temporary ./tmp/ directory is place where the data processing is done. Be careful, the processing does very intensive disc operations. You can use in-memory storage for ./tmp/ directory. Be careful and do not use --not-clean option with huge data amount and in-memory storage.