Spacy / Prodigy - Prediction pauses without any Error

Hello Everyone!

I am making an automatized machine learning based prediction system where I feed in document and get the predictions for the paragraphs in the document using a pre-trained text classification model using Prodigy.

In some cases, I am getting some text in document (tabular data, which I am not interested in), which
freezes spaCy without giving a prediction (no error is thrown):


Even after two days, no prediction is returned and no error is being thrown. Due to this my automation fails.

Is it possible to define some time-out option in spaCy that will enable it to skip prediction after a pre-defined computational time-out (say 1 minute).

Text on which spacy prediction freezes:

model('METEREOLOGICAL CONDITIONS :\r\n\r\n Annual Frequency of Occurrence of Maximum Current Speeds for Each of the 16 Characteristic Current Profiles 1234567891011121314151601.0000.9211.0001.0001.0001.0000.9030.4920.4781.0000.7071.0000.4951.0000.3700.89431.0000.9211.0001.0001.0001.0000.9030.4920.4781.0000.7071.0000.4951.0000.3700.89470.9760.8920.9760.9740.9800.9800.8850.4570.4650.9600.6900.9550.4750.9630.3460.885110.9630.8780.9590.9500.9670.9660.8780.4410.4620.9280.6860.9250.4710.9230.3460.887160.9530.8700.9340.9130.9480.9520.8770.4290.4630.8830.6940.8850.4750.8620.3640.898220.9450.8660.8960.8580.9130.9340.8880.4060.4640.8110.7160.8200.4900.7610.4060.924280.9380.8670.8400.7910.8690.9120.9120.3800.4650.7070.7580.7300.5180.6300.4760.963360.9260.8760.7760.7110.8280.8770.9530.3570.4780.5910.8180.6400.5570.4990.5620.996460.9030.8930.7000.6090.7950.8120.9910.3460.5080.4820.8940.5630.6090.3770.6661.000570.8650.9160.6290.4990.7810.7191.0000.3600.5510.3910.9650.5100.6730.2770.7750.946690.8130.9430.5710.3960.7890.6120.9590.4060.6080.3281.0000.4800.7510.2060.8790.838810.7570.9760.5250.3040.8280.4940.8850.4820.6960.3000.9950.4930.8550.1690.9610.680930.6911.0000.4910.2320.9010.3750.7730.6130.8300.3120.9580.5680.9580.1571.0000.4501050.6030.9770.4720.1890.9670.2710.6050.7920.9560.3670.8580.7001.0000.1650.9590.2531170.4970.8850.4580.1730.9820.1960.4280.9461.0000.4420.6900.8510.9510.1880.8210.1381290.3940.7340.4430.1720.9230.1510.2921.0000.9380.5160.5060.9590.8170.2240.6350.0971410.3100.5640.4270.1850.8070.1300.2080.9580.8040.5690.3600.9940.6440.2620.4590.0901530.2450.4080.4010.2000.6690.1310.1570.8480.6380.5730.2550.9400.4740.2810.3300.0901650.1960.2840.3690.2100.5400.1390.1300.7050.4900.5380.1910.8230.3420.2880.2530.1001890.1390.1640.2560.1690.3140.1180.1110.3960.2540.3540.1390.4960.1840.2220.1820.099211 and below0.0410.0300.0900.0730.0810.0530.0430.1190.0620.1290.0430.1600.0400.0910.0590.045Occurrence Percentage of each profile23.0616.037.826.176.005.895.365.124.694.184.093.943.912.321.090.').cats

Thank you,
Swarup Selvaraj.

I think this could be related to a bug in the tokenizer, actually (see here). This should already be fixed on spacy-nightly. So as a first sanity check, you could set up the nightly version in a separate environment, try parsing that string with a blank model and see if it hangs.

The current version of Prodigy isn’t officially compatible with the upcoming spaCy v2.1 yet (and the nightly currently has a small bug in the text classifier). So you’d have to wait until the stable release or at least the next nightly release.

(Also, in the future, there’s no need to let spaCy run for 2 days. Honestly, if you don’t get anything back in like, a few minutes – excluding model startup time – stop the process, there’s likely a problem. At least if you’re using spaCy’s built-in components.)


for the issue above, I tried using spacy-nightly but I could not load previous models (trained with previous spacy version), probably I need to re-train them using spacy-nightly

before I re-train all models, do you think we could try using spacy nightly last release Feb 25, 2019 with prodigy?

this way we could use prodigy batch train recepi to re-train all models and use spacy nightly to overcome the issue above on tokenizer

thank you
kind regards

claudio nespoli

I tried spacy-nightly but I would need to re-train all my mo