C:\Users\Damiano>python -m prodigy stats
? Prodigy stats
Version 1.4.0
Location C:\Program Files\Python3.6.4\lib\site-packages\prodigy
Prodigy Home C:\Users\Damiano\.prodigy
Platform Windows-10-10.0.16299-SP0
Python Version 3.6.4
Database Name SQLite
Database Id sqlite
Total Datasets 8
Total Sessions 23
Hmm! On Windows I could imagine a few different problems. I should’ve noticed your path in your earlier snippet.
Did you install via conda or pip? Could you run this benchmark and tell me what it says?
from timeit import default_timer as timer
import numpy
def main():
m = 20000
n = 10000
k = 5000
A = numpy.zeros((m,k), dtype='f')
A += numpy.random.uniform(size=A.size).reshape(A.shape)
B = numpy.zeros((k,n), dtype='f')
B += numpy.random.uniform(size=B.size).reshape(B.shape)
start = timer()
C = numpy.dot(A, B)
end = timer()
print(end-start, C.sum())
if __name__ == '__main__':
main()
On my machine, the timing of this benchmark varies greatly depending on how numpy is configured. If I install numpy via conda, the matrix multiplication is performed via Intel’s mkl library, and the calculation finishes in 12.2s. If I install numpy via pip with default settings, it’s linked against a version of OpenBLAS that does not detect my CPU correctly. That version takes 55.6s to complete. On Windows, I’m not sure what numpy does by default. It’s possible it’s fallen back to a reference BLAS library with no assembly kernels, which can be 10-20x slower than it should be.
The next versions of spaCy and Thinc address this by shipping OpenBLAS’s matrix multiplication function within Thinc. But in the meantime, if you’ve installed numpy via pip, try using conda to install numpy. You can still install the Prodigy wheel via pip — but just install numpy into the environment first.
Hi @honnibal
maybe i have resolved my problem leaving this stupid OS… I just have installed Ubuntu 16.04.
I will install numpy via conda and maybe run Win 10 pro in a virtual machine.
@honnibal i am trying to use prodigy but there are many “out of memory” error using ner.batch-trainner.train-curve
Before i was using big documents around (600-800 tokens long) but now the text are smaller. I have created a custom recipe that show 100 character before/after the entity to confirm. So max 250 characters now.
Matt another problem is that we cannot split a dataset. If it is a problem with memory we must split annotations but it does not seem possible right now. correct?
600-800 tokens isn’t that long. When I wrote the beam search I did assume one sentence per text, so I was thinking 50-60 words maximum. If possible it would be better to let Prodigy use the sentence boundary detection than to impose the character-based cutoff, which will cut words part through.
You can do prodigy db-out on your dataset to get a .jsonl file, and then split that up and feed it back in with prodigy db-in. I hope that’s not necessary though – I don’t think the size of the dataset is the problem.
This sets the dimensions much smaller within the neural network, which should make things faster. Hopefully these settings make it less painful to run the model while we figure out whether there’s a memory leak.